Parsers¶

Parsers are responsible for two distinct stages:

turning raw log lines into structured records
turning structured message text into templates

This page is the reference for both the built-in parser implementations and the protocols they satisfy.

>>> from anomalog.parsers import BGLParser, Drain3Parser, IdentityTemplateParser
>>> BGLParser.name
'bgl'
>>> Drain3Parser.name
'drain3'
>>> IdentityTemplateParser("demo").inference("node 7 failed")
('node 7 failed', [])

`anomalog.parsers`¶

Public parser package.

`AITADSParser` `dataclass` ¶

Bases: StructuredParser

Parse the canonical JSONL alert stream derived from AIT-ADS.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry/config name for the built-in parser.

`parse_line(raw_line)` ¶

Parse one canonical AIT-ADS alert row.

Parameters:

Name	Type	Description	Default
`raw_line`	`str`	Canonical JSONL row emitted by the AIT-ADS source.	required

Returns:

Type	Description
`BaseStructuredLine \| None`	BaseStructuredLine \| None: Parsed canonical alert, or `None` when the row is malformed.

`BGLParser` `dataclass` ¶

Bases: StructuredParser

Parse Blue Gene/L log lines into structured fields with anomaly flag.

The BGL corpus encodes anomaly state in the optional leading dash, so this parser preserves that dataset convention directly in the shared anomalous field while keeping the original message tail for template mining.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry/config name for the built-in parser.

`parse_line(raw_line)` ¶

Parse a single BGL line; return None for unparseable lines.

Parameters:

Name	Type	Description	Default
`raw_line`	`str`	Raw BGL log line to parse.	required

Examples:

>>> sample = (
...     "- 1117838570 2005.06.03 R02-M1-N0-C:J12-U11 "
...     "2005-06-03-15.42.50.363779 R02-M1-N0-C:J12-U11 "
...     "RAS KERNEL INFO cache parity corrected"
... )
>>> parsed = BGLParser().parse_line(sample)
>>> (parsed.entity_id, parsed.anomalous)  # dash prefix => normal
('R02-M1-N0-C:J12-U11', 0)

Returns:

Type	Description
`BaseStructuredLine \| None`	BaseStructuredLine \| None: Parsed structured record, or `None` when the line does not match the expected format.

`Drain3Parser` ¶

Bases: TemplateParser

Drain3-based template miner with Prefect asset caching.

Instances accept an optional dataset name plus explicit config and cache paths so trained state can be persisted per dataset.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry name for the built-in Drain3 parser.
`is_identity_parser`	`ClassVar[bool]`	Always `False`; Drain3 mines templates rather than preserving the raw text.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str \| None`	Optional dataset name used to scope persisted Drain3 state.	`None`
`config_file`	`Path \| None`	Optional Drain3 config file override.	`None`
`cache_path`	`Path \| None`	Optional explicit cache directory override.	`None`

`cache_file_path` `property` ¶

Return the resolved cache file path for this parser instance.

Raises:

Type	Description
`ValueError`	If the parser has not been bound to a dataset yet.

`resolved_cache_path` `property` ¶

Return the on-disk cache directory for this parser instance.

Raises:

Type	Description
`ValueError`	If the parser has not been bound to a dataset yet.

`inference(unstructured_text)` ¶

Return template and parameters for a single unstructured log line.

Parameters:

Name	Type	Description	Default
`unstructured_text`	`UntemplatedText`	Raw untemplated log line to match against the trained miner.	required

Returns:

Type	Description
`tuple[LogTemplate, ExtractedParameters]`	tuple[LogTemplate, ExtractedParameters]: Matched template and extracted parameter values.

Raises:

Type	Description
`ValueError`	If the parser has not been trained yet.

`train(untemplated_text_iterator, *, asset_deps=None)` ¶

Train Drain3 on the dataset's untemplated message stream.

Parameters:

Name	Type	Description	Default
`untemplated_text_iterator`	`Callable[[], Iterator[UntemplatedText]]`	Zero-argument iterator factory over untemplated message text.	required
`asset_deps`	`list[Asset] \| None`	Optional upstream asset dependencies to include in the training cache key.	`None`

`HDFSV1Parser` `dataclass` ¶

Bases: StructuredParser

Parse HDFS v1 log lines into structured fields.

HDFS anomaly datasets are block-centric, so this parser prefers the block id mentioned in the log message as the entity_id; when no block is present it falls back to the logging component so entity-based grouping still works.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry/config name for the built-in parser.

`parse_line(raw_line)` ¶

Parse a single HDFS v1 line; return None for unparseable lines.

Parameters:

Name	Type	Description	Default
`raw_line`	`str`	Raw HDFS log line to parse.	required

Examples:

>>> line = (
...     "081109 203518 143 INFO dfs.DataNode$DataXceiver: "
...     "Receiving block blk_-160 src: /10.0.0.1:54106 "
...     "dest: /10.0.0.2:50010"
... )
>>> parsed = HDFSV1Parser().parse_line(line)
>>> parsed.entity_id, parsed.anomalous, parsed.untemplated_message_text[:13]
('blk_-160', None, 'INFO dfs.Data')

Returns:

Type	Description
`BaseStructuredLine \| None`	BaseStructuredLine \| None: Parsed structured record, or `None` when the line does not match the expected format.

`IdentityTemplateParser` `dataclass` ¶

Bases: TemplateParser

No-op template parser that returns the input string as its template.

This parser is useful when experiments should operate on exact message text rather than mined abstractions, or when tests need deterministic, side-effect-free template inference.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry/config name for the identity parser.
`is_identity_parser`	`ClassVar[bool]`	Always `True`; the parser returns the raw text unchanged.
`dataset_name`	`str \| None`	Optional dataset identifier kept only for parity with the shared template-parser contract.

`inference(unstructured_text)` ¶

Return the raw text as the template with no parameters.

Parameters:

Name	Type	Description	Default
`unstructured_text`	`UntemplatedText`	Raw log text to treat as its own template.	required

Examples:

>>> IdentityTemplateParser("demo").inference("hello")
('hello', [])

Returns:

Type	Description
`tuple[LogTemplate, ExtractedParameters]`	tuple[LogTemplate, ExtractedParameters]: Raw text and an empty parameter list.

`train(untemplated_text_iterator, *, asset_deps=None)` ¶

Ignore the training stream because identity inference is stateless.

Parameters:

Name	Type	Description	Default
`untemplated_text_iterator`	`Callable[[], Iterator[UntemplatedText]]`	Iterator factory accepted for contract compatibility.	required
`asset_deps`	`list[Asset] \| None`	Ignored upstream asset dependencies accepted for interface compatibility.	`None`

`OpenStackDeepLogParser` `dataclass` ¶

Bases: StructuredParser

Parse labelled OpenStack rows used by the DeepLog reproduction preset.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry/config name for the built-in parser.

`parse_line(raw_line)` ¶

Parse one labelled OpenStack row into the shared structured schema.

Parameters:

Name	Type	Description	Default
`raw_line`	`str`	Raw labelled OpenStack row from the preprocessed stream.	required

Returns:

Type	Description
`BaseStructuredLine \| None`	BaseStructuredLine \| None: Structured row, or `None` when the labelled OpenStack row is malformed.

`ParquetStructuredSink` `dataclass` ¶

Bases: StructuredSink

StructuredSink backed by partitioned Parquet datasets.

Provides efficient iteration, windowing helpers, and label-aware counts for downstream anomaly workflows.

Attributes:

Name	Type	Description
`dataset_name`	`str`	Dataset identifier used to scope parquet cache paths.
`raw_dataset_path`	`Path`	Materialsed raw log file parsed into parquet.
`parser`	`StructuredParser`	Parser used to convert raw lines into structured records.
`cache_paths`	`CachePathsConfig`	Data/cache roots used for parquet output.
`cache_dir`	`ClassVar[str]`	Dataset-local cache directory name for structured parquet artifacts.

`count_entities_by_label(label_for_group)` ¶

Return counts of normal and total distinct entity ids.

Parameters:

Name	Type	Description	Default
`label_for_group`	`Callable[[str], int \| None]`	Lookup that maps each entity id to its anomaly label.	required

Returns:

Name	Type	Description
`EntityLabelCounts`	`EntityLabelCounts`	Normal and total distinct entity counts.

`count_rows()` ¶

Return total number of structured rows.

Returns:

Name	Type	Description
`int`	`int`	Total number of structured rows.

`entity_chronology_index_path()` ¶

Return the sidecar path storing entity chronology metadata.

Returns:

Name	Type	Description
`Path`	`Path`	JSONL sidecar path used for entity chronology ordering.

`entity_count_path()` ¶

Return the sidecar path storing the total distinct entity count.

Returns:

Name	Type	Description
`Path`	`Path`	JSON sidecar path for the total distinct entity count.

`inline_label_cache_path()` ¶

Return the sidecar path storing sparse inline labels.

Returns:

Name	Type	Description
`Path`	`Path`	JSONL sidecar path used for sparse inline anomaly labels.

`iter_entity_sequences()` ¶

Yield sequences grouped by entity in deterministic bucket order.

Returns:

Type	Description
`Callable[[], Iterator[Collection[StructuredLine]]]`	Callable[[], Iterator[Collection[StructuredLine]]]: Callable producing entity-grouped windows ordered by bucket and then by the materialised chronology index within each bucket.

`iter_entity_sequences_from_line_order(min_line_order)` ¶

Yield entity sequences whose rows occur at or after a raw cutoff.

Parameters:

Name	Type	Description	Default
`min_line_order`	`int`	Inclusive raw-entry cutoff used to filter out the train prefix before entity grouping.	required

Returns:

Type	Description
`Callable[[], Iterator[Collection[StructuredLine]]]`	Callable[[], Iterator[Collection[StructuredLine]]]: Callable producing entity-grouped rows for the suffix at or after the requested cutoff.

`iter_fixed_window_sequences(window_size, step_size=None)` ¶

Yield sequences of fixed window size over ordered rows.

Parameters:

Name	Type	Description	Default
`window_size`	`int`	Number of rows in each emitted window.	required
`step_size`	`int \| None`	Optional step between successive windows. Defaults to `window_size`.	`None`

Returns:

Type	Description
`Callable[[], Iterator[Collection[StructuredLine]]]`	Callable[[], Iterator[Collection[StructuredLine]]]: Callable producing fixed-size row windows.

`iter_structured_lines(columns=None, *, filter_expr=None, batch_size=None)` ¶

Iterate over StructuredLine objects with optional column projection.

Parameters:

Name	Type	Description	Default
`columns`	`Sequence[str] \| None`	Optional projected column names to load from parquet.	`None`
`filter_expr`	`Expression \| None`	Optional PyArrow dataset filter expression.	`None`
`batch_size`	`int \| None`	Optional scanner batch size override.	`None`

Returns:

Type	Description
`Callable[[], Iterator[StructuredLine]]`	Callable[[], Iterator[StructuredLine]]: Callable producing projected structured rows from the parquet dataset.

`iter_structured_lines_in_source_order(filter_expr=None)` ¶

Iterate over structured rows in raw-entry order.

The parquet dataset is partitioned by entity buckets, so this merges the bucket-local scans by line_order to recover the original raw-entry chronology without materialising the entire dataset in memory.

Parameters:

Name	Type	Description	Default
`filter_expr`	`Expression \| None`	Optional dataset filter applied before bucket-local source-order merging. This is used by the split-aware suffix replay paths to avoid rescanning the train prefix when only the test suffix is needed.	`None`

Returns:

Type	Description
`Callable[[], Iterator[StructuredLine]]`	Callable[[], Iterator[StructuredLine]]: Zero-argument callable that yields structured rows ordered by `line_order`.

`iter_time_window_sequences(time_span_ms, step_span_ms=None)` ¶

Yield sequences grouped by sliding time windows.

Parameters:

Name	Type	Description	Default
`time_span_ms`	`int`	Width of each window in milliseconds.	required
`step_span_ms`	`int \| None`	Optional step between successive windows. Defaults to `time_span_ms`.	`None`

Returns:

Type	Description
`Callable[[], Iterator[Collection[StructuredLine]]]`	Callable[[], Iterator[Collection[StructuredLine]]]: Callable producing time-window grouped rows.

`load_entity_chronology_index()` ¶

Load the materialised chronology sidecar, if it exists.

Returns:

Type	Description
`dict[str, EntityChronologyKey]`	dict[str, EntityChronologyKey]: Chronology metadata keyed by entity id.

`load_entity_count()` ¶

Load the total distinct entity count sidecar, if present.

Returns:

Type	Description
`int \| None`	int \| None: Total entity count when the sidecar exists, otherwise `None`.

`load_inline_label_cache()` ¶

Load sparse inline labels directly from parquet batches.

Returns:

Type	Description
`dict[int, int]`	tuple[dict[int, int], dict[str, int]]: Sparse per-line and per-group
`dict[str, int]`	anomaly labels.

`structured_cache_path()` ¶

Return the structured parquet cache path for this sink.

Returns:

Name	Type	Description
`Path`	`Path`	Directory containing the materialised structured dataset.

`structured_data_cache(dataset_name)` ¶

Return the cache directory for this dataset.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Dataset name whose parquet cache should be used.	required

Returns:

Name	Type	Description
`Path`	`Path`	Structured-parquet cache directory for the dataset.

`timestamp_bounds()` ¶

Return min and max timestamps present in the dataset.

Returns:

Type	Description
`tuple[int \| None, int \| None]`	tuple[int \| None, int \| None]: Minimum and maximum timestamps, if any.

`write_structured_lines(_workers=None, *, refresh_cache=False)` ¶

Parse raw logs and persist structured lines to Parquet.

Parameters:

Name	Type	Description	Default
`_workers`	`int \| None`	Reserved worker-count override. Currently unused by this sink implementation.	`None`
`refresh_cache`	`bool`	Whether to force Prefect to ignore any cached materialisation result and rebuild the parquet cache.	`False`

Returns:

Name	Type	Description
`bool`	`bool`	Whether any anomalous rows were observed during parsing.

`SpellTemplateParser` `dataclass` ¶

Bases: TemplateParser

Spell-based template parser for DeepLog-style key extraction.

This parser trains Spell on the provided text stream, then performs inference by matching lines against the mined templates. Training now delegates to the upstream spellpy parser directly, which keeps the implementation small while still avoiding the old raw CSV bottleneck.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry/config name for the built-in parser.
`is_identity_parser`	`ClassVar[bool]`	Always `False`; Spell mines a canonical template representation from the raw text.
`dataset_name`	`str \| None`	Optional dataset name used for cache paths.
`tau`	`float`	Spell similarity threshold passed to Spell training.
`max_lcs_comparisons_per_line`	`int \| None`	Maximum number of LCS comparisons spellpy may perform for one line before it falls back to creating or reusing a less-specific template.

`inference(unstructured_text)` ¶

Infer template and extracted parameters for one log line.

Parameters:

Name	Type	Description	Default
`unstructured_text`	`UntemplatedText`	Raw line to match.	required

Returns:

Type	Description
`tuple[LogTemplate, ExtractedParameters]`	tuple[LogTemplate, ExtractedParameters]: Matched template and captured parameters, or a self-template fallback when unmatched.

Raises:

Type	Description
`ValueError`	If the parser has not been trained yet.

`train(untemplated_text_iterator, *, asset_deps=None)` ¶

Train Spell templates from the text stream.

Parameters:

Name	Type	Description	Default
`untemplated_text_iterator`	`Callable[[], Iterator[UntemplatedText]]`	Zero-argument iterator factory over untemplated message text.	required
`asset_deps`	`list[Asset] \| None`	Ignored upstream asset dependencies accepted for interface compatibility.	`None`

Raises:

Type	Description
`ModuleNotFoundError`	If optional `spellpy` is not installed.

`ThunderbirdParser` `dataclass` ¶

Bases: StructuredParser

Parse Thunderbird supercomputer log lines into structured fields.

Loghub's Thunderbird corpus uses a labelled raw-line format where the first token marks alert status (- for normal, any other tag for an alert) and the remaining header fields expose the event chronology plus the host and location tokens. The parser keeps the free-text tail as the message body for template mining, stripping an optional component[pid]: prefix when the raw line includes one. It also trims a trailing colon from bare message tails such as mysql_install_db: so the template miner sees the underlying command name rather than the punctuation artefact. The parser collapses the label into AnomaLog's binary anomaly flag.

The parser deliberately stays close to the observed raw structure so the downstream template miner sees the message body rather than a Thunderbird- specific normalisation of the header fields.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry/config name for the built-in parser.

`analyse_line(raw_line)` `classmethod` ¶

Parse one Thunderbird line and report the reason when skipped.

Parameters:

Name	Type	Description	Default
`raw_line`	`str`	Raw Thunderbird log line to inspect.	required

Returns:

Type	Description
`BaseStructuredLine \| None`	tuple[BaseStructuredLine \| None, str \| None]: Parsed structured row
`str \| None`	and an optional skip reason.

`parse_line(raw_line)` ¶

Parse a single Thunderbird line; return None for skipped rows.

Parameters:

Name	Type	Description	Default
`raw_line`	`str`	Raw Thunderbird log line to parse.	required

Returns:

Type	Description
`BaseStructuredLine \| None`	BaseStructuredLine \| None: Parsed structured record, or `None` when the line is blank, malformed, or has no message body.

`raw_label_for_line(raw_line)` `classmethod` ¶

Return the raw anomaly label token for one Thunderbird line.

Parameters:

Name	Type	Description	Default
`raw_line`	`str`	Raw Thunderbird log line to inspect.	required

Returns:

Type	Description
`int \| None`	int \| None: `0` for normal rows, `1` for anomalous rows, or
`int \| None`	`None` when the line does not match the Thunderbird envelope.

Notes

The helper mirrors the raw-line label token even when the parser later skips the row because the message body is empty. That keeps raw-position window labels aligned with the original line stream.

Parsers¶

anomalog.parsers¶

AITADSParser dataclass ¶

parse_line(raw_line) ¶

BGLParser dataclass ¶

parse_line(raw_line) ¶

Drain3Parser ¶

cache_file_path property ¶

resolved_cache_path property ¶

inference(unstructured_text) ¶

train(untemplated_text_iterator, *, asset_deps=None) ¶

HDFSV1Parser dataclass ¶

parse_line(raw_line) ¶

IdentityTemplateParser dataclass ¶

inference(unstructured_text) ¶

train(untemplated_text_iterator, *, asset_deps=None) ¶

OpenStackDeepLogParser dataclass ¶

parse_line(raw_line) ¶

ParquetStructuredSink dataclass ¶

count_entities_by_label(label_for_group) ¶

count_rows() ¶

entity_chronology_index_path() ¶

entity_count_path() ¶

inline_label_cache_path() ¶

iter_entity_sequences() ¶

iter_entity_sequences_from_line_order(min_line_order) ¶

iter_fixed_window_sequences(window_size, step_size=None) ¶

iter_structured_lines(columns=None, *, filter_expr=None, batch_size=None) ¶

iter_structured_lines_in_source_order(filter_expr=None) ¶

iter_time_window_sequences(time_span_ms, step_span_ms=None) ¶

load_entity_chronology_index() ¶

load_entity_count() ¶

load_inline_label_cache() ¶

structured_cache_path() ¶

structured_data_cache(dataset_name) ¶

timestamp_bounds() ¶

write_structured_lines(_workers=None, *, refresh_cache=False) ¶

SpellTemplateParser dataclass ¶

inference(unstructured_text) ¶

train(untemplated_text_iterator, *, asset_deps=None) ¶

ThunderbirdParser dataclass ¶

analyse_line(raw_line) classmethod ¶

parse_line(raw_line) ¶

raw_label_for_line(raw_line) classmethod ¶

`anomalog.parsers`¶

`AITADSParser` `dataclass` ¶

`parse_line(raw_line)` ¶

`BGLParser` `dataclass` ¶

`parse_line(raw_line)` ¶

`Drain3Parser` ¶

`cache_file_path` `property` ¶

`resolved_cache_path` `property` ¶

`inference(unstructured_text)` ¶

`train(untemplated_text_iterator, *, asset_deps=None)` ¶

`HDFSV1Parser` `dataclass` ¶

`parse_line(raw_line)` ¶

`IdentityTemplateParser` `dataclass` ¶

`inference(unstructured_text)` ¶

`train(untemplated_text_iterator, *, asset_deps=None)` ¶

`OpenStackDeepLogParser` `dataclass` ¶

`parse_line(raw_line)` ¶

`ParquetStructuredSink` `dataclass` ¶

`count_entities_by_label(label_for_group)` ¶

`count_rows()` ¶

`entity_chronology_index_path()` ¶

`entity_count_path()` ¶

`inline_label_cache_path()` ¶

`iter_entity_sequences()` ¶

`iter_entity_sequences_from_line_order(min_line_order)` ¶

`iter_fixed_window_sequences(window_size, step_size=None)` ¶

`iter_structured_lines(columns=None, *, filter_expr=None, batch_size=None)` ¶

`iter_structured_lines_in_source_order(filter_expr=None)` ¶

`iter_time_window_sequences(time_span_ms, step_span_ms=None)` ¶

`load_entity_chronology_index()` ¶

`load_entity_count()` ¶

`load_inline_label_cache()` ¶

`structured_cache_path()` ¶

`structured_data_cache(dataset_name)` ¶

`timestamp_bounds()` ¶

`write_structured_lines(_workers=None, *, refresh_cache=False)` ¶

`SpellTemplateParser` `dataclass` ¶

`inference(unstructured_text)` ¶

`train(untemplated_text_iterator, *, asset_deps=None)` ¶

`ThunderbirdParser` `dataclass` ¶

`analyse_line(raw_line)` `classmethod` ¶

`parse_line(raw_line)` ¶

`raw_label_for_line(raw_line)` `classmethod` ¶