Skip to content

Parsers

Parsers are responsible for two distinct stages:

  • turning raw log lines into structured records
  • turning structured message text into templates

This page is the reference for both the built-in parser implementations and the protocols they satisfy.

>>> from anomalog.parsers import BGLParser, Drain3Parser, IdentityTemplateParser
>>> BGLParser.name
'bgl'
>>> Drain3Parser.name
'drain3'
>>> IdentityTemplateParser("demo").inference("node 7 failed")
('node 7 failed', [])

anomalog.parsers

Public parser package.

AITADSParser dataclass

Bases: StructuredParser

Parse the canonical JSONL alert stream derived from AIT-ADS.

Attributes:

Name Type Description
name ClassVar[str]

Registry/config name for the built-in parser.

parse_line(raw_line)

Parse one canonical AIT-ADS alert row.

Parameters:

Name Type Description Default
raw_line str

Canonical JSONL row emitted by the AIT-ADS source.

required

Returns:

Type Description
BaseStructuredLine | None

BaseStructuredLine | None: Parsed canonical alert, or None when the row is malformed.

BGLParser dataclass

Bases: StructuredParser

Parse Blue Gene/L log lines into structured fields with anomaly flag.

The BGL corpus encodes anomaly state in the optional leading dash, so this parser preserves that dataset convention directly in the shared anomalous field while keeping the original message tail for template mining.

Attributes:

Name Type Description
name ClassVar[str]

Registry/config name for the built-in parser.

parse_line(raw_line)

Parse a single BGL line; return None for unparseable lines.

Parameters:

Name Type Description Default
raw_line str

Raw BGL log line to parse.

required

Examples:

>>> sample = (
...     "- 1117838570 2005.06.03 R02-M1-N0-C:J12-U11 "
...     "2005-06-03-15.42.50.363779 R02-M1-N0-C:J12-U11 "
...     "RAS KERNEL INFO cache parity corrected"
... )
>>> parsed = BGLParser().parse_line(sample)
>>> (parsed.entity_id, parsed.anomalous)  # dash prefix => normal
('R02-M1-N0-C:J12-U11', 0)

Returns:

Type Description
BaseStructuredLine | None

BaseStructuredLine | None: Parsed structured record, or None when the line does not match the expected format.

Drain3Parser

Bases: TemplateParser

Drain3-based template miner with Prefect asset caching.

Instances accept an optional dataset name plus explicit config and cache paths so trained state can be persisted per dataset.

Attributes:

Name Type Description
name ClassVar[str]

Registry name for the built-in Drain3 parser.

is_identity_parser ClassVar[bool]

Always False; Drain3 mines templates rather than preserving the raw text.

Parameters:

Name Type Description Default
dataset_name str | None

Optional dataset name used to scope persisted Drain3 state.

None
config_file Path | None

Optional Drain3 config file override.

None
cache_path Path | None

Optional explicit cache directory override.

None

cache_file_path property

Return the resolved cache file path for this parser instance.

Raises:

Type Description
ValueError

If the parser has not been bound to a dataset yet.

resolved_cache_path property

Return the on-disk cache directory for this parser instance.

Raises:

Type Description
ValueError

If the parser has not been bound to a dataset yet.

inference(unstructured_text)

Return template and parameters for a single unstructured log line.

Parameters:

Name Type Description Default
unstructured_text UntemplatedText

Raw untemplated log line to match against the trained miner.

required

Returns:

Type Description
tuple[LogTemplate, ExtractedParameters]

tuple[LogTemplate, ExtractedParameters]: Matched template and extracted parameter values.

Raises:

Type Description
ValueError

If the parser has not been trained yet.

train(untemplated_text_iterator, *, asset_deps=None)

Train Drain3 on the dataset's untemplated message stream.

Parameters:

Name Type Description Default
untemplated_text_iterator Callable[[], Iterator[UntemplatedText]]

Zero-argument iterator factory over untemplated message text.

required
asset_deps list[Asset] | None

Optional upstream asset dependencies to include in the training cache key.

None

HDFSV1Parser dataclass

Bases: StructuredParser

Parse HDFS v1 log lines into structured fields.

HDFS anomaly datasets are block-centric, so this parser prefers the block id mentioned in the log message as the entity_id; when no block is present it falls back to the logging component so entity-based grouping still works.

Attributes:

Name Type Description
name ClassVar[str]

Registry/config name for the built-in parser.

parse_line(raw_line)

Parse a single HDFS v1 line; return None for unparseable lines.

Parameters:

Name Type Description Default
raw_line str

Raw HDFS log line to parse.

required

Examples:

>>> line = (
...     "081109 203518 143 INFO dfs.DataNode$DataXceiver: "
...     "Receiving block blk_-160 src: /10.0.0.1:54106 "
...     "dest: /10.0.0.2:50010"
... )
>>> parsed = HDFSV1Parser().parse_line(line)
>>> parsed.entity_id, parsed.anomalous, parsed.untemplated_message_text[:13]
('blk_-160', None, 'INFO dfs.Data')

Returns:

Type Description
BaseStructuredLine | None

BaseStructuredLine | None: Parsed structured record, or None when the line does not match the expected format.

IdentityTemplateParser dataclass

Bases: TemplateParser

No-op template parser that returns the input string as its template.

This parser is useful when experiments should operate on exact message text rather than mined abstractions, or when tests need deterministic, side-effect-free template inference.

Attributes:

Name Type Description
name ClassVar[str]

Registry/config name for the identity parser.

is_identity_parser ClassVar[bool]

Always True; the parser returns the raw text unchanged.

dataset_name str | None

Optional dataset identifier kept only for parity with the shared template-parser contract.

inference(unstructured_text)

Return the raw text as the template with no parameters.

Parameters:

Name Type Description Default
unstructured_text UntemplatedText

Raw log text to treat as its own template.

required

Examples:

>>> IdentityTemplateParser("demo").inference("hello")
('hello', [])

Returns:

Type Description
tuple[LogTemplate, ExtractedParameters]

tuple[LogTemplate, ExtractedParameters]: Raw text and an empty parameter list.

train(untemplated_text_iterator, *, asset_deps=None)

Ignore the training stream because identity inference is stateless.

Parameters:

Name Type Description Default
untemplated_text_iterator Callable[[], Iterator[UntemplatedText]]

Iterator factory accepted for contract compatibility.

required
asset_deps list[Asset] | None

Ignored upstream asset dependencies accepted for interface compatibility.

None

OpenStackDeepLogParser dataclass

Bases: StructuredParser

Parse labelled OpenStack rows used by the DeepLog reproduction preset.

Attributes:

Name Type Description
name ClassVar[str]

Registry/config name for the built-in parser.

parse_line(raw_line)

Parse one labelled OpenStack row into the shared structured schema.

Parameters:

Name Type Description Default
raw_line str

Raw labelled OpenStack row from the preprocessed stream.

required

Returns:

Type Description
BaseStructuredLine | None

BaseStructuredLine | None: Structured row, or None when the labelled OpenStack row is malformed.

ParquetStructuredSink dataclass

Bases: StructuredSink

StructuredSink backed by partitioned Parquet datasets.

Provides efficient iteration, windowing helpers, and label-aware counts for downstream anomaly workflows.

Attributes:

Name Type Description
dataset_name str

Dataset identifier used to scope parquet cache paths.

raw_dataset_path Path

Materialsed raw log file parsed into parquet.

parser StructuredParser

Parser used to convert raw lines into structured records.

cache_paths CachePathsConfig

Data/cache roots used for parquet output.

cache_dir ClassVar[str]

Dataset-local cache directory name for structured parquet artifacts.

count_entities_by_label(label_for_group)

Return counts of normal and total distinct entity ids.

Parameters:

Name Type Description Default
label_for_group Callable[[str], int | None]

Lookup that maps each entity id to its anomaly label.

required

Returns:

Name Type Description
EntityLabelCounts EntityLabelCounts

Normal and total distinct entity counts.

count_rows()

Return total number of structured rows.

Returns:

Name Type Description
int int

Total number of structured rows.

entity_chronology_index_path()

Return the sidecar path storing entity chronology metadata.

Returns:

Name Type Description
Path Path

JSONL sidecar path used for entity chronology ordering.

entity_count_path()

Return the sidecar path storing the total distinct entity count.

Returns:

Name Type Description
Path Path

JSON sidecar path for the total distinct entity count.

inline_label_cache_path()

Return the sidecar path storing sparse inline labels.

Returns:

Name Type Description
Path Path

JSONL sidecar path used for sparse inline anomaly labels.

iter_entity_sequences()

Yield sequences grouped by entity in deterministic bucket order.

Returns:

Type Description
Callable[[], Iterator[Collection[StructuredLine]]]

Callable[[], Iterator[Collection[StructuredLine]]]: Callable producing entity-grouped windows ordered by bucket and then by the materialised chronology index within each bucket.

iter_entity_sequences_from_line_order(min_line_order)

Yield entity sequences whose rows occur at or after a raw cutoff.

Parameters:

Name Type Description Default
min_line_order int

Inclusive raw-entry cutoff used to filter out the train prefix before entity grouping.

required

Returns:

Type Description
Callable[[], Iterator[Collection[StructuredLine]]]

Callable[[], Iterator[Collection[StructuredLine]]]: Callable producing entity-grouped rows for the suffix at or after the requested cutoff.

iter_fixed_window_sequences(window_size, step_size=None)

Yield sequences of fixed window size over ordered rows.

Parameters:

Name Type Description Default
window_size int

Number of rows in each emitted window.

required
step_size int | None

Optional step between successive windows. Defaults to window_size.

None

Returns:

Type Description
Callable[[], Iterator[Collection[StructuredLine]]]

Callable[[], Iterator[Collection[StructuredLine]]]: Callable producing fixed-size row windows.

iter_structured_lines(columns=None, *, filter_expr=None, batch_size=None)

Iterate over StructuredLine objects with optional column projection.

Parameters:

Name Type Description Default
columns Sequence[str] | None

Optional projected column names to load from parquet.

None
filter_expr Expression | None

Optional PyArrow dataset filter expression.

None
batch_size int | None

Optional scanner batch size override.

None

Returns:

Type Description
Callable[[], Iterator[StructuredLine]]

Callable[[], Iterator[StructuredLine]]: Callable producing projected structured rows from the parquet dataset.

iter_structured_lines_in_source_order(filter_expr=None)

Iterate over structured rows in raw-entry order.

The parquet dataset is partitioned by entity buckets, so this merges the bucket-local scans by line_order to recover the original raw-entry chronology without materialising the entire dataset in memory.

Parameters:

Name Type Description Default
filter_expr Expression | None

Optional dataset filter applied before bucket-local source-order merging. This is used by the split-aware suffix replay paths to avoid rescanning the train prefix when only the test suffix is needed.

None

Returns:

Type Description
Callable[[], Iterator[StructuredLine]]

Callable[[], Iterator[StructuredLine]]: Zero-argument callable that yields structured rows ordered by line_order.

iter_time_window_sequences(time_span_ms, step_span_ms=None)

Yield sequences grouped by sliding time windows.

Parameters:

Name Type Description Default
time_span_ms int

Width of each window in milliseconds.

required
step_span_ms int | None

Optional step between successive windows. Defaults to time_span_ms.

None

Returns:

Type Description
Callable[[], Iterator[Collection[StructuredLine]]]

Callable[[], Iterator[Collection[StructuredLine]]]: Callable producing time-window grouped rows.

load_entity_chronology_index()

Load the materialised chronology sidecar, if it exists.

Returns:

Type Description
dict[str, EntityChronologyKey]

dict[str, EntityChronologyKey]: Chronology metadata keyed by entity id.

load_entity_count()

Load the total distinct entity count sidecar, if present.

Returns:

Type Description
int | None

int | None: Total entity count when the sidecar exists, otherwise None.

load_inline_label_cache()

Load sparse inline labels directly from parquet batches.

Returns:

Type Description
dict[int, int]

tuple[dict[int, int], dict[str, int]]: Sparse per-line and per-group

dict[str, int]

anomaly labels.

structured_cache_path()

Return the structured parquet cache path for this sink.

Returns:

Name Type Description
Path Path

Directory containing the materialised structured dataset.

structured_data_cache(dataset_name)

Return the cache directory for this dataset.

Parameters:

Name Type Description Default
dataset_name str

Dataset name whose parquet cache should be used.

required

Returns:

Name Type Description
Path Path

Structured-parquet cache directory for the dataset.

timestamp_bounds()

Return min and max timestamps present in the dataset.

Returns:

Type Description
tuple[int | None, int | None]

tuple[int | None, int | None]: Minimum and maximum timestamps, if any.

write_structured_lines(_workers=None, *, refresh_cache=False)

Parse raw logs and persist structured lines to Parquet.

Parameters:

Name Type Description Default
_workers int | None

Reserved worker-count override. Currently unused by this sink implementation.

None
refresh_cache bool

Whether to force Prefect to ignore any cached materialisation result and rebuild the parquet cache.

False

Returns:

Name Type Description
bool bool

Whether any anomalous rows were observed during parsing.

SpellTemplateParser dataclass

Bases: TemplateParser

Spell-based template parser for DeepLog-style key extraction.

This parser trains Spell on the provided text stream, then performs inference by matching lines against the mined templates. Training now delegates to the upstream spellpy parser directly, which keeps the implementation small while still avoiding the old raw CSV bottleneck.

Attributes:

Name Type Description
name ClassVar[str]

Registry/config name for the built-in parser.

is_identity_parser ClassVar[bool]

Always False; Spell mines a canonical template representation from the raw text.

dataset_name str | None

Optional dataset name used for cache paths.

tau float

Spell similarity threshold passed to Spell training.

max_lcs_comparisons_per_line int | None

Maximum number of LCS comparisons spellpy may perform for one line before it falls back to creating or reusing a less-specific template.

inference(unstructured_text)

Infer template and extracted parameters for one log line.

Parameters:

Name Type Description Default
unstructured_text UntemplatedText

Raw line to match.

required

Returns:

Type Description
tuple[LogTemplate, ExtractedParameters]

tuple[LogTemplate, ExtractedParameters]: Matched template and captured parameters, or a self-template fallback when unmatched.

Raises:

Type Description
ValueError

If the parser has not been trained yet.

train(untemplated_text_iterator, *, asset_deps=None)

Train Spell templates from the text stream.

Parameters:

Name Type Description Default
untemplated_text_iterator Callable[[], Iterator[UntemplatedText]]

Zero-argument iterator factory over untemplated message text.

required
asset_deps list[Asset] | None

Ignored upstream asset dependencies accepted for interface compatibility.

None

Raises:

Type Description
ModuleNotFoundError

If optional spellpy is not installed.

ThunderbirdParser dataclass

Bases: StructuredParser

Parse Thunderbird supercomputer log lines into structured fields.

Loghub's Thunderbird corpus uses a labelled raw-line format where the first token marks alert status (- for normal, any other tag for an alert) and the remaining header fields expose the event chronology plus the host and location tokens. The parser keeps the free-text tail as the message body for template mining, stripping an optional component[pid]: prefix when the raw line includes one. It also trims a trailing colon from bare message tails such as mysql_install_db: so the template miner sees the underlying command name rather than the punctuation artefact. The parser collapses the label into AnomaLog's binary anomaly flag.

The parser deliberately stays close to the observed raw structure so the downstream template miner sees the message body rather than a Thunderbird- specific normalisation of the header fields.

Attributes:

Name Type Description
name ClassVar[str]

Registry/config name for the built-in parser.

analyse_line(raw_line) classmethod

Parse one Thunderbird line and report the reason when skipped.

Parameters:

Name Type Description Default
raw_line str

Raw Thunderbird log line to inspect.

required

Returns:

Type Description
BaseStructuredLine | None

tuple[BaseStructuredLine | None, str | None]: Parsed structured row

str | None

and an optional skip reason.

parse_line(raw_line)

Parse a single Thunderbird line; return None for skipped rows.

Parameters:

Name Type Description Default
raw_line str

Raw Thunderbird log line to parse.

required

Returns:

Type Description
BaseStructuredLine | None

BaseStructuredLine | None: Parsed structured record, or None when the line is blank, malformed, or has no message body.

raw_label_for_line(raw_line) classmethod

Return the raw anomaly label token for one Thunderbird line.

Parameters:

Name Type Description Default
raw_line str

Raw Thunderbird log line to inspect.

required

Returns:

Type Description
int | None

int | None: 0 for normal rows, 1 for anomalous rows, or

int | None

None when the line does not match the Thunderbird envelope.

Notes

The helper mirrors the raw-line label token even when the parser later skips the row because the message body is empty. That keeps raw-position window labels aligned with the original line stream.