Parsers¶
Parsers are responsible for two distinct stages:
- turning raw log lines into structured records
- turning structured message text into templates
This page is the reference for both the built-in parser implementations and the protocols they satisfy.
>>> from anomalog.parsers import BGLParser, Drain3Parser, IdentityTemplateParser
>>> BGLParser.name
'bgl'
>>> Drain3Parser.name
'drain3'
>>> IdentityTemplateParser("demo").inference("node 7 failed")
('node 7 failed', [])
anomalog.parsers¶
Public parser package.
AITADSParser
dataclass
¶
Bases: StructuredParser
Parse the canonical JSONL alert stream derived from AIT-ADS.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry/config name for the built-in parser. |
parse_line(raw_line)
¶
Parse one canonical AIT-ADS alert row.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_line
|
str
|
Canonical JSONL row emitted by the AIT-ADS source. |
required |
Returns:
| Type | Description |
|---|---|
BaseStructuredLine | None
|
BaseStructuredLine | None: Parsed canonical alert, or |
BGLParser
dataclass
¶
Bases: StructuredParser
Parse Blue Gene/L log lines into structured fields with anomaly flag.
The BGL corpus encodes anomaly state in the optional leading dash, so this
parser preserves that dataset convention directly in the shared anomalous
field while keeping the original message tail for template mining.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry/config name for the built-in parser. |
parse_line(raw_line)
¶
Parse a single BGL line; return None for unparseable lines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_line
|
str
|
Raw BGL log line to parse. |
required |
Examples:
>>> sample = (
... "- 1117838570 2005.06.03 R02-M1-N0-C:J12-U11 "
... "2005-06-03-15.42.50.363779 R02-M1-N0-C:J12-U11 "
... "RAS KERNEL INFO cache parity corrected"
... )
>>> parsed = BGLParser().parse_line(sample)
>>> (parsed.entity_id, parsed.anomalous) # dash prefix => normal
('R02-M1-N0-C:J12-U11', 0)
Returns:
| Type | Description |
|---|---|
BaseStructuredLine | None
|
BaseStructuredLine | None: Parsed structured record, or |
Drain3Parser
¶
Bases: TemplateParser
Drain3-based template miner with Prefect asset caching.
Instances accept an optional dataset name plus explicit config and cache paths so trained state can be persisted per dataset.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry name for the built-in Drain3 parser. |
is_identity_parser |
ClassVar[bool]
|
Always |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str | None
|
Optional dataset name used to scope persisted Drain3 state. |
None
|
config_file
|
Path | None
|
Optional Drain3 config file override. |
None
|
cache_path
|
Path | None
|
Optional explicit cache directory override. |
None
|
cache_file_path
property
¶
Return the resolved cache file path for this parser instance.
Raises:
| Type | Description |
|---|---|
ValueError
|
If the parser has not been bound to a dataset yet. |
resolved_cache_path
property
¶
Return the on-disk cache directory for this parser instance.
Raises:
| Type | Description |
|---|---|
ValueError
|
If the parser has not been bound to a dataset yet. |
inference(unstructured_text)
¶
Return template and parameters for a single unstructured log line.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
unstructured_text
|
UntemplatedText
|
Raw untemplated log line to match against the trained miner. |
required |
Returns:
| Type | Description |
|---|---|
tuple[LogTemplate, ExtractedParameters]
|
tuple[LogTemplate, ExtractedParameters]: Matched template and extracted parameter values. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the parser has not been trained yet. |
train(untemplated_text_iterator, *, asset_deps=None)
¶
Train Drain3 on the dataset's untemplated message stream.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
untemplated_text_iterator
|
Callable[[], Iterator[UntemplatedText]]
|
Zero-argument iterator factory over untemplated message text. |
required |
asset_deps
|
list[Asset] | None
|
Optional upstream asset dependencies to include in the training cache key. |
None
|
HDFSV1Parser
dataclass
¶
Bases: StructuredParser
Parse HDFS v1 log lines into structured fields.
HDFS anomaly datasets are block-centric, so this parser prefers the block id
mentioned in the log message as the entity_id; when no block is present it
falls back to the logging component so entity-based grouping still works.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry/config name for the built-in parser. |
parse_line(raw_line)
¶
Parse a single HDFS v1 line; return None for unparseable lines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_line
|
str
|
Raw HDFS log line to parse. |
required |
Examples:
>>> line = (
... "081109 203518 143 INFO dfs.DataNode$DataXceiver: "
... "Receiving block blk_-160 src: /10.0.0.1:54106 "
... "dest: /10.0.0.2:50010"
... )
>>> parsed = HDFSV1Parser().parse_line(line)
>>> parsed.entity_id, parsed.anomalous, parsed.untemplated_message_text[:13]
('blk_-160', None, 'INFO dfs.Data')
Returns:
| Type | Description |
|---|---|
BaseStructuredLine | None
|
BaseStructuredLine | None: Parsed structured record, or |
IdentityTemplateParser
dataclass
¶
Bases: TemplateParser
No-op template parser that returns the input string as its template.
This parser is useful when experiments should operate on exact message text rather than mined abstractions, or when tests need deterministic, side-effect-free template inference.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry/config name for the identity parser. |
is_identity_parser |
ClassVar[bool]
|
Always |
dataset_name |
str | None
|
Optional dataset identifier kept only for parity with the shared template-parser contract. |
inference(unstructured_text)
¶
Return the raw text as the template with no parameters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
unstructured_text
|
UntemplatedText
|
Raw log text to treat as its own template. |
required |
Examples:
Returns:
| Type | Description |
|---|---|
tuple[LogTemplate, ExtractedParameters]
|
tuple[LogTemplate, ExtractedParameters]: Raw text and an empty parameter list. |
train(untemplated_text_iterator, *, asset_deps=None)
¶
Ignore the training stream because identity inference is stateless.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
untemplated_text_iterator
|
Callable[[], Iterator[UntemplatedText]]
|
Iterator factory accepted for contract compatibility. |
required |
asset_deps
|
list[Asset] | None
|
Ignored upstream asset dependencies accepted for interface compatibility. |
None
|
OpenStackDeepLogParser
dataclass
¶
Bases: StructuredParser
Parse labelled OpenStack rows used by the DeepLog reproduction preset.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry/config name for the built-in parser. |
parse_line(raw_line)
¶
Parse one labelled OpenStack row into the shared structured schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_line
|
str
|
Raw labelled OpenStack row from the preprocessed stream. |
required |
Returns:
| Type | Description |
|---|---|
BaseStructuredLine | None
|
BaseStructuredLine | None: Structured row, or |
ParquetStructuredSink
dataclass
¶
Bases: StructuredSink
StructuredSink backed by partitioned Parquet datasets.
Provides efficient iteration, windowing helpers, and label-aware counts for downstream anomaly workflows.
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_name |
str
|
Dataset identifier used to scope parquet cache paths. |
raw_dataset_path |
Path
|
Materialsed raw log file parsed into parquet. |
parser |
StructuredParser
|
Parser used to convert raw lines into structured records. |
cache_paths |
CachePathsConfig
|
Data/cache roots used for parquet output. |
cache_dir |
ClassVar[str]
|
Dataset-local cache directory name for structured parquet artifacts. |
count_entities_by_label(label_for_group)
¶
Return counts of normal and total distinct entity ids.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
label_for_group
|
Callable[[str], int | None]
|
Lookup that maps each entity id to its anomaly label. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
EntityLabelCounts |
EntityLabelCounts
|
Normal and total distinct entity counts. |
count_rows()
¶
Return total number of structured rows.
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
Total number of structured rows. |
entity_chronology_index_path()
¶
Return the sidecar path storing entity chronology metadata.
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
JSONL sidecar path used for entity chronology ordering. |
entity_count_path()
¶
Return the sidecar path storing the total distinct entity count.
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
JSON sidecar path for the total distinct entity count. |
inline_label_cache_path()
¶
Return the sidecar path storing sparse inline labels.
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
JSONL sidecar path used for sparse inline anomaly labels. |
iter_entity_sequences()
¶
Yield sequences grouped by entity in deterministic bucket order.
Returns:
| Type | Description |
|---|---|
Callable[[], Iterator[Collection[StructuredLine]]]
|
Callable[[], Iterator[Collection[StructuredLine]]]: Callable producing entity-grouped windows ordered by bucket and then by the materialised chronology index within each bucket. |
iter_entity_sequences_from_line_order(min_line_order)
¶
Yield entity sequences whose rows occur at or after a raw cutoff.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_line_order
|
int
|
Inclusive raw-entry cutoff used to filter out the train prefix before entity grouping. |
required |
Returns:
| Type | Description |
|---|---|
Callable[[], Iterator[Collection[StructuredLine]]]
|
Callable[[], Iterator[Collection[StructuredLine]]]: Callable producing entity-grouped rows for the suffix at or after the requested cutoff. |
iter_fixed_window_sequences(window_size, step_size=None)
¶
Yield sequences of fixed window size over ordered rows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
window_size
|
int
|
Number of rows in each emitted window. |
required |
step_size
|
int | None
|
Optional step between successive windows.
Defaults to |
None
|
Returns:
| Type | Description |
|---|---|
Callable[[], Iterator[Collection[StructuredLine]]]
|
Callable[[], Iterator[Collection[StructuredLine]]]: Callable producing fixed-size row windows. |
iter_structured_lines(columns=None, *, filter_expr=None, batch_size=None)
¶
Iterate over StructuredLine objects with optional column projection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns
|
Sequence[str] | None
|
Optional projected column names to load from parquet. |
None
|
filter_expr
|
Expression | None
|
Optional PyArrow dataset filter expression. |
None
|
batch_size
|
int | None
|
Optional scanner batch size override. |
None
|
Returns:
| Type | Description |
|---|---|
Callable[[], Iterator[StructuredLine]]
|
Callable[[], Iterator[StructuredLine]]: Callable producing projected structured rows from the parquet dataset. |
iter_structured_lines_in_source_order(filter_expr=None)
¶
Iterate over structured rows in raw-entry order.
The parquet dataset is partitioned by entity buckets, so this merges the
bucket-local scans by line_order to recover the original raw-entry
chronology without materialising the entire dataset in memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filter_expr
|
Expression | None
|
Optional dataset filter applied before bucket-local source-order merging. This is used by the split-aware suffix replay paths to avoid rescanning the train prefix when only the test suffix is needed. |
None
|
Returns:
| Type | Description |
|---|---|
Callable[[], Iterator[StructuredLine]]
|
Callable[[], Iterator[StructuredLine]]: Zero-argument callable that
yields structured rows ordered by |
iter_time_window_sequences(time_span_ms, step_span_ms=None)
¶
Yield sequences grouped by sliding time windows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
time_span_ms
|
int
|
Width of each window in milliseconds. |
required |
step_span_ms
|
int | None
|
Optional step between successive windows.
Defaults to |
None
|
Returns:
| Type | Description |
|---|---|
Callable[[], Iterator[Collection[StructuredLine]]]
|
Callable[[], Iterator[Collection[StructuredLine]]]: Callable producing time-window grouped rows. |
load_entity_chronology_index()
¶
load_entity_count()
¶
Load the total distinct entity count sidecar, if present.
Returns:
| Type | Description |
|---|---|
int | None
|
int | None: Total entity count when the sidecar exists, otherwise
|
load_inline_label_cache()
¶
structured_cache_path()
¶
Return the structured parquet cache path for this sink.
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Directory containing the materialised structured dataset. |
structured_data_cache(dataset_name)
¶
timestamp_bounds()
¶
write_structured_lines(_workers=None, *, refresh_cache=False)
¶
Parse raw logs and persist structured lines to Parquet.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
_workers
|
int | None
|
Reserved worker-count override. Currently unused by this sink implementation. |
None
|
refresh_cache
|
bool
|
Whether to force Prefect to ignore any cached materialisation result and rebuild the parquet cache. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
Whether any anomalous rows were observed during parsing. |
SpellTemplateParser
dataclass
¶
Bases: TemplateParser
Spell-based template parser for DeepLog-style key extraction.
This parser trains Spell on the provided text stream, then performs
inference by matching lines against the mined templates. Training now
delegates to the upstream spellpy parser directly, which keeps the
implementation small while still avoiding the old raw CSV bottleneck.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry/config name for the built-in parser. |
is_identity_parser |
ClassVar[bool]
|
Always |
dataset_name |
str | None
|
Optional dataset name used for cache paths. |
tau |
float
|
Spell similarity threshold passed to Spell training. |
max_lcs_comparisons_per_line |
int | None
|
Maximum number of LCS comparisons spellpy may perform for one line before it falls back to creating or reusing a less-specific template. |
inference(unstructured_text)
¶
Infer template and extracted parameters for one log line.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
unstructured_text
|
UntemplatedText
|
Raw line to match. |
required |
Returns:
| Type | Description |
|---|---|
tuple[LogTemplate, ExtractedParameters]
|
tuple[LogTemplate, ExtractedParameters]: Matched template and captured parameters, or a self-template fallback when unmatched. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the parser has not been trained yet. |
train(untemplated_text_iterator, *, asset_deps=None)
¶
Train Spell templates from the text stream.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
untemplated_text_iterator
|
Callable[[], Iterator[UntemplatedText]]
|
Zero-argument iterator factory over untemplated message text. |
required |
asset_deps
|
list[Asset] | None
|
Ignored upstream asset dependencies accepted for interface compatibility. |
None
|
Raises:
| Type | Description |
|---|---|
ModuleNotFoundError
|
If optional |
ThunderbirdParser
dataclass
¶
Bases: StructuredParser
Parse Thunderbird supercomputer log lines into structured fields.
Loghub's Thunderbird corpus uses a labelled raw-line format where the first
token marks alert status (- for normal, any other tag for an alert) and
the remaining header fields expose the event chronology plus the host and
location tokens. The parser keeps the free-text tail as the message body
for template mining, stripping an optional component[pid]: prefix
when the raw line includes one. It also trims a trailing colon from bare
message tails such as mysql_install_db: so the template miner sees the
underlying command name rather than the punctuation artefact. The parser
collapses the label into AnomaLog's binary anomaly flag.
The parser deliberately stays close to the observed raw structure so the downstream template miner sees the message body rather than a Thunderbird- specific normalisation of the header fields.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry/config name for the built-in parser. |
analyse_line(raw_line)
classmethod
¶
Parse one Thunderbird line and report the reason when skipped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_line
|
str
|
Raw Thunderbird log line to inspect. |
required |
Returns:
| Type | Description |
|---|---|
BaseStructuredLine | None
|
tuple[BaseStructuredLine | None, str | None]: Parsed structured row |
str | None
|
and an optional skip reason. |
parse_line(raw_line)
¶
Parse a single Thunderbird line; return None for skipped rows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_line
|
str
|
Raw Thunderbird log line to parse. |
required |
Returns:
| Type | Description |
|---|---|
BaseStructuredLine | None
|
BaseStructuredLine | None: Parsed structured record, or |
raw_label_for_line(raw_line)
classmethod
¶
Return the raw anomaly label token for one Thunderbird line.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_line
|
str
|
Raw Thunderbird log line to inspect. |
required |
Returns:
| Type | Description |
|---|---|
int | None
|
int | None: |
int | None
|
|
Notes
The helper mirrors the raw-line label token even when the parser later skips the row because the message body is empty. That keeps raw-position window labels aligned with the original line stream.