Getting Started¶
Install AnomaLog, start from a checked-in preset, and turn raw logs into reproducible model inputs.
Install¶
Mental model¶
AnomaLog treats preprocessing as part of the research artifact rather than as setup code around the model.
The public API keeps the preprocessing stages explicit:
from anomalog import DatasetSpec
dataset = (
DatasetSpec("...") # (1)!
.from_source(...) # (2)!
.parse_with(...) # (3)!
.label_with(...) # (4)!
.template_with(...) # (5)!
.build()
)
- Choose a stable dataset name.
- Decide where the raw logs come from.
- Parse the dataset-specific log format.
- Attach anomaly labels when they are not already present inline.
- Choose the template miner.
.build() returns a templated dataset. Sequence construction happens
after that step, and representation happens after sequence construction:
- build a templated dataset
- group events into
TemplateSequencewindows - choose a representation for the detector family you want to run
That separation is deliberate. Sequence construction decides which events belong together in one example. Representation then decides how that example is encoded for a model.
A few terms used throughout the docs:
- A template is a canonical message pattern shared by many concrete log lines.
- A sequence is a grouped window of log events that becomes one model input.
- A representation is the model-facing form of that sequence, such as an ordered list or a count vector.
For the full stage-by-stage explanation, see Pipeline Concepts.
Start from a preset¶
The easiest starting point is a preset dataset specification from
anomalog.presets.
AnomaLog includes ready-made presets for benchmark datasets including BGL and HDFS v1.
Each preset is an ordinary DatasetSpec, so its preprocessing choices remain visible, inspectable, and modifiable:
>>> from anomalog.presets import bgl
>>> from anomalog.parsers import IdentityTemplateParser
>>> bgl.source.url
'https://zenodo.org/records/8196385/files/BGL.zip'
>>> bgl.template_parser.name
'drain3'
# Ablation: disable template mining and use raw log lines directly
>>> ablated_dataset = bgl.template_with(IdentityTemplateParser).build()
That matters because presets are not opaque shortcuts. They are checked-in builder definitions that you can inspect, keep fixed for a baseline, and modify one stage at a time for ablations.
Build a dataset¶
Build a templated dataset directly from the preset:
This materialises the preset pipeline and returns a templated dataset.
If you want explicit control instead of a preset, define a DatasetSpec directly:
from pathlib import Path
from anomalog import DatasetSpec
from anomalog.labels import CSVReader
from anomalog.parsers import Drain3Parser, HDFSV1Parser
from anomalog.sources import LocalZipSource
dataset = (
DatasetSpec("my-hdfs")
.from_source(
LocalZipSource(
Path("HDFS_v1.zip"),
raw_logs_relpath=Path("HDFS.log"),
),
)
.parse_with(HDFSV1Parser())
.label_with(
CSVReader(
relative_path=Path("preprocessed/anomaly_label.csv"),
entity_column="BlockId",
label_column="Label",
),
)
.template_with(Drain3Parser)
.build()
)
The same fluent builder is used in both cases. Presets simply provide a
checked-in starting DatasetSpec.
Group into sequences¶
Once .build() returns a templated dataset, choose how downstream models
should see the log stream.
For benchmarks such as BGL and HDFS, entity grouping is often the right starting point:
from anomalog import SplitLabel
from anomalog.presets import bgl
dataset = bgl.build()
sequences = dataset.group_by_entity().with_train_fraction(0.8)
for sequence in sequences:
if sequence.split_label is SplitLabel.TRAIN:
print(sequence.window_id, sequence.label, sequence.templates[:3])
0 0 [
"RAS KERNEL INFO instruction cache parity error corrected",
"RAS KERNEL INFO data cache parity error corrected",
"RAS KERNEL INFO data cache parity error corrected",
]
AnomaLog also supports fixed-size and time-based windows when the research question is not entity-centric:
fixed_sequences = dataset.group_by_fixed_window(window_size=128, step_size=64)
time_sequences = dataset.group_by_time_window(
time_span_ms=60_000,
step_span_ms=30_000,
)
All grouping modes produce TemplateSequence objects. See
Sequences for the full object shape and
Pipeline Concepts for grouping
tradeoffs.
Choose a representation¶
TemplateSequence is still model-agnostic. The representation layer converts a
sequence into the input shape expected by a detector.
from anomalog.representations import (
SequentialRepresentation,
TemplateCountRepresentation,
TemplatePhraseRepresentation,
)
builder = dataset.group_by_fixed_window(window_size=3).with_train_fraction(0.8)
sequential = builder.represent_with(SequentialRepresentation())
template_counts = builder.represent_with(TemplateCountRepresentation())
template_phrases = builder.represent_with(
TemplatePhraseRepresentation(phrase_ngram_min=1, phrase_ngram_max=2),
)
Use the representation that matches the model family:
SequentialRepresentationfor ordered template streamsTemplateCountRepresentationfor sparse template countsTemplatePhraseRepresentationfor sparse phrase counts extracted from template text
Custom representations are not limited to template text. They receive the full
TemplateSequence, so they can use event timing deltas, parameters, entity
IDs, or split metadata.
For more detail, see Representations and Pipeline Concepts.
What next¶
- Read Pipeline Concepts for the full stage-by-stage explanation and reproducibility model
- See Experiments for config-driven detector runs and result artifacts
- See Reference for the API pages and module map
- See Development for contributor setup and implementation-facing storage details