Development¶
This page covers local setup for contributors.
Clone the repository¶
Install development dependencies¶
AnomaLog uses uv for local development.
That installs the main package plus the dev and docs dependency groups.
Install pre-commit hooks¶
Run the required checks¶
Before opening a change, run the same validation commands used for the project:
uv run ruff format
uv run ruff check --fix
uv run ty check
uv run pytest --doctest-modules --cov=anomalog --cov-context=test --cov-report term-missing tests
Build the documentation locally¶
If you want to iterate on the docs locally, use the equivalent serve command supported by your docs setup.
Where to look in the codebase¶
anomalog/contains the library itselfanomalog/_runtime/contains internal orchestration codeexperiments/contains the experiment runner layerdocs/contains the documentation sitetests/contains unit and integration tests
For the module map, see API Reference.
How structured storage works¶
The default sink is ParquetStructuredSink.
At a high level:
- raw lines are parsed into structured records
- records are written as a partitioned Parquet dataset
- partitioning is based on a stable entity-hash bucket
- entity-based grouping reads bucket partitions and then groups rows by entity
- time-based grouping re-merges rows from bucket partitions into global timestamp order with a heap
In plain terms:
- grouping by entity is efficient because related entities land in deterministic bucket partitions
- grouping by time needs an extra merge step because time order spans multiple buckets
This is why the code in anomalog/parsers/structured/parquet/ matters if you are changing grouping behavior or storage layout.
How caching works¶
Caching is handled through the helpers in anomalog/cache/ and the internal runtime in anomalog/_runtime/.
The important design points are:
- dataset source materialisation is tied to the dataset root under
data_root - derived artifacts live under
cache_root - structured data writes are materialised against the raw log asset path
- template training is materialised against the trained parser output path
- local output existence is checked defensively after Prefect returns, because a cached completed state alone is not enough to guarantee the artifact still exists on disk
In practice, that means:
- if you keep the same raw logs and parser, the structured stage can be reused
- if you keep the same structured data and template setup, template mining can be reused
- if an expected local artifact has been deleted, the helper will rerun the work rather than trusting the cache state blindly
If you are debugging stale outputs or changing cache behavior, start with:
anomalog/cache/__init__.pyanomalog/_runtime/services.pyanomalog/parsers/structured/parquet/sink.pyanomalog/parsers/template/parsers.py