Skip to content

Sources

Sources define how a dataset is materialised locally before parsing starts.

Use them to distinguish between checked-in local data, local archives, and remote benchmark downloads while keeping the dataset pipeline explicit.

>>> from pathlib import Path
>>> from anomalog.sources import LocalDirSource, LocalZipSource, RemoteZipSource
>>> LocalDirSource.name, LocalZipSource.name, RemoteZipSource.name
('local_dir', 'local_zip', 'remote_zip')
>>> LocalDirSource(Path("logs"), raw_logs_relpath=Path("demo.log")).raw_logs_relpath
PosixPath('demo.log')

anomalog.sources

Dataset source abstractions for fetching log data.

AITADSScenarioSource dataclass

Bases: DatasetSource

Materialise one or more AIT-ADS scenarios into a canonical alert stream.

Attributes:

Name Type Description
name ClassVar[str]

Registry/config name for the built-in source.

scenario_names tuple[str, ...]

Ordered scenario names selected for materialisation.

base_source DatasetSource

Archive source that provides the extracted AIT-ADS files.

labels_relpath Path

Relative path to the published scenario label CSV.

labels_url str

Download URL for the label CSV when it is missing.

labels_md5_checksum str

Expected MD5 checksum for the label CSV.

raw_logs_relpath Path | None

Optional relative path for the derived JSONL stream.

materialise(*, dst_dir)

Materialise the archive, labels, and canonical alert stream.

Parameters:

Name Type Description Default
dst_dir Path

Dataset root used for archive extraction and the derived canonical alert stream.

required

Returns:

Name Type Description
Path Path

Materialised dataset root containing the upstream files plus the derived canonical alert stream.

raw_logs_path(*, dataset_name, dataset_root)

Return the validated raw log path inside dataset_root.

Parameters:

Name Type Description Default
dataset_name str

Dataset name used for the default log filename.

required
dataset_root Path

Materialised dataset root directory.

required

Returns:

Name Type Description
Path Path

Validated path to the raw log file inside the dataset root.

Raises:

Type Description
ValueError

If raw_logs_relpath is absolute or escapes the dataset root.

FileNotFoundError

If the resolved log path does not exist.

IsADirectoryError

If the resolved path is not a file.

DatasetSource

Bases: Protocol

Download or copy a dataset into the given directory.

Source implementations materialise a dataset root and then rely on the shared raw_logs_path() helper to validate the final raw-log location.

Attributes:

Name Type Description
name ClassVar[str]

Stable registry/config name for the source.

raw_logs_relpath Path | None

Optional raw-log path relative to the materialised dataset root. When omitted, <dataset_name>.log is used.

materialise(*, dst_dir)

Ensure dataset exists under dst_dir and return the dataset root path.

Parameters:

Name Type Description Default
dst_dir Path

Target directory where the dataset should appear.

required

Returns:

Name Type Description
Path Path

Materialised dataset root directory.

raw_logs_path(*, dataset_name, dataset_root)

Return the validated raw log path inside dataset_root.

Parameters:

Name Type Description Default
dataset_name str

Dataset name used for the default log filename.

required
dataset_root Path

Materialised dataset root directory.

required

Returns:

Name Type Description
Path Path

Validated path to the raw log file inside the dataset root.

Raises:

Type Description
ValueError

If raw_logs_relpath is absolute or escapes the dataset root.

FileNotFoundError

If the resolved log path does not exist.

IsADirectoryError

If the resolved path is not a file.

LocalDirSource dataclass

Bases: DatasetSource

Use an existing local directory as the dataset source.

Attributes:

Name Type Description
name ClassVar[str]

Registry/config name for the source.

path Path

Existing directory treated as the dataset root.

raw_logs_relpath Path | None

Optional raw-log path relative to path.

materialise(*, dst_dir)

Validate directory existence and return the dataset root.

Parameters:

Name Type Description Default
dst_dir Path

Requested dataset destination. Ignored for local directory sources.

required

Returns:

Name Type Description
Path Path

Existing dataset root directory.

Raises:

Type Description
FileNotFoundError

If the configured path does not exist.

NotADirectoryError

If the configured path is not a directory.

raw_logs_path(*, dataset_name, dataset_root)

Return the validated raw log path inside dataset_root.

Parameters:

Name Type Description Default
dataset_name str

Dataset name used for the default log filename.

required
dataset_root Path

Materialised dataset root directory.

required

Returns:

Name Type Description
Path Path

Validated path to the raw log file inside the dataset root.

Raises:

Type Description
ValueError

If raw_logs_relpath is absolute or escapes the dataset root.

FileNotFoundError

If the resolved log path does not exist.

IsADirectoryError

If the resolved path is not a file.

LocalZipSource dataclass

Bases: DatasetSource

Use a local zip archive as the dataset source.

Attributes:

Name Type Description
name ClassVar[str]

Registry/config name for the source.

zip_path Path

Local zip archive to extract.

raw_logs_relpath Path | None

Optional raw-log path relative to the extracted dataset root.

md5_checksum str | None

Optional checksum used to verify the archive before extraction.

materialise(*, dst_dir)

Extract the zip file into dst_dir and return the dataset root.

Parameters:

Name Type Description Default
dst_dir Path

Destination directory for extracted dataset files.

required

Returns:

Name Type Description
Path Path

Extracted dataset root directory.

Raises:

Type Description
FileNotFoundError

If the configured zip archive does not exist.

raw_logs_path(*, dataset_name, dataset_root)

Return the validated raw log path inside dataset_root.

Parameters:

Name Type Description Default
dataset_name str

Dataset name used for the default log filename.

required
dataset_root Path

Materialised dataset root directory.

required

Returns:

Name Type Description
Path Path

Validated path to the raw log file inside the dataset root.

Raises:

Type Description
ValueError

If raw_logs_relpath is absolute or escapes the dataset root.

FileNotFoundError

If the resolved log path does not exist.

IsADirectoryError

If the resolved path is not a file.

PostProcessedSource dataclass

Bases: DatasetSource

Materialise a base source and derive a raw log file from it.

Attributes:

Name Type Description
name ClassVar[str]

Registry/config name for the derived source.

base_source DatasetSource

Upstream source that materialises the archive or directory containing the source files.

post_process PostProcessFn

Function that derives the raw log file from the materialised base source root.

raw_logs_relpath Path | None

Relative path of the derived raw log file inside the materialised dataset root.

split_provenance property

Return provenance for recognised file-boundary split materialisers.

materialise(*, dst_dir)

Materialise the base source and derive the raw log file.

Parameters:

Name Type Description Default
dst_dir Path

Destination directory for the materialised dataset.

required

Returns:

Name Type Description
Path Path

Dataset root containing the derived raw log file.

Raises:

Type Description
FileNotFoundError

If the post-processing step fails to create the derived raw log file.

raw_logs_path(*, dataset_name, dataset_root)

Return the validated raw log path inside dataset_root.

Parameters:

Name Type Description Default
dataset_name str

Dataset name used for the default log filename.

required
dataset_root Path

Materialised dataset root directory.

required

Returns:

Name Type Description
Path Path

Validated path to the raw log file inside the dataset root.

Raises:

Type Description
ValueError

If raw_logs_relpath is absolute or escapes the dataset root.

FileNotFoundError

If the resolved log path does not exist.

IsADirectoryError

If the resolved path is not a file.

RemoteZipSource dataclass

Bases: DatasetSource

Download a dataset archive from a remote URL and extract it locally.

Attributes:

Name Type Description
name ClassVar[str]

Registry/config name for the source.

url str

Absolute HTTP(S) URL of the dataset archive.

md5_checksum str | None

Optional checksum for the downloaded archive.

raw_logs_relpath Path | None

Optional raw-log path relative to the extracted dataset root.

materialise(*, dst_dir)

Fetch, checksum, and extract the dataset into dst_dir.

Parameters:

Name Type Description Default
dst_dir Path

Destination directory for the extracted dataset.

required

Returns:

Name Type Description
Path Path

Extracted dataset root directory.

raw_logs_path(*, dataset_name, dataset_root)

Return the validated raw log path inside dataset_root.

Parameters:

Name Type Description Default
dataset_name str

Dataset name used for the default log filename.

required
dataset_root Path

Materialised dataset root directory.

required

Returns:

Name Type Description
Path Path

Validated path to the raw log file inside the dataset root.

Raises:

Type Description
ValueError

If raw_logs_relpath is absolute or escapes the dataset root.

FileNotFoundError

If the resolved log path does not exist.

IsADirectoryError

If the resolved path is not a file.