Sources¶

Sources define how a dataset is materialised locally before parsing starts.

Use them to distinguish between checked-in local data, local archives, and remote benchmark downloads while keeping the dataset pipeline explicit.

>>> from pathlib import Path
>>> from anomalog.sources import LocalDirSource, LocalZipSource, RemoteZipSource
>>> LocalDirSource.name, LocalZipSource.name, RemoteZipSource.name
('local_dir', 'local_zip', 'remote_zip')
>>> LocalDirSource(Path("logs"), raw_logs_relpath=Path("demo.log")).raw_logs_relpath
PosixPath('demo.log')

`anomalog.sources`¶

Dataset source abstractions for fetching log data.

`AITADSScenarioSource` `dataclass` ¶

Bases: DatasetSource

Materialise one or more AIT-ADS scenarios into a canonical alert stream.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry/config name for the built-in source.
`scenario_names`	`tuple[str, ...]`	Ordered scenario names selected for materialisation.
`base_source`	`DatasetSource`	Archive source that provides the extracted AIT-ADS files.
`labels_relpath`	`Path`	Relative path to the published scenario label CSV.
`labels_url`	`str`	Download URL for the label CSV when it is missing.
`labels_md5_checksum`	`str`	Expected MD5 checksum for the label CSV.
`raw_logs_relpath`	`Path \| None`	Optional relative path for the derived JSONL stream.

`materialise(*, dst_dir)` ¶

Materialise the archive, labels, and canonical alert stream.

Parameters:

Name	Type	Description	Default
`dst_dir`	`Path`	Dataset root used for archive extraction and the derived canonical alert stream.	required

Returns:

Name	Type	Description
`Path`	`Path`	Materialised dataset root containing the upstream files plus the derived canonical alert stream.

`raw_logs_path(*, dataset_name, dataset_root)` ¶

Return the validated raw log path inside dataset_root.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Dataset name used for the default log filename.	required
`dataset_root`	`Path`	Materialised dataset root directory.	required

Returns:

Name	Type	Description
`Path`	`Path`	Validated path to the raw log file inside the dataset root.

Raises:

Type	Description
`ValueError`	If `raw_logs_relpath` is absolute or escapes the dataset root.
`FileNotFoundError`	If the resolved log path does not exist.
`IsADirectoryError`	If the resolved path is not a file.

`DatasetSource` ¶

Bases: Protocol

Download or copy a dataset into the given directory.

Source implementations materialise a dataset root and then rely on the shared raw_logs_path() helper to validate the final raw-log location.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Stable registry/config name for the source.
`raw_logs_relpath`	`Path \| None`	Optional raw-log path relative to the materialised dataset root. When omitted, `<dataset_name>.log` is used.

`materialise(*, dst_dir)` ¶

Ensure dataset exists under dst_dir and return the dataset root path.

Parameters:

Name	Type	Description	Default
`dst_dir`	`Path`	Target directory where the dataset should appear.	required

Returns:

Name	Type	Description
`Path`	`Path`	Materialised dataset root directory.

`raw_logs_path(*, dataset_name, dataset_root)` ¶

Return the validated raw log path inside dataset_root.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Dataset name used for the default log filename.	required
`dataset_root`	`Path`	Materialised dataset root directory.	required

Returns:

Name	Type	Description
`Path`	`Path`	Validated path to the raw log file inside the dataset root.

Raises:

Type	Description
`ValueError`	If `raw_logs_relpath` is absolute or escapes the dataset root.
`FileNotFoundError`	If the resolved log path does not exist.
`IsADirectoryError`	If the resolved path is not a file.

`LocalDirSource` `dataclass` ¶

Bases: DatasetSource

Use an existing local directory as the dataset source.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry/config name for the source.
`path`	`Path`	Existing directory treated as the dataset root.
`raw_logs_relpath`	`Path \| None`	Optional raw-log path relative to `path`.

`materialise(*, dst_dir)` ¶

Validate directory existence and return the dataset root.

Parameters:

Name	Type	Description	Default
`dst_dir`	`Path`	Requested dataset destination. Ignored for local directory sources.	required

Returns:

Name	Type	Description
`Path`	`Path`	Existing dataset root directory.

Raises:

Type	Description
`FileNotFoundError`	If the configured path does not exist.
`NotADirectoryError`	If the configured path is not a directory.

`raw_logs_path(*, dataset_name, dataset_root)` ¶

Return the validated raw log path inside dataset_root.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Dataset name used for the default log filename.	required
`dataset_root`	`Path`	Materialised dataset root directory.	required

Returns:

Name	Type	Description
`Path`	`Path`	Validated path to the raw log file inside the dataset root.

Raises:

Type	Description
`ValueError`	If `raw_logs_relpath` is absolute or escapes the dataset root.
`FileNotFoundError`	If the resolved log path does not exist.
`IsADirectoryError`	If the resolved path is not a file.

`LocalZipSource` `dataclass` ¶

Bases: DatasetSource

Use a local zip archive as the dataset source.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry/config name for the source.
`zip_path`	`Path`	Local zip archive to extract.
`raw_logs_relpath`	`Path \| None`	Optional raw-log path relative to the extracted dataset root.
`md5_checksum`	`str \| None`	Optional checksum used to verify the archive before extraction.

`materialise(*, dst_dir)` ¶

Extract the zip file into dst_dir and return the dataset root.

Parameters:

Name	Type	Description	Default
`dst_dir`	`Path`	Destination directory for extracted dataset files.	required

Returns:

Name	Type	Description
`Path`	`Path`	Extracted dataset root directory.

Raises:

Type	Description
`FileNotFoundError`	If the configured zip archive does not exist.

`raw_logs_path(*, dataset_name, dataset_root)` ¶

Return the validated raw log path inside dataset_root.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Dataset name used for the default log filename.	required
`dataset_root`	`Path`	Materialised dataset root directory.	required

Returns:

Name	Type	Description
`Path`	`Path`	Validated path to the raw log file inside the dataset root.

Raises:

Type	Description
`ValueError`	If `raw_logs_relpath` is absolute or escapes the dataset root.
`FileNotFoundError`	If the resolved log path does not exist.
`IsADirectoryError`	If the resolved path is not a file.

`PostProcessedSource` `dataclass` ¶

Bases: DatasetSource

Materialise a base source and derive a raw log file from it.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry/config name for the derived source.
`base_source`	`DatasetSource`	Upstream source that materialises the archive or directory containing the source files.
`post_process`	`PostProcessFn`	Function that derives the raw log file from the materialised base source root.
`raw_logs_relpath`	`Path \| None`	Relative path of the derived raw log file inside the materialised dataset root.

`split_provenance` `property` ¶

Return provenance for recognised file-boundary split materialisers.

`materialise(*, dst_dir)` ¶

Materialise the base source and derive the raw log file.

Parameters:

Name	Type	Description	Default
`dst_dir`	`Path`	Destination directory for the materialised dataset.	required

Returns:

Name	Type	Description
`Path`	`Path`	Dataset root containing the derived raw log file.

Raises:

Type	Description
`FileNotFoundError`	If the post-processing step fails to create the derived raw log file.

`raw_logs_path(*, dataset_name, dataset_root)` ¶

Return the validated raw log path inside dataset_root.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Dataset name used for the default log filename.	required
`dataset_root`	`Path`	Materialised dataset root directory.	required

Returns:

Name	Type	Description
`Path`	`Path`	Validated path to the raw log file inside the dataset root.

Raises:

Type	Description
`ValueError`	If `raw_logs_relpath` is absolute or escapes the dataset root.
`FileNotFoundError`	If the resolved log path does not exist.
`IsADirectoryError`	If the resolved path is not a file.

`RemoteZipSource` `dataclass` ¶

Bases: DatasetSource

Download a dataset archive from a remote URL and extract it locally.

Attributes:

Name	Type	Description
`name`	`ClassVar[str]`	Registry/config name for the source.
`url`	`str`	Absolute HTTP(S) URL of the dataset archive.
`md5_checksum`	`str \| None`	Optional checksum for the downloaded archive.
`raw_logs_relpath`	`Path \| None`	Optional raw-log path relative to the extracted dataset root.

`materialise(*, dst_dir)` ¶

Fetch, checksum, and extract the dataset into dst_dir.

Parameters:

Name	Type	Description	Default
`dst_dir`	`Path`	Destination directory for the extracted dataset.	required

Returns:

Name	Type	Description
`Path`	`Path`	Extracted dataset root directory.

`raw_logs_path(*, dataset_name, dataset_root)` ¶

Return the validated raw log path inside dataset_root.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Dataset name used for the default log filename.	required
`dataset_root`	`Path`	Materialised dataset root directory.	required

Returns:

Name	Type	Description
`Path`	`Path`	Validated path to the raw log file inside the dataset root.

Raises:

Type	Description
`ValueError`	If `raw_logs_relpath` is absolute or escapes the dataset root.
`FileNotFoundError`	If the resolved log path does not exist.
`IsADirectoryError`	If the resolved path is not a file.

Sources¶

anomalog.sources¶

AITADSScenarioSource dataclass ¶

materialise(*, dst_dir) ¶

raw_logs_path(*, dataset_name, dataset_root) ¶

DatasetSource ¶

materialise(*, dst_dir) ¶

raw_logs_path(*, dataset_name, dataset_root) ¶

LocalDirSource dataclass ¶

materialise(*, dst_dir) ¶

raw_logs_path(*, dataset_name, dataset_root) ¶

LocalZipSource dataclass ¶

materialise(*, dst_dir) ¶

raw_logs_path(*, dataset_name, dataset_root) ¶

PostProcessedSource dataclass ¶

split_provenance property ¶

materialise(*, dst_dir) ¶

raw_logs_path(*, dataset_name, dataset_root) ¶

RemoteZipSource dataclass ¶

materialise(*, dst_dir) ¶

raw_logs_path(*, dataset_name, dataset_root) ¶

`anomalog.sources`¶

`AITADSScenarioSource` `dataclass` ¶

`materialise(*, dst_dir)` ¶

`raw_logs_path(*, dataset_name, dataset_root)` ¶

`DatasetSource` ¶

`materialise(*, dst_dir)` ¶

`raw_logs_path(*, dataset_name, dataset_root)` ¶

`LocalDirSource` `dataclass` ¶

`materialise(*, dst_dir)` ¶

`raw_logs_path(*, dataset_name, dataset_root)` ¶

`LocalZipSource` `dataclass` ¶

`materialise(*, dst_dir)` ¶

`raw_logs_path(*, dataset_name, dataset_root)` ¶

`PostProcessedSource` `dataclass` ¶

`split_provenance` `property` ¶

`materialise(*, dst_dir)` ¶

`raw_logs_path(*, dataset_name, dataset_root)` ¶

`RemoteZipSource` `dataclass` ¶

`materialise(*, dst_dir)` ¶

`raw_logs_path(*, dataset_name, dataset_root)` ¶