Skip to content

Sources

Sources define how a dataset is materialised locally before parsing starts.

Use them to distinguish between checked-in local data, local archives, and remote benchmark downloads while keeping the dataset pipeline explicit.

>>> from pathlib import Path
>>> from anomalog.sources import LocalDirSource, LocalZipSource, RemoteZipSource
>>> LocalDirSource.name, LocalZipSource.name, RemoteZipSource.name
('local_dir', 'local_zip', 'remote_zip')
>>> LocalDirSource(Path("logs"), raw_logs_relpath=Path("demo.log")).raw_logs_relpath
PosixPath('demo.log')

anomalog.sources

Dataset source abstractions for fetching log data.

DatasetSource

Bases: Protocol

Download or copy a dataset into the given directory.

materialise(*, dst_dir)

Ensure dataset exists under dst_dir and return the dataset root path.

raw_logs_path(*, dataset_name, dataset_root)

Return the validated raw log path inside dataset_root.

Parameters:

Name Type Description Default
dataset_name str

Dataset name used for the default log filename.

required
dataset_root Path

Materialized dataset root directory.

required

Returns:

Name Type Description
Path Path

Validated path to the raw log file inside the dataset root.

Raises:

Type Description
ValueError

If raw_logs_relpath is absolute or escapes the dataset root.

FileNotFoundError

If the resolved log path does not exist.

IsADirectoryError

If the resolved path is not a file.

LocalDirSource dataclass

Bases: DatasetSource

Use an existing local directory as the dataset source.

materialise(*, dst_dir)

Validate directory existence and return the dataset root.

Parameters:

Name Type Description Default
dst_dir Path

Requested dataset destination. Ignored for local directory sources.

required

Returns:

Name Type Description
Path Path

Existing dataset root directory.

Raises:

Type Description
FileNotFoundError

If the configured path does not exist.

NotADirectoryError

If the configured path is not a directory.

raw_logs_path(*, dataset_name, dataset_root)

Return the validated raw log path inside dataset_root.

Parameters:

Name Type Description Default
dataset_name str

Dataset name used for the default log filename.

required
dataset_root Path

Materialized dataset root directory.

required

Returns:

Name Type Description
Path Path

Validated path to the raw log file inside the dataset root.

Raises:

Type Description
ValueError

If raw_logs_relpath is absolute or escapes the dataset root.

FileNotFoundError

If the resolved log path does not exist.

IsADirectoryError

If the resolved path is not a file.

LocalZipSource dataclass

Bases: DatasetSource

Use a local zip archive as the dataset source.

materialise(*, dst_dir)

Extract the zip file into dst_dir and return the dataset root.

Parameters:

Name Type Description Default
dst_dir Path

Destination directory for extracted dataset files.

required

Returns:

Name Type Description
Path Path

Extracted dataset root directory.

Raises:

Type Description
FileNotFoundError

If the configured zip archive does not exist.

raw_logs_path(*, dataset_name, dataset_root)

Return the validated raw log path inside dataset_root.

Parameters:

Name Type Description Default
dataset_name str

Dataset name used for the default log filename.

required
dataset_root Path

Materialized dataset root directory.

required

Returns:

Name Type Description
Path Path

Validated path to the raw log file inside the dataset root.

Raises:

Type Description
ValueError

If raw_logs_relpath is absolute or escapes the dataset root.

FileNotFoundError

If the resolved log path does not exist.

IsADirectoryError

If the resolved path is not a file.

RemoteZipSource dataclass

Bases: DatasetSource

Download a dataset zip from a remote URL and extract it locally.

materialise(*, dst_dir)

Fetch, checksum, and extract the dataset into dst_dir.

Parameters:

Name Type Description Default
dst_dir Path

Destination directory for the extracted dataset.

required

Returns:

Name Type Description
Path Path

Extracted dataset root directory.

raw_logs_path(*, dataset_name, dataset_root)

Return the validated raw log path inside dataset_root.

Parameters:

Name Type Description Default
dataset_name str

Dataset name used for the default log filename.

required
dataset_root Path

Materialized dataset root directory.

required

Returns:

Name Type Description
Path Path

Validated path to the raw log file inside the dataset root.

Raises:

Type Description
ValueError

If raw_logs_relpath is absolute or escapes the dataset root.

FileNotFoundError

If the resolved log path does not exist.

IsADirectoryError

If the resolved path is not a file.