Sources¶
Sources define how a dataset is materialised locally before parsing starts.
Use them to distinguish between checked-in local data, local archives, and remote benchmark downloads while keeping the dataset pipeline explicit.
>>> from pathlib import Path
>>> from anomalog.sources import LocalDirSource, LocalZipSource, RemoteZipSource
>>> LocalDirSource.name, LocalZipSource.name, RemoteZipSource.name
('local_dir', 'local_zip', 'remote_zip')
>>> LocalDirSource(Path("logs"), raw_logs_relpath=Path("demo.log")).raw_logs_relpath
PosixPath('demo.log')
anomalog.sources¶
Dataset source abstractions for fetching log data.
AITADSScenarioSource
dataclass
¶
Bases: DatasetSource
Materialise one or more AIT-ADS scenarios into a canonical alert stream.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry/config name for the built-in source. |
scenario_names |
tuple[str, ...]
|
Ordered scenario names selected for materialisation. |
base_source |
DatasetSource
|
Archive source that provides the extracted AIT-ADS files. |
labels_relpath |
Path
|
Relative path to the published scenario label CSV. |
labels_url |
str
|
Download URL for the label CSV when it is missing. |
labels_md5_checksum |
str
|
Expected MD5 checksum for the label CSV. |
raw_logs_relpath |
Path | None
|
Optional relative path for the derived JSONL stream. |
materialise(*, dst_dir)
¶
Materialise the archive, labels, and canonical alert stream.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dst_dir
|
Path
|
Dataset root used for archive extraction and the derived canonical alert stream. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Materialised dataset root containing the upstream files plus the derived canonical alert stream. |
raw_logs_path(*, dataset_name, dataset_root)
¶
Return the validated raw log path inside dataset_root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Dataset name used for the default log filename. |
required |
dataset_root
|
Path
|
Materialised dataset root directory. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Validated path to the raw log file inside the dataset root. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
FileNotFoundError
|
If the resolved log path does not exist. |
IsADirectoryError
|
If the resolved path is not a file. |
DatasetSource
¶
Bases: Protocol
Download or copy a dataset into the given directory.
Source implementations materialise a dataset root and then rely on the
shared raw_logs_path() helper to validate the final raw-log location.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Stable registry/config name for the source. |
raw_logs_relpath |
Path | None
|
Optional raw-log path relative to the
materialised dataset root. When omitted, |
materialise(*, dst_dir)
¶
raw_logs_path(*, dataset_name, dataset_root)
¶
Return the validated raw log path inside dataset_root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Dataset name used for the default log filename. |
required |
dataset_root
|
Path
|
Materialised dataset root directory. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Validated path to the raw log file inside the dataset root. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
FileNotFoundError
|
If the resolved log path does not exist. |
IsADirectoryError
|
If the resolved path is not a file. |
LocalDirSource
dataclass
¶
Bases: DatasetSource
Use an existing local directory as the dataset source.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry/config name for the source. |
path |
Path
|
Existing directory treated as the dataset root. |
raw_logs_relpath |
Path | None
|
Optional raw-log path relative to |
materialise(*, dst_dir)
¶
Validate directory existence and return the dataset root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dst_dir
|
Path
|
Requested dataset destination. Ignored for local directory sources. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Existing dataset root directory. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the configured path does not exist. |
NotADirectoryError
|
If the configured path is not a directory. |
raw_logs_path(*, dataset_name, dataset_root)
¶
Return the validated raw log path inside dataset_root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Dataset name used for the default log filename. |
required |
dataset_root
|
Path
|
Materialised dataset root directory. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Validated path to the raw log file inside the dataset root. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
FileNotFoundError
|
If the resolved log path does not exist. |
IsADirectoryError
|
If the resolved path is not a file. |
LocalZipSource
dataclass
¶
Bases: DatasetSource
Use a local zip archive as the dataset source.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry/config name for the source. |
zip_path |
Path
|
Local zip archive to extract. |
raw_logs_relpath |
Path | None
|
Optional raw-log path relative to the extracted dataset root. |
md5_checksum |
str | None
|
Optional checksum used to verify the archive before extraction. |
materialise(*, dst_dir)
¶
Extract the zip file into dst_dir and return the dataset root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dst_dir
|
Path
|
Destination directory for extracted dataset files. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Extracted dataset root directory. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the configured zip archive does not exist. |
raw_logs_path(*, dataset_name, dataset_root)
¶
Return the validated raw log path inside dataset_root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Dataset name used for the default log filename. |
required |
dataset_root
|
Path
|
Materialised dataset root directory. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Validated path to the raw log file inside the dataset root. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
FileNotFoundError
|
If the resolved log path does not exist. |
IsADirectoryError
|
If the resolved path is not a file. |
PostProcessedSource
dataclass
¶
Bases: DatasetSource
Materialise a base source and derive a raw log file from it.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry/config name for the derived source. |
base_source |
DatasetSource
|
Upstream source that materialises the archive or directory containing the source files. |
post_process |
PostProcessFn
|
Function that derives the raw log file from the materialised base source root. |
raw_logs_relpath |
Path | None
|
Relative path of the derived raw log file inside the materialised dataset root. |
split_provenance
property
¶
Return provenance for recognised file-boundary split materialisers.
materialise(*, dst_dir)
¶
Materialise the base source and derive the raw log file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dst_dir
|
Path
|
Destination directory for the materialised dataset. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Dataset root containing the derived raw log file. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the post-processing step fails to create the derived raw log file. |
raw_logs_path(*, dataset_name, dataset_root)
¶
Return the validated raw log path inside dataset_root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Dataset name used for the default log filename. |
required |
dataset_root
|
Path
|
Materialised dataset root directory. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Validated path to the raw log file inside the dataset root. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
FileNotFoundError
|
If the resolved log path does not exist. |
IsADirectoryError
|
If the resolved path is not a file. |
RemoteZipSource
dataclass
¶
Bases: DatasetSource
Download a dataset archive from a remote URL and extract it locally.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
ClassVar[str]
|
Registry/config name for the source. |
url |
str
|
Absolute HTTP(S) URL of the dataset archive. |
md5_checksum |
str | None
|
Optional checksum for the downloaded archive. |
raw_logs_relpath |
Path | None
|
Optional raw-log path relative to the extracted dataset root. |
materialise(*, dst_dir)
¶
raw_logs_path(*, dataset_name, dataset_root)
¶
Return the validated raw log path inside dataset_root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Dataset name used for the default log filename. |
required |
dataset_root
|
Path
|
Materialised dataset root directory. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Validated path to the raw log file inside the dataset root. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
FileNotFoundError
|
If the resolved log path does not exist. |
IsADirectoryError
|
If the resolved path is not a file. |