Sources¶
Sources define how a dataset is materialised locally before parsing starts.
Use them to distinguish between checked-in local data, local archives, and remote benchmark downloads while keeping the dataset pipeline explicit.
>>> from pathlib import Path
>>> from anomalog.sources import LocalDirSource, LocalZipSource, RemoteZipSource
>>> LocalDirSource.name, LocalZipSource.name, RemoteZipSource.name
('local_dir', 'local_zip', 'remote_zip')
>>> LocalDirSource(Path("logs"), raw_logs_relpath=Path("demo.log")).raw_logs_relpath
PosixPath('demo.log')
anomalog.sources¶
Dataset source abstractions for fetching log data.
DatasetSource
¶
Bases: Protocol
Download or copy a dataset into the given directory.
materialise(*, dst_dir)
¶
Ensure dataset exists under dst_dir and return the dataset root path.
raw_logs_path(*, dataset_name, dataset_root)
¶
Return the validated raw log path inside dataset_root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Dataset name used for the default log filename. |
required |
dataset_root
|
Path
|
Materialized dataset root directory. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Validated path to the raw log file inside the dataset root. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
FileNotFoundError
|
If the resolved log path does not exist. |
IsADirectoryError
|
If the resolved path is not a file. |
LocalDirSource
dataclass
¶
Bases: DatasetSource
Use an existing local directory as the dataset source.
materialise(*, dst_dir)
¶
Validate directory existence and return the dataset root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dst_dir
|
Path
|
Requested dataset destination. Ignored for local directory sources. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Existing dataset root directory. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the configured path does not exist. |
NotADirectoryError
|
If the configured path is not a directory. |
raw_logs_path(*, dataset_name, dataset_root)
¶
Return the validated raw log path inside dataset_root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Dataset name used for the default log filename. |
required |
dataset_root
|
Path
|
Materialized dataset root directory. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Validated path to the raw log file inside the dataset root. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
FileNotFoundError
|
If the resolved log path does not exist. |
IsADirectoryError
|
If the resolved path is not a file. |
LocalZipSource
dataclass
¶
Bases: DatasetSource
Use a local zip archive as the dataset source.
materialise(*, dst_dir)
¶
Extract the zip file into dst_dir and return the dataset root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dst_dir
|
Path
|
Destination directory for extracted dataset files. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Extracted dataset root directory. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the configured zip archive does not exist. |
raw_logs_path(*, dataset_name, dataset_root)
¶
Return the validated raw log path inside dataset_root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Dataset name used for the default log filename. |
required |
dataset_root
|
Path
|
Materialized dataset root directory. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Validated path to the raw log file inside the dataset root. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
FileNotFoundError
|
If the resolved log path does not exist. |
IsADirectoryError
|
If the resolved path is not a file. |
RemoteZipSource
dataclass
¶
Bases: DatasetSource
Download a dataset zip from a remote URL and extract it locally.
materialise(*, dst_dir)
¶
raw_logs_path(*, dataset_name, dataset_root)
¶
Return the validated raw log path inside dataset_root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Dataset name used for the default log filename. |
required |
dataset_root
|
Path
|
Materialized dataset root directory. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Validated path to the raw log file inside the dataset root. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
FileNotFoundError
|
If the resolved log path does not exist. |
IsADirectoryError
|
If the resolved path is not a file. |