policyML.raw_dataset_preprocessing

source module policyML.raw_dataset_preprocessing

Raw data preparations for the policyML project.

Phase 1: Splitting the dataset into historic and new datasets (used for training). Phase 2: Splitting the new dataset into monthly datasets (used for inference).

Functions

split_dataset_into_historic_and_new — Splits a training dataset into two disjoint datasets: historic and new.
monthly_split_new_dataset — Splits the new dataset into monthly datasets for inference.

source split_dataset_into_historic_and_new(data_dir: Path = Path('../data/raw')) → None

Splits a training dataset into two disjoint datasets: historic and new.

The function loads a CSV file named 'trainset.csv' from the specified directory, performs a 50/50 random split, verifies disjointness, and saves the resulting subsets as 'historic_dataset.csv' and 'new_dataset.csv'.

Parameters

data_dir : Path — Directory where 'trainset.csv' is located and where the output files will be saved.

Raises

ValueError

source monthly_split_new_dataset(data_dir: Path = Path('../data/raw')) → None

Splits the new dataset into monthly datasets for inference.

This function is a placeholder for future implementation. It will take the 'new_dataset.csv' and split it into monthly datasets.