policyML.raw_dataset_preprocessing
source module policyML.raw_dataset_preprocessing
Raw data preparations for the policyML project.
Phase 1: Splitting the dataset into historic and new datasets (used for training). Phase 2: Splitting the new dataset into monthly datasets (used for inference).
Functions
-
split_dataset_into_historic_and_new — Splits a training dataset into two disjoint datasets: historic and new.
-
monthly_split_new_dataset — Splits the new dataset into monthly datasets for inference.
source split_dataset_into_historic_and_new(data_dir: Path = Path('../data/raw')) → None
Splits a training dataset into two disjoint datasets: historic and new.
The function loads a CSV file named 'trainset.csv' from the specified directory, performs a 50/50 random split, verifies disjointness, and saves the resulting subsets as 'historic_dataset.csv' and 'new_dataset.csv'.
Parameters
-
data_dir : Path — Directory where 'trainset.csv' is located and where the output files will be saved.
Raises
-
ValueError
source monthly_split_new_dataset(data_dir: Path = Path('../data/raw')) → None
Splits the new dataset into monthly datasets for inference.
This function is a placeholder for future implementation. It will take the 'new_dataset.csv' and split it into monthly datasets.