get_eeg_split_table_kfold
- selfeeg.dataloading.load.get_eeg_split_table_kfold(partition_table: pd.DataFrame, kfold: int = 10, test_ratio: float = 0.2, test_split_mode: str or int = 2, val_split_mode: str or int = 2, exclude_data_id: list or dict = None, test_data_id: list or dict = None, stratified: bool = False, labels: array like = None, dataset_id_extractor: function = None, subject_id_extractor: function = None, split_tolerance=0.01, perseverance=1000, save: bool = False, save_path: str = None, seed: int = None) pd.DataFrame[source]
creates a table with multiple splits for cross-validation.
Test split, if calculated, is kept equal in every CV split. Split is done in the following way:
dataset is split in Train and Test sets
train set is split in Train and Validation sets
Test split is optional and can be done with the same modalities described in the
get_eeg_split_tablefunction, i.e. by giving specific ID or by giving a split ratio. CV’s train/validation split cannot be done in this way, since this does not guarantee the preservation of the split ratio, which is the core of cross validation.- Parameters:
partition_table (pd.Dataframe) –
A two columns dataframe where:
the first column has name ‘file_name’ and contain all the file names
the second column has name ‘N_samples’ and has the number of samples which can be extracted from the file
This table can be automatically created with a custom setting with the provided function
get_eeg_partition_number().Kfold (int, optional) –
The number of folds to extract. Must be a number higher or equal than 2.
Default = 10
test_ratio (float, optional) –
The percentage of data with respect to the whole number of samples (partitions) of the dataset to be included in the test set. Must be a number in [0,1]. 0 means that the test split is skipped if test_data_id is not given.
Default = 0.2
test_split_mode (int or str, optional) –
The type of split to perform in the step train test split. It can be one of the following:
any of [0, ‘d’, ‘set’, ‘dataset’]: split will be performed using dataset IDs, i.e. all files of the same dataset will be put in the same split set
any of [1, ‘s’, ‘subj’, ‘subject’]: split will be performed using subjects IDs, i.e. all files of the same subjects will be put in the same split set
any of [2, ‘file’, ‘record’]: split will be performed looking at single files
Default = 2
val_split_mode (int or str, optional) –
The type of split to perform in the step train to train - validation split. Input allowed are the same as in test_split_mode.
Default = 2
exclude_data_id (list or dict, optional) –
Dataset ID to be excluded. It can be given in the following formats:
a list with all dataset IDs to exclude
a dictionary where keys are the dataset IDs and values its relative subject IDs. If a key has an empty value, then all the files with that dataset ID will be included
Note that to work, the function must be able to identify the dataset or subject IDs from the file name in order to check if they are in the given list or dict. Custom extraction functions can be given as arguments; however, if nothing is given, the function will try to extract IDs considering that file names are in the format a_b_c_d.extension (the output of the BIDSalign library), where “a” is an integer with the dataset ID and “b” an integer with the subject ID. If this fail, all files will be considered from the same datasets (id=0), and each file from a different subject (id from 0 to N-1).
Also note that if the input argument is not a list or a dict, it will be automatically converted to a list. No checks about what is converted to a list will be performed.
Default = None
test_data_id (list or dict, optional) –
Same as exclude_data_id but for the test split.
Defaul = None
stratified (bool, optional) –
Whether to apply stratification to the split or not. Might be used for fine-tuning split (the typical phase where labels are involved). Stratification will preserve, if possible, the label’s ratio on the training, validation, and test sets. Works only when each file has an unique label, which must be given in input.
Default = False
labels (list or ArrayLike, optional) –
A list or 1d ArrayLike objects with the label of each file listed in the partition table. Must be given if stratification is set to True Indeces of labels must match row indeces in the partition table, i.e. label1 -> row1, label2 -> row2, etc.
Default = None
dataset_id_extractor (function, optional) –
A custom function to be used to extract the dataset ID from the file name. It must accept only one argument, which is the file name (not the full path, only the file name).
Default = None
subject_id_extractor (function, optional) –
A custom function to be used to extract the subject ID from the file name. It must accept only one argument, which is the file name (not the full path, only the file name).
Default = None
split_tolerance (float, optional) –
Argument for
get_subarray_closest_sumfunction. Set the maximum accepted tolerance between the given split ratio and the one got with the obtained subset. Must be a number in [0,1]Default = 0.01
perseverance (int, optional) –
Argument for
get_subarray_closest_sumfunction. Set the maximum number of tries before stop searching for a split whose ratio is in the range [target_ratio - tolerance, target_ratio + tolerance]Default = 1000
save (bool, optional) –
Whether to save the resulted DataFrame as a .csv file or not.
Default = False
save_path (str, optional) –
A custom path to be used instead of the current working directory. It is the string given to the
pandas.DataFrame.to_csv()method.Default = None
seed (int, optional) –
An integer defining the seed to use. Set it to reproduce split results.
Default = None
- Returns:
EEGSplitKfold (pd.DataFrame) – Pandas DataFrame where the first column has the EEG file names, while the others will have the assigned split for each CV split. Each split is included in a column with the name “split_k” with k from 1 to the given Kfold argument. Each split will assign the following labels to a file:
-1 : the file is excluded
0 : the file is included in the training set
1 : the file is included in the validation set
2 : the file is included in the test set
See also
get_splitextract a specific split from the output dataframe.
Warning
Some configurations may produce strange results. For example, if you want to do a 10 fold CV with a subject based split, but your dataset has only 5 subjects, the function will not throw an error, but some splits won’t have a validation split.
Example
>>> import pickle >>> import pandas as pd >>> import selfeeg.dataloading as dl >>> import selfeeg.utils >>> labels = utils.create_dataset() >>> def loadEEG(path): ... with open(path, 'rb') as handle: ... EEG = pickle.load(handle) ... x = EEG['data'] ... return x >>> EEGlen = dl.get_eeg_partition_number('Simulated_EEG',freq=128, window=2, ... overlap=0.3, load_function=loadEEG ) >>> EEGsplit = dl.get_eeg_split_table_kfold(EEGlen, seed=1234) >>> dl.check_split(EEGlen,dl.get_split(EEGsplit,1)) #will return 0.72/0.08/0.2