get_eeg_split_table_kfold

selfeeg.dataloading.load.get_eeg_split_table_kfold(partition_table: pd.DataFrame, kfold: int = 10, test_ratio: float = 0.2, test_split_mode: str or int = 2, val_split_mode: str or int = 2, exclude_data_id: list or dict = None, test_data_id: list or dict = None, stratified: bool = False, labels: array like = None, dataset_id_extractor: function = None, subject_id_extractor: function = None, split_tolerance=0.01, perseverance=1000, save: bool = False, save_path: str = None, seed: int = None) pd.DataFrame[source]

creates a table with multiple splits for cross-validation.

Test split, if calculated, is kept equal in every CV split. Split is done in the following way:

  1. dataset is split in Train and Test sets

  2. train set is split in Train and Validation sets

Test split is optional and can be done with the same modalities described in the get_eeg_split_table function, i.e. by giving specific ID or by giving a split ratio. CV’s train/validation split cannot be done in this way, since this does not guarantee the preservation of the split ratio, which is the core of cross validation.

Parameters:
  • partition_table (pd.Dataframe) –

    A two columns dataframe where:

    1. the first column has name ‘file_name’ and contain all the file names

    2. the second column has name ‘N_samples’ and has the number of samples which can be extracted from the file

    This table can be automatically created with a custom setting with the provided function get_eeg_partition_number().

  • Kfold (int, optional) –

    The number of folds to extract. Must be a number higher or equal than 2.

    Default = 10

  • test_ratio (float, optional) –

    The percentage of data with respect to the whole number of samples (partitions) of the dataset to be included in the test set. Must be a number in [0,1]. 0 means that the test split is skipped if test_data_id is not given.

    Default = 0.2

  • test_split_mode (int or str, optional) –

    The type of split to perform in the step train test split. It can be one of the following:

    1. any of [0, ‘d’, ‘set’, ‘dataset’]: split will be performed using dataset IDs, i.e. all files of the same dataset will be put in the same split set

    2. any of [1, ‘s’, ‘subj’, ‘subject’]: split will be performed using subjects IDs, i.e. all files of the same subjects will be put in the same split set

    3. any of [2, ‘file’, ‘record’]: split will be performed looking at single files

    Default = 2

  • val_split_mode (int or str, optional) –

    The type of split to perform in the step train to train - validation split. Input allowed are the same as in test_split_mode.

    Default = 2

  • exclude_data_id (list or dict, optional) –

    Dataset ID to be excluded. It can be given in the following formats:

    1. a list with all dataset IDs to exclude

    2. a dictionary where keys are the dataset IDs and values its relative subject IDs. If a key has an empty value, then all the files with that dataset ID will be included

    Note that to work, the function must be able to identify the dataset or subject IDs from the file name in order to check if they are in the given list or dict. Custom extraction functions can be given as arguments; however, if nothing is given, the function will try to extract IDs considering that file names are in the format a_b_c_d.extension (the output of the BIDSalign library), where “a” is an integer with the dataset ID and “b” an integer with the subject ID. If this fail, all files will be considered from the same datasets (id=0), and each file from a different subject (id from 0 to N-1).

    Also note that if the input argument is not a list or a dict, it will be automatically converted to a list. No checks about what is converted to a list will be performed.

    Default = None

  • test_data_id (list or dict, optional) –

    Same as exclude_data_id but for the test split.

    Defaul = None

  • stratified (bool, optional) –

    Whether to apply stratification to the split or not. Might be used for fine-tuning split (the typical phase where labels are involved). Stratification will preserve, if possible, the label’s ratio on the training, validation, and test sets. Works only when each file has an unique label, which must be given in input.

    Default = False

  • labels (list or ArrayLike, optional) –

    A list or 1d ArrayLike objects with the label of each file listed in the partition table. Must be given if stratification is set to True Indeces of labels must match row indeces in the partition table, i.e. label1 -> row1, label2 -> row2, etc.

    Default = None

  • dataset_id_extractor (function, optional) –

    A custom function to be used to extract the dataset ID from the file name. It must accept only one argument, which is the file name (not the full path, only the file name).

    Default = None

  • subject_id_extractor (function, optional) –

    A custom function to be used to extract the subject ID from the file name. It must accept only one argument, which is the file name (not the full path, only the file name).

    Default = None

  • split_tolerance (float, optional) –

    Argument for get_subarray_closest_sum function. Set the maximum accepted tolerance between the given split ratio and the one got with the obtained subset. Must be a number in [0,1]

    Default = 0.01

  • perseverance (int, optional) –

    Argument for get_subarray_closest_sum function. Set the maximum number of tries before stop searching for a split whose ratio is in the range [target_ratio - tolerance, target_ratio + tolerance]

    Default = 1000

  • save (bool, optional) –

    Whether to save the resulted DataFrame as a .csv file or not.

    Default = False

  • save_path (str, optional) –

    A custom path to be used instead of the current working directory. It is the string given to the pandas.DataFrame.to_csv() method.

    Default = None

  • seed (int, optional) –

    An integer defining the seed to use. Set it to reproduce split results.

    Default = None

Returns:

EEGSplitKfold (pd.DataFrame) – Pandas DataFrame where the first column has the EEG file names, while the others will have the assigned split for each CV split. Each split is included in a column with the name “split_k” with k from 1 to the given Kfold argument. Each split will assign the following labels to a file:

  1. -1 : the file is excluded

  2. 0 : the file is included in the training set

  3. 1 : the file is included in the validation set

  4. 2 : the file is included in the test set

See also

get_split

extract a specific split from the output dataframe.

Warning

Some configurations may produce strange results. For example, if you want to do a 10 fold CV with a subject based split, but your dataset has only 5 subjects, the function will not throw an error, but some splits won’t have a validation split.

Example

>>> import pickle
>>> import pandas as pd
>>> import selfeeg.dataloading as dl
>>> import selfeeg.utils
>>> labels = utils.create_dataset()
>>> def loadEEG(path):
...     with open(path, 'rb') as handle:
...         EEG = pickle.load(handle)
...     x = EEG['data']
...     return x
>>>  EEGlen = dl.get_eeg_partition_number('Simulated_EEG',freq=128, window=2,
...                                    overlap=0.3, load_function=loadEEG )
>>>  EEGsplit = dl.get_eeg_split_table_kfold(EEGlen, seed=1234)
>>>  dl.check_split(EEGlen,dl.get_split(EEGsplit,1)) #will return 0.72/0.08/0.2