get_eeg_split_table

selfeeg.dataloading.load.get_eeg_split_table(partition_table: pd.DataFrame, test_ratio: float = 0.2, val_ratio: float = 0.2, test_split_mode: str or int = 2, val_split_mode: str or int = 2, exclude_data_id: list or dict = None, test_data_id: list or dict = None, val_data_id: list or dict = None, val_ratio_on_all_data: bool = True, stratified: bool = False, labels: ArrayLike = None, dataset_id_extractor: function = None, subject_id_extractor: function = None, split_tolerance=0.01, perseverance=1000, save: bool = False, save_path: str = None, seed: int = None) pd.DataFrame[source]

creates a split table defining the files to use as train, validation and test sets.

Split is done in the following way:

  1. Dataset is split in Train and Test sets

  2. Train set is split in Train and Validation sets

If specific IDs are given, the split is done using them ignoring any split ratio, otherwise split is done randomly using the given ratio. Note that Test or Validation sets can be empty, if for example you want to split the dataset only in two subsets. To further understand how to use this function see the introductory notebook provided in the documentation.

Parameters:
  • partition_table (pd.Dataframe) –

    A two columns dataframe where:

    1. the first column has name ‘file_name’ and contain all the file names

    2. the second column has name ‘N_samples’ and has the number of samples which can be extracted from the file

    This table can be automatically created with a custom setting with the provided function get_eeg_partition_number() .

  • test_ratio (float, optional) –

    The percentage of data with respect to the whole number of samples (partitions) of the dataset to be included in the test set. Must be a number in [0,1]. 0 means that the test split is skipped if test_data_id is not given.

    Default = 0.2

  • val_ratio (float, optional) –

    The percentage of data with respect to the whole number of samples (partitions) of the dataset or the remaining ones after test split (see val_ratio_on_all_data argument) to be included in the validation set. Must be a number in [0,1]. 0 means that the validation split is skipped if val_data_id is not given.

    Default = 0.2

  • test_split_mode (int or str, optional) –

    The type of split to perform in the step train test split. It can be one of the following:

    1. any of [0, ‘d’, ‘set’, ‘dataset’]: split will be performed using dataset IDs, i.e. all files of the same dataset will be put in the same split set

    2. any of [1, ‘s’, ‘subj’, ‘subject’]: split will be performed using subjects IDs, i.e. all files of the same subjects will be put in the same split set

    3. any of [2, ‘file’, ‘record’]: split will be performed looking at single files

    Default = 2

  • val_split_mode (int or str, optional) –

    The type of split to perform in the step train to train - validation split. Inputs allowed are the same as in test_split_mode.

    Default = 2

  • exclude_data_id (list or dict, optional) –

    Dataset ID to be excluded. It can be given in the following formats:

    1. a list with all dataset IDs to exclude

    2. a dictionary where keys are the dataset IDs and values its relative subject IDs. If a key has an empty value, then all the files with that dataset ID will be included.

    Note that to work, the function must be able to identify the dataset or subject IDs from the file name in order to check if they are in the given list or dict. Custom extraction functions can be given as arguments; however, if nothing is given, the function will try to extract IDs considering that file names are in the format a_b_c_d.extension (the typical output of the BIDSAlign library), where “a” is an integer with the dataset ID and “b” an integer with the subject ID. If this fails, all files will be considered from the same datasets (id=0), and each file from a different subject (id from 0 to N-1).

    Also note that if the input argument is not a list or a dict, it will be automatically converted to a list. No checks about what is converted to a list will be performed.

    Default = None

  • test_data_id (list or dict, optional) –

    Same as exclude_data_id but for the test split.

    Defaul = None

  • val_data_id (list or dict, optional) –

    Same as exclude_data_id but for validation split.

    Default = None

  • val_ratio_on_all_data (bool, optional) –

    Whether to calculate the validation split size only on the training set size (False) or on the entire “considered” dataset (True), i.e., the size of all files except ones included in exclude_data_id.

    Default = True

  • stratified (bool, optional) –

    Whether to apply stratification to the split or not. Might be used for fine-tuning split (the typical phase where labels are involved). Stratification will preserve, if possible, the label’s ratio on the training, validation, and test sets. Works only when each file has an unique label, which must be given in input.

    Default = False

  • labels (list or ArrayLike, optional) –

    A list or 1d ArrayLike objects with the label of each file listed in the partition table. Must be given if stratification is set to True. Indeces of labels must match row indeces in the partition table, i.e. label1 -> row1, label2 -> row2, etc.

    Default = None

  • dataset_id_extractor (function, optional) –

    A custom function to be used to extract the dataset ID from file the file name. It must accept only one argument, which is the file name (not the file path, only the file name).

    Default = None

  • subject_id_extractor (function, optional) –

    A custom function to be used to extract the subject ID from the file name. It must accept only one argument, which is the file name (not the file path, only the file name).

    Default = None

  • split_tolerance (float, optional) –

    Argument for get_subarray_closest_sum function. Set the maximum accepted tolerance between the given split ratio and the one obtained with the resulting subset. Must be a number in [0,1].

    Default = 0.01

  • perseverance (int, optional) –

    Argument for get_subarray_closest_sum function. Set the maximum number of tries before stop searching for a split whose ratio is in the range [target_ratio - tolerance, target_ratio + tolerance].

    Default = 1000

  • save (bool, optional) –

    Whether to save the resulting DataFrame as a .csv file or not.

    Default = False

  • save_path (str, optional) –

    A custom path to be used instead of the current working directory. It is the string given to the pandas.DataFrame.to_csv() method.

    Default = None

  • seed (int, optional) –

    An integer defining the seed to use. Set it to reproduce split results.

    Default = None

Returns:

EEGSplit (DataFrame) – Two columns Pandas DataFrame. The first column has the EEG file name, the second defines the split. The split will assign the following labels to a file:

  1. -1 : the file is excluded

  2. 0 : the file is included in the training set

  3. 1 : the file is included in the validation set

  4. 2 : the file is included in the test set

Example

>>> import pickle
>>> import pandas as pd
>>> import selfeeg.dataloading as dl
>>> import selfeeg.utils
>>> labels = utils.create_dataset()
>>> def loadEEG(path):
...     with open(path, 'rb') as handle:
...         EEG = pickle.load(handle)
...     x = EEG['data']
...     return x
>>>  EEGlen = dl.get_eeg_partition_number('Simulated_EEG',freq=128, window=2,
...                                    overlap=0.3, load_function=loadEEG )
>>>  EEGsplit = dl.get_eeg_split_table(EEGlen, seed=1234) #default 60/20/20 split
>>>  dl.check_split(EEGlen,EEGsplit) #will return 60/20/20