EEGDataset

class selfeeg.dataloading.load.EEGDataset(EEGlen: pd.DataFrame, EEGsplit: pd.DataFrame, EEGpartition_spec: list, mode: str = 'train', supervised: bool = False, load_function: function = None, transform_function: function = None, label_function: function = None, optional_load_fun_args: list or dict = None, optional_transform_fun_args: list or dict = None, optional_label_fun_args: list or dict = None, multilabel_on_load: bool = False, label_on_load: bool = False, label_key: list = None, default_dtype=torch.float32)[source]

custom pytorch.Dataset class that manages different loading configurations.

It can be used for both the pretraining and fine tuning phase. Its main functionalities reside in the ability to accepts different ways to load, transform and extract optional labels from the data without preallocate the entire dataset, which is especially useful in SSL experiments, where multiple and large datasets are used. To further check how to use this class see the introductory notebook provided in the documentation.

Parameters:
  • EEGlen (DataFrame) – DataFrame with the number of partition per EEG record. Must be the output of the get_eeg_partition_number() function.

  • EEGsplit (DataFrame) – DataFrame with the train/test split info. Must be the output of the get_eeg_split_table() or a split extracted from the get_eeg_split_table_kfold function output with the get_split function.

  • EEGpartition_spec (list) – 3-element list with the input gave to get_eeg_partition_number() in [sampling_rate, window_length, overlap_percentage] format.

  • mode (string, optional) –

    If the dataset is intended for train, test or validation. It accept only the following strings: ‘train’,’test’,’validation’.

    Default = ‘train’

  • supervised (bool, optional) –

    Whether the class __getItem__() method must return a label or not. Must be set to True during fine-tuning.

    Default = False

  • load_function ('function', optional) –

    A custom EEG file loading function. It will be used instead of the default:

    loadmat(ii, simplify_cells=True)['DATA_STRUCT']['data']

    which is the default output format for files preprocessed with the BIDSalign library. The function must take only one required argument, which is the full path to the EEG file (e.g. the function will be called in this way: load_function(fullpath, optional_arguments) )

    The function can output one or two arguments where the first must be the EEG file and the second (if there is one) is its label. Note that the assumed number of outputs is based on the parameter label_on_load. So if the function will return only the EEG remember to set label_on_load on False. Note also that this function must load the EEGs in the same way as during get_eeg_partition_number call.

    Default = None

  • transform_function ('function', optional) –

    A custom transformation to be applied after the EEG is loaded. Might be useful if there are portions of the signal to cut (usually the initial or the final). The function must take only one required argument, which is the loaded EEG file to transform (e.g. the function will be called in this way: transform_function(EEG, optional_arguments) ). Note that this function must transform the EEGs in the same way as during get_eeg_partition_number call.

    Default = None

  • label_function ('function', optional) –

    A custom transformation to be applied for the label extraction. Might be useful for the fine-tuning phase. Considering that an EEG file can have single or multiple labels the functionwill be called with 2 required arguments:

    1. full path to the EEG file

    2. list with all indeces necessary to identify the extracted partition (if EEG is a 2-D array the list will have only the starting and ending indeces of the slice of the last axis, if the EEG is N-D the list will also add all the other indeces from the first to the second to last axis)

    e.g. the function will be called in this way:

    label_function(full_path, [*first_axis_idx, start, end], optional args)

    It is strongly suggested to save EEG labels in a separate file in order to avoid loading every time the entire EEG file which is the purpose of this entire module implementation.

    Default = None

  • optional_load_fun_args (list or dict, optional) –

    Optional arguments to give to the custom loading function. Can be a list or a dict.

    Default = None

  • optional_transform_fun_args (list or dict, optional) –

    Optional arguments to give to the EEG transformation function. Can be a list or a dict.

    Default = None

  • optional_label_fun_args (list or dict, optional) –

    Optional arguments to give to the EEG transformation function. Can be a list or a dict.

    Default = None

  • multilabel_on_load (bool, optional) –

    Whether the custom loading function will also load an array of labels associated to the EEG file. In this case it is assumed that the number of labels is equal to the number of samples, i.e. windows that can be extracted from the EEG according to the partition EEGpartition_spec.

    Default = True

  • label_on_load (bool, optional) –

    Whether the custom loading function will also load a single label associated to the EEG file.

    Default = False

  • label_key (str or list of str, optional) –

    A single or set of dictionary keys given as list of strings, used to access a specific label if multiple were loaded. Might be useful if the loading function will return a dictionary of labels associated to the file, for example when you have a set of patient info but you want to use only a specific one.

    Default = None

  • default_dtype (torch.dtype) – The dtype to use when converting loaded EEG to torch tensors. It is suggested to change the default float32 only if there are specific requirements since float32 are faster on GPU devices.

Example

>>> import pickle
>>> import selfeeg.dataloading as dl
>>> import selfeeg.utils
>>> labels = utils.create_dataset()
>>> def loadEEG(path):
...     with open(path, 'rb') as handle:
...         EEG = pickle.load(handle)
...     x = EEG['data']
...     return x
>>>  EEGlen = dl.get_eeg_partition_number('Simulated_EEG',freq=128, window=2,
...                                    overlap=0.3, load_function=loadEEG )
>>>  EEGsplit = dl.get_eeg_split_table(EEGlen, seed=1234) #default 60/20/20
>>>  TrainSet = dl.EEGDataset(EEGlen,EEGsplit,[128,2,0.3],load_function=loadEEG)
>>>  print(len(TrainSet))
>>>  print(TrainSet.__getitem__(10).shape) # will return torch.Size([8, 256])
>>>  print(TrainSet.file_path) # will return 'Simulated_EEG/1_11_3_1.pickle'

This image summarizes how to set up the main arguments of the EEGDataset class:

../_images/DatasetClassScheme.jpeg
preload_dataset()[source]

preload_dataset eagerly loads the entire dataset to allow a faster batch creation. The dataset will be stored inside two torch tensors: x_preload for the EEG data and y_preload for the label, if supervised is set to True.

In case a tensor conversion is not possible, a tuple will be created instead.

Warning

As reported by many, eagerly loading the data, i.e. pre-loading the entire data in the Dataset.__init__, increase the overall memory usage significantly. Do not pre-load the entire dataset if you have a really large dataset or you plan to use multiple workers, as each worker will hold a reference to an own Dataset. See https://discuss.pytorch.org/t/what-data-does-each-worker-process-hold- does-it-hold-the-full-dataset-object-or-only-a-batch-of-it/160136