EEGDataset
- class selfeeg.dataloading.load.EEGDataset(EEGlen: pd.DataFrame, EEGsplit: pd.DataFrame, EEGpartition_spec: list, mode: str = 'train', supervised: bool = False, load_function: function = None, transform_function: function = None, label_function: function = None, optional_load_fun_args: list or dict = None, optional_transform_fun_args: list or dict = None, optional_label_fun_args: list or dict = None, multilabel_on_load: bool = False, label_on_load: bool = False, label_key: list = None, default_dtype=torch.float32)[source]
custom pytorch.Dataset class that manages different loading configurations.
It can be used for both the pretraining and fine tuning phase. Its main functionalities reside in the ability to accepts different ways to load, transform and extract optional labels from the data without preallocate the entire dataset, which is especially useful in SSL experiments, where multiple and large datasets are used. To further check how to use this class see the introductory notebook provided in the documentation.
- Parameters:
EEGlen (DataFrame) – DataFrame with the number of partition per EEG record. Must be the output of the
get_eeg_partition_number()function.EEGsplit (DataFrame) – DataFrame with the train/test split info. Must be the output of the
get_eeg_split_table()or a split extracted from theget_eeg_split_table_kfoldfunction output with theget_splitfunction.EEGpartition_spec (list) – 3-element list with the input gave to
get_eeg_partition_number()in [sampling_rate, window_length, overlap_percentage] format.mode (string, optional) –
If the dataset is intended for train, test or validation. It accept only the following strings: ‘train’,’test’,’validation’.
Default = ‘train’
supervised (bool, optional) –
Whether the class
__getItem__()method must return a label or not. Must be set to True during fine-tuning.Default = False
load_function ('function', optional) –
A custom EEG file loading function. It will be used instead of the default:
loadmat(ii, simplify_cells=True)['DATA_STRUCT']['data']which is the default output format for files preprocessed with the BIDSalign library. The function must take only one required argument, which is the full path to the EEG file (e.g. the function will be called in this way: load_function(fullpath, optional_arguments) )
The function can output one or two arguments where the first must be the EEG file and the second (if there is one) is its label. Note that the assumed number of outputs is based on the parameter label_on_load. So if the function will return only the EEG remember to set label_on_load on False. Note also that this function must load the EEGs in the same way as during
get_eeg_partition_numbercall.Default = None
transform_function ('function', optional) –
A custom transformation to be applied after the EEG is loaded. Might be useful if there are portions of the signal to cut (usually the initial or the final). The function must take only one required argument, which is the loaded EEG file to transform (e.g. the function will be called in this way: transform_function(EEG, optional_arguments) ). Note that this function must transform the EEGs in the same way as during
get_eeg_partition_numbercall.Default = None
label_function ('function', optional) –
A custom transformation to be applied for the label extraction. Might be useful for the fine-tuning phase. Considering that an EEG file can have single or multiple labels the functionwill be called with 2 required arguments:
full path to the EEG file
list with all indeces necessary to identify the extracted partition (if EEG is a 2-D array the list will have only the starting and ending indeces of the slice of the last axis, if the EEG is N-D the list will also add all the other indeces from the first to the second to last axis)
e.g. the function will be called in this way:
label_function(full_path, [*first_axis_idx, start, end], optional args)It is strongly suggested to save EEG labels in a separate file in order to avoid loading every time the entire EEG file which is the purpose of this entire module implementation.
Default = None
optional_load_fun_args (list or dict, optional) –
Optional arguments to give to the custom loading function. Can be a list or a dict.
Default = None
optional_transform_fun_args (list or dict, optional) –
Optional arguments to give to the EEG transformation function. Can be a list or a dict.
Default = None
optional_label_fun_args (list or dict, optional) –
Optional arguments to give to the EEG transformation function. Can be a list or a dict.
Default = None
multilabel_on_load (bool, optional) –
Whether the custom loading function will also load an array of labels associated to the EEG file. In this case it is assumed that the number of labels is equal to the number of samples, i.e. windows that can be extracted from the EEG according to the partition EEGpartition_spec.
Default = True
label_on_load (bool, optional) –
Whether the custom loading function will also load a single label associated to the EEG file.
Default = False
label_key (str or list of str, optional) –
A single or set of dictionary keys given as list of strings, used to access a specific label if multiple were loaded. Might be useful if the loading function will return a dictionary of labels associated to the file, for example when you have a set of patient info but you want to use only a specific one.
Default = None
default_dtype (torch.dtype) – The dtype to use when converting loaded EEG to torch tensors. It is suggested to change the default float32 only if there are specific requirements since float32 are faster on GPU devices.
Example
>>> import pickle >>> import selfeeg.dataloading as dl >>> import selfeeg.utils >>> labels = utils.create_dataset() >>> def loadEEG(path): ... with open(path, 'rb') as handle: ... EEG = pickle.load(handle) ... x = EEG['data'] ... return x >>> EEGlen = dl.get_eeg_partition_number('Simulated_EEG',freq=128, window=2, ... overlap=0.3, load_function=loadEEG ) >>> EEGsplit = dl.get_eeg_split_table(EEGlen, seed=1234) #default 60/20/20 >>> TrainSet = dl.EEGDataset(EEGlen,EEGsplit,[128,2,0.3],load_function=loadEEG) >>> print(len(TrainSet)) >>> print(TrainSet.__getitem__(10).shape) # will return torch.Size([8, 256]) >>> print(TrainSet.file_path) # will return 'Simulated_EEG/1_11_3_1.pickle'
This image summarizes how to set up the main arguments of the EEGDataset class:
- preload_dataset()[source]
preload_dataseteagerly loads the entire dataset to allow a faster batch creation. The dataset will be stored inside two torch tensors: x_preload for the EEG data and y_preload for the label, if supervised is set to True.In case a tensor conversion is not possible, a tuple will be created instead.
Warning
As reported by many, eagerly loading the data, i.e. pre-loading the entire data in the Dataset.__init__, increase the overall memory usage significantly. Do not pre-load the entire dataset if you have a really large dataset or you plan to use multiple workers, as each worker will hold a reference to an own Dataset. See https://discuss.pytorch.org/t/what-data-does-each-worker-process-hold- does-it-hold-the-full-dataset-object-or-only-a-batch-of-it/160136