{ "cells": [ { "cell_type": "markdown", "id": "8bdb80d3-737a-4306-8c46-3408f98851db", "metadata": {}, "source": [ "# Dataloading guide\n", "\n", "This section is intended to provide a brief introduction to the dataloading module and its main functionalities.\n", "\n", "In short, all functions and custom classes are designed to help you create an efficient Pytorch Dataloader to use during training. The main objective is to avoid loading the entire dataset all at once, but instead iteratively load (possibly overlapping) time windows called \"partitions\". A typical pipeline is based on the following steps:\n", "\n", "\n", "1) Define the **partition specs**, i.e. the EEGs' sampling rate, the window length and the overlap between consecutive windows.\n", "2) Call the **GetEEGPartitionNumber** function to extract the dataset length, i.e. the number of partitions which can be extracted from the EEG datasets, given the defined partition specs.\n", "3) Call the **GetEEGSplitTable** or the **GetEEGSplitTableKfold** function to split the data in train, validation and test sets.\n", "4) Pass the results of the previous points to the custom Pytorch Dataset **EEGDataset**\n", "5) Optional: create a custom Pytorch Sampler **EEGSampler**\n", "6) Create a **Pytorch Dataloader** with the custom Dataset (and Sampler)" ] }, { "cell_type": "markdown", "id": "6103b7e9-d8cc-4889-83ce-ec0b5f5bfe97", "metadata": {}, "source": [ "First, let's import the dataloading module" ] }, { "cell_type": "code", "execution_count": 1, "id": "31f8ad6a", "metadata": {}, "outputs": [], "source": [ "import os\n", "import random\n", "import pickle\n", "import sys\n", "sys.path.append('..') # Needed when running this from the selfeeg/doc folder\n", "from selfeeg import dataloading as dl\n", "\n", "import numpy as np\n", "import torch\n", "from torch.utils.data import DataLoader\n", "\n", "# set seeds for reproducibility\n", "seed = 12\n", "torch.manual_seed( seed )\n", "np.random.seed( seed )\n", "random.seed( seed )" ] }, { "cell_type": "markdown", "id": "6038e1ec-3373-4e5f-bc32-193da92c44b7", "metadata": {}, "source": [ "To provide a simple and excecutable tutorial, we will create a fake collection of EEG datasets (already aligned) which we will save in a folder \"Simulated EEG\".\n", "Just to be clear, we will generate randn arrays of random length and save them. This is just to avoid downloading large datasets.\n", "\n", "To keep the size of the folder low, each file will be:\n", "1) a 2 Channel EEG\n", "2) random length between 1024 and 4096 samples\n", "3) Stored with name `\"{dataset_id}_{subject_id}_{session_id}_{trial_id}.pickle\"`. This will be useful for the split part " ] }, { "cell_type": "code", "execution_count": 2, "id": "669a9877-adc2-4741-b291-2ff0d5a11396", "metadata": {}, "outputs": [], "source": [ "# create a folder if that not exists\n", "if not(os.path.isdir('Simulated_EEG')):\n", " os.mkdir('Simulated_EEG')\n", "\n", "N=1000\n", "for i in range(N):\n", " x = np.random.randn(2,np.random.randint(1024,4097))\n", " y = np.random.randint(1,5)\n", " sample = {'data': x, 'label': y}\n", " dataset_id = (int(i//200)+1)\n", " subject_id = (int( (i - 200*int(i//200)))//5+1)\n", " session_id = (i%5+1)\n", " trial_id = 1\n", " file_name = f'Simulated_EEG/{dataset_id}_{subject_id}_{session_id}_{trial_id}.pickle'\n", " with open(file_name, 'wb') as f:\n", " pickle.dump(sample, f)" ] }, { "cell_type": "markdown", "id": "39788994-04b8-4554-b41e-ea9f56ea3e52", "metadata": {}, "source": [ "Now we have a folder with simulated 1000 EEGs coming from:\n", "1) 5 datasets (ID from 1 to 5);\n", "2) 40 subjects per dataset (ID from 1 to 40)\n", "3) 5 session per subject (ID from 1 to 5)\n", "\n", "Each file is a pickle file with a dictionary having keys:\n", "1) `'data'`: the numpy 2D array\n", "2) `'label`': a fake label associated to the EEG file (from 1 to 4)" ] }, { "cell_type": "markdown", "id": "06d4ca58", "metadata": {}, "source": [ "## The GetEEGPartitionNumber function\n", "\n", "This function is important to calculate the dataset length once defined the partition specs. Let's suppose data have a sampling rate of 128 Hz, and we want to extract 2 seconds samples with a 15% overlap. \n", "\n", "To complicate things, let's assume that we want to remove the last half second of record, for example because it often has bad recorded data.\n", "\n", "
| \n", " | full_path | \n", "file_name | \n", "N_samples | \n", "
|---|---|---|---|
| 0 | \n", "Simulated_EEG/1_10_1_1.pickle | \n", "1_10_1_1.pickle | \n", "15 | \n", "
| 1 | \n", "Simulated_EEG/1_10_2_1.pickle | \n", "1_10_2_1.pickle | \n", "5 | \n", "
| 2 | \n", "Simulated_EEG/1_10_3_1.pickle | \n", "1_10_3_1.pickle | \n", "7 | \n", "
| 3 | \n", "Simulated_EEG/1_10_4_1.pickle | \n", "1_10_4_1.pickle | \n", "12 | \n", "
| 4 | \n", "Simulated_EEG/1_10_5_1.pickle | \n", "1_10_5_1.pickle | \n", "6 | \n", "
| \n", " | file_name | \n", "split_set | \n", "
|---|---|---|
| 0 | \n", "1_10_1_1.pickle | \n", "0 | \n", "
| 1 | \n", "1_10_2_1.pickle | \n", "0 | \n", "
| 2 | \n", "1_10_3_1.pickle | \n", "0 | \n", "
| 3 | \n", "1_10_4_1.pickle | \n", "0 | \n", "
| 4 | \n", "1_10_5_1.pickle | \n", "0 | \n", "
| ... | \n", "... | \n", "... | \n", "
| 995 | \n", "5_9_1_1.pickle | \n", "0 | \n", "
| 996 | \n", "5_9_2_1.pickle | \n", "0 | \n", "
| 997 | \n", "5_9_3_1.pickle | \n", "0 | \n", "
| 998 | \n", "5_9_4_1.pickle | \n", "0 | \n", "
| 999 | \n", "5_9_5_1.pickle | \n", "0 | \n", "
1000 rows × 2 columns
\n", " EEGSampler( EEGDataset, Mode=0)\n",
"2. **Shuffled**: it returns a customized iterator. The iterator is constructed in this way:\n",
" 1) Samples are shuffled at the file level;\n",
" 2) Samples of the same file are shuffled;\n",
" 3) Samples are rearranged based on the desired batch size and number of works. This step is performed to exploit the parallelization properties of the pytorch dataloader and reduce the number of loading operations. To initialize the sampler in this mode, simply use the command EEGSampler( EEGDataset, BatchSize, Workers ) \n",
"\n",
"