Datasets¶

CV Datasets¶

CIFAR-100: 100-class dataset with image size 32 x 32 pixels. For each class, there are 500 training samples and 100 testing samples.
STL10: 10-class dataset with image size of 96 x 96 pixels. Each class has 500 training samples and 800 testing samples. Besides labeled set, there is a unlabeled set with 100,000 samples.
TissueMNIST: a medical dataset of human kidney cortex cells, segmented from 3 reference tissue specimens and organized into 8 categories. The total 236,386 training samples are split with a ratio of 7 : 1 : 2 into training (165,466 images), validation (23,640 images) and test set (47,280 images). Each gray-scale image is 28 x 28 pixels.
SemiAves: Aves (birds) classification, where 3,959 images of 200 bird species are labeled and 26,640 images are unlabeled. The validation and test set contain 10 and 20 images respectively for each of the 200 categories in the labeled set.
EuroSAT: covers Sentinel-2 satellite images covering 13 spectral bands and consisting of 10 classes with 27,000 labeled and geo-referenced samples. Training, validation and test set follows 6:2:2 ratio splits.
ImageNet: 1000-class, high resolution (224 x 224 pixels) recognition dataset. The number of images within each class ranges from 732 to 1300. The validation set consists of 50,000 images, which is evenly distributed across classes.

NLP Datasets¶

IMDB: is a binary sentiment classification dataset. There are 25,000 reviews for training and 25,000 for test. IMDB is class balanced which means the positive and negative reviews have the same number both for training and test. For USB, we draw 12,500 samples and 1,000 samples per class from training samples to form the training dataset and validation dataset respectively. The test dataset is unchanged.
Amazon Review: is a sentiment classification dataset. There are 5 classes (scores). Each class (score) contains 600,000 training samples and 130,000 test samples. For USB, we draw 50,000 samples and 5,000 samples per class from training samples to form the training dataset and validation dataset respectively. The test dataset is unchanged.
Yelp Review: is a sentiment classification dataset with 5 classes (scores). Each class (score) contains 130,000 training samples and 10,000 test samples. For USB, we draw 50,000 samples and 5,000 samples per class from training samples to form the training dataset and validation dataset respectively. The test dataset is unchanged.
AG News: is a news topic classification dataset containing 4 classes. Each class contains 30,000 training samples and 1,900 test samples. For USB, we draw 25,000 samples and 2,500 samples per class from training samples to form the training dataset and validation dataset respectively. The test dataset is unchanged.
Yahoo! Answer: is a topic classification dataset has 10 categories. Each class contains 140,000 training samples and 6,000 test samples. For USB, we draw 50,000 samples and 5,000 samples per class from training samples to form the training dataset and validation dataset respectively. The test dataset is unchanged.

Audio Datasets¶

GTZAN: is collected for music genre classification of 10 classes and 100 audio recordings for each class. The maximum length of the recordings is 30 seconds and the original sampling rate is 22,100 Hz. We split 7,000 samples for training, 1,500 for validation, and 1,500 for testing. All recordings are re-sampled at 16,000 Hz.
UrbanSound8k: contains 8,732 labeled sound events of urban sounds of 10 classes, with the maximum length of 4 seconds. The original sampling rate of the audio recordings is 44,100 and we re-sample it to 16,000. It is originally divided into 10 folds, where we use the first 8 folds of 7,079 samples as training set, and the last two folds as validation set of size 816 and testing set of size 837 respectively.
FSDNoisy18k: is a sound event classification dataset across 20 classes. It consists of a small amount of manually labeled data - 1,772 and a large amount of noisy data - 15,813 which is treated as unlabeled data in our paper. The original sample rate is 44,100 Hz, and the length of the recordings lies between 3 seconds and 30 seconds. We use the testing set provided for evaluation, which contains 947 samples.
Keyword Spotting: is one of the tasks in Superb[30] for classifying the keywords. It contains speech utterances of a maximum length of 1 second and the sampling rate of 16,000. The training, validation, and testing set contain 18,538; 2,577; 2,567 recordings, respectively. For pre-processing, we remove the silence and unknown labels from the dataset.
ESC-50: is a dataset containing 2,000 environmental audio recordings for 50 sound classes. The maximum length of the recordings is 5 seconds and the original sampling rate is 44,100. We split 1,200 samples as training data, 400 as validation data, and 400 as testing data. We also re-sample the audio recordings to 16,000 Hz during pre-processing.