These data sets are routinely used for machine learning in peer reviewed academic journals and other publications. They are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.[2][3][4][5] This list aggregates high-quality datasets that have been shown to be of value to the machine learning research community from multiple different data repositories to provide greater coverage of the topic than is otherwise available.

Image data

Datasets consisting primarily of images or videos for tasks such as object detection, facial recognition, and multi-label classification.

Facial recognition

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Facial Recognition Technology (FERET) 11338 images of 1199 individuals in different positions and at different times None 11,338 images Classification, Facial Recognition 2003 [6] United States Department of Defense
Pose, Illumination, and Expression (PIE) 41,368 color images of 68 people in 13 different poses. Images labeled with expressions 41,368 images, text Classification, Facial Recognition 2000 [7] R. Gross et al.
SCFace Color images of faces at various angles Location of facial features extracted. Coordinates of features given. 4,160 images, text Classification, Facial Recognition 2011 [8] M. Grgic et al.

Object detection

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Berkeley 3-D Object Dataset 849 images taken in 75 different scenes. About 50 different object classes are labeled. Object bounding boxes and labeling 849 labeled images, text Object Recognition 2014 [9] A. Janoch et al.
Microsoft Common Objects in Context complex everyday scenes of common objects in their natural context Object highlighting, labeling, and classification into 91 object types 2,500,000 labeled images, text Object Recognition 2015 [10] T. Lin et al.

Aerial images

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Aerial Image Segmentation Dataset 80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0. Images manually segmented 80 images, segmented images Aerial Classification, Object Detection 2013 [11] J. Yuan et al.
KIT AIS Data Set Multiple labeled training and evaluation datasets of aerial images of crowds. Images manually labeled to show paths of individuals through crowds. ~ 150 images with paths People tracking, Aerial tracking 2012 [12] M. Butenuth et al.

Other images

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
MPII Cooking Activities Dataset Videos and images of various cooking activities. activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling 881,755 frames labeled video, images, text Classification 2012 [13] M. Rohrbach et al.

Text data

Datasets consisting primarily of text for tasks such as sentiment analysis, translation, and cluster analysis.


Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Amazon reviews Customer reviews from Amazon.com commerce. Full text not given, features includewords used, punctuation, length, etc. 1500 English Classification, Sentiment analysis 2011 [14] Zhi Liu
OpinRank Review Dataset Reviews of cars and hotels from Edmunds.com and TripAdvisor respectively. None 42,230 / ~259,000 respectively Text Sentiment analysis, Clustering 2011 [15] Ganesan, K., and Chengxiang Zhai

News articles

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
NYSK Data Set English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn Filtered and presented in XML format. 10,421 XML, Text Sentiment analysis, topic extraction 2013 [16] Dermouche, M. et al.
The Reuters Corpus Volume 1 Large corpus of Reuters news stories in English Fine-grain categorization and topic codes 810,000 articles Documents Classification, Clustering, Summarization 2002 [17] T. Rose et al


Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Enron Email Dataset Emails from employees at Enron organized into folders. Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com. ~ 500,000 Emails Network analysis, Sentiment analysis 2004 (2015) [18] Klimt, B. and Y. Yang
Ling-Spam Dataset Corpus containing both legitimate and spam emails. Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled. Emails Classification 2000 [19] Androutsopoulos, J. et al.


Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Sentiment140 Tweet data from 2009 including original text, time stamp, user and sentiment. Classified using distant supervision from presence of emoticon in tweet. 1,578,627 Tweets, comma, separated values Sentiment analysis 2009 [20] Go, A., R. Bhayani, and L. Huang

Other text

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Legal Case Reports Federal Court of Australia cases from 2006-2009. None 4,000 Documents Summarization,

Citation analysis

2012 [21] Galgani F., P. Compton, and A. Hoffmann

Sound data

Datasets of sounds and sound features.


Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Parkinson Speech Dataset Multiple recordings of people with and without Parkinson's Disease. Voice features extracted, disease scored by physician using unified Parkinson's disease rating scale 1,040 text Classification, Regression 2013 [22] B. E. Sakar et al.
Spoken Arabic Digits Spoken Arabic digits from 44 male and 44 female Timeseries of mel-frequency cepstrum coefficients 8,800 text Classification 2010 [23] M. Bedda et al.


Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Geographical Original of Music Data Set Audio features of music samples from different locations Audio features extracted using MARSYAS software 1,059 text Geographical Classification, Clustering 2014 [24] F. Zhou et al.

Other sounds

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
UrbanSound Labeled sound recordings of sounds like air conditioners, car horns and children playing. Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file. 1,059 Sound Classification 2014 Salamon, J., C. Jacoby, and J.P. Bello


Classification 2014 [25] Salamon, J., C. Jacoby, and J.P. Bello

Signal data

Datasets containing electric signal information requiring some sort of Signal processing for further analysis.


Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Referece Creator
EEG Database Study to examine EEG correlates of genetic predisposition to alcoholism Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9-msec epoch) for 1 second 122 Comma separated values Classification 1999 [26] Henri Begleiter
P300 Interface Dataset Data from nine subjects collected using P300-based brain-computer interface for disabled subjects. Split into four sessions for each subject. MATLAB code given. 1,224 Comma separated values Classification 2008 [27] Hoffman, U. et al.


Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Referece Creator
Witty Worm Dataset Dataset detailing the spread of the Witty worm and the infected computers. Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers 55,909 IP addresses Comma separated values Classification 2004 [28] Center for Applied Internet Data Analysis

Other signals

Multivariate data

Data sets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used.


Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Referece Creator
Dow Jones Index Weekly data of stocks from the first and second quarters of 2011. Calculated values included such as percentage change and a lags. 750 Comma separated values Classification, Regression, Time Series 2014 [29] Brown, M., M. Pelosi, and H. Dirska
Statlog (Australian Credit Approval) Credit card applications either accepted or rejected and attributes about the application. Attribute names are removed as well as identifying information. Factors have been relabeled. 690 Comma separated values Classification 1987 [30] Ross Quinlan


Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Referece Creator
Adult Dataset Census data from 1994 containing demographic features of adults and their income. Cleaned and anonymized. 48,842 Comma separated values Classification 1996 [31] United States Census Bureau
Census-Income (KDD) Weighted census data from the 1994 and 1995 Current Population Surveys. Split into training and test sets. 299,285 Comma separated values Classification 2000 [32], [33] United States Census Bureau

Other Multivariate

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Referece Creator
Housing Data Set Median home values of Boston with associated home and neighborhood attributes None 506 Comma separated values Regression 1993 [34] Harrison, D. and Rubinfeld, D.L.


