User:Datakeeper/pageskeleton

List of datasets for machine learning research

This is a list of noteworthy datasets for machine learning research. This list is limited to noteworthy, high-quality datasets that have been used in peer reviewed publications such as academic journals. This list is not exhaustive.

Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.^[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.^[2]^[3]^[4]^[5] This list aggregates high-quality datasets that have been shown to be of value to the machine learning research community from multiple different data repositories to provide greater coverage of the topic than is otherwise available.

Image data[edit]

Datasets consisting primarily of images or videos for tasks such as object detection, facial recognition, and multi-label classification.

Facial recognition[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Facial Recognition Technology (FERET)	11338 images of 1199 individuals in different positions and at different times	None	11,338	images	Classification, Facial Recognition	2003	^[6]	United States Department of Defense
Pose, Illumination, and Expression (PIE)	41,368 color images of 68 people in 13 different poses.	Images labeled with expressions	41,368	images, text	Classification, Facial Recognition	2000	^[7]	R. Gross et al.
SCFace	Color images of faces at various angles	Location of facial features extracted. Coordinates of features given.	4,160	images, text	Classification, Facial Recognition	2011	^[8]	M. Grgic et al.

Object detection[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Berkeley 3-D Object Dataset	849 images taken in 75 different scenes. About 50 different object classes are labeled.	Object bounding boxes and labeling	849	labeled images, text	Object Recognition	2014	^[9]	A. Janoch et al.
Microsoft Common Objects in Context	complex everyday scenes of common objects in their natural context	Object highlighting, labeling, and classification into 91 object types	2,500,000	labeled images, text	Object Recognition	2015	^[10]	T. Lin et al.

Aerial images[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Aerial Image Segmentation Dataset	80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0.	Images manually segmented	80	images, segmented images	Aerial Classification, Object Detection	2013	^[11]	J. Yuan et al.
KIT AIS Data Set	Multiple labeled training and evaluation datasets of aerial images of crowds.	Images manually labeled to show paths of individuals through crowds.	~ 150	images with paths	People tracking, Aerial tracking	2012	^[12]	M. Butenuth et al.

Other images[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
MPII Cooking Activities Dataset	Videos and images of various cooking activities.	activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling	881,755 frames	labeled video, images, text	Classification	2012	^[13]	M. Rohrbach et al.

Text data[edit]

Datasets consisting primarily of text for tasks such as sentiment analysis, translation, and cluster analysis.

Reviews[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Amazon reviews	Customer reviews from Amazon.com commerce.	Full text not given, features includewords used, punctuation, length, etc.	1500	English	Classification, Sentiment analysis	2011	^[14]	Zhi Liu
OpinRank Review Dataset	Reviews of cars and hotels from Edmunds.com and TripAdvisor respectively.	None	42,230 / ~259,000 respectively	Text	Sentiment analysis, Clustering	2011	^[15]	Ganesan, K., and Chengxiang Zhai

News articles[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
NYSK Data Set	English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn	Filtered and presented in XML format.	10,421	XML, Text	Sentiment analysis, topic extraction	2013	^[16]	Dermouche, M. et al.
The Reuters Corpus Volume 1	Large corpus of Reuters news stories in English	Fine-grain categorization and topic codes	810,000 articles	Documents	Classification, Clustering, Summarization	2002	^[17]	T. Rose et al

Messages[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Enron Email Dataset	Emails from employees at Enron organized into folders.	Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com.	~ 500,000	Emails	Network analysis, Sentiment analysis	2004 (2015)	^[18]	Klimt, B. and Y. Yang
Ling-Spam Dataset	Corpus containing both legitimate and spam emails.	Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled.		Emails	Classification	2000	^[19]	Androutsopoulos, J. et al.

Tweets[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Sentiment140	Tweet data from 2009 including original text, time stamp, user and sentiment.	Classified using distant supervision from presence of emoticon in tweet.	1,578,627	Tweets, comma, separated values	Sentiment analysis	2009	^[20]	Go, A., R. Bhayani, and L. Huang

Other text[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Legal Case Reports	Federal Court of Australia cases from 2006-2009.	None	4,000	Documents	Summarization, Citation analysis	2012	^[21]	Galgani F., P. Compton, and A. Hoffmann

Sound data[edit]

Datasets of sounds and sound features.

Speech[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Parkinson Speech Dataset	Multiple recordings of people with and without Parkinson's Disease.	Voice features extracted, disease scored by physician using unified Parkinson's disease rating scale	1,040	text	Classification, Regression	2013	^[22]	B. E. Sakar et al.
Spoken Arabic Digits	Spoken Arabic digits from 44 male and 44 female	Timeseries of mel-frequency cepstrum coefficients	8,800	text	Classification	2010	^[23]	M. Bedda et al.

Music[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Geographical Original of Music Data Set	Audio features of music samples from different locations	Audio features extracted using MARSYAS software	1,059	text	Geographical Classification, Clustering	2014	^[24]	F. Zhou et al.

Other sounds[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
UrbanSound	Labeled sound recordings of sounds like air conditioners, car horns and children playing.	Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file.	1,059	Sound (WAV)	Classification	2014	^[25]	Salamon, J., C. Jacoby, and J.P. Bello

Signal data[edit]

Datasets containing electric signal information requiring some sort of Signal processing for further analysis.

Medical[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Referece	Creator
EEG Database	Study to examine EEG correlates of genetic predisposition to alcoholism	Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9-msec epoch) for 1 second	122	Comma separated values	Classification	1999	^[26]	Henri Begleiter
P300 Interface Dataset	Data from nine subjects collected using P300-based brain-computer interface for disabled subjects.	Split into four sessions for each subject. MATLAB code given.	1,224	Comma separated values	Classification	2008	^[27]	Hoffman, U. et al.

Electrical[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Referece	Creator
Witty Worm Dataset	Dataset detailing the spread of the Witty worm and the infected computers.	Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers	55,909 IP addresses	Comma separated values	Classification	2004	^[28]	Center for Applied Internet Data Analysis

Other signals[edit]

Multivariate data[edit]

Data sets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used.

Financial[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Referece	Creator
Dow Jones Index	Weekly data of stocks from the first and second quarters of 2011.	Calculated values included such as percentage change and a lags.	750	Comma separated values	Classification, Regression, Time Series	2014	^[29]	Brown, M., M. Pelosi, and H. Dirska
Statlog (Australian Credit Approval)	Credit card applications either accepted or rejected and attributes about the application.	Attribute names are removed as well as identifying information. Factors have been relabeled.	690	Comma separated values	Classification	1987	^[30]	Ross Quinlan

Census[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Referece	Creator
Adult Dataset	Census data from 1994 containing demographic features of adults and their income.	Cleaned and anonymized.	48,842	Comma separated values	Classification	1996	^[31]	United States Census Bureau
Census-Income (KDD)	Weighted census data from the 1994 and 1995 Current Population Surveys.	Split into training and test sets.	299,285	Comma separated values	Classification	2000	^[32], ^[33]	United States Census Bureau

Other Multivariate[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Referece	Creator
Housing Data Set	Median home values of Boston with associated home and neighborhood attributes	None	506	Comma separated values	Regression	1993	^[34]	Harrison, D. and Rubinfeld, D.L.

References[edit]

^ Wissner-Gross, A. "Datasets Over Algorithms". Edge.com. Retrieved 8 January 2016.
^ Weiss, Gary M., and Foster Provost. "Learning when training data are costly: the effect of class distribution on tree induction." Journal of Artificial Intelligence Research (2003): 315-354.
^ Turney, Peter. "Types of cost in inductive concept learning." (2000).
^ Abney, Steven. Semisupervised learning for computational linguistics. CRC Press, 2007.
^ Žliobaitė, Indrė, et al. "Active learning with evolving streaming data." Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg, 2011. 597-612.
^ Phillips, P. Jonathon, et al. "The FERET database and evaluation procedure for face-recognition algorithms." Image and vision computing 16.5 (1998): 295-306.
^ Sim, Terence, Simon Baker, and Maan Bsat. "The CMU pose, illumination, and expression (PIE) database." Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on. IEEE, 2002.
^ Grgic, Mislav, Kresimir Delac, and Sonja Grgic. "SCface–surveillance cameras face database." Multimedia tools and applications 51.3 (2011): 863-879.
^ Karayev, S., et al. "A category-level 3-D object dataset: putting the Kinect to work." Proceedings of the IEEE International Conference on Computer Vision Workshops. 2011.
^ Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." Computer Vision–ECCV 2014. Springer International Publishing, 2014. 740-755.
^ Yuan, Jiangye, Shaun S. Gleason, and Anil M. Cheriyadat. "Systematic benchmarking of aerial image segmentation." Geoscience and Remote Sensing Letters, IEEE 10.6 (2013): 1527-1531.
^ Butenuth, Matthias, et al. "Integrating pedestrian simulation, tracking and event detection for crowd analysis." Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2011.
^ Rohrbach, Marcus, et al. "A database for fine grained activity detection of cooking activities."Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
^ Used in: Liu, Sanya, et al. "Application of synergetic neural network in online writeprint identification." International Journal of Digital Content Technology and its Applications 5.3 (2011): 126-135.
^ Ganesan, Kavita, and Chengxiang Zhai. "Opinion-based entity ranking." Information retrieval 15.2 (2012): 116-150.
^ Dermouche, Mohamed, et al. "A Joint Model for Topic-Sentiment Evolution over Time." Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 2014.
^ Rose, Tony, Mark Stevenson, and Miles Whitehead. "The Reuters Corpus Volume 1-from Yesterday's News to Tomorrow's Language Resources."LREC. Vol. 2. 2002.
^ Klimt, Bryan, and Yiming Yang. "Introducing the Enron Corpus." CEAS. 2004.
^ Androutsopoulos, Ion, et al. "An evaluation of naive bayesian anti-spam filtering." arXiv preprint cs/0006013 (2000).
^ Go, Alec, Richa Bhayani, and Lei Huang. "Twitter sentiment classification using distant supervision." CS224N Project Report, Stanford 1 (2009): 12.
^ Galgani, Filippo, Paul Compton, and Achim Hoffmann. "Combining different summarization techniques for legal text." Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, 2012.
^ Sakar, Betul Erdogdu, et al. "Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings." Biomedical and Health Informatics, IEEE Journal of 17.4 (2013): 828-834.
^ Used in: Hammami, Nacereddine, and Mouldi Bedda. "Improved tree model for arabic speech recognition." Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on. Vol. 5. IEEE, 2010.
^ Zhou, Fang, Q. Claire, and Ross D. King. "Predicting the geographical origin of music." Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 2014.
^ Salamon, Justin, Christopher Jacoby, and Juan Pablo Bello. "A dataset and taxonomy for urban sound research." Proceedings of the ACM International Conference on Multimedia. ACM, 2014.
^ Used in: Ingber, Lester. "Statistical mechanics of neocortical interactions: Canonical momenta indicatorsof electroencephalography." Physical Review E 55.4 (1997): 4578.
^ Hoffmann, Ulrich, et al. "An efficient P300-based brain–computer interface for disabled subjects." Journal of Neuroscience methods 167.1 (2008): 115-125.
^ The CAIDA UCSD Dataset on the Witty Worm - March 19-24, 2004, http://www.caida.org/data/passive/witty_worm_dataset.xml.
^ Brown, Michael Scott, Michael J. Pelosi, and Henry Dirska. "Dynamic-radius species-conserving genetic algorithm for the financial forecasting of Dow Jones index stocks." Machine Learning and Data Mining in Pattern Recognition. Springer Berlin Heidelberg, 2013. 27-41.
^ Quinlan, J. Ross. "Simplifying decision trees." International journal of man-machine studies 27.3 (1987): 221-234.
^ Kohavi, Ron. "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid." KDD. Vol. 96. 1996.
^ Oza, Nikunj C., and Stuart Russell. "Experimental comparisons of online and batch versions of bagging and boosting." Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001.
^ Bay, Stephen D. "Multivariate discretization for set mining." Knowledge and Information Systems 3.4 (2001): 491-512.
^ Belsley, David A., Edwin Kuh, and Roy E. Welsch. Regression diagnostics: Identifying influential data and sources of collinearity. Vol. 571. John Wiley & Sons, 2005.

[1] Wissner-Gross, A. "Datasets Over Algorithms". Edge.com. Retrieved 8 January 2016.

[2] Weiss, Gary M., and Foster Provost. "Learning when training data are costly: the effect of class distribution on tree induction." Journal of Artificial Intelligence Research (2003): 315-354.

[3] Turney, Peter. "Types of cost in inductive concept learning." (2000).

[4] Abney, Steven. Semisupervised learning for computational linguistics. CRC Press, 2007.

[5] Žliobaitė, Indrė, et al. "Active learning with evolving streaming data." Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg, 2011. 597-612.

[:4-6] Phillips, P. Jonathon, et al. "The FERET database and evaluation procedure for face-recognition algorithms." Image and vision computing 16.5 (1998): 295-306.

[7] Sim, Terence, Simon Baker, and Maan Bsat. "The CMU pose, illumination, and expression (PIE) database." Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on. IEEE, 2002.

[:0-8] Grgic, Mislav, Kresimir Delac, and Sonja Grgic. "SCface–surveillance cameras face database." Multimedia tools and applications 51.3 (2011): 863-879.

[:6-9] Karayev, S., et al. "A category-level 3-D object dataset: putting the Kinect to work." Proceedings of the IEEE International Conference on Computer Vision Workshops. 2011.

[10] Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." Computer Vision–ECCV 2014. Springer International Publishing, 2014. 740-755.

[11] Yuan, Jiangye, Shaun S. Gleason, and Anil M. Cheriyadat. "Systematic benchmarking of aerial image segmentation." Geoscience and Remote Sensing Letters, IEEE 10.6 (2013): 1527-1531.

[12] Butenuth, Matthias, et al. "Integrating pedestrian simulation, tracking and event detection for crowd analysis." Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2011.

[13] Rohrbach, Marcus, et al. "A database for fine grained activity detection of cooking activities."Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.

[:1-14] Used in: Liu, Sanya, et al. "Application of synergetic neural network in online writeprint identification." International Journal of Digital Content Technology and its Applications 5.3 (2011): 126-135.

[15] Ganesan, Kavita, and Chengxiang Zhai. "Opinion-based entity ranking." Information retrieval 15.2 (2012): 116-150.

[16] Dermouche, Mohamed, et al. "A Joint Model for Topic-Sentiment Evolution over Time." Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 2014.

[17] Rose, Tony, Mark Stevenson, and Miles Whitehead. "The Reuters Corpus Volume 1-from Yesterday's News to Tomorrow's Language Resources."LREC. Vol. 2. 2002.

[18] Klimt, Bryan, and Yiming Yang. "Introducing the Enron Corpus." CEAS. 2004.

[19] Androutsopoulos, Ion, et al. "An evaluation of naive bayesian anti-spam filtering." arXiv preprint cs/0006013 (2000).

[20] Go, Alec, Richa Bhayani, and Lei Huang. "Twitter sentiment classification using distant supervision." CS224N Project Report, Stanford 1 (2009): 12.

[21] Galgani, Filippo, Paul Compton, and Achim Hoffmann. "Combining different summarization techniques for legal text." Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, 2012.

[22] Sakar, Betul Erdogdu, et al. "Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings." Biomedical and Health Informatics, IEEE Journal of 17.4 (2013): 828-834.

[:2-23] Used in: Hammami, Nacereddine, and Mouldi Bedda. "Improved tree model for arabic speech recognition." Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on. Vol. 5. IEEE, 2010.

[24] Zhou, Fang, Q. Claire, and Ross D. King. "Predicting the geographical origin of music." Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 2014.

[25] Salamon, Justin, Christopher Jacoby, and Juan Pablo Bello. "A dataset and taxonomy for urban sound research." Proceedings of the ACM International Conference on Multimedia. ACM, 2014.

[:3-26] Used in: Ingber, Lester. "Statistical mechanics of neocortical interactions: Canonical momenta indicatorsof electroencephalography." Physical Review E 55.4 (1997): 4578.

[27] Hoffmann, Ulrich, et al. "An efficient P300-based brain–computer interface for disabled subjects." Journal of Neuroscience methods 167.1 (2008): 115-125.

[28] The CAIDA UCSD Dataset on the Witty Worm - March 19-24, 2004, http://www.caida.org/data/passive/witty_worm_dataset.xml.

[29] Brown, Michael Scott, Michael J. Pelosi, and Henry Dirska. "Dynamic-radius species-conserving genetic algorithm for the financial forecasting of Dow Jones index stocks." Machine Learning and Data Mining in Pattern Recognition. Springer Berlin Heidelberg, 2013. 27-41.

[30] Quinlan, J. Ross. "Simplifying decision trees." International journal of man-machine studies 27.3 (1987): 221-234.

[31] Kohavi, Ron. "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid." KDD. Vol. 96. 1996.

[32] Oza, Nikunj C., and Stuart Russell. "Experimental comparisons of online and batch versions of bagging and boosting." Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001.

[33] Bay, Stephen D. "Multivariate discretization for set mining." Knowledge and Information Systems 3.4 (2001): 491-512.

[:5-34] Belsley, David A., Edwin Kuh, and Roy E. Welsch. Regression diagnostics: Identifying influential data and sources of collinearity. Vol. 571. John Wiley & Sons, 2005.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]