User:Datakeeper/pageskeleton

From Wikipedia, the free encyclopedia

List of datasets for machine learning research

This is a list of noteworthy datasets for machine learning research. This list is limited to noteworthy, high-quality datasets that have been used in peer reviewed publications such as academic journals. This list is not exhaustive.

Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.[2][3][4][5] This list aggregates high-quality datasets that have been shown to be of value to the machine learning research community from multiple different data repositories to provide greater coverage of the topic than is otherwise available.

Image data[edit]

Datasets consisting primarily of images or videos for tasks such as object detection, facial recognition, and multi-label classification.

Facial recognition[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Facial Recognition Technology (FERET) 11338 images of 1199 individuals in different positions and at different times None 11,338 images Classification, Facial Recognition 2003 [6] United States Department of Defense
Pose, Illumination, and Expression (PIE) 41,368 color images of 68 people in 13 different poses. Images labeled with expressions 41,368 images, text Classification, Facial Recognition 2000 [7] R. Gross et al.
SCFace Color images of faces at various angles Location of facial features extracted. Coordinates of features given. 4,160 images, text Classification, Facial Recognition 2011 [8] M. Grgic et al.

Object detection[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Berkeley 3-D Object Dataset 849 images taken in 75 different scenes. About 50 different object classes are labeled. Object bounding boxes and labeling 849 labeled images, text Object Recognition 2014 [9] A. Janoch et al.
Microsoft Common Objects in Context complex everyday scenes of common objects in their natural context Object highlighting, labeling, and classification into 91 object types 2,500,000 labeled images, text Object Recognition 2015 [10] T. Lin et al.

Aerial images[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Aerial Image Segmentation Dataset 80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0. Images manually segmented 80 images, segmented images Aerial Classification, Object Detection 2013 [11] J. Yuan et al.
KIT AIS Data Set Multiple labeled training and evaluation datasets of aerial images of crowds. Images manually labeled to show paths of individuals through crowds. ~ 150 images with paths People tracking, Aerial tracking 2012 [12] M. Butenuth et al.

Other images[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
MPII Cooking Activities Dataset Videos and images of various cooking activities. activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling 881,755 frames labeled video, images, text Classification 2012 [13] M. Rohrbach et al.

Text data[edit]

Datasets consisting primarily of text for tasks such as sentiment analysis, translation, and cluster analysis.

Reviews[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Amazon reviews Customer reviews from Amazon.com commerce. Full text not given, features includewords used, punctuation, length, etc. 1500 English Classification, Sentiment analysis 2011 [14] Zhi Liu
OpinRank Review Dataset Reviews of cars and hotels from Edmunds.com and TripAdvisor respectively. None 42,230 / ~259,000 respectively Text Sentiment analysis, Clustering 2011 [15] Ganesan, K., and Chengxiang Zhai

News articles[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
NYSK Data Set English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn Filtered and presented in XML format. 10,421 XML, Text Sentiment analysis, topic extraction 2013 [16] Dermouche, M. et al.
The Reuters Corpus Volume 1 Large corpus of Reuters news stories in English Fine-grain categorization and topic codes 810,000 articles Documents Classification, Clustering, Summarization 2002 [17] T. Rose et al

Messages[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Enron Email Dataset Emails from employees at Enron organized into folders. Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com. ~ 500,000 Emails Network analysis, Sentiment analysis 2004 (2015) [18] Klimt, B. and Y. Yang
Ling-Spam Dataset Corpus containing both legitimate and spam emails. Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled. Emails Classification 2000 [19] Androutsopoulos, J. et al.

Tweets[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Sentiment140 Tweet data from 2009 including original text, time stamp, user and sentiment. Classified using distant supervision from presence of emoticon in tweet. 1,578,627 Tweets, comma, separated values Sentiment analysis 2009 [20] Go, A., R. Bhayani, and L. Huang

Other text[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Legal Case Reports Federal Court of Australia cases from 2006-2009. None 4,000 Documents Summarization,

Citation analysis

2012 [21] Galgani F., P. Compton, and A. Hoffmann

Sound data[edit]

Datasets of sounds and sound features.

Speech[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Parkinson Speech Dataset Multiple recordings of people with and without Parkinson's Disease. Voice features extracted, disease scored by physician using unified Parkinson's disease rating scale 1,040 text Classification, Regression 2013 [22] B. E. Sakar et al.
Spoken Arabic Digits Spoken Arabic digits from 44 male and 44 female Timeseries of mel-frequency cepstrum coefficients 8,800 text Classification 2010 [23] M. Bedda et al.

Music[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Geographical Original of Music Data Set Audio features of music samples from different locations Audio features extracted using MARSYAS software 1,059 text Geographical Classification, Clustering 2014 [24] F. Zhou et al.

Other sounds[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
UrbanSound Labeled sound recordings of sounds like air conditioners, car horns and children playing. Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file. 1,059 Sound

(WAV)

Classification 2014 [25] Salamon, J., C. Jacoby, and J.P. Bello

Signal data[edit]

Datasets containing electric signal information requiring some sort of Signal processing for further analysis.

Medical[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Referece Creator
EEG Database Study to examine EEG correlates of genetic predisposition to alcoholism Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9-msec epoch) for 1 second 122 Comma separated values Classification 1999 [26] Henri Begleiter
P300 Interface Dataset Data from nine subjects collected using P300-based brain-computer interface for disabled subjects. Split into four sessions for each subject. MATLAB code given. 1,224 Comma separated values Classification 2008 [27] Hoffman, U. et al.

Electrical[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Referece Creator
Witty Worm Dataset Dataset detailing the spread of the Witty worm and the infected computers. Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers 55,909 IP addresses Comma separated values Classification 2004 [28] Center for Applied Internet Data Analysis

Other signals[edit]

Multivariate data[edit]

Data sets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used.

Financial[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Referece Creator
Dow Jones Index Weekly data of stocks from the first and second quarters of 2011. Calculated values included such as percentage change and a lags. 750 Comma separated values Classification, Regression, Time Series 2014 [29] Brown, M., M. Pelosi, and H. Dirska
Statlog (Australian Credit Approval) Credit card applications either accepted or rejected and attributes about the application. Attribute names are removed as well as identifying information. Factors have been relabeled. 690 Comma separated values Classification 1987 [30] Ross Quinlan

Census[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Referece Creator
Adult Dataset Census data from 1994 containing demographic features of adults and their income. Cleaned and anonymized. 48,842 Comma separated values Classification 1996 [31] United States Census Bureau
Census-Income (KDD) Weighted census data from the 1994 and 1995 Current Population Surveys. Split into training and test sets. 299,285 Comma separated values Classification 2000 [32], [33] United States Census Bureau

Other Multivariate[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Referece Creator
Housing Data Set Median home values of Boston with associated home and neighborhood attributes None 506 Comma separated values Regression 1993 [34] Harrison, D. and Rubinfeld, D.L.

References[edit]

  1. ^ Wissner-Gross, A. "Datasets Over Algorithms". Edge.com. Retrieved 8 January 2016.
  2. ^ Weiss, Gary M., and Foster Provost. "Learning when training data are costly: the effect of class distribution on tree induction." Journal of Artificial Intelligence Research (2003): 315-354.
  3. ^ Turney, Peter. "Types of cost in inductive concept learning." (2000).
  4. ^ Abney, Steven. Semisupervised learning for computational linguistics. CRC Press, 2007.
  5. ^ Žliobaitė, Indrė, et al. "Active learning with evolving streaming data." Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg, 2011. 597-612.
  6. ^ Phillips, P. Jonathon, et al. "The FERET database and evaluation procedure for face-recognition algorithms." Image and vision computing 16.5 (1998): 295-306.
  7. ^ Sim, Terence, Simon Baker, and Maan Bsat. "The CMU pose, illumination, and expression (PIE) database." Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on. IEEE, 2002.
  8. ^ Grgic, Mislav, Kresimir Delac, and Sonja Grgic. "SCface–surveillance cameras face database." Multimedia tools and applications 51.3 (2011): 863-879.
  9. ^ Karayev, S., et al. "A category-level 3-D object dataset: putting the Kinect to work." Proceedings of the IEEE International Conference on Computer Vision Workshops. 2011.
  10. ^ Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." Computer Vision–ECCV 2014. Springer International Publishing, 2014. 740-755.
  11. ^ Yuan, Jiangye, Shaun S. Gleason, and Anil M. Cheriyadat. "Systematic benchmarking of aerial image segmentation." Geoscience and Remote Sensing Letters, IEEE 10.6 (2013): 1527-1531.
  12. ^ Butenuth, Matthias, et al. "Integrating pedestrian simulation, tracking and event detection for crowd analysis." Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2011.
  13. ^ Rohrbach, Marcus, et al. "A database for fine grained activity detection of cooking activities."Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
  14. ^ Used in: Liu, Sanya, et al. "Application of synergetic neural network in online writeprint identification." International Journal of Digital Content Technology and its Applications 5.3 (2011): 126-135.
  15. ^ Ganesan, Kavita, and Chengxiang Zhai. "Opinion-based entity ranking." Information retrieval 15.2 (2012): 116-150.
  16. ^ Dermouche, Mohamed, et al. "A Joint Model for Topic-Sentiment Evolution over Time." Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 2014.
  17. ^ Rose, Tony, Mark Stevenson, and Miles Whitehead. "The Reuters Corpus Volume 1-from Yesterday's News to Tomorrow's Language Resources."LREC. Vol. 2. 2002.
  18. ^ Klimt, Bryan, and Yiming Yang. "Introducing the Enron Corpus." CEAS. 2004.
  19. ^ Androutsopoulos, Ion, et al. "An evaluation of naive bayesian anti-spam filtering." arXiv preprint cs/0006013 (2000).
  20. ^ Go, Alec, Richa Bhayani, and Lei Huang. "Twitter sentiment classification using distant supervision." CS224N Project Report, Stanford 1 (2009): 12.
  21. ^ Galgani, Filippo, Paul Compton, and Achim Hoffmann. "Combining different summarization techniques for legal text." Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, 2012.
  22. ^ Sakar, Betul Erdogdu, et al. "Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings." Biomedical and Health Informatics, IEEE Journal of 17.4 (2013): 828-834.
  23. ^ Used in: Hammami, Nacereddine, and Mouldi Bedda. "Improved tree model for arabic speech recognition." Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on. Vol. 5. IEEE, 2010.
  24. ^ Zhou, Fang, Q. Claire, and Ross D. King. "Predicting the geographical origin of music." Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 2014.
  25. ^ Salamon, Justin, Christopher Jacoby, and Juan Pablo Bello. "A dataset and taxonomy for urban sound research." Proceedings of the ACM International Conference on Multimedia. ACM, 2014.
  26. ^ Used in: Ingber, Lester. "Statistical mechanics of neocortical interactions: Canonical momenta indicatorsof electroencephalography." Physical Review E 55.4 (1997): 4578.
  27. ^ Hoffmann, Ulrich, et al. "An efficient P300-based brain–computer interface for disabled subjects." Journal of Neuroscience methods 167.1 (2008): 115-125.
  28. ^ The CAIDA UCSD Dataset on the Witty Worm - March 19-24, 2004, http://www.caida.org/data/passive/witty_worm_dataset.xml.
  29. ^ Brown, Michael Scott, Michael J. Pelosi, and Henry Dirska. "Dynamic-radius species-conserving genetic algorithm for the financial forecasting of Dow Jones index stocks." Machine Learning and Data Mining in Pattern Recognition. Springer Berlin Heidelberg, 2013. 27-41.
  30. ^ Quinlan, J. Ross. "Simplifying decision trees." International journal of man-machine studies 27.3 (1987): 221-234.
  31. ^ Kohavi, Ron. "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid." KDD. Vol. 96. 1996.
  32. ^ Oza, Nikunj C., and Stuart Russell. "Experimental comparisons of online and batch versions of bagging and boosting." Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001.
  33. ^ Bay, Stephen D. "Multivariate discretization for set mining." Knowledge and Information Systems 3.4 (2001): 491-512.
  34. ^ Belsley, David A., Edwin Kuh, and Roy E. Welsch. Regression diagnostics: Identifying influential data and sources of collinearity. Vol. 571. John Wiley & Sons, 2005.