Automated machine learning

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Automated machine learning (AutoML) is the process of automating end-to-end the process of applying machine learning to real-world problems. In a typical machine learning application, practitioners have a dataset consisting of input data points to train on. The raw data itself may not be in a form that all algorithms may be applicable to it out of the box. An expert may have to apply the appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods that make the dataset amenable for machine learning. Following those preprocessing steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their final machine learning model. As many of these steps are often beyond the abilities of non-experts, AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning.[1][2] Automating the process of applying machine learning end-to-end offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform models that were designed by hand. However, AutoML is not a silver bullet and can introduce additional parameters of its own, called hyperhyperparameters, which may need some expertise to be set themselves. But it does make application of Machine Learning easier for non-experts.

Targets of automation[edit]

Automated machine learning can target various stages of the machine learning process:[2]

  • Automated data preparation and ingestion (from raw data and miscellaneous formats)
    • Automated column type detection; e.g., boolean, discrete numerical, continuous numerical, or text
    • Automated column intent detection; e.g., target/label, stratification field, numerical feature, categorical text feature, or free text feature
    • Automated task detection; e.g., binary classification, regression, clustering, or ranking
  • Automated feature engineering
  • Automated model selection
  • Hyperparameter optimization of the learning algorithm and featurization
  • Automated pipeline selection under time, memory, and complexity constraints
  • Automated selection of evaluation metrics / validation procedures
  • Automated problem checking
    • Leakage detection
    • Misconfiguration detection
  • Automated analysis of results obtained
  • User interfaces and visualizations for automated machine learning


Notable platforms tackling various stages of AutoML:

Hyperparameter optimization and model selection[edit]

  • Auto-WEKA[3][4][5] is a Bayesian hyperparameter optimization layer on top of WEKA.
  • auto-sklearn[6][4][5] is a Bayesian hyperparameter optimization layer on top of scikit-learn.
  • ATM[7] is an open source software library under the Human Data Interaction project (HDI) at MIT. It is a distributed, scalable AutoML system designed with ease of use in mind.
  • H2O AutoML[4][5] provides automated data preparation, hyperparameter tuning via random search, and stacked ensembles in a distributed machine learning platform.

Full pipeline optimization[edit]

  • TPOT[8][9][4][5] is a Python library that automatically creates and optimizes full machine learning pipelines using genetic programming.
  • H2O Driverless AI[4][10][11] is an automated machine learning platform developed by for automated visualization, feature engineering, model training, hyperparameter optimization, and explainability.
  • dotData Enterprise[12] is an automated machine learning platform developed by dotData for automated feature engineering, model training, hyperparameter optimization explainability and operationalization of AI and ML models.
  • TransmogrifAI[13][14][4] is a Scala/SparkML library created by Salesforce for automated data cleansing, feature engineering, model selection, and hyperparameter optimization
  • RECIPE [15] is a framework based on grammar-based genetic programming that builds customized scikit-learn classification pipelines.
  • GA-Auto-MLC[16] and Auto-MEKAGGP[17] are freely-available methods that perform automated multi-label classification on the MEKA software.[18]
  • ML-Plan[19] is an open-source AutoML tool based on Hierarchical Task Network planning, which is implemented to work with WEKA and scikit-learn algorithms. Furthermore it has been extended to support an unlimited number of preprocessing steps[20] using scikit-learn and for the problem domain of multi-label classification[21] based on MEKA.

Deep neural network architecture search[edit]

See also[edit]


  1. ^ Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013). Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. KDD '13 Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 847–855.
  2. ^ a b Hutter F, Caruana R, Bardenet R, Bilenko M, Guyon I, Kegl B, and Larochelle H. "AutoML 2014 @ ICML". AutoML 2014 Workshop @ ICML. Retrieved 2018-03-28.
  3. ^ Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K (2017). "Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA". Journal of Machine Learning Research. 18 (25): 1–5.
  4. ^ a b c d e f Truong A, Walters A, Goodsitt J, Hines K, Bruss B, Farivar R (2019). "Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools". arXiv:1908.05557 [cs.LG].CS1 maint: date and year (link)
  5. ^ a b c d Gijsbers P, LeDell E, Thomas J, Poirier S, Bischl B, Vanschoren J (2019). "An Open Source AutoML Benchmark" (PDF). arXiv:1907.00909 [cs.LG].CS1 maint: date and year (link)
  6. ^ Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F (2015). "Efficient and Robust Automated Machine Learning". Advances in Neural Information Processing Systems 28 (NIPS 2015): 2962–2970.
  7. ^ Swearingen, Thomas; Drevo, Will; Cyphers, Bennett; Cuesta-Infante, Alfredo; Ross, Arun; Veeramachaneni, Kalyan (December 2017). "ATM: A distributed, collaborative, scalable system for automated machine learning". 2017 IEEE International Conference on Big Data (Big Data). IEEE: 151–162. doi:10.1109/bigdata.2017.8257923. ISBN 9781538627150.
  8. ^ Olson RS, Urbanowicz RJ, Andrews PC, Lavender NA, Kidd L, Moore JH (2016). Automating biomedical data science through tree-based pipeline optimization. Proceedings of EvoStar 2016. Lecture Notes in Computer Science. 9597. pp. 123–137. arXiv:1601.07925. doi:10.1007/978-3-319-31204-0_9. ISBN 978-3-319-31203-3.
  9. ^ Olson RS, Bartley N, Urbanowicz RJ, Moore JH (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of EvoBIO 2016. Gecco '16. pp. 485–492. arXiv:1603.06212. doi:10.1145/2908812.2908918. ISBN 9781450342063.
  10. ^ Heller, Martin (6 November 2017). "Review: automates machine learning". Infoworld.
  11. ^ Janofsky, Adam (21 August 2019). "Startup Aims to Democratize AI". WSJ. Retrieved 2019-08-21.
  12. ^ Woodie, Alex (1 February 2019). datanami Missing or empty |title= (help)
  13. ^ Shubha Nabar (2018-08-16). "Open Sourcing TransmogrifAI – Automated Machine Learning for Structured Data - Salesforce Engineering". Salesforce Engineering. Retrieved 2018-08-16.
  14. ^ Kyle Wiggers (2018-08-16). "Salesforce open-sources TransmogrifAI, the machine learning library that powers Einstein". VentureBeat. Retrieved 2018-08-16. Once TransmogrifAI has extracted features from the dataset, it’s primed to begin automated model training. At this stage, it runs a cadre of machine learning algorithms in parallel on the data, automatically selects the best-performing model, and samples and recalibrates predictions to avoid imbalanced data.
  15. ^ de Sá, Alex G. C.; Pinto, Walter José G. S.; Oliveira, Luiz Otavio V. B.; Pappa, Gisele L. (2017), "RECIPE: A Grammar-Based Framework for Automatically Evolving Classification Pipelines", Lecture Notes in Computer Science, Springer International Publishing, pp. 246–261, doi:10.1007/978-3-319-55696-3_16, ISBN 9783319556956
  16. ^ de Sá, Alex G. C.; Pappa, Gisele L.; Freitas, Alex A. (2017). "Towards a Method for Automatically Selecting and Configuring Multi-label Classification Algorithms". Proceedings of the Genetic and Evolutionary Computation Conference Companion. GECCO '17. New York, NY, USA: ACM: 1125–1132. doi:10.1145/3067695.3082053. ISBN 9781450349390.
  17. ^ de Sá, Alex G. C.; Freitas, Alex A.; Pappa, Gisele L. (2018). Auger, Anne; Fonseca, Carlos M.; Lourenço, Nuno; Machado, Penousal; Paquete, Luís; Whitley, Darrell (eds.). "Automated Selection and Configuration of Multi-Label Classification Algorithms with Grammar-Based Genetic Programming" (PDF). Parallel Problem Solving from Nature – PPSN XV. Lecture Notes in Computer Science. Springer International Publishing. 11102: 308–320. doi:10.1007/978-3-319-99259-4_25. ISBN 9783319992594.
  18. ^ Read, Jesse; Reutemann, Peter; Pfahringer, Bernhard; Holmes, Geoff (January 2016). "Meka: A Multi-label/Multi-target Extension to Weka". J. Mach. Learn. Res. 17 (1): 667–671. ISSN 1532-4435.
  19. ^ Mohr, Felix; Wever, Marcel; Hüllermeier, Eyke (3 July 2018). "ML-Plan: Automated machine learning via hierarchical planning". Machine Learning. 107 (8–10): 1495–1515. doi:10.1007/s10994-018-5735-z.
  20. ^ Wever, Marcel; Mohr, Felix; Hüllermeier, Eyke. "ML-Plan for Unlimited-Length Machine Learning Pipelines" (PDF). ICML 2018 AutoML Workshop.
  21. ^ Wever, Marcel; Mohr, Felix; Tornede, Alexander; Hüllermeier, Eyke. "Automating Multi-Label Classification Extending ML-Plan". 6th ICML Workshop on Automated Machine Learning.
  22. ^ Haifeng J, Qingquan S, Xia H (2018). "Auto-Keras: Efficient Neural Architecture Search with Network Morphism". arXiv:1806.10282 [cs.LG].