Feature engineering

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data.[1]

A feature is a property shared by independent units on which analysis or prediction is to be done.[2]

Features are used by predictive models and influence results.[3]

Feature engineering has been employed in Kaggle competitions[4] and machine learning projects.[5]

Process[edit]

The feature engineering process is:[6]

  • Brainstorming or testing features;[7]
  • Deciding what features to create;
  • Creating features;
  • Testing the impact of the identified features on the task;
  • Improving your features if needed;
  • Repeat.

Relevance[edit]

Features vary in significance.[8] Even relatively insignificant features may contribute to a model. Feature selection can reduce the number of features to prevent a model from becoming too specific to the training data set (overfitting).[9]

Explosion[edit]

Feature explosion occurs when the number of identified features grows inappropriately. Common causes include:

  • Feature templates - implementing feature templates instead of coding new features
  • Feature combinations - combinations that cannot be represented by a linear system

Feature explosion can be limited via techniques such as: regularization, kernel method, and feature selection.[10]

Automation[edit]

Automation of feature engineering is a research topic that dates back to the 1990s.[11] Machine learning software that incorporates automated feature engineering has been commercially available since 2016.[12] Related academic literature can be roughly separated into two types:

  • Multi-relational decision tree learning (MRDTL) uses a supervised algorithm that is similar to a decision tree.
  • Deep Feature Synthesis uses simpler methods.[citation needed]

MRDTL generates features in the form of SQL queries by successively adding clauses to the queries.[13] For instance, the algorithm might start out with

SELECT COUNT(*) FROM ATOM t1 LEFT JOIN MOLECULE t2 ON t1.mol_id = t2.mol_id GROUP BY t1.mol_id

The query can then successively be refined by adding conditions, such as "WHERE t1.charge <= -0.392".[14]

However, most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using tricks such as tuple id propagation.[15][16] Efficiency can be increased by using incremental updates, which eliminates redundancies.[17]

Deep Feature Synthesis[edit]

The Deep Feature Synthesis algorithm beat 615 of 906 human teams in a competition.[18][19]

Libraries:

[OneBM] helps data scientists reduce data exploration time allowing them to try and error many ideas in short time. On the other hand, it enables non-experts, who are not familiar with data science, to quickly extract value from their data with a little effort, time, and cost.[23]

Feature stores[edit]

A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used.[24]

Feature stores can be standalone software tools or built into machine learning platforms. For example, Feast[25] is an open source feature store, while platforms like Uber's Michelangelo use feature stores as a component.[26]

See also[edit]

References[edit]

  1. ^ "Machine Learning and AI via Brain simulations". Stanford University. Retrieved 2019-08-01.
  2. ^ "Discover Feature Engineering, How to Engineer Features and How to Get Good at It - Machine Learning Mastery". Machine Learning Mastery. 25 September 2014. Retrieved 2015-11-11.
  3. ^ "Feature Engineering: How to transform variables and create new ones?". Analytics Vidhya. 2015-03-12. Retrieved 2015-11-12.
  4. ^ "Q&A with Xavier Conort". kaggle.com. 2013-04-10. Retrieved 12 November 2015.
  5. ^ Domingos, Pedro (2012-10-01). "A few useful things to know about machine learning" (PDF). Communications of the ACM. 55 (10): 78–87. doi:10.1145/2347736.2347755. S2CID 2559675.
  6. ^ "Big Data: Week 3 Video 3 - Feature Engineering". youtube.com.
  7. ^ Jalal, Ahmed Adeeb (January 1, 2018). "Big data and intelligent software systems". International Journal of Knowledge-based and Intelligent Engineering Systems. 22 (3): 177–193. doi:10.3233/KES-180383 – via content.iospress.com.
  8. ^ "Feature Engineering" (PDF). 2010-04-22. Retrieved 12 November 2015.
  9. ^ "Feature engineering and selection" (PDF). Alexandre Bouchard-Côté. October 1, 2009. Retrieved 12 November 2015.
  10. ^ "Feature engineering in Machine Learning" (PDF). Zdenek Zabokrtsky. Archived from the original (PDF) on 4 March 2016. Retrieved 12 November 2015.
  11. ^ Knobbe, Arno J.; Siebes, Arno; Van Der Wallen, Daniël (1999). "Multi-relational Decision Tree Induction" (PDF). Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science. 1704. pp. 378–383. doi:10.1007/978-3-540-48247-5_46. ISBN 978-3-540-66490-1.
  12. ^ "Its all about the features". Reality AI Blog. September 2017.
  13. ^ "A Comparative Study Of Multi-Relational Decision Tree Learning Algorithm". CiteSeerX 10.1.1.636.2932. Cite journal requires |journal= (help)
  14. ^ Leiva, Hector; Atramentov, Anna; Honavar, Vasant (2002). "Experiments with MRDTL – A Multi-relational Decision Tree Learning Algorithm" (PDF). Cite journal requires |journal= (help)
  15. ^ Yin, Xiaoxin; Han, Jiawei; Yang, Jiong; Yu, Philip S. (2004). "CrossMine: Efficient Classification Across Multiple Database Relations". Proceedings. 20th International Conference on Data Engineering. Proceedings of the 20th International Conference on Data Engineering. pp. 399–410. doi:10.1109/ICDE.2004.1320014. ISBN 0-7695-2065-0. S2CID 1183403.
  16. ^ Frank, Richard; Moser, Flavia; Ester, Martin (2007). "A Method for Multi-relational Classification Using Single and Multi-feature Aggregation Functions". Knowledge Discovery in Databases: PKDD 2007. Lecture Notes in Computer Science. 4702. pp. 430–437. doi:10.1007/978-3-540-74976-9_43. ISBN 978-3-540-74975-2.
  17. ^ "How automated feature engineering works - The most efficient feature engineering solution for relational data and time series". Retrieved 2019-11-21.[promotional source?]
  18. ^ "Automating big-data analysis".
  19. ^ Kanter, James Max; Veeramachaneni, Kalyan (2015). "Deep Feature Synthesis: Towards Automating Data Science Endeavors". 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE International Conference on Data Science and Advanced Analytics. pp. 1–10. doi:10.1109/DSAA.2015.7344858. ISBN 978-1-4673-8272-4. S2CID 206610380.
  20. ^ "Featuretools | An open source framework for automated feature engineering Quick Start". www.featuretools.com. Retrieved 2019-08-22.
  21. ^ Hoang Thanh Lam; Thiebaut, Johann-Michael; Sinn, Mathieu; Chen, Bei; Mai, Tiep; Alkan, Oznur (2017). "One button machine for automating feature engineering in relational databases". arXiv:1706.00327 [cs.DB].
  22. ^ "ExploreKit: Automatic Feature Generation and Selection" (PDF).
  23. ^ Thanh Lam, Hoang; Thiebaut, Johann-Michael; Sinn, Mathieu; Chen, Bei; Mai, Tiep; Alkan, Oznur (2017-06-01). "One button machine for automating feature engineering in relational databases". arXiv:1706.00327 [cs.DB].
  24. ^ "An Introduction to Feature Stores". Retrieved 2021-04-15.
  25. ^ "Feast: Feature Store for Machine Learning". Retrieved 2021-04-15.
  26. ^ "Meet Michelangelo: Uber's Machine Learning Platform". 5 September 2017. Retrieved 2021-04-15.

Further reading[edit]

  • Boehmke, Bradley; Greenwell, Brandon (2019). "Feature & Target Engineering". Hands-On Machine Learning with R. Chapman & Hall. pp. 41–75. ISBN 978-1-138-49568-5.
  • Zheng, Alice; Casari, Amanda (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O'Reilly. ISBN 978-1-4919-5324-2.
  • Zumel, Nina; Mount, John (2020). "Data Engineering and Data Shaping". Practical Data Science with R (2nd ed.). Manning. pp. 113–160. ISBN 978-1-61729-587-4.