Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: −100), impossible data combinations (e.g., Sex: Male, Pregnant: Yes), and missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running any analysis. Often, data preprocessing is the most important phase of a machine learning project, especially in computational biology.
If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Examples of data preprocessing include cleaning, instance selection, normalization, one hot encoding, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set.
Data preprocessing may affect the way in which outcomes of the final data processing can be interpreted. This aspect should be carefully considered when interpretation of the results is a key point, such in the multivariate processing of chemical data (chemometrics).
Tasks of data preprocessing
In this example we have 5 Adults in our dataset who have the Sex of Male or Female and whether they are pregnant or not. We can detect that Adult 3 and 5 are impossible data combinations.
We can perform a Data cleansing and choose to delete such data from our table. We remove such data because we can determine that such data existing in the dataset is caused by user entry errors or data corruption. A reason that one might have to delete such data is because the impossible data will affect the calculation or data manipulation process in the later steps of the data mining process.
We can perform a Data editing and change the Sex of the Adult by knowing that the Adult is Pregnant we can make the assumption that the Adult is Female and make changes accordingly. We edit the dataset to have a clearer analysis of the data when performing data manipulation in the later steps within the data mining process.
We can use a form of Data reduction and sort the data by Sex and by doing this we can simplify our dataset and choose what Sex we want to focus on more.
The origins of data preprocessing are located in data mining. The idea is to aggregate existing information and search in the content. Later it was recognized, that for machine learning and neural networks a data preprocessing step is needed too. So it has become to a universal technique which is used in computing in general.
Data preprocessing allows for the removal of unwanted data with the use of data cleaning, this allows the user to have a dataset to contain more valuable information after the preprocessing stage for data manipulation later in the data mining process. Editing such dataset to either correct data corruption or human error is a crucial step to get accurate quantifiers like true positives, true negatives, False positives and false negatives found in a Confusion matrix that are commonly used for a medical diagnosis. Users are able to join data files together and use preprocessing to filter any unnecessary noise from the data which can allow for higher accuracy. Users use Python programming scripts accompanied by the pandas library which gives them the ability to import data from a Comma-separated values as a data-frame. The data-frame is then used to manipulate data that can be challenging otherwise to do in Excel. pandas (software) which is a powerful tool that allows for data analysis and manipulation; which makes data visualizations, statistical operations and much more, a lot easier. Many also use the R (programming language) to do such tasks as well.
The reason why a user transforms existing files into a new one is because of many reasons. Data preprocessing has the objective to add missing values, aggregate information, label data with categories (Data binning) and smooth a trajectory. More advanced techniques like principal component analysis and feature selection are working with statistical formulas and are applied to complex datasets which are recorded by GPS trackers and motion capture devices.
Semantic data preprocessing
Complex problems are asking for more elaborated analyzing techniques of existing information. Instead of creating a simple script for aggregating different numerical values into one, it make sense to focus on semantic based data preprocessing. Here is the idea to build a dedicated ontology which explains on a higher level what the problem is about. The Protégé (software) is the standard tool for this purpose. A second more advanced technique is Fuzzy preprocessing. Here is the idea to ground numerical values with linguistic information. Raw data are transformed into natural language.
- Pyle, D., 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers, Los Altos, California.
- Chicco D (December 2017). "Ten quick tips for machine learning in computational biology". BioData Mining. 10 (35): 35. doi:10.1186/s13040-017-0155-3. PMC 5721660. PMID 29234465.
- Oliveri, Paolo; Malegori, Cristina; Simonetti, Remo; Casale, Monica (2019). "The impact of signal preprocessing on the final interpretation of analytical outcomes – A tutorial". Analytica Chimica Acta. 1058: 9–17. doi:10.1016/j.aca.2018.10.055. PMID 30851858.
- Culmone, Rosario and Falcioni, Marco and Quadrini, Michela (2014). An ontology-based framework for semantic data preprocessing aimed at human activity recognition. SEMAPRO 2014: The Eighth International Conference on Advances in Semantic Processing. Alexey Cheptsov, High Performance Computing Center Stuttgart (HLRS). S2CID 196091422.CS1 maint: multiple names: authors list (link)
- David Perez-Rey and Alberto Anguita and Jose Crespo (2006). OntoDataClean: Ontology-Based Integration and Preprocessing of Distributed Data. Biological and Medical Data Analysis. Springer Berlin Heidelberg. pp. 262–272. doi:10.1007/11946465_24.
- F. Mary Harin Fernandez and R. Ponnusamy (2016). "Data Preprocessing and Cleansing in Web Log on Ontology for Enhanced Decision Making". Indian Journal of Science and Technology. Indian Society for Education and Environment. 9 (10). doi:10.17485/ijst/2016/v9i10/88899.