Noisy data

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Noisy data is data that is corrupted, or distorted, or has a low Signal-to-Noise Ratio. Improper procedures (or improperly-documented procedures) to subtract out the noise in data can lead to a false sense of accuracy or false conclusions.

Definition[edit]

Data = true signal + noise.[1] Noisy data is data with a large amount of additional meaningless information in it called noise.[2] The term has often been used as a synonym for corrupt data.[2] It also includes any data that cannot be understood and interpreted correctly by machines, such as unstructured text. Noisy data can adversely affect the results of any data analysis and skew conclusions if not handled properly. Statistical analysis is sometimes used to weed the noise out of noisy data.[2]

Sources of noise[edit]

In this example of an outlier and filtering, point t2 is an outlier. The smooth transition to and from the outlier is from filtering, and is also not valid data, but more noise. Presenting filtered results (the smoothed transitions) as actual measurements can lead to false conclusions.
This type of filter (a moving average) shifts the data to the right. The moving average price at a given time is usually much different than the actual price at that time.

Differences in real-world measured data from the true values come about from by multiple factors affecting the measurement.[3]

Random noise is often a large component of the noise in data.[4] Random noise in a signal is measured as the Signal-to-Noise Ratio. Random noise contains almost equal amounts of a wide range of frequencies, and is also called white noise (as colors of light combine to make white). Random noise is an unavoidable problem. It affects the data collection and data preparation processes, where errors commonly occur. Noise has two main sources: errors introduced by measurement tools and random errors introduced by processing or by experts when the data is gathered.[5]

Improper Filtering can add noise, if the filtered signal is treate as if it were a directly measured signal. As an example, Convolution-type digital filters such a moving average can have side effects such as lags or truncation of peaks. Differentiating digital filters amplify random noise in the original data.

Outlier data is data that appears to not belong in the data set. It can be caused by human error such as transposing numerals, mislabeling, programming bugs, etc. If actual outliers are not removed from the data set, they corrupt the results to a small or large degree depending on circumstances. If valid data is identified as an outlier and is mistakenly removed, that also corrupts results.

Fraud: Individuals may deliberately skew data to influence the results toward a desired conclusion. Data that looks good with few outliers reflects well on the individual collecting it, and so there may be incentive to remove more data as outliers, or make the data look smoother than it is.

References[edit]

  1. ^ "What is the basic difference between noise and outliers in Data mining? - Quora". www.quora.com.
  2. ^ a b c "What is noisy data? - Definition from WhatIs.com".
  3. ^ "Noisy Data in Data Mining - Soft Computing and Intelligent Information Systems". sci2s.ugr.es.
  4. ^ R.Y. Wang, V.C. Storey, C.P. Firth, A Framework for Analysis of Data Quality Research, IEEE Transactions on Knowledge and Data Engineering 7 (1995) 623-640 doi: 10.1109/69.404034)
  5. ^ X. Zhu, X. Wu, Class Noise vs. Attribute Noise: A Quantitative Study, Artificial Intelligence Review 22 (2004) 177-210 doi: 10.1007/s10462-004-0751-8