From Wikipedia, the free encyclopedia
Jump to: navigation, search

In econometrics and statistics, a top-coded data set is one for which data points whose values are above an upper bound are censored. This is often done to preserve the anonymity of people participating in the survey (for example, if a survey included a person with wealth of $79 billion, it would not be anonymous because people would know there is a good chance it is Bill Gates).

Example: Top-coding of wealth at 30,000[edit]

id age actual wealth wealth variable in data set
1 26 24,778 24,778
2 32 26,750 26,750
3 45 26,780 26,780
4 64 35,469 30000+
5 27 43,695 30000+

Jacob S. Hacker and Paul Pierson argue that the practice of top-coding, or capping the reported maximum value on tax returns ostensibly to protect the earner's anonymity, complicates the analysis of the distribution of wealth in the United States.[1]

Implications for ordinary least squares[edit]

  • If the lower bound of the top-coded group is used as a regressor value (30000 in the example above), OLS is biased and inconsistent.
  • The top-coded group can be omitted from the regression entirely. Provided there are no systematic differences between the omitted group and the included groups, OLS is consistent and unbiased.
  • The Tobit procedure is robust to top coding, and gives unbiased estimates.

See also[edit]

Further reading[edit]


  1. ^ Hacker, Jacob S. and Paul Pierson (2010). Winner-Take-All Politics: How Washington Made the Rich Richer--And Turned Its Back on the Middle Class. Simon & Schuster. p. 13. ISBN 978-1-4165-8869-6.