Top-coded

In econometrics and statistics, a top-coded data set is one for which the value of variables above an upper bound are censored. This is often done to preserve the anonymity of people participating in the survey (for example, if a survey included a person with wealth of $51 billion, it would not be anonymous because people would know it is Bill Gates).

Example: Top-coding of wealth at 30,000

id	age	actual wealth	wealth variable in data set
1	26	24,778	24,778
2	32	26,750	26,750
3	45	26,780	26,780
4	64	35,469	30000+
5	27	43,695	30000+

Jacob S. Hacker and Paul Pierson argue that the practice of top-coding, or capping the reported maximum value on tax returns ostensibly to protect the earner's anonymity, complicates the analysis of the distribution of wealth in the United States.^[1]

Implications for ordinary least squares

If the lower bound of the top-coded group is used as a regressor value (30000 in the example above), OLS is biased and inconsistent.
The top-coded group can be omitted from the regression entirely. Provided there are no systematic differences between the omitted group and the included groups, OLS is consistent and unbiased.
The Tobit procedure is robust to top coding, and gives unbiased estimates.

References

^ Hacker, Jacob S. and Paul Pierson (2010). Winner-Take-All Politics: How Washington Made the Rich Richer--And Turned Its Back on the Middle Class. Simon & Schuster. p. 13. ISBN 978-1-4165-8869-6.

Tobin, James (1958). "Estimation for relationships with limited dependent variables". Econometrica 26 (1), 24–36.

This Econometrics-related article is a stub. You can help Wikipedia by expanding it.

[1] Hacker, Jacob S. and Paul Pierson (2010). Winner-Take-All Politics: How Washington Made the Rich Richer--And Turned Its Back on the Middle Class. Simon & Schuster. p. 13. ISBN 978-1-4165-8869-6.

[1]

Example: Top-coding of wealth at 30,000

Implications for ordinary least squares

See also

References