In econometrics and statistics, a top-coded dataset is one for which the upper bound is not known. This is often done to preserve the anonymity of people participating in the survey (for example, if a survey included a person with wealth of $51 billion, it would not be anonymous because people would know it is Bill Gates).
[edit] Example: Top-coding of wealth
| id |
age |
income |
| 1 |
26 |
24778 |
exact value |
| 2 |
32 |
26750 |
exact value |
| 3 |
45 |
26780 |
exact value |
| 4 |
32 |
30000+ |
top coded |
| 5 |
45 |
30000+ |
top coded |
Jacob S. Hacker and Paul Pierson argue that the practice of top-coding, or capping the reported maximum value on tax returns ostensibly to protect the earner's anonymity, complicates the analysis of the distribution of wealth in the United States.[1]
- If the lower bound of the top-coded group is used as a regressor value (30000 in the example above), OLS is biased and inconsistent.
- The top-coded group can be omitted from the regression entirely. Provided there are no systematic differences between the omitted group and the included groups, OLS is consistent and unbiased.
- The Tobit procedure is robust to top coding, and gives unbiased estimates.
[edit] See also
[edit] References
- ^ Hacker, Jacob S. and Paul Pierson (2010). Winner-Take-All Politics: How Washington Made the Rich Richer--And Turned Its Back on the Middle Class. Simon & Schuster. pp. 13. ISBN 978-1-4165-8869-6.
- Tobin, James (1958). "Estimation for relationships with limited dependent variables". Econometrica 26 (1), 24–36.