Ecological fallacy

An ecological fallacy (or ecological inference fallacy^[1]) is a logical fallacy in the interpretation of statistical data where inferences about the nature of individuals are deduced from inference for the group to which those individuals belong. Ecological fallacy sometimes refers to the Fallacy of division which is not a statistical issue. We concentrate below on four common statistical ecological fallacies: confusion between ecological correlations and individual correlations, confusion between group average and total average, Simpson's paradox, and confusion between higher average and higher likelihood.

Correlation of Groups vs Correlation of Individuals

Ecological fallacy happens when correlation between individual variables is based upon correlation of the variables collected for the group to which those individuals belong.

Examples

Assume that at the individual level, being Protestant impacts negatively one's tendency to commit suicide but the probability that one's neighbor commits suicide rises one's tendency to become protestant. Then, even if at the individual level there is negative correlation between suicidal tenancies and Protestantism, there can be a positive correlation at the aggregate level.

Similarly, even if at the individual level, wealth is positively correlated to tendancy to vote Republican, we observe that wealthier states tend to vote Democrat. For example, in 2004, the Republican candidate, George W. Bush, won the fifteen poorest states, and the Democratic candidate, John Kerry, won 9 of the 11 wealthiest states. Yet 62% of voters with annual incomes over $200,000 voted for Bush, but only 36% of voters with annual incomes of $15,000 or less voted for Bush.^[2]

Formal Problem

The correlation of aggregate quantities (or ecological correlation) is not equal to the the correlation of individual quantities. Denote $X_{i},Y_{i}$ two quantities at the individual level. The formula for the covariance of the aggregate quantities in groups of size N is

 $cov(\sum _{jN}^{(j+1)N}Y_{i},\sum _{1}^{N}X_{i})=\sum _{jN}^{(j+1)N}cov(Y_{i},X_{i})+\sum _{i\neq j}cov(Y_{i},X_{j})$

The covariance of two aggregated variables depends not only the covariance of two variables within the same individuals but also of covariances of the variables between different individuals. In other words, correlation of aggregate variables take into account cross sectional effects which are not relevant at the individual level.

The problem for correlations entails naturally a problem for regressions on aggregate variables: the correlation fallacy is therefore an important issue for a researcher who wants to measure causal impacts. Start with a regression model where the outcome $Y_{i}$ is impacted by $X_{i}$

 ${\begin{aligned}Y_{i}&=\alpha +\beta X_{i}+u_{i}\\cov[u_{i},X_{i}]&=0\\\end{aligned}}$

The regression model at the aggregate level is obtained by summing the individual equations:

 ${\begin{aligned}\sum _{jN}^{(j+1)N}Y_{i}&=\alpha +\beta \sum _{jN}^{(j+1)N}X_{i}+\sum _{jN}^{(j+1)N}u_{i}\\cov[\sum _{jN}^{(j+1)N}u_{i},\sum _{jN}^{(j+1)N}X]&\neq 0\end{aligned}}$

Nothing prevents the regressors and the errors from being correlated at the aggregate level. Therefore, generally, running a regression on aggregate data does not estimate the same model than running a regression with individual data.

The aggregate model is correct if and only if $\forall i,cov[u_{i},\sum _{jN}^{(j+1)N}X]=0$ . This means that, controlling for $X_{i}$ , $\sum _{jN}^{(j+1)N}X_{i}$ does not determine $Y_{i}$ . Going back to the religion example, the aggregate model correctly measures Protestants' tendancy to commit suicide if and only, inside each religion, one's tendency to commit suicide is not determined by the number of Protestants in one's state

Historical Examples of the Fallacy

An early example of the ecological fallacy was Émile Durkheim's 1897 study of suicide in France although this has been debated by some.^[3]^[4]. Another example is a 1950 paper by William S. Robinson that coined the term.^[5] For each of the 48 states + District of Columbia in the US as of the 1930 census, he computed the illiteracy rate and the proportion of the population born outside the US. He showed that these two figures were associated with a negative correlation of −0.53 — in other words, the greater the proportion of immigrants in a state, the lower its average illiteracy. However, when individuals are considered, the correlation was +0.12 — immigrants were on average more illiterate than native citizens. Robinson showed that the negative correlation at the level of state populations was because immigrants tended to settle in states where the native population was more literate. He cautioned against deducing conclusions about individuals on the basis of population-level, or "ecological" data. In 2011, it was found that Robinson's calculations of the ecological correlations are based on the wrong state level data. The correlation of −0.53 mentioned above is in fact −0.46.^[6]

Choosing between Aggregate and Individual Inference

There is nothing wrong in running regressions on aggregate data if one is interested in the aggregate model. For instance, as a governor, it is correct to run regressions between police force on crime rate at the state level if one is interested in the policy implication of a rise in police force. However, an ecological fallacy would happen if a city council deduces the impact of a rise in police force in the crime rate at the city level from the correlation at the state level.

Choosing to run aggregate or individual regressions to understand aggregate impacts on some policy depends on the following trade off: aggregate regressions loose individual level datas but individual regressions add strong modeling assumptions. Some researchers suggest that the ecological correlation gives a better picture of the outcome of public policy actions, thus they recommend the ecological correlation over the individual level correlation for this purpose (Lubinski & Humphreys, 1996). Other researchers disagree, especially when the relationships among the levels are not clearly modeled. To prevent ecological fallacy, researchers with no individual datas can model first what is occurring at the individual level, then model how the individual and group levels are related, and finally examine whether anything occurring at the group level adds to the understanding of the relationship. For instance, in evaluating the impact of state policies, it is helpful to know that policy impacts vary less among the states than do the policies themselves, suggesting that the policy differences are not well translated into results, despite high ecological correlations (Rose, 1973).

Group Average vs Total Average

Ecological fallacy also happens when the average for a group is approximated by the average in the total population divided by the group size. Suppose one knows the number of Protestants and the suicide rate in the USA, but one does not have datas linking religion and suicide at the individual level. If one is interested in the suicidal rate of Protestants, it is a mistake to consider as an unbiased estimate the total suicide rate divided by the number of Protestants. This estimate implicitly assumes that the suicidal rate of the other religions is zero.

Formally, denote $Y$ the variable of interest and $E[Y|{\text{Protestant}}]$ the mean of the group, we generally have

E[Y|{\text{Protestant}}]\neq {\frac {E[Y]}{P({\text{Protestant}})}}

However, the law of total expectation says

{\begin{aligned}E[Y]={\color {Blue}E[Y|{\text{Protestant}}]}P({\text{Protestant}})+{\color {Blue}E[Y|{\text{Not Protestant}}]}(1-P({\text{Protestant}})).\end{aligned}}

In this equation, the only things we don't know how to estimate are in blue. We know that $E[Y|{\text{Not Protestant}}]$ , as a probability of suicide, is between 0 and 1. We can plug in this bound and our total estimate in the equation above to obtain an estimate of $E[Y|{\text{ Protestant}}]$ .

Simpson's Paradox

A striking ecological fallacy is Simpson's paradox. Simpson's paradox refers to the fact, when comparing two populations divided in groups of different sizes, the average of some variable in the first population can be higher in every group and yet lower in the total population. Formally, when each value of Z refers to a different group and X refers to some treatment, it can happen that

\forall z,E[Y|Z=z,X=1]>E[Y|Z=z,X=0]{\text{ while }}E[Y|X=1]<E[Y|X=0]

When $E[Y|Z=z,X=1]-E[Y|Z=z,X=0]$ does not depend on $Z$ , the Simpson's paradox is exactly the omitted variable bias for the regression of $Y$ on $X$ where the regressor $X$ is a dummy variable and the omitted variable $Z$ is a categorical variable defining groups for each value it takes. The application is striking because the bias is high enough that parameters have opposite opposite signs.

Mean and median

A third example of ecological fallacy is when the average of a population is assumed to have an interpretation in term of likelihood at the individual level.

For instance, if the average score of group A is superior to zero, it does not mean that a random individual of group A is more likely to have a positive score. Similarly, if a particular group of people is measured to have a lower average IQ than the general population, it is an error to conclude that a randomly selected member of the group is more likely to have a lower IQ than the average general population. Mathematically, this comes from the fact that a distribution can have a positive mean but a negative median. This property is linked to the skewness of the distribution

Consider the following numerical example:

Group A: 80% of people got 40 points and 20% of them got 95 points. The average score is 51 points.
Group B: 50% of people got 45 points and 50% got 55 points. The average score is 50 points.
If we pick two people at random from A and B, there are 4 possible outcomes:
- A - 40, B - 45 (B wins, 40% probability)
- A - 40, B - 55 (B wins, 40% probability)
- A - 95, B - 45 (A wins, 10% probability)
- A - 95, B - 55 (A wins, 10% probability)
Although Group A has a higher average score, 80% of the time a random individual of A will score lower than a random individual of B.

Legal Applications

The ecological fallacy was discussed in a court challenge to the Washington gubernatorial election, 2004 in which a number of illegal voters were identified, after the election; their votes were unknown, because the vote was by secret ballot. The challengers argued that illegal votes cast in the election would have followed the voting patterns of the precincts in which they had been cast, and thus adjustments should be made accordingly.^[7] An expert witness said this approach was like trying to figure out Ichiro Suzuki's batting average by looking at the batting average of the entire Seattle Mariners team, since the illegal votes were cast by a unrepresentative sample of each precinct's voters, and might be as different from the average voter in the precinct as Ichiro was from the rest of his team.^[8] The judge determined that the challengers' argument was an ecological fallacy, and rejected it.^[9]

References

Rose, D. D. (1973). National and local forces in state politics: The implications of multi-level policy analysis. "American Political Science Review," Volume LXVII, No. 4, pages 1162-1173.

Lubinski, D., & Humphreys, L. G. (1996). Seeing the forest from the trees: When predicting the behavior or status of groups, correlate means. Psychology, Public Policy, and Law, volume 2, pages 363-376.

References

^ Charles Ess; Fay Sudweeks (2001). Culture, technology, communication: towards an intercultural global village. SUNY Press. p. 90. ISBN 978-0-7914-5015-4. The problem lies with the 'ecological fallacy' (or fallacy of division)—the impulse to apply group or societal level characteristics onto individuals within that group.
^ Gelman, Andrew; Park, David; Shor, Boris; Bafumi, Joseph; Cortina, Jeronimo (2008). Red State, Blue State, Rich State, Poor State. Princeton University Press. ISBN 978-0-691-13927-2.
^ Freedman, David A. 2002. The Ecological Fallacy. University of California. [1]
^ H. C. Selvin. 1965. "Durkheim's Suicide:Further Thoughts on a Methodological Classic", in R. A. Nisbet (ed.) Émile Durkheim pp. 113-136
^ Robinson, W.S. (1950). "Ecological Correlations and the Behavior of Individuals". American Sociological Review. 15 (3). American Sociological Review, Vol. 15, No. 3: 351–357. doi:10.2307/2087176. JSTOR 2087176.
^ The research note on this curious data glitch is published in the International Journal for Epidemiology (http://ije.oxfordjournals.org/content/early/2011/05/24/ije.dyr081.full%20). The data Robinson used and the corrections are available at http://www.ru.nl/mt/rob/downloads/
^ George Howland Jr. (May 18, 2005). "The Monkey Wrench Trial: Dino Rossi's challenge of the 2004 election is on shaky legal ground. But if he prevails, watch litigation become an option in close races everywhere". Seattle Weekly.
^ Christopher Adolph (May 12, 2005). "Report on the 2004 Washington Gubernatorial Election". Expert witness report to the Chelan County Superior Court in Borders et al v. King County et al.
^ Borders et al. v. King County et al., transcript of the decision by Chelan County Superior Court Judge John Bridges, June 6, 2005, published: June 8, 2005

[1] Charles Ess; Fay Sudweeks (2001). Culture, technology, communication: towards an intercultural global village. SUNY Press. p. 90. ISBN 978-0-7914-5015-4. The problem lies with the 'ecological fallacy' (or fallacy of division)—the impulse to apply group or societal level characteristics onto individuals within that group.

[2] Gelman, Andrew; Park, David; Shor, Boris; Bafumi, Joseph; Cortina, Jeronimo (2008). Red State, Blue State, Rich State, Poor State. Princeton University Press. ISBN 978-0-691-13927-2.

[3] Freedman, David A. 2002. The Ecological Fallacy. University of California. [1]

[4] H. C. Selvin. 1965. "Durkheim's Suicide:Further Thoughts on a Methodological Classic", in R. A. Nisbet (ed.) Émile Durkheim pp. 113-136

[5] Robinson, W.S. (1950). "Ecological Correlations and the Behavior of Individuals". American Sociological Review. 15 (3). American Sociological Review, Vol. 15, No. 3: 351–357. doi:10.2307/2087176. JSTOR 2087176.

[6] The research note on this curious data glitch is published in the International Journal for Epidemiology (http://ije.oxfordjournals.org/content/early/2011/05/24/ije.dyr081.full%20). The data Robinson used and the corrections are available at http://www.ru.nl/mt/rob/downloads/

[7] George Howland Jr. (May 18, 2005). "The Monkey Wrench Trial: Dino Rossi's challenge of the 2004 election is on shaky legal ground. But if he prevails, watch litigation become an option in close races everywhere". Seattle Weekly.

[8] Christopher Adolph (May 12, 2005). "Report on the 2004 Washington Gubernatorial Election". Expert witness report to the Chelan County Superior Court in Borders et al v. King County et al.

[9] Borders et al. v. King County et al., transcript of the decision by Chelan County Superior Court Judge John Bridges, June 6, 2005, published: June 8, 2005

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Correlation of Groups vs Correlation of Individuals

Examples

Formal Problem

Historical Examples of the Fallacy

Choosing between Aggregate and Individual Inference

Group Average vs Total Average

Simpson's Paradox

Mean and median

Legal Applications

References

See also

References