Talk:G-test

WikiProject Statistics (Rated Start-class, Mid-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

Start  This article has been rated as Start-Class on the quality scale.
Mid  This article has been rated as Mid-importance on the importance scale.

Feasibility of Fisher Exact test

Before writing the words below, I ran several such calculations using this web-based application: http://home.clara.net/sisa/twoby2.htm with the Firefox browser: for examples in which all cells had values between 10,000 and 20,000 it took about 30 seconds to finish the calculations.

For example, a laptop with a 1.7 Ghz Pentium and 1 GB of RAM, specifications not considered particularly high end in 2006, can readily handle cases of the Fisher exact test in which each cell's value is around 10,000 with commonly available statistical software.

Reverted as off topic. not really about G-test. Pete.Hurd 17:31, 31 July 2006 (UTC)

similarity to Kullback-Leibler divergence

Does the G-test and the Kullback-Leibler divergence mean the same but from another point of view?

It seems that they are the same thing. --Memming (talk) 18:58, 8 April 2009 (UTC)

G^2

Note that the "G-test" is referred to as the G^2 (g-squared) test (at least in psychology-related statistics).

Humph. I've never seen that - please give a reference. seglea 23:29, 22 July 2005 (UTC)

To name few references to G^2 in psychological stats (this is common in multinomial modeling work in the memory literature and is becoming more common in fitting other types of models as well):

Dodson, Holland, & Shimamura, 1998. Using Excel to estimate parameters from observed data: An example from source memory data. Behavior Research Methods, Instruments, & Computers 1998, 30 (3), 517-526.

Batchelder & Reifer, 1999. Theoretical and empirical review of multinomial process tree modeling. Psychonomic Bulletin & Review, 6(1), 57-86.

Bayen, Murane, & Erdfelder. (1996). Source Discrimination, Item Detection, and Multinomial Models of Source Monitoring. Journal of Experimental Psychology: Learning, Memory, and Cognition 1996, Vol. 22, No. 1, 197-215.

Erdfelder & Buchner. (1998). Process-Dissociation Measurement Models: Threshold Theory or Detection Theory? Journal of Experimental Psychology: General, 127(1), 83-96.

fisher.g.test in GeneTS not G-test as described?

fisher.g.test implemented in GeneTS is an exact test for whether a time series is different from Gaussian white noise, not the alternative to the chi-square test as described.

Where does the 2 come from?

I've been trying to work out how Pearson's formula is an approximation for this test.

$\ln(1 + x) \approx x$ (for small x)

$G = 2 \sum_i O_i \ln (O_i / E_i)$

$= 2 \sum_i O_i \ln (1 + \frac{O_i-E_i}{E_i})$

$\approx 2 \sum_i O_i \frac{O_i-E_i}{E_i}$

$= 2 \sum_i \frac{O_i}{E_i} (O_i - E_i) - 2 \sum_i (O_i - E_i)$ (since $\sum_i (O_i - E_i) = 0$)

$= 2 \sum_i \frac{O_i}{E_i} (O_i - E_i) - (O_i - E_i)$

$= 2 \sum_i \frac{O_i - E_i}{E_i} (O_i - E_i)$

$= 2 \sum_i \frac{(O_i - E_i)^2}{E_i}$

This is the formula for $\chi^2$, except that the factor of 2 is still there. What was my error? Thanks! — ciphergoth 14:11, 3 June 2006 (UTC)

Your approximation for ln(1+x) at $\approx$ wasn't good enough; it roughly works for each term, but its error for positive and negative numbers reinforces is enough for the factor of 2. Taking the -x^2/2 term and another approximation should get you there. --Henrygb 15:01, 9 March 2007 (UTC)

Then please tell me where is my error:

$\ln(1 + x) \approx x - \frac{x^2}{2}$ (for small x)

$G = 2 \sum_i O_i \ln (O_i / E_i)$

$= 2 \sum_i O_i \ln (1 + \frac{O_i-E_i}{E_i} )$

$\approx 2 \sum_i O_i ( \frac{O_i-E_i}{E_i} - \frac{(O_i-E_i)^2}{2 E_i^2})$

$= 2 \sum_i \frac{O_i}{2} ( 2 \frac{O_i-E_i}{E_i} - \frac{(O_i-E_i)^2}{E_i^2})$

$= \sum_i O_i ( 1 -1 + 2 \frac{O_i-E_i}{E_i} - \frac{(O_i-E_i)^2}{E_i^2})$

$= \sum_i O_i ( 1 - (1 - \frac{O_i-E_i}{E_i})^2 )$

$= \sum_i O_i - \sum_i O_i ( \frac{O_i}{E_i})^2 )$

$= n - \sum_i \frac{O_i^3}{E_i^2}$

You made a mistake in the one-before-last equality:
$1 - \frac{O_i-E_i}{E_i} \neq \frac{O_i}{E_i}$ —Preceding unsigned comment added by 87.69.46.105 (talk) 07:38, 23 February 2008 (UTC)

More precise stating of distribution of G under null hypothesis

This sentence should be made more precise

Given the null hypothesis that the observed frequencies result from random sampling from a distribution with the given expected frequencies, the distribution of G is approximately that of chi-squared, with the same number of degrees of freedom as in the corresponding chi-squared test.

Does it converge in distribution? So does the $\chi^2$ statistic, right? Is the asymptotic rate of convergence quicker for $G$ than for $\chi^2$? I don't have any references on $G$ so I'm afraid I won't be of any help answering these questions.

Andyrew609 19:39, 27 November 2006 (UTC)

splitting of the G statistics

I am currently going through agrasti's: Categorical Data Analysis (2002) and at page 82 he gies a clean explanation on how to partition the G statistic (p.s: be aware that on the 2007 edition on the book, most of this section was cut - so don't bother looking for it there)

This partitioning is useful - so it might be worth noting in the article... Talgalili —Preceding unsigned comment added by Talgalili (talkcontribs) 17:38, 5 September 2007 (UTC)

Maybe a squeamish comment about notation

It should be correct in the G formulae to write the bigger brackets outside the summation operator and containing the whole expression of the terms using indexes.

How to handle zero frequencies in observations?

Since in the formula

$G = 2\sum_{i} {O_i \cdot \ln(O_i/E_i) }$

the logarithm is used, how terms are handled where $O_i = 0$?

I guess you should use Pearson's in that case, as implicitly recommended for small $| O_i - E_i |$: “[T]he approximation to the theoretical chi-square distribution for the G-test is better than for the Pearson chi-squared tests in cases where for any cell |Oi − Ei | > Ei, and in any such case the G-test should always be used.”
l0b0 (talk) 15:45, 16 April 2009 (UTC)

If $O_i = 0$ the term ${O_i \cdot \ln(O_i/E_i) }$ in the sum should be counted as zero. Entropeter (talk) 20:25, 4 February 2012 (UTC)

Better introduction

The opening paragraph never mentions that the "cells" it refers to are contingency table cells . Holopoj (talk) 21:01, 15 July 2013 (UTC)