Chi-square Test for a Contingency Table of Counts


     This bio-recipe shows how to test for independence in a contingency table of counts. To do that we will use a chi-square test which is also known under the name of Pearson's chi-square test. As in the bio-recipe AApairs, we will be using the total number of counts of amino acid pairs in the SwissProt database as an example.

     The null-hypothesis we want to test is that the observed number of pairs of amino acids corresponds to the number of pairs we would expect based on the frequencies of the two amino acid involved in the pair.

     The data we want to use for our null-hypothesis consists of the counts matrix C obtained from the bio-recipe AApairs. So we use the code from AApairs, especially the procedure Size that extracts the counts of the amino acid pairs from the data base.

Size := proc( x:string )
  t := SearchSeqDb(x);
  t[1,2]-t[1,1]+1
end:

define the location of the SwissProt database and load it
SwissProt := '~cbrg/DB/SwissProt.Z':
ReadDb(SwissProt);
Reading 169638448 characters from file /home/cbrg/DB/SwissProt.Z
Pre-processing input (peptides)
163235 sequences within 163235 entries considered
Peptide file(/home/cbrg/DB/SwissProt.Z(169638448), 163235 entries, 59631787
 aminoacids)
M := 8:
C := CreateArray(1..M,1..M):
for i to M do
	for j to M do
		aapair := IntToA(i).IntToA(j):
		C[i,j] := Size(aapair):
	od:
od:

print a header line and the count matrix
for i to M do printf( '%7s', IntToAAA(i) ) od;
printf('\n');  print(C);
    Ala    Arg    Asn    Asp    Cys    Gln    Glu    Gly
 469536 246987 165610 234294  67935 182718 302217 342032
 236998 221709 128206 168203  49457 131368 216985 205625
 172817 119157 127122 123057  39732  96894 148390 174090
 238027 156099 125148 177308  46553 104121 226639 220279
  60995  52031  38877  49757  24430  37576  52023  76560
 193952 137005  97651 108404  35500 141753 156841 147336
 315161 221906 180628 218457  51900 161480 345095 238123
 313624 222980 162380 214222  64665 152035 250593 333941

     The test statistic that we want to use for our test is computed as the squared difference between the observed and the expected counts divided by the expected counts, summed over all elements in the matrix. The expected counts under the null-hypothesis correspond to the product of the amino acid frequencies times the total number of amino acid pairs. Let us now start to compute the test statistic by defining a matrix E with expected counts of amino acid pairs.

nrRow := length(C);
nrCol := length(C[1]);
E:= CreateArray(1..nrRow, 1..nrCol):
sumRow := [ seq(sum(C[i]), i=1..nrRow) ];
Ct := transpose(C):
sumCol := [ seq(sum(Ct[i]), i=1..nrCol) ];
sumC := sum(sumCol);
for i to nrRow do
  for j to nrCol do
    E[i,j] := sumRow[i] * sumCol[j] / sumC;
  od;
od;
printf( 'Expected counts of amino acid pairs:
' );
print(E);
nrRow := 8
nrCol := 8
sumRow := [2011329, 1358551, 1001259, 1294174, 392249, 1018442, 1732750, 1714440
]
sumCol := [2001110, 1377874, 1025622, 1293702, 380172, 1007945, 1698783, 1737986
]
sumC := 10523194
Expected counts of amino acid pairs:
 382478.04 263357.11 196030.15 247269.07  72663.39 192651.49 324693.39 332186.37
 258344.57 177884.40 132408.45 167017.75  49080.45 130126.34 219313.96 224375.09
 190401.26 131101.71  97585.70 123092.93  36172.54  95903.77 161635.50 165365.58
 246102.52 169455.08 126134.07 159103.36  46754.69 123960.10 208921.43 213742.74
  74590.79  51359.85  38229.76  48222.37  14170.80  37570.86  63321.64  64782.92
 193668.81 133351.60  99260.41 125205.38  36793.31  97549.61 164409.39 168203.49
 329502.94 226880.85 168879.00 213021.08  62599.15 165968.31 279721.75 286176.92
 326021.08 224483.39 167094.46 210770.08  61937.67 164214.52 276765.93 283152.88

     Now that we have the expected counts for the amino acid pairs, the next step is to compute the test statistic which corresponds to the squared differences between each element of the two matrices of C and E divided by the expected count, and summed over all elements in matrix C. The degree of freedom and the significance level are both important to determine the limit of the test statistic at which we are willing to either accept or reject our null-hypothesis. The level of significance has to be determined by the researcher. The degrees of freedom for this test can be computed as the product of the number of rows and the number of columns each subtracted by one.

testStat := 0:
for i to nrRow do
  for j to nrCol do
    testStat := testStat + (C[i,j] - E[i,j])^2 / E[i,j]:
  od:
od:
df := (nrRow - 1) * (nrCol - 1):
printf( 'Chi-square test statistic: %.2f\n', testStat);
printf( 'Degrees of freedom:        %.0f\n', df);
Chi-square test statistic: 144847.38
Degrees of freedom:        49

     With the test statistic, the degree of freedom, and the significance level which you have to decide on your own, you can get the test decision by looking at any table of the chi-square distribution.

© 2005 by Peter von Rohr, Informatik, ETH Zurich

Index of bio-recipes

Last updated on Fri Sep 2 14:43:55 2005 by PvR