vignettes/samplesizeCMH-introduction.Rmd
samplesizeCMH-introduction.Rmd
The Cochran-Mantel-Haenszel test (CMH) is an inferential test for the association between two binary variables, while controlling for a third confounding nominal variable (Cochran 1954; Mantel and Haenszel 1959). Essentially, the CMH test examines the weighted association of a set of 2 \(\times\) 2 tables. A common odds ratio relating to the test statistic can also be generated (Mantel and Haenszel 1959). The CMH test is a common technique in the field of biostatistics, where it is often used for case-control studies.
This introduction briefly describes some of the terminology and
concepts surrounding stratified tables. Examples are given which show
some basic techniques for working with multidimensional tables in
R
. Functionality of the samplesizeCMH
package
is highlighted where it may augment the analysis.
Consider a contingency table comparing \(X\) and \(Y\) at some fixed level of \(Z\). The cross-section of the three-way table examining only one level of \(Z\) is called a partial table. On the other hand, the combined counts of \(X\) and \(Y\) across all levels of \(Z\), id est a simple two-way contingency table ignoring \(Z\), produce the marginal table. These concepts are described in depth by Agresti (2002, sec. 2.7.1).
We will use the Titanic
dataset in
the datasets
package to illustrate.
This dataset is a four-dimensional table which includes the
Class (1st, 2nd, 3rd, Crew),
Sex (Male, Female), Age (Child, Adult), and
Survival (No, Yes) of the passengers of the 1912 maritime
disaster. Use help("Titanic", "datasets")
to find more information.
## 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
## - attr(*, "dimnames")=List of 4
## ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
## ..$ Sex : chr [1:2] "Male" "Female"
## ..$ Age : chr [1:2] "Child" "Adult"
## ..$ Survived: chr [1:2] "No" "Yes"
For this illustration, we will remove the age dimension,
transforming the four-dimensional table into a three-dimensional table.
Let \(X\) = sex, \(Y\) = survival, and \(Z\) = class. This dimensionality reduction
is accomplished using the margin.table()
function in the base
package.
partial_tables <- margin.table(Titanic, c(2,4,1))
partial_tables
## , , Class = 1st
##
## Survived
## Sex No Yes
## Male 118 62
## Female 4 141
##
## , , Class = 2nd
##
## Survived
## Sex No Yes
## Male 154 25
## Female 13 93
##
## , , Class = 3rd
##
## Survived
## Sex No Yes
## Male 422 88
## Female 106 90
##
## , , Class = Crew
##
## Survived
## Sex No Yes
## Male 670 192
## Female 3 20
Each of the tables above is a partial table: survival by sex at a
fixed level of class. The tables can be flattened for easier
viewing using the ftable()
function in
the stats
package (not shown).
The code below shows the marginal table of survival by sex, ignoring
class. Again the dimensionality is reduced using the margin.table()
function.
marginal_table <- margin.table(Titanic, c(2,4))
marginal_table
## Survived
## Sex No Yes
## Male 1364 367
## Female 126 344
As an aside, we may get the table, row, or column proportions using
the prop.table()
function. Because the Titanic
dataset
is a multidimensional table, it must first be transformed into a
two-dimensional table using margin.table()
(as
was performed above). Failure to do so will produce unexpected
results.
# Table proportions
prop.table(marginal_table)
## Survived
## Sex No Yes
## Male 0.61971831 0.16674239
## Female 0.05724671 0.15629259
# Row proportions
prop.table(marginal_table, 1)
## Survived
## Sex No Yes
## Male 0.7879838 0.2120162
## Female 0.2680851 0.7319149
# Column proportions
prop.table(marginal_table, 2)
## Survived
## Sex No Yes
## Male 0.91543624 0.51617440
## Female 0.08456376 0.48382560
In comparing variables \(X\) and \(Y\) at a fixed \(j\) level of \(Z\), we may use a conditional odds ratio, described by Agresti (2002, sec. 2.7.4), to represent the point estimate of association between the to variables. We will denote it as \(\theta_{XY(j)}\). The marginal odds ratio would then refer to the odds ratio of \(X\) and \(Y\) generated by the marginal table. It follows that the marginal odds ratio would be denoted by \(\theta_{XY}\).
An odds ratio estimate (\(\hat\theta\)) can be calculated from a
table or matrix using the samplesizeCMH
package using the odds.ratio()
function. The odds.ratio()
function can take either a table of frequencies, probabilities, or
percents, as the results are algebraically equivalent.
Using proportions, we see how the ratio of the row odds \(o_1\) and \(o_2\) are estimated.
\[ \hat{\theta}= \frac{\hat{o}_1}{\hat{o}_2} = \frac{\hat{\pi}_{11} / \hat{\pi}_{12}}{\hat{\pi}_{21} / \hat{\pi}_{22}} = \frac{\hat{\pi}_{11}\hat{\pi}_{22}}{\hat{\pi}_{12}\hat{\pi}_{21}}. \]
And since row odds estimates are related row proportions, which are in turn related to to cell counts through the following,
\[ \hat{o} = \frac{\hat{\pi}_1}{1 - \hat{\pi}_1} = \frac{\hat{\pi}_1}{\hat{\pi}_2} = \frac{n_1 / n_+}{n_2 / n_+} = \frac{n_1}{n_2}, \]
the odds estimate, defined as \(\frac{\hat{\pi}_1}{\hat{\pi}_2}\), is equivalent to \(\frac{n_1}{n_2}\). Therefore,
\[ \hat{\theta}= \frac{\hat{\pi}_{11}\hat{\pi}_{22}}{\hat{\pi}_{12}\hat{\pi}_{21}} = \frac{n_{11}n_{22}}{n_{12}n_{21}}. \]
Let’s first look at the marginal odds ratio of survival by sex using the Titanic data.
library(samplesizeCMH)
odds.ratio(marginal_table)
## [1] 10.14697
The conditional odds ratios can be calculated using the partial tables.
apply(partial_tables, 3, odds.ratio)
## 1st 2nd 3rd Crew
## 67.088710 44.067692 4.071612 23.263889
Obviously this is more informative than a simple marginal odds ratio. Based on what we see above, survival by sex appears to vary widely by class, where women in 1st class survive at a much higher rate than men, whereas 3rd class women had only slightly better chance of survival than their male counterparts.
We can produce a common (weighted) odds ratio using mantelhaen.test()
from the stats
package. Note that it
differs slightly from the marginal odds ratio above since it takes into
account the differential sizes of each partial table.
mantelhaen.test(partial_tables)
##
## Mantel-Haenszel chi-squared test with continuity correction
##
## data: partial_tables
## Mantel-Haenszel X-squared = 360.33, df = 1, p-value < 2.2e-16
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
## 8.232629 14.185153
## sample estimates:
## common odds ratio
## 10.80653
The term conditional association refers to the association of the \(X\) and \(Y\) variables conditional on the level of \(Z\). Likewise, the marginal association refers to the overall association between \(X\) and \(Y\) while ignoring \(Z\).
The finding of conditional association does not imply marginal association, nor vice-versa. The use of the CMH test to control for the stratifying variable in analysis serves to avoid the well-documented phenomenon of the Simpson’s Paradox in which statistical significance may be found when considering the association between two variables, but where no such significance may be found after considering the stratification. Likewise, the reverse situation may arise where no association may be found between the binary variables, but may be observed when the third variable is introduced.
Refer to Agresti (2002, sec. 2.7.3) for more information on the content of this section.
Homogeneous association is when all the odds ratios between binary variables \(X\) and \(Y\) are equal for all \(j\) levels of variable \(Z\), such that \[\theta_{XY(1)}=\theta_{XY(2)}=...=\theta_{XY(j)}.\] (Agresti 2002, sec. 2.7.6)
The Breslow-Day test can be used to check the null hypothesis that
all odds ratios are equal. The Cochran-Mantel-Haenszel test can be
thought of as a special case of the Breslow-Day test wherein the common
odds ratio is assumed to be 1 (however, the calculations are not
equivalent). Using the Titanic
data, we can perform the
Breslow-Day test using BreslowDayTest()
from the DescTools
package.
library(DescTools)
BreslowDayTest(x = partial_tables, OR = 1)
##
## Breslow-Day test on Homogeneity of Odds Ratios
##
## data: partial_tables
## X-squared = 397.54, df = 3, p-value < 2.2e-16
Note the near agreement with the output from mantelhaen.test()
from the section above.