Chi Square

Test of Independence

Introduction

  • Chi square test is the single most frequently used test of hypothesis in the social sciences
  • It is a nonparametric test, so requires no assumption about the exact shape of the population distribution
  • It is appropriate for nominally measured variables
  • Can be used in the two-sample case, but can also be used when there are more than two samples

The Logic of Chi Square

The chi square test for independence

  1. Two variables are independent if, for all cases in the sample, the classification of a case into a particular category of one variable has no effect on the probability that the case will fall into any particular category of the second variable.

  2. To conduct a chi square test, the variables must first be organized into a bivariate table, a.k.a. contingency tables.

  3. A contingency table is used to investigate whether two traits or characteristics are related. Each observation is classified according to two criteria. We use the usual hypothesis testing procedure.

Data Requirements

Your data must meet the following requirements:

Important

  1. Two categorical variables.
  2. Two or more categories (groups) for each variable.
  3. Independence of observations.
    • There is no relationship between the subjects in each group.
    • The categorical variables are not “paired” in any way (e.g. pre-test/post-test observations).
  4. Relatively large sample size.
    • Expected frequencies for each cell are at least 1.
    • Expected frequencies should be at least 5 for the majority (80%) of the cells.

Application of Chi-Square Test of Independence

We can use the chi-square statistic to formally test for a relationship between two nominal-scaled variables. To put it another way, Is one variable independent of the other?

Ford Motor Company runs an assembly plant in Dearborn, Michigan. The plant operates three shifts per day, 5 days a week. The quality control manager wishes to compare the quality level on the three shifts. Vehicles are classified by quality level (acceptable, unacceptable) and shift (day, afternoon, night). Is there a difference in the quality level on the three shifts? That is, is the quality of the product related to the shift when it was manufactured? Or is the quality of the product independent of the shift on which it was manufactured?

A sample of 100 drivers, who were stopped for speeding violations, was classified by gender and whether or not the drivers were wearing a seat-belt when stopped. For this sample, is wearing a seat-belt related to gender?

Does a male released from federal prison make a different adjustment to civilian life if he returns to his hometown or if he goes elsewhere to live? The two variables are adjustment to civilian life and place of residence. Note that both variables are measured on the nominal scale.

Bivariate Tables

The idea of independence can be seen in bivariate or a.k.a. contingency tables

  • Bivariate tables display joint classification of the cases on two variables
  • The categories of the independent variable are used as column headings
  • The categories of the dependent variable are used as row headings
  • The marginals are the univariate frequency distributions for each variable (i.e., the row total and the column total)

Bivariate Tables, cont.

  • To find the number of cells in a table, multiply the number of categories of the independent variable by the number of categories of the dependent variable
  • A bivariate table in which both variables have three categories has nine cells
  • If two variables are independent, the cell frequencies will be determined by random chance

Bivariate Tables, cont.

  • The null hypothesis states that the variables are independent
  • If the null hypothesis is true, the expected cell frequencies are what we would expect to find if only random chance were operating
  • The actual frequencies would differ little from the expected frequencies
  • Therefore, it is still the hypothesis of no difference, but this time the difference measured is between the observed frequencies and the expected frequencies

Independence

  • When the variables are independent of each other, there should be little difference between the observed frequencies and the expected frequencies
  • These slight differences would be due to chance alone
  • If the null is false (we reject the null), there should be large differences between the two

Computation of Chi Square

You need to compute a test statistic:

  1. Chi Square (obtained - test statistic, calculated using Formula 1)
  2. Then you need to find Chi Square (critical value) to compare with your test statistic
  3. Chi Square (critical value) is found by looking in a chi square table (refer to your statistical table) for a particular alpha level and degrees of freedom
  4. degrees of freedom, \(df\) = (number of rows - 1)(number of columns - 1)

Computation, cont.

Formula 1 for Chi Square (obtained - test statistic):

\[ \chi^{2} = \sum \frac{(O - E)^{2}}{E} \]

The Chi Square statistic is

  1. the sum over all cells of

  2. the squared difference between the obtained value and the expected value, which is then

  3. divided by the expected frequency.

Computation, cont.

\[ E_{i,j} = \frac{(Row_{i}\ Total) * (Column_{j}\ Total)}{Grand\ Total} \]

  • You have to calculate an expected frequency for each cell in the table
  • Since marginals will be unequal in most cases, you need Formula 2 to compute the expected frequencies:

Examples:

Where Chi-Square Test can be applied?

The following are situations where we can use the Chi-Square test:

  • Testing if there is an association between gender and preferred learning method (online/books).
  • Testing if there is an association between political ideology and opinion on tax reform.
  • Testing if there is an association between marital status and level of education.

How to write the hypothesis?

The null hypothesis (\(H_{0}\)) and alternative hypothesis (\(H_{1}\)) of the Chi-Square Test of Independence can be expressed in two different but equivalent ways:

\(H_{0}\): “[Variable 1] is independent of [Variable 2]”
\(H_{1}\): “[Variable 1] is not independent of [Variable 2]”

\(H_{0}\): “[Variable 1] is not associated with [Variable 2]”
\(H_{1}\): “[Variable 1] is associated with [Variable 2]”

Therefore, when we reject the null hypothesis, we believe that the two variables affect each other, even though the test does not tell us how this association occurs between categories.

Working Example 1

Suppose that the state legislature is considering a bill to lower the legal drinking age to 18. A political scientist is interested in whether there is a relationship between party affiliation and attitude toward the bill. A random sample of 150 registered republicans and 200 registered democrats are asked their opinion about the proposed bill.

Calculating the Expected Value for a Particular Cell
Ex: \(Cell_{11}\ =\ \frac{150 * 130}{350}\ =\ 55.7\)

Numbers in Black are obtained (\(f_{o}\))
Numbers in Purple are expected (\(f_{e}\))

The calculated value for the chi square statistic is compared to the critical value found in Table.
Note: The distribution of the Chi Square Statistic is not normal and the critical values are only on one side. If the obtained values are close to the expected value, then the chi square statistic will approach 0. As the obtained value is different from the expected, the value of chi square will increase. This is reflected in the values found in Table.
The Degrees of Freedom for the Chi Square Test of Independence is the product of the (number of rows minus 1) times the (number of columns minus 1).
\(\chi^{2}=5.62+0.27+3.11+4.22+0.20+2.33=15.75\)

In our study, we had two rows (Republicans and Democrats) and three columns (For, Undecided, Against). Therefore, the degrees of freedom for our study is (2-1)(3-1) = 1(2) = 2. Using an a of .05, the critical value from Table would be 5.991 Since our calculated chi square is 15.75, we conclude that there IS a relationship between political party and opinion on lowering the drinking age, thereby rejecting the Null Hypothesis

Working Example 2

PROBLEM STATEMENT

In the sample dataset, respondents were asked their gender and whether or not they were a cigarette smoker. There were three answer choices: Nonsmoker, Past smoker, and Current smoker. Suppose we want to test for an association between smoking behavior (nonsmoker, current smoker, or past smoker) and gender (male or female) using a Chi-Square Test of Independence (we’ll use α = 0.05).
Hypothesis:
\(H_{0}\): Gender is not associated with Smoking
\(H_{1}\): Gender is associated with Smoking

BEFORE THE TEST

Note

Before we test for “association”, it is helpful to understand what an “association” and a “lack of association” between two categorical variables looks like. One way to visualize this is using clustered bar charts. Let’s look at the clustered bar chart produced by the Crosstabs procedure. This is the chart that is produced if you use Smoking as the row variable and Gender as the column variable (running the syntax later in this example):
The “clusters” in a clustered bar chart are determined by the row variable (in this case, the smoking categories). The color of the bars is determined by the column variable (in this case, gender). The height of each bar represents the total number of observations in that particular combination of categories.
This type of chart emphasizes the differences within the categories of the row variable. Notice how within each smoking category, the heights of the bars (i.e., the number of males and females) are very similar. That is, there are an approximately equal number of male and female nonsmokers; approximately equal number of male and female past smokers; approximately equal number of male and female current smokers. If there were an association between gender and smoking, we would expect these counts to differ between groups in some way.

RUNNING THE TEST

  1. Open the Crosstabs dialog (Analyze > Descriptive Statistics > Crosstabs).
  2. Select Smoking as the row variable, and Gender as the column variable.
  3. Click Statistics. Check Chi-square, then click Continue.
  4. (Optional) Check the box for Display clustered bar charts.
  5. Click OK.

Syntax

CROSSTABS
  /TABLES=Smoking BY Gender
  /FORMAT=AVALUE TABLES
  /STATISTICS=CHISQ
  /CELLS=COUNT
  /COUNT ROUND CELL
  /BARCHART.

NOTE: Refer to Tab 6 for output tables

Note

  1. The first table is the Case Processing summary (Table 1), which tells us the number of valid cases used for analysis. Only cases with nonmissing values for both smoking behavior and gender can be used in the test.

Note

  1. The next tables are the crosstabulation (Table 2a) and chi-square test results (Table 2b).


Important

The key result in the Chi-Square Tests table is the Pearson Chi-Square.

  • The value of the test statistic is 3.171.
  • The footnote for this statistic pertains to the expected cell count assumption (i.e., expected cell counts are all greater than 5): no cells had an expected count less than 5, so this assumption was met.
  • Because the test statistic is based on a 3x2 crosstabulation table, the degrees of freedom (df) for the test statistic is df=(R−1)∗(C−1)=(3−1)∗(2−1)=2∗1=2
  • The corresponding p-value of the test statistic is p = 0.205.

Decision
Since the p-value is greater than our chosen significance level (\(\alpha\) = 0.05), we do not reject the null hypothesis. Rather, we conclude that there is not enough evidence to suggest an association between gender and smoking.

Based on the results, we can state the following:
- No association was found between gender and smoking behavior (\(\chi^{2}_{2}\)> = 3.171, \(p\) = 0.205).

Past Exam Paper Questions

Solution