A Chi-squared Primer

Sep 10

ChiSquaredBlog

Introduction

During high school I feel like I had a pretty solid intuitive grasp of all the ideas we covered in Statistics class. There were minor things that I took for granted (for example, why we divide by \(n-1\) instead of \(n\) when calculating sample standard deviation), but generally a lot of the early material is pretty intuitive. This all changed toward the end though when we started learning about goodness of fit tests, and the required \(\chi^{2}\) distribution (still not sure if it’s correct to call it chi-square or chi-squared) that came along with it.

Unlike Normal Distributions, which show us the average and spread of a population, \(\chi^{2}\) didn’t have an immediately obvious "real world" interpretation. We were taught how to use it for various tests, but in a way that felt fairly prescriptive and that I forgot not long after finishing the class. I was able to answer questions for our exams, but there’s no way would I have recognized a situation in life that would have required the use of \(\chi^{2}\) distribution if it deviated at all from the examples in class. I just really didn’t grasp it. Something about the tests and distribution itself felt so arbitrary. Where did the distribution come from, why did it only have degrees of freedom as a parameter, and why was it used for these tests?

I’m happy to say that after a few weeks of on and off research and a ton of algebra, I feel I have a much better grasp of it and can answer these questions. (And to be clear I had a fantastic teacher. This was toward the end of the senior year of high school and I was mailing it in. I’m sure she would have been more than happy to sit down and explain these things had I asked.)

Early History

From what I gather, the early history of \(\chi^{2}\)isn’t really cut and dry, and certainly doesn’t have a fun motivation like gambling that the normal distribution did. After the standard normal distribution had been kicking around for about 150-200 years (discovered by De Moivre in the 1700s, nearly forgotten, and then "rediscovered" by Laplace in the 1800s), it looks like mathematicians started to realize the importance and ubiquity of the normal distribution towards the end of the 19th century. Consequently I think some mathematicians started toying around with the idea of squaring the normal distribution, but it wasn’t until Karl Pearson’s 1900 paper that the \(\chi^{2}\)distribution really started taking shape. I’m going to gloss over some of the finer points here, but after a number of people had toyed the distribution in one form or another, Pearson is generally credited with "fathering" the \(\chi^{2}\)distribution in his preposterously titled 1900 paper "On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is such that it can be Reasonably Supposed to have Arisen from Random Sampling"; however, I wasn’t able to get a copy of this paper (and I imagine I’d have some difficulty understanding it) so I don’t know exactly how he used it or what his motivating problem was.

It’s also worth mentioning that some other sources make it sound as though he may have only used in in the narrow context of his particular problem, and there was still some refining to be done. After several subsequent papers in the early 20th century as well as an apparent academic feud with Ronald Fisher, it was Fisher who described the \(\chi^{2}\)distribution in the terms we use it today.

I’d like to learn more about this feud and the specifics of each paper that moved us closer to the final form of the \(\chi^{2}\)distribution, but I’m very eager to start looking at some math. Let’s dive in.

What is the \(\chi^{2}\)distribution?

Put as simply as possible, the \(\chi^{2}\)distribution is what we get when we we sum some number of squared observations drawn from the standard normal distribution. Put into more concrete terms, we define the \(\chi^{2}\)distribution with \(n\) degrees of freedom (denoted \(\chi^{2}\)(n) from here on unless otherwise noted) as:

\[\chi^{2}(n) = \sum_{i=1}^{n} (Y_{i})^{2}\]

where each \(Y_{i}\) is a standard normal variable.

Now it’s perfectly reasonable to see this and think, "well how on earth does this help us? You’ve given a definition without any real motivation and it feels like you pulled this out of this air," which I concede is kind of a fair point. That notwithstanding, I do think it’s helpful to see the definition right from the outset for context. It’s also worth noting that a lot of math starts with simple experimentation and finds a problem application after the fact. With this in mind, let’s look a bit closer at the distribution and see if we can’t start getting a better feel for it.

1 degree of freedom

The case with one degree of freedom seems like the most obvious starting point. A quick simulation showing us a histogram of 10,000 squared observations shows us the following:

It’s a bit tough to see, but this is more or less what we would have expected. Most of the observations in the standard normal curve are clustered near 0, so when squared most observations will be small positive numbers. Note the kernel density line over the top curls down toward \(x\) = 0, but this is a weird quirk of the kernel density drawing algorithm. In actuality in asymptotically approaches the y-axis.

Now, For those curious about the pdf, it’s:

\[f(x) = \frac{1}{\sqrt{2\pi x}} e^{\frac{-x}{2}}\]

I’ll refer you to the wikipedia article about chi-squared proofs for this derivation. Apologies for not handling the proof myself, but I think the wikipedia article handles this particularly well and there’s no sense in replicating it here. Now let’s look into the case with multiple degrees of freedom.

Multiple degrees of freedom

Gonna move pretty quickly here, but a similar simulation to above yields the following distributions for various degrees of freedom (I hope it’s clear that "two degs" refers to 2 degrees of freedom):

Now this is starting to get interesting. I wouldn’t have guessed this, but all of the distributions appear to be centered right around their number of degrees of freedom, with a shape starting to resemble a normal distribution as the degrees of freedom gets larger. I suppose I don’t know exactly what I would have guessed, but I would have imagined it would be centered at a smaller number given that most draws will be less than 1 (and so the squared draws would be even smaller). I guess the draws bigger in magnitude than 1 (remember, negative values will become positive when squared) balance out the larger number of smaller draws.

Also once again, for those curious, this page has a proof for the more general case with n-degrees of freedom. I won’t be covering it here, but this does seem like a proof someone could follow themselves.

Mean and Variance

Before finishing up with our initial investigation, let’s quickly try and figure out the mean and variance of this \(\chi^{2}\)distribution. When encountering a new distribution, it’s usually helpful to get our hands around these two quantities as a starting point.

Now we’re going to use a nice little trick to get started here. If you remember this convenient form of the definition of variance, and letting \(X\) be a standard normal variable, we have:

\[\begin{aligned} Var(X) &= E(X^{2}) - (E(X))^{2} \notag \\ 1 &= E(X^{2}) - 0^{2} \notag \\ E(X^{2}) &= 1 \notag \end{aligned}\]

So already we see that the mean of a one degree of freedom \(\chi^{2}\)variable is 1. Now let’s look at variance.

\[\begin{aligned} Var(X^{2}) &= E(X^{4}) - (E(X^{2}))^{2} \notag \\ Var(X^{2}) &= E(X^{4}) - 1 \notag \end{aligned}\]

We can now determine \(E(X^{4})\) by the following integral:

\[\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} x^{4} e^{\frac{-x^{2}}{2}} dx\]

Given how overly detailed this post has been already (and how much more we still have to go), I’ll leave this as an exercise to the reader. It’s a good chance to brush up on integration by parts. Just know that it equals 3, giving us:

\[Var(X^{2}) = 3 - 1 = 2\]

With values for both the mean and variance in the one degree of freedom case, it shouldn’t be too difficult to see the following values for the \(n\) degrees of freedom case:

\[\begin{aligned} mean(\chi^{2}(n)) &= n \notag \\ Var(\chi^{2}(n))&= 2n \notag \end{aligned}\]

Alrighty, that should just about do it for initial investigations. With a few simulations and a few notebook pages of calculations, we know the general shape, mean, and variance for our new distribution.

A simple problem

Keeping in the gambling theme which helped us "discover" the normal distribution, consider the following (admittedly contrived) situation: you, a statistician, are playing a crazy casino game where you roll 600 dice and count the results. If you get over 100 4s, 5s, or 6s, then you win some money (the amount and odds here aren’t really important), otherwise you lose. You grab all 600 dice, and here are your results:

Number	Observed
1	122
2	114
3	118
4	83
5	81
6	82

Not only did you not win, but you weren’t particularly close! "That’s absolutely terrible luck" you think to yourself as you walk away. However you can’t shake the feeling that your luck seemed too bad. There’s just no way you’d have three separate numbers not even break into the 90s, and they happen to be the 3 that you needed. Smells \(very\) fishy. Might be time to look into this and see if there is any foul play.

A first pass

Remembering our tool from last time, we use the normal approximation of the binomial distribution to see if our low numbers seem too improbable to be true. Let’s look at the probability of frequencies \(as \; extreme\) as the ones we observed. (Also if you’re playing along at home remember that you should add or subtract 0.5 to the observed frequencies depending on what sort of tail probability you’re aiming for)

Number	Observed	Likelihood
1	122	0.9%
2	114	7.0%
3	118	2.8%
4	83	3.5%
5	81	2.1%
6	82	2.8%

Well although it’s a bunch of small numbers, there isn’t necessarily one we’d call a smoking gun. While 0.9% is quite small, 1 in 100 isn’t \(that\) crazy. And yet you can’t shake the feeling that, taken as a whole, this distribution just seems very unlikely. Somehow too good to be true. Maybe there’s some way to quantify how much the whole distribution deviates from what it "should" be.

Investigating a simpler example

As is usually the case in math, it’s helpful to try and play around with a simpler example than what we’d like first to see what we find. With that in mind, let’s define the following:

\[\begin{gathered} X = Number \; of \; heads \; after \; n \; tosses \notag \\ Y = \left( \frac{X-np}{\sqrt{np(1-p)}} \right)^{2} \notag \end{gathered}\]

Note that \(Y\) is sneaky way of writing a squared observation from the standard normal distribution. (Now forgive me for doing this, but I’m going to omit a LOT of algebra here and in subsequent equations just because it’s quite a lot and quite messy.) If we expand and rearrange terms in \(Y\) we can get the following:

\[\begin{gathered} Y = \frac{(X-np)^{2}}{np} + \frac{((n-X)-n(1-p))^{2}}{n(1-p)} \end{gathered}\]

Then if we define \(O\) to mean the observed amount and \(E\) to be the expected amount, we can rewrite it like so:

\[\begin{gathered} Y = \frac{(O_{heads}-E_{heads})^{2}}{E_{heads}} + \frac{(O_{tails}-E_{tails})^{2}}{E_{tails}} \end{gathered}\]

Now this is interesting. Equation (2) tells us that if we sum our squared errors and divide them by their expected values, the result is distributed like a single observation from the standard normal distribution squared. A few things strike me about this: first, I would have expected that the entire fraction in each term be squared, not just the numerator. Second, summing two terms is distributed like drawing and squaring a single standard normal observation, not two.

Either way, this is looking like a good lead. Starting with a binomial distribution we were able to arrive at a statistic we could easily determine based on observed and expected results, which has a distribution that should be easy enough to understand. Taking a bit of a leap of faith, I’d like to think we could generalize this result to something like the following for a random process with \(k\) discrete outcomes:

\[\begin{gathered} X_{i} = Observed \; trials \; with \; outcome \; i \notag \\ p_{i} = probability \; of \; success \; for \; outcome \; i \notag \\ n = total \; number \; of \; trials \notag \\ Z = \sum_{i=1}^{k} \frac{(X_{i} - np_{i})^{2}}{np_{i}}, Z \sim \chi^{2}(k-1) \end{gathered}\]

Where \(Z\) in equation (3) is distributed like \(\chi^{2}\)with \(k-1\) degrees of freedom as defined earlier on. Let’s move and see what happens when we have 3 possible outcomes (rather than two, like in the heads/tails example).

Laying the groundwork

Let’s take a moment to define some variables and other things before we proceed. For what it’s worth I also think this is a good time to start being a little more generic with our variable use. Let’s define a 3 outcome system with outcome states 1, 2, and 3. Using similar definitions as above we have:

\[\begin{gathered} X_{1} + X_{2} + X_{3} = n \Longrightarrow n - X_{1} = X_{2} + X_{3}\notag \\ p_{1} + p_{2} + p_{3} = 1 \Longrightarrow 1-p_{1} = p_{2} + p_{3} \notag \\ Z = \frac{(X_{1}-np_{1})^{2}}{np_{1}} + \frac{(X_{2}-np_{2})^{2}}{np_{3}} + \frac{(X_{3}-np_{3})^{2}}{np_{3}} \end{gathered}\]

I’ve added a few helpful rearrangements to the above definitions that we will use later. With that out of the way, let’s turn our attention to the \(Z\) term.

Let’s begin by thinking our 3 outcome system as a 2 outcome system, namely rather than outcomes 1, 2, and 3, we have outcomes 1 and \(not \; 1\). (It might seem a bit silly to define a 3 outcome system and immediately compress it back to 2 outcomes, but I promise it will pay off.) With this in mind, we can rewrite \(Z\) the following way

\[Z = \frac{(X_{1}-np_{1})^{2}}{np_{1}} + \frac{((n-X_{1})-n(1-p_{1}))^{2}}{n(1-p_{1})} + \frac{(X_{2}-np_{2})^{2}}{np_{2}} + \frac{(X_{3}-np_{3})^{2}}{np_{3}} - \frac{((n-X_{1})-n(1-p_{1}))^{2}}{n(1-p_{1})}\]

which using our alternative rearrangements in eqn. (4) can be re written as:

and if you remember our result from eq. (1), the first two terms can be replaced by a squared standard normal variable, which we’ll call \(Y\) for right now. With this in mind, let us rewrite it like so:

\[\begin{aligned} Z &= Y^{2} + Q \notag \\ Q &= \frac{(X_{2}-np_{2})^{2}}{np_{2}} + \frac{(X_{3}-np_{3})^{2}}{np_{3}} - \frac{((X_{2}+X_{3})-n(p_{2}+p_{3}))^{2}}{n(p_{2}+p_{3})} \notag \\ Y &\sim N(0,1) \notag \end{aligned}\]

Now if we can show that \(Q\) is really a squared standard normal variable, then we will have shown our initial variable \(Z\) was the sum of two squared standard normal variables, and proven our own claim in eqn. (3).

Regretfully I’d like to point out one MAJOR disclaimer before we proceed: I’m definitely playing a little fast and loose with independence at the moment and don’t have a great explanation for why this isn’t an issue. Unlike in the previous example where we reduced the Z term to a single standard normal variable, (spoiler ahead) we will be reducing this example to the sum of two standard normal variables that aren’t obviously independent. The definition up above for the \(\chi^{2}\)distribution is built on summing independent squared standard normal variables, and in this multi outcome situation the number we observe for one outcomes directly influences how many we can see for the others (i.e. if outcome 1 happens a ton there can’t be many 2s or 3s). I suspect the final variable \(Q\) we find will be independent from the variable \(Y^{2}\), but I might have to omit any details and just leave it as a hunch (or let computer simulations dictate if it’s good enough).

A standard normal distribution in a trench coat

Before proceeding I have to admit that similar to the normal distribution derivation, I was not able to tackle this one on my own. After getting to the definition of \(Q\) above, I got stuck and found a paper online that finished the proof in the more or less the same manner. The calculation relied on a clever substitution that helps "simplify" the calculation (using the term simplify here loosely) and recognizing something as standard normal distribution that does not look like one (both things I likely couldn’t have done on my own). The document is located here and the reference proof is proof 6:

https://arxiv.org/pdf/1808.09171.pdf

I have filled in some of the detail the authors glossed over so hopefully I’m adding at least a little to the discussion here, although I will be relegating most of that to the appendix.

Now, In light of these disclaimers, note the following equality:

\[\begin{gathered} \frac{(X_{2}-np_{2})^{2}}{np_{2}} + \frac{(X_{3}-np_{3})^{2}}{np_{3}} - \frac{((X_{2}+X_{3})-n(p_{2}+p_{3}))^{2}}{n(p_{2}+p_{3})} = \left( \frac{p_{3}X_{2}-p_{2}X_{3}}{\sqrt{np_{2}p_{3}(p_{2}+p_{3})}} \right)^{2} \end{gathered}\]

Believe it or not but the term on the right hand side of eqn. 5 is actually a standard normal variable. To see why, let’s define a few more things and recall a few facts to make our lives easier:

\[\begin{gathered} k = \frac{1}{\sqrt{np_{2}p_{3}(p_{2}+p_{3})}} \notag \\ A = kp_{3}X_{2}-kp_{2}X_{3}\notag \\ E(X_{2}) = np_{2}, E(X_{3}) = np_{3} \notag \end{gathered}\]

First recall that \(n\), \(p_{2}\), and \(p_{3}\) are all constant terms and so \(k\) is as well. Because of this, A must be a normally distributed variable because it is a linear combination of both \(X_{2}\) and \(X_{3}\) which are also normally distributed variables. Now let’s see what happens when we take the expectation of our variable A:

\[\begin{aligned} E(A) &= E(kp_{3}X_{2}-kp_{2}X_{3}) \notag \\ &= kp_{3}E(X_{2})-kp_{2}E(X_{3}) \notag \\ &= kp_{3}np_{2}-kp_{2}np_{3} \notag \\ &= 0 \notag \end{aligned}\]

Alright, so now we know that A is both normally distributed and has mean 0. Moving on to variance, don’t forget that our variables \(X_{2}\) and \(X_{3}\) aren’t independent! Based on our definitions above, we have:

\[\begin{aligned} Var(A) &= Var(kp_{3}X_{2}-kp_{2}X_{3}) \notag \\ &= (kp_{3})^{2} Var(X_{2}) + (kp_{2})^{2} Var(X_{3}) - k^{2}p_{2}p_{3}2Cov(X_{2},X_{3}) \notag \\ &= \frac{p_{2}-p_{2}p_{3}}{p_{2}+p_{3}} + \frac{p_{3}-p_{2}p_{3}}{p_{2}+p_{3}} + \frac{2p_{2}p_{3}}{p_{2}+p_{3}} \notag \\ &= 1 \notag \end{aligned}\]

And so in conclusion A is normally distributed with mean 0 and variance 1 and is therefore a standard normal variable. I have to say I honestly didn’t believe it when I first saw it that it could possibly be a standard normal variable, but somehow a bunch of stuff seems to cancel and here we are. Finally after showing that our variable \(Q\) above is actually a squared standard normal term, let’s write:

\[\begin{gathered} Q = H^{2} \notag \\ Z = Y^{2} + H^{2} \notag \\ H,Y \sim N(0,1) \notag \end{gathered}\]

And so we have shown that our variable \(Z\) all along was really just the sum of two squared standard normal variables, and so is distributed like \(\chi^{2}\)(2).

(The curious reader can check the appendix for a more detailed explanation of the covariance term. It turns out multinomial distributions have a nice clean covariance between the different states.).

Bringing it home

I appreciate you sticking it out through that massive detour to show very explicitly how the three "squared error" terms can actually be written as two squared variables. Moreover, I hope you can appreciate how that process would generalize to the case with an arbitrary number of \(n\) summed standard normal squared variables (if not the proof in the linked document shows how). The "off by one" mismatch between the number of errors we sum and the degrees of freedom of the \(\chi^{2}\)distribution we compare it to has always bothered me and it’s nice to finally see a construction showing why this is the case. The prevailing intuition of "once you’ve calculated the first \(k-1\) variables, the \(k^{th}\) one is automatically determined" is indeed correct, but now we’ve put a little extra rigor to this idea.

But what does this mean for our dice example above?? Using our formula in eqn (3), we have:

\[\frac{22^{2}+14^{2}+18^{2}+17^{2}+19^{2}+18^{2}}{100}=19.78\]

If we compare this to the \(\chi^{2}\)(5) distribution we see that a statistic at least this large has a p-value of 0.00137, or just above 0.01%. While this isn’t quite the smoking gun we were hoping for, this does show that the distribution as a whole appears to be a bit less likely than any of the observations taken individually.

Anywho that’s all for now. Despite how verbose this post is, there are still additional ideas and concepts with the \(\chi^{2}\)distribution that we didn’t discuss. For instance, why do people say don’t use it if one of your observed frequencies is less than 5? Or how do we determine degrees of freedom in instances where we have multiple dimensions in our observed data? These are further areas I’d like to explore, but I’ll leave it here for now.

Appendix

Coming soon

William Floyd