# .

# Squared deviations

In probability theory and statistics, the definition of variance is either the expected value (when considering a theoretical distribution), or average value (for actual experimental data), of squared deviations from the mean. Computations for analysis of variance involve the partitioning of a sum of squared deviations. An understanding of the complex computations involved is greatly enhanced by a detailed study of the statistical value:

\( \operatorname{E}( X ^ 2 ). \)

It is well-known that for a random variable X with mean \mu and variance \sigma^2:

\( \sigma^2 = \operatorname{E}( X ^ 2 ) - \mu^2[1] \)

Therefore

\( \operatorname{E}( X ^ 2 ) = \sigma^2 + \mu^2. \)

From the above, the following are easily derived:

\( \operatorname{E}\left( \sum\left( X ^ 2\right) \right) = n\sigma^2 + n\mu^2 \)

\( \operatorname{E}\left( \left(\sum X \right)^ 2 \right) = n\sigma^2 + n^2\mu^2 \)

Sample variance

The sum of squared deviations needed to calculate variance (before deciding whether to divide by n or n − 1) is most easily calculated as

\( S = \sum x ^ 2 - \frac{\left(\sum x\right)^2}{n} \)

From the two derived expectations above the expected value of this sum is

\( \operatorname{E}(S) = n\sigma^2 + n\mu^2 - \frac{n\sigma^2 + n^2\mu^2}{n} \)

which implies

\( \operatorname{E}(S) = (n - 1)\sigma^2. \)

This effectively proves the use of the divisor n − 1 in the calculation of an unbiased sample estimate of σ2.

Partition — analysis of variance

In the situation where data is available for k different treatment groups having size ni where i varies from 1 to k, then it is assumed that the expected mean of each group is

\( \operatorname{E}(\mu_i) = \mu + T_i \)

and the variance of each treatment group is unchanged from the population variance \( \sigma^2. \)

Under the Null Hypothesis that the treatments have no effect, then each of the \( T_i \) will be zero.

It is now possible to calculate three sums of squares:

Individual

\( I = \sum x^2 \)

\( \operatorname{E}(I) = n\sigma^2 + n\mu^2 \)

Treatments

\( T = \sum_{i=1}^k \left(\left(\sum x\right)^2/n_i\right) \)

\( \operatorname{E}(T) = k\sigma^2 + \sum_{i=1}^k n_i(\mu + T_i)^2 \)

\operatorname{E}(T) = k\sigma^2 + n\mu^2 + 2\mu \sum_{i=1}^k (n_iT_i) + \sum_{i=1}^k \) n_i(T_i)^2

Under the null hypothesis that the treatments cause no differences and all the \( T_i \) are zero, the expectation simplifies to

\( \operatorname{E}(T) = k\sigma^2 + n\mu^2. \)

Combination

\( C = \left(\sum x\right)^2/n \)

\( \operatorname{E}(C) = \sigma^2 + n\mu^2 \)

Sums of squared deviations

Under the null hypothesis, the difference of any pair of I, T, and C does not contain any dependency on \( \mu \) , only \( \sigma^2. \)

\( \operatorname{E}(I - C) = (n - 1)\sigma^2 \) total squared deviations aka total sum of squares

\( \operatorname{E}(T - C) = (k - 1)\sigma^2 \) treatment squared deviations aka explained sum of squares

\( \operatorname{E}(I - T) = (n - k)\sigma^2 \) residual squared deviations aka residual sum of squares

The constants (n − 1), (k − 1), and (n − k) are normally referred to as the number of degrees of freedom.

Example

In a very simple example, 5 observations arise from two treatments. The first treatment gives three values 1, 2, and 3, and the second treatment gives two values 4, and 6.

I = \frac{1^2}{1} + \frac{2^2}{1} + \frac{3^2}{1} + \frac{4^2}{1} + \frac{6^2}{1} = 66 \)

T = \frac{(1 + 2 + 3)^2}{3} + \frac{(4 + 6)^2}{2} = 12 + 50 = 62 \)

C = \frac{(1 + 2 + 3 + 4 + 6)^2}{5} = 256/5 = 51.2 \)

Giving

Total squared deviations = 66 − 51.2 = 14.8 with 4 degrees of freedom.

Treatment squared deviations = 62 − 51.2 = 10.8 with 1 degree of freedom.

Residual squared deviations = 66 − 62 = 4 with 3 degrees of freedom.

Two-way analysis of variance

The following hypothetical example gives the yields of 15 plants subject to two different environmental variations, and three different fertilisers.

Extra CO_{2} |
Extra humidity | |
---|---|---|

No fertiliser | 7, 2, 1 | 7, 6 |

Nitrate | 11, 6 | 10, 7, 3 |

Phosphate | 5, 3, 4 | 11, 4 |

Five sums of squares are calculated:

Factor | Calculation | Sum | \( \sigma^2\) |
---|---|---|---|

Individual | \( 7^2+2^2+1^2 + 7^2+6^2 + 11^2+6^2 + 10^2+7^2+3^2 + 5^2+3^2+4^2 + 11^2+4^2 641 15\) | 641 | 15 |

Fertiliser × Environment | \( \frac{(7+2+1)^2}{3} + \frac{(7+6)^2}{2} + \frac{(11+6)^2}{2} + \frac{(10+7+3)^2}{3} + \frac{(5+3+4)^2}{3} + \frac{(11+4)^2}{2} 556.1667 6 \) | 556.1667 | 6 |

Fertiliser | \(\frac{(7+2+1+7+6)^2}{5} + \frac{(11+6+10+7+3)^2}{5} + \frac{(5+3+4+11+4)^2}{5} 525.4 3 \) | 525.4 | 3 |

Environment | \( \frac{(7+2+1+11+6+5+3+4)^2}{8} + \frac{(7+6+10+7+3+11+4)^2}{7} 519.2679 2\) | 519.2679 | 2 |

Composite | \( \frac{(7+2+1+11+6+5+3+4+7+6+10+7+3+11+4)^2}{15} 504.6 1\) | 504.6 | 1 |

Factor | Sum | \( \sigma^2 \) | Total | Environment | Fertiliser | Fertiliser × Environment | Residual |
---|---|---|---|---|---|---|---|

Individual | 641 | 15 | 1 | 1 | |||

Fertiliser × Environment | 556.1667 | 6 | 1 | −1 | |||

Fertiliser | 525.4 | 3 | 1 | −1 | |||

Environment | 519.2679 | 2 | 1 | −1 | |||

Composite | 504.6 | 1 | −1 | −1 | −1 | 1 | |

Squared deviations | 136.4 | 14.668 | 20.8 | 16.099 | 84.833 | ||

Degrees of freedom | 14 | 1 | 2 | 2 | 9 |

See also

Variance decomposition

Errors and residuals in statistics

Absolute deviation

References

^ Mood & Graybill: An introduction to the Theory of Statistics (McGraw Hill)

Retrieved from "http://en.wikipedia.org/"

All text is available under the terms of the GNU Free Documentation License