PS 813 - Causal Inference
January 21, 2026
\[ \require{cancel} \]
The classic estimation problem in statistics is to estimate some unknown population mean \(\mu\) from an i.i.d. sample of \(n\) observations \(Y_1, Y_2, \dotsc, Y_n\).
Our estimand: \(\mu\)
Our estimator: The sample mean \(\hat{\mu} = \bar{Y} = \frac{1}{n}\sum_{i=1}^n Y_i\)
Our estimate: A particular realization of that estimator based on our observed sample (e.g. \(0.4\))
Note that our estimator is a random variable – it’s a function of \(Y_i\)s which are random variables.
Is the expectation of \(\hat{\mu}\) equal to \(\mu\)?
First we pull out the constant.
\[E[\hat{\mu}] = E\left[\frac{1}{n}\sum_{i=1}^n Y_i\right] = \frac{1}{n}E\left[\sum_{i=1}^n Y_i\right]\]
Next we use linearity of expectations
\[\frac{1}{n}E\left[\sum_{i=1}^n Y_i\right] = \frac{1}{n}\sum_{i=1}^n E\left[Y_i\right]\]
Finally, under our i.i.d. assumption
\[\frac{1}{n}\sum_{i=1}^n E\left[Y_i\right] = \frac{1}{n}\sum_{i=1}^n \mu = \frac{n \mu}{n} = \mu\]
Therefore, the bias, \(\text{Bias}(\hat{\mu}) = E[\hat{\mu}] - \mu = 0\)
What is the variance of \(\hat{\mu}\)? Again, start by pulling out the constant.
\[Var(\hat{\mu}) = Var\left[\frac{1}{n}\sum_{i=1}^n Y_i\right] = \frac{1}{n^2}Var\left[\sum_{i=1}^n Y_i\right]\]
We can further simplify by using our i.i.d. assumption. The variance of a sum of i.i.d. random variables is the sum of the variances.
\[\frac{1}{n^2}Var\left[\sum_{i=1}^n Y_i\right] = \frac{1}{n^2}\sum_{i=1}^n Var\left[Y_i\right]\]
“identically distributed”
\[\frac{1}{n^2}\sum_{i=1}^n Var\left[Y_i\right] = \frac{1}{n^2}\sum_{i=1}^n \sigma^2 = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}\]
Therefore, the variance is \(\frac{\sigma^2}{n}\)
Another way of talking about properties of estimators is to consider how they behave as the sample size gets large.
This requires us to talk about sequences of estimators indexed (typically) by sample size.
\[\bar{Y}_n = \{\bar{Y}_1, \bar{Y}_2, \bar{Y}_3, \dotsc\}\]
Can we characterize the limit of this sequence of estimators?
A sequence of random variables \(\hat{\mu}_n\) is said to converge in probability to some value \(\mu\) if for every \(\epsilon > 0\),
\[\Pr(|\hat{\mu}_n - \mu| \ge \epsilon) \to 0\] as \(n \to \infty\).
A sequence of random variables \(X_n\) with CDF \(F_n(x)\) is said to converge in distribution to some random variable \(X\) with CDF \(F(x)\) if
\[\lim_{n \to \infty} F_n(x) = F(x)\]
The most famous convergence in distribution result is the central limit theorem
By the central limit theorem
\[\sqrt{n}(\hat{\mu}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)\]
And we’ll use this result to justify our asymptotic approximation of the sampling distribution of \(\hat{\mu}_n\)
\[\hat{\mu}_n \overset{a}{\sim} \mathcal{N}\bigg(\mu, \frac{\sigma^2}{n}\bigg)\]
Sometimes we’ll want to use shorthand to characterize the asymptotic behavior of a given term
Little-\(o_p\) notation denotes convergence (in probability) to zero
Big-\(O_p\) notation denotes boundedness of a sequence as \(n \to \infty\)
Most common place where you will see Big-\(O_p\) notation is in discussion of convergence rates of estimators
An estimator \(\hat{\mu}_n\) has a convergence rate \(r_n \to \infty\) if
\[r_n(\hat{\mu}_n - \mu) = O_p(1)\]
This implies that the estimation error \(\hat{\mu}_n - \mu\) is \(O_p(1/r_n)\) or in other words bounded in probability by \(1/r_n \to 0\).
From the central limit theorem, the most common convergence rate that you’ll encounter is \(\sqrt{n}\)
Suppose the central limit theorem holds w.r.t. \(\hat{\mu}_n\) (e.g. it’s the sample mean).
\[\sqrt{n}(\hat{\mu}_n - \mu) \sim \mathcal{N}(0, \sigma^2)\]
The variance is finite, does not depend on \(n\), and the distribution is normal
And therefore
\[\sqrt{n}(\hat{\mu}_n - \mu) = O_p(1)\]
PS 813 - University of Wisconsin - Madison