Laplace's Rule of Succession
Earlier this year, the legendary 3b1b posted a video on the binomial distribution. The motivating context for the video is the question:
Suppose an item on Amazon received $k$ good reviews out of a total of $n$ reviews. What is the estimate for its true quality?
The naive answer of $\frac{k}{n}$ lacks nuance (however, it is the maximum likelihood estimate). For instance, even if $k=n$, it feels unlikely that the true quality is 100%, especially if $n$ is small. As 3b1b explains, if we have $n=10$ and a 95% true quality, then $k$ will also be $10$ with ~60% probability.
It turns out that the expected value of the true quality is $\frac{k+1}{n+2}$ (apologies to any frequentists out there; I know that in your view, the true quality is a universal truth and hence can't really have an "expectation"). Or, as 3b1b put it, "imagine there were two extra reviews, one good and one bad". This handy heuristic is called "Laplace's Rule of Succession", which 3b1b did not derive (but probably will in a future video).
I've never seen this before, and so I'm very intrigued why this is the case! It seems arbitrary that we add two extra reviews. Why not one or three, and why is this rule even valid for all $(n,k)$? So, here's a short attempt to derive it.
We know the binomial distribution, which, for $n$ trials and some success probability $p$ (what we have referred to as the "true quality"), gives us the probability of $k$ total successes:
$$P(k\;|\;p,n)={n\choose k}p^k (1-p)^{n-k}$$
We now convert this to a likelihood function, which describes the relative likelihood of observing $p$ given some $k$. Note that the likelihood function $L(p\;|\;k,n)$ is a function in $p$, in contrast to $P(k\;|\;p,n)$, a function in $k$.
$$L(p\;|\;k,n)={n\choose k}p^k (1-p)^{n-k}$$
The likelihood is not a probability, as it can only be interpreted in reference to other likelihoods. In fact, because of this relativity, we can drop the constant ${n\choose k}$ factor:
$$L(p\;|\;k,n)=p^k (1-p)^{n-k}$$
Under Bayesian inference, the product of a likelihood function, with the parameter's prior density function, results in the posterior density function (after normalising). This means we can convert the above likelihood function into a probability density function as follows:
$$P_1(p\;|\;k,n)=c\cdot L(p\;|\;k,n)\cdot P_0(p)$$
where $P_1$ and $P_0$ are the posterior and prior densities, and $c$ is a normalising constant.
As with many Bayesian problems, we assume a uniform prior for $p$, which represents our total lack of knowledge about $p$ before we receive any reviews. Hence $P_0(p)=1$ and so we have
$$P_1(p\;|\;k,n)=c\cdot p^k (1-p)^{n-k}$$
implying that $c^{-1}$ is $\int_0^1 p^k (1-p)^{n-k} dp$.
And so, we finally arrive at the original problem. Given $n$ and $k$, what is our best estimate, or expected value, for $p$?
$$\begin{align*}\mathbb{E}(p)&=\int_0^1 p\cdot P_1(p\;|\;k,n)dp\\&=\frac{\int_0^1 p^{k+1} (1-p)^{n-k}dp}{\int_0^1 p^k (1-p)^{n-k}dp}\\&=\frac{B(k+2, \;n-k+1)}{B(k+1, \;n-k+1)}\\&=\frac{\Gamma(k+2)\;\Gamma(n-k+1)}{\Gamma(n+3)}\cdot\frac{\Gamma(n+2)}{\Gamma(k+1)\;\Gamma(n-k+1)}\\&=\frac{(k+1)!\;(n-k)!}{(n+2)!}\cdot\frac{(n+1)!}{k!\;(n-k)!}\\&=\frac{k+1}{n+2}\end{align*}$$
where we've used three identities:
- the beta function $B(x,y)=\int_0^1 t^{x-1}(1-t)^{y-1} dt$
- the beta function's link to the gamma function $B(x,y)=\frac{\Gamma(x)\Gamma(y)}{\Gamma(x+y)}$
- the gamma function itself $\Gamma(x)=(x-1)!$ for $x\in\mathbb{N}$
And so $\mathbb{E}(p)=\frac{k+1}{n+2}$, the result we set out to show.