Template:Probability distribution
In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bernoulli trials is either unknown or random. The beta-binomial distribution is the binomial distribution in which the probability of success at each of n trials is fixed but randomly drawn from a beta distribution. It is frequently used in Bayesian statistics, empirical Bayes methods and classical statistics to capture overdispersion in binomial type distributed data.
It reduces to the Bernoulli distribution as a special case when n = 1. For α = β = 1, it is the discrete uniform distribution from 0 to n. It also approximates the binomial distribution arbitrarily well for large α and β. Similarly, it contains the negative binomial distribution in the limit with large β and n. The beta-binomial is a one-dimensional version of the Dirichlet-multinomial distribution as the binomial and beta distributions are univariate versions of the multinomial and Dirichlet distributions respectively.
Motivation and derivation
As a compound distribution
The Beta distribution is a conjugate distribution of the binomial distribution. This fact leads to an analytically tractable compound distribution where one can think of the
parameter in the binomial distribution as being randomly drawn from a beta distribution. Namely, if

then

where Bin(n,p) stands for the binomial distribution, and where p is a random variable with a beta distribution.
![{\displaystyle {\begin{aligned}\pi (p\mid \alpha ,\beta )&=\mathrm {Beta} (\alpha ,\beta )\\[5pt]&={\frac {p^{\alpha -1}(1-p)^{\beta -1}}{\mathrm {B} (\alpha ,\beta )}}\quad {\text{for }}0\leq p\leq 1,\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/b774219c2be5e5ef2db8114187ce996587b10529)
then the compound distribution is given by
![{\displaystyle {\begin{aligned}f(k\mid n,\alpha ,\beta )&=\int _{0}^{1}L(p\mid k)\pi (p\mid \alpha ,\beta )\,dp\\[6pt]&={n \choose k}{\frac {1}{\mathrm {B} (\alpha ,\beta )}}\int _{0}^{1}p^{k+\alpha -1}(1-p)^{n-k+\beta -1}\,dp\\[6pt]&={n \choose k}{\frac {\mathrm {B} (k+\alpha ,n-k+\beta )}{\mathrm {B} (\alpha ,\beta )}}.\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/6509c3d1e2dda9163ef5353ad514af33e0f9ae96)
Using the properties of the beta function, this can alternatively be written

Beta-binomial as an urn model
The beta-binomial distribution can also be motivated via an urn model for positive integer values of α and β, known as the Pólya urn model. Specifically, imagine an urn containing α red balls and β black balls, where random draws are made. If a red ball is observed, then two red balls are returned to the urn. Likewise, if a black ball is drawn, then two black balls are returned to the urn. If this is repeated n times, then the probability of observing k red balls follows a beta-binomial distribution with parameters n, α and β.
Note that if the random draws are with simple replacement (no balls over and above the observed ball are added to the urn), then the distribution follows a binomial distribution and if the random draws are made without replacement, the distribution follows a hypergeometric distribution.
Moments and properties
The first three raw moments are
![{\displaystyle {\begin{aligned}\mu _{1}&={\frac {n\alpha }{\alpha +\beta }}\\[8pt]\mu _{2}&={\frac {n\alpha [n(1+\alpha )+\beta ]}{(\alpha +\beta )(1+\alpha +\beta )}}\\[8pt]\mu _{3}&={\frac {n\alpha [n^{2}(1+\alpha )(2+\alpha )+3n(1+\alpha )\beta +\beta (\beta -\alpha )]}{(\alpha +\beta )(1+\alpha +\beta )(2+\alpha +\beta )}}\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d8b08123d7cc1c1b79069bd5d3d3f78776de5945)
and the kurtosis is
![{\displaystyle \beta _{2}={\frac {(\alpha +\beta )^{2}(1+\alpha +\beta )}{n\alpha \beta (\alpha +\beta +2)(\alpha +\beta +3)(\alpha +\beta +n)}}\left[(\alpha +\beta )(\alpha +\beta -1+6n)+3\alpha \beta (n-2)+6n^{2}-{\frac {3\alpha \beta n(6-n)}{\alpha +\beta }}-{\frac {18\alpha \beta n^{2}}{(\alpha +\beta )^{2}}}\right].}](https://wikimedia.org/api/rest_v1/media/math/render/svg/8a0a324a1e2fa8215447cc6cf5761738050f371f)
Letting
we note, suggestively, that the mean can be written as

and the variance as
![{\displaystyle \sigma ^{2}={\frac {n\alpha \beta (\alpha +\beta +n)}{(\alpha +\beta )^{2}(\alpha +\beta +1)}}=n\pi (1-\pi ){\frac {\alpha +\beta +n}{\alpha +\beta +1}}=n\pi (1-\pi )[1+(n-1)\rho ]\!}](https://wikimedia.org/api/rest_v1/media/math/render/svg/991ce686abc74a57c81097ad07c2b8eca60b5178)
where
. The parameter
is known as the "intra class" or "intra cluster" correlation. It is this positive correlation which gives rise to overdispersion.
Point estimates
Method of moments
The method of moments estimates can be gained by noting the first and second moments of the beta-binomial namely
![{\displaystyle {\begin{aligned}\mu _{1}&={\frac {n\alpha }{\alpha +\beta }}\\[6pt]\mu _{2}&={\frac {n\alpha [n(1+\alpha )+\beta ]}{(\alpha +\beta )(1+\alpha +\beta )}}\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/8823da9a5ad741ae07796e17e601c4f0d325013b)
and setting these raw moments equal to the first and second raw sample moments respectively
![{\displaystyle {\begin{aligned}{\widehat {\mu }}_{1}&:=m_{1}={\frac {1}{N}}\sum _{i=1}^{N}X_{i}\\[6pt]{\widehat {\mu }}_{2}&:=m_{2}={\frac {1}{N}}\sum _{i=1}^{N}X_{i}^{2}\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/8ecd69c69958e11798bf6777604329cf654f18da)
and solving for α and β we get
![{\displaystyle {\begin{aligned}{\widehat {\alpha }}&={\frac {nm_{1}-m_{2}}{n({\frac {m_{2}}{m_{1}}}-m_{1}-1)+m_{1}}}\\[5pt]{\widehat {\beta }}&={\frac {(n-m_{1})(n-{\frac {m_{2}}{m_{1}}})}{n({\frac {m_{2}}{m_{1}}}-m_{1}-1)+m_{1}}}.\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/74abfcda5a1906399e3218f8a67446428ad1d557)
Note that these estimates can be non-sensically negative which is evidence that the data is either undispersed or underdispersed relative to the binomial distribution. In this case, the binomial distribution and the hypergeometric distribution are alternative candidates respectively.
Maximum likelihood estimation
While closed-form maximum likelihood estimates are impractical, given that the pdf consists of common functions (gamma function and/or Beta functions), they can be easily found via direct numerical optimization. Maximum likelihood estimates from empirical data can be computed using general methods for fitting multinomial Pólya distributions, methods for which are described in (Minka 2003).
The R package VGAM through the function vglm, via maximum likelihood, facilitates the fitting of glm type models with responses distributed according to the beta-binomial distribution. Note also that there is no requirement that n is fixed throughout the observations.
Example
The following data gives the number of male children among the first 12 children of family size 13 in 6115 families taken from hospital records in 19th century Saxony (Sokal and Rohlf, p. 59 from Lindsey). The 13th child is ignored to assuage the effect of families non-randomly stopping when a desired gender is reached.
Males |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12
|
Families |
3 |
24 |
104 |
286 |
670 |
1033 |
1343 |
1112 |
829 |
478 |
181 |
45 |
7
|
We note the first two sample moments are

and therefore the method of moments estimates are

The maximum likelihood estimates can be found numerically

and the maximized log-likelihood is

from which we find the AIC

The AIC for the competing binomial model is AIC = 25070.34 and thus we see that the beta-binomial model provides a superior fit to the data i.e. there is evidence for overdispersion. Trivers and Willard posit a theoretical justification for heterogeneity (also known as "burstiness") in gender-proneness among mammalian offspring (i.e. overdispersion).
The superior fit is evident especially among the tails
Males |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12
|
Observed Families |
3 |
24 |
104 |
286 |
670 |
1033 |
1343 |
1112 |
829 |
478 |
181 |
45 |
7
|
Fitted Expected (Beta-Binomial) |
2.3 |
22.6 |
104.8 |
310.9 |
655.7 |
1036.2 |
1257.9 |
1182.1 |
853.6 |
461.9 |
177.9 |
43.8 |
5.2
|
Fitted Expected (Binomial p = 0.519215) |
0.9 |
12.1 |
71.8 |
258.5 |
628.1 |
1085.2 |
1367.3 |
1265.6 |
854.2 |
410.0 |
132.8 |
26.1 |
2.3
|
Further Bayesian considerations
It is convenient to reparameterize the distributions so that the expected mean of the prior is a single parameter: Let
![{\displaystyle {\begin{aligned}\pi (\theta \mid \mu ,M)&=\operatorname {Beta} (M\mu ,M(1-\mu ))\\[6pt]&={\frac {\Gamma (M)}{\Gamma (M\mu )\Gamma (M(1-\mu ))}}\theta ^{M\mu -1}(1-\theta )^{M(1-\mu )-1}\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/9513115288dd1d5b478ac26cdae886a65641f368)
where
![{\displaystyle {\begin{aligned}\mu &={\frac {\alpha }{\alpha +\beta }}\\[6pt]M&=\alpha +\beta \end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/0710866719b771618db2827f0fd6bca15a88b1b7)
so that
![{\displaystyle {\begin{aligned}\operatorname {E} (\theta \mid \mu ,M)&=\mu \\[6pt]\operatorname {Var} (\theta \mid \mu ,M)&={\frac {\mu (1-\mu )}{M+1}}.\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a60de1264421c9162224d9e63974b79c9c5f1c1c)
The posterior distribution ρ(θ | k) is also a beta distribution:
![{\displaystyle {\begin{aligned}\rho (\theta \mid k)&\propto \ell (k\mid \theta )\pi (\theta \mid \mu ,M)\\[6pt]&=\operatorname {Beta} (k+M\mu ,n-k+M(1-\mu ))\\[6pt]&={\frac {\Gamma (M)}{\Gamma (M\mu )\Gamma (M(1-\mu ))}}{n \choose k}\theta ^{k+M\mu -1}(1-\theta )^{n-k+M(1-\mu )-1}\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/c43077d17fa635bd89822f79b4d7edd66616ce39)
And

while the marginal distribution m(k|μ, M) is given by
![{\displaystyle {\begin{aligned}m(k\mid \mu ,M)&=\int _{0}^{1}\ell (k\mid \theta )\pi (\theta \mid \mu ,M)\,d\theta \\[6pt]&={\frac {\Gamma (M)}{\Gamma (M\mu )\Gamma (M(1-\mu ))}}{n \choose k}\int _{0}^{1}\theta ^{k+M\mu -1}(1-\theta )^{n-k+M(1-\mu )-1}\,d\theta \\[6pt]&={\frac {\Gamma (M)}{\Gamma (M\mu )\Gamma (M(1-\mu ))}}{n \choose k}{\frac {\Gamma (k+M\mu )\Gamma (n-k+M(1-\mu ))}{\Gamma (n+M)}}.\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/eaa1e1bd62615cdbe3fef46e7a187b5237404ffa)
Substituting back M and μ, in terms of
and
, this becomes:

which is the expected beta-binomial distribution with parameters
and
.
We can also use the method of iterated expectations to find the expected value of the marginal moments. Let us write our model as a two-stage compound sampling model. Let ki be the number of success out of ni trials for event i:
![{\displaystyle {\begin{aligned}k_{i}&\sim \operatorname {Bin} (n_{i},\theta _{i})\\[6pt]\theta _{i}&\sim \operatorname {Beta} (\mu ,M),\ \mathrm {i.i.d.} \end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/21d6d5bcb66e8ee3097d5b0fbae6ebea6e69d28e)
We can find iterated moment estimates for the mean and variance using the moments for the distributions in the two-stage model:
![{\displaystyle \operatorname {E} \left({\frac {k}{n}}\right)=\operatorname {E} \left[\operatorname {E} \left(\left.{\frac {k}{n}}\right|\theta \right)\right]=\operatorname {E} (\theta )=\mu }](https://wikimedia.org/api/rest_v1/media/math/render/svg/9a111b29de189969aa34cc5790268d6bf7b03c49)
![{\displaystyle {\begin{aligned}\operatorname {var} \left({\frac {k}{n}}\right)&=\operatorname {E} \left[\operatorname {var} \left(\left.{\frac {k}{n}}\right|\theta \right)\right]+\operatorname {var} \left[\operatorname {E} \left(\left.{\frac {k}{n}}\right|\theta \right)\right]\\[6pt]&=\operatorname {E} \left[\left(\left.{\frac {1}{n}}\right)\theta (1-\theta )\right|\mu ,M\right]+\operatorname {var} \left(\theta \mid \mu ,M\right)\\[6pt]&={\frac {1}{n}}\left(\mu (1-\mu )\right)+{\frac {n-1}{n}}{\frac {(\mu (1-\mu ))}{M+1}}\\[6pt]&={\frac {\mu (1-\mu )}{n}}\left(1+{\frac {n-1}{M+1}}\right).\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/24f1793e7dd1ca9ebddb7e8986ba819b64c96cb1)
(Here we have used the law of total expectation and the law of total variance.)
We want point estimates for
and
. The estimated mean
is calculated from the sample

The estimate of the hyperparameter M is obtained using the moment estimates for the variance of the two-stage model:
![{\displaystyle s^{2}={\frac {1}{N}}\sum _{i=1}^{N}\operatorname {var} \left({\frac {k_{i}}{n_{i}}}\right)={\frac {1}{N}}\sum _{i=1}^{N}{\frac {{\widehat {\mu }}(1-{\widehat {\mu }})}{n_{i}}}\left[1+{\frac {n_{i}-1}{{\widehat {M}}+1}}\right]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/bab04920b21118ce1c9e163ac2f8769fb699cdb3)
Solving:

where

Since we now have parameter point estimates,
and
, for the underlying distribution, we would like to find a point estimate
for the probability of success for event i. This is the weighted average of the event estimate
and
. Given our point estimates for the prior, we may now plug in these values to find a point estimate for the posterior

Shrinkage factors
We may write the posterior estimate as a weighted average:

where
is called the shrinkage factor.

Related distributions
where
is the discrete uniform distribution.
See also
References
External links
Template:ProbDistributions