Bayes inference II: random variables

Bayes inference II: random variables

E. Canuto, Former faculty, Politecnico di Torino, Torino,Italy

September 7, 2020

Draft

Acronyms

RV random variable

PDF   probability density function

PMF probability mass function

MLE  maximum likelihood estimate

MAP maximum a posteriori

LF likelihood function

SD standard deviation

Introduction

In part I, the Bayes Theorem was illustrated by several examples under the  assumption of a set \Omega of finite outcomes and their events. We extend the theorem to integer and real random variables (RV) [2]. The integer random variable K=\left \{ 0\leq k\leq n \right \} was already introduced in Part I to describe the possible outcomes of k heads (H) in n repeated trials of coin tossing. Also the set \left \{ H,T \right \} can be associated to the integer RV X=\left \{ 0\left ( tail \right ),1 (head)\right \},P\left ( X=1 \right )=p.  Consider now a finite sequence X=\left \{ X_{1},...,X_{i},...,X_{n} \right \}  of independent RV X_{i}=\left \{ 0,1 \right \}. The random variable K holds K=\sum_{i=1}^{n}X_{i}, and from Part I the likelihood function (LF) of the outcome K=k is the Binomial distribution

P\left ( K=k/p\right )=P\left ( K=k;n,p \right )=\begin{pmatrix}n \\ k\end{pmatrix}p^{k}\left ( 1-p \right )^{n-k}\; (1)

In Part I we assumed that the real number  0< p<1 was unknown. Here we assume to know a prior probability density function (PDF) f\left ( p; q_{1},q_{2},...\right ), which implies that p becomes the outcome of a real RV, denoted with p not to abuse of P; q_{1},q_{2},... denote the hyper-parameters of the PDF. The simplest is the uniform probability density, f\left ( p; q_{1},q_{2},...\right )=1. Actually, we are looking for a generic PDF with the property that the product of likelihood function and prior PDF (proportional to the posterior PDF, see below) has the same form of the prior to ease computations. The prior PDF is then said to be a conjugate prior of the likelihood function. The conjugate prior of the Binomial distribution is the Beta PDF f\left ( p; \alpha ,\beta \right )=\textup{Beta}\left ( p;\alpha ,\beta \right ), which is defined for 0\leq p\leq 1 and holds

\begin{matrix}\textup{Beta}\left ( p;\alpha ,\beta \right )=\frac{p^{\alpha }\left ( 1-p \right )^{\beta }}{B\left ( \alpha ,\beta \right )} \\ B\left ( \alpha ,\beta \right )=\int_{0}^{1}p^{\alpha }\left ( 1-p \right )^{\beta }dp=\frac{\Gamma \left ( \alpha+1 \right )\Gamma \left ( \beta +1\right )}{\Gamma \left ( \alpha+ \beta+2 \right )} \end{matrix}\: \; (1)

where \Gamma \left ( k \right )=(k-1)!, \: \; k\: \; \textup{integer}, -1 has been avoided in the exponents and B\left ( \alpha ,\beta \right ) is the normalizing constant. The uniform density follows from \alpha =\beta =0. The PDF in (1) is symmetric around p=0.5 for \alpha =\beta. As  the exponents increase, (1) becomes narrower and approaches a Gaussian distribution as Figure 1 shows.

  
Figure 1 - Symmetric Beta PDF for increasing parameter value.

Remark. Meaning and difference between Binomial and Beta distributions. Comparison of (1) with the Binomial distribution \begin{matrix}\textup{Bin}\left ( k,n,p \right )=P\left ( K=k/p,n \right )=\begin{pmatrix}n \\ k\end{pmatrix}p^{k}\left ( 1-p \right )^{n-k} \\ \sum_{k=0}^{n}\textup{Bin}\left ( k,n,p \right )=1 \end{matrix}

shows their similarity, although they express different models. The binomial RV K is discrete as it accounts for the head counts in n trials, having denoted P(head) (known or unknown) with p. The Beta distribution is the PDF of a bounded real  RV, 0\leq p\leq 1, which we interpret as P(head) .  The Beta exponents may be interpreted as the head and tail counts that has been used to build up the PDF. In fact, as shown in Figure 1, by increasing the exponents, the PDF narrows implying smaller uncertainty.

The goal is to compute the posterior PDF g\left ( p/K=k \right ) of the parameter p given an outcome k of K obtained from an outcome x of the sequence X. Bayes theorem for random variables requires four probability functions (either discrete or real)

  1. the likelihood function of the observed  outcomes, discrete in this case, P\left ( K=k/p\right ) in (1), conditioned by a generic outcome p of the parameter prior probability,
  2. the prior PDF of the real valued parameter p, in (2), depending on known parameters \left ( \alpha ,\beta \right ) called hyper-parameters,
  3. the discrete probability P\left ( K=k\right ) of the observed outcome k whichever be parameter p; it is obtained from the law of total probability (see Part I) extended to discrete and real RV and does not depend on p (it plays the role of a scale factor in Bayes equation)

P\left ( K=k \right )=\int_{0}^{1}P\left ( K=k /p\right )f\left ( p;\alpha ,\beta \right )dp\: \; (3)The posterior PDF holds

\begin{matrix}g\left ( p/K=k \right )=f\left ( p;\alpha ,\beta \right )\frac{P\left ( K=k/p\right )}{\int_{0}^{1}P\left ( K=k /p\right )f\left ( p;\alpha ,\beta \right )dp} \\ =\frac{p^{\alpha+k }\left ( 1-p \right )^{\beta+n-k }}{B\left ( \alpha +k,\beta +n-k\right )}\end{matrix}\; (4)

In Part I , the likelihood function was maximized to provide the MLE of p equal to \hat{p}=k/n. The same can be done here, by maximizing the posterior probability versus the unknown p. The new estimate, known as the Maximum a Posteriori (MAP) estimate holds

\begin{matrix}\frac{d}{dp}g\left ( p/K=k \right )=p^{\alpha +k-1}\left ( 1-p \right )^{\beta +n-k-1}\left (\left (\alpha +k \right ) \left ( 1-p \right )-\left ( \beta +n-k \right )p \right )\\ \hat{p}_{MAP}=\frac{\alpha +k}{\alpha +\beta +n}\end{matrix}\; (5)

The MAP estimate becomes the MLE for \alpha =\beta =0, namely under uniform prior density, which is intuitive as the uniform density does not depend on p. We can also say that (5) is the mode of the posterior PDF.  The posterior Beta distribution in (4) for a prior with \alpha =\beta =1 and data n=20 and k={5,10} is shown in Figure 2. The posterior PDF becomes narrower than prior, meaning that uncertainty is reduced. Although the prior is centered in p=0.5, the posterior becomes centered below p=0.5 when k=5 (biased coin), a reasonable result.

Figure2 - Prior and two posterior Beta densities. 

Remark 1. Posterior mean and variance.  From (4), the posterior PDF is the function \textup{Beta}\left ( \alpha +k,\beta +n-k \right ), whose mean and variance hold

\begin{matrix}E\left \{ p/K=k \right \}=\frac{\alpha +k+1}{\alpha +\beta +n+2}, \: \; \lim_{n,k\rightarrow \infty }E\left \{ p/K=k \right \} =k/n\\ \lim_{n\rightarrow \infty } var\left \{ p/K=k \right \} =\frac{k\left ( n-k \right )}{n^{3}}=\frac{\hat{p}_{MLE}\left ( 1-\hat{p}_{MLE} \right )}{n} \end{matrix}

The limiting posterior variance depends on the MLE estimation and the size n of the repetitions (size of the measurements), implying that the estimate uncertainty goes down with 1/\sqrt{n}.

Remark 2. Iterative posterior. Posterior PDF and MAP estimate can be easily improved when new data (new coin tossing outcomes, (\Delta k,\Delta n)) become available. It is enough to replace (k,n) with (k+\Delta k,n+\Delta n).

Remark 3. Prediction of a future outcome and zero count paradox. Given a posterior PDF of p=P\left ( \textup{head}\right )=P\left ( X_{i}=1 \right ),

g\left ( p/K=k \right )=\textup{Beta}\left ( a,b \right ); a=\alpha +k, b=\beta +n-kwe want the probability P\left ( X_{n+1}=1/K=k \right ) that the next trial n+1 provides head as the outcome. To this end, we employ the law of total probability

\begin{matrix}P\left ( X_{n+1}=1/K=k \right )=\int_{0}^{1}P(X_{n+1}=1/p)g\left ( p/K=k \right )dp= \\ =\int_{0}^{1}p\textup{Beta}\left ( \alpha +k,\beta +n-k \right )dp=E\left \{ p/K=k \right \}=\frac{\alpha +k+1}{\alpha +\beta +n+2} \end{matrix}\: (6)

The result in (6) is equivalent to the identity P\left ( X_{n+1}=1 \right)=E\left \{ p/K=k \right \} in (6).   Alternatively, one may adopt the MLE instead of the posterior mean, that is: P\left ( X_{n+1}=1 \right)=\hat{p}_{MLE}\left ( k,n \right )=k/n. Which is the difference?  Consider the case when k=0, which may occur under small n. The MLE fails to provide a non zero estimate of p, a paradox known as the zero count problem, where inference (induction) from observations fails. On the contrary, the posterior mean provides a non zero estimate of p, also when the prior probability of p is uniform, namely \alpha =\beta =0, in which case from (6) we have

\begin{matrix}\hat{p}_{MAP}\left ( \alpha =\beta =k=0 \right )=\frac{1}{n+2} \\ \hat{p}_{MAP}\left ( \alpha =\beta =0 \right )=\frac{k+1}{n+2} \; (\textup{Laplace's rule of succession}) \end{matrix}

Let us remark that under uniform prior and k=0, also the MAP estimate in (5) would fail to provide a non zero estimate, which fact suggests the practice, known as add-one smoothing, of adding 1 to empirical counts. Addition of 2 in the denominator of (6) is justified by adding 1 to both head count k and tail count n-k.

Remark 4. Prediction of future repeated trials

We aim to predict the number h of heads in m future repeated trials, after having observed k heads in n trials and having assumed a prior PDF of p. We define the random variable H=\sum_{i=n+1}^{n+m}X_{i}. We repeat (6) by replacing   P(X_{n+1}=1/p) with the binomial distribution with parameters (m,h)

\begin{matrix}P\left ( H=h/K=k \right )=\int_{0}^{1}P(H=h/p)g\left ( p/K=k \right )dp= \\ =\begin{pmatrix} m \\ h\end{pmatrix} \int_{0}^{1}p^{h}\left ( 1-p \right )^{m-h}\textup{Beta}\left ( \alpha +k,\beta +n-k \right )dp=\\= \begin{pmatrix} m \\ h\end{pmatrix}\frac{1}{B(\alpha +k,\beta +n-k )} \int_{0}^{1}p^{h+\alpha +k}\left ( 1-p \right )^{m-h+\beta +n-k}dp \\ =\begin{pmatrix} m \\ h\end{pmatrix}\frac{B(h+\alpha +k,m-h+\beta +n-k )}{B(\alpha +k,\beta +n-k )} \end{matrix}\: (7)The last expression as a function of h is an integer probability function known as beta-binomial, \textup{BetaBin}(h/m,n,k,\alpha ,\beta )). Of course  \sum_{h=0}^{m}P\left ( H=h/K=k \right )=1. The mean value holds

E\left \{ H=h \right \}=m\frac{\alpha +k+1}{\alpha +\beta +n+2}

which for m=1 provides the same expected value as in (6). Figure 3 shows the prediction probability of head count during future m=10 trials under two batches of observed data: (k=9,n=20) and (k=45,n=100). The prior parameters are \alpha =\beta =1.  In the second case, the prediction probability becomes narrower.

Figure 3 - Prediction probability of head counts out of m=10 trials, after n=(20,100) observed trials with head count k=(8,45).                                          

Application to ballot

Head and tail outcomes can be replaced by ballot candidates (C1,C2). Let us keep the RV notation X_{i}=\left \{ T=0 (\textup{C1}), H=1 (\textup{C2})) \right \}.  A repeated trial of size n can be either an election poll or the same election. Election polls are the available observations, which provide candidate preference counts (k,n-k). We aim to compute the posterior probability that a candidate will be elected, given the poll data.  Let us denote with p the probability that a vote is for candidate 2, P\left ( X_{i} =1\right )=p. The probability that candidate 2 will win is P\left ( p>1/2 \right ), which implies that p is a RV, as in the Bayes inference. As an alternative, we can get the MLE estimate p\hat{}_{MLE}=k/n, which would imply the election of C2 under p\hat{}_{MLE}>1/2. Unfortunately the estimate is uncertain (it is a RV) and it must be completed  by its variance. Based on data in [1] we will employ both methods.  The difficulty of Bayesian inference  is the prior PDF of p.  One way  is to subdivide the available poll data in two parts: the oldest will estimate the prior PDF, the newest will estimate the likelihood function.  Poll data form the same company are shown in Table 1.  

Table I – Election polls, i=1,...,4
No Date Candidate 1, n_{i}-k_{i},X_{i}=0 Candidate 2, k_{i}, X_{i}=1 Total, n_{i} Used for
1 September 4-7 344 284 628 Prior PDF
2 September 25-28 325 312 637 Prior PDF
3 October 17-20 339 346 685 Prior PDF
4 Partial sum \beta+1=1008 \alpha+1=942 1950 Prior PDF
5 Last poll n_{4}-k_{4}=511 k_{4}=556 1067 Likelihood
6

November 2 election (fraction)

0.511 0.489 1  

The exponents \left (\alpha+1=\sum_{i=1}^{3}k_{i} ,\beta+1=\sum_{i=1}^{3}\left (n_{i}-k_{i} \right ) \right ) of the prior  PDF are estimated from the sum of the preferences in the oldest three polls (row 4 in table 1). The exponents of the likelihood function correspond to the last poll  preferences. The posterior PDF in (4) is a Beta PDF with parameters  \left (\alpha+k_{4} ,\beta+n_{4}-k_{4} \right ). The prior PDF (blue line), the normalized LF (normalized to become a Beta PDF, red line) and the posterior PDF (green line) are in Figure 4. Due to large batch of data (about 1000 people), the three functions being very narrow, have been plotted for 0.4\leq p\leq 0.6. The prior PDF is in favor of C1, the LF from the last poll is in favor of C2, but the posterior PDF is in favor of C1, but to a less extent than the prior (see Table 2, below). The posterior SD is smaller than the prior. The posterior mean value in Remark 1 and the standard deviation (SD) hold

E\left \{ g\left ( p/K=k \right ) \right \}\cong 0.4965, SD\left \{ g \left ( p/K=k \right )\right \}\cong 0.0091

The posterior probability that  p>1/2 (C2 probability of being voted) holds P\left ( p>1/2 \right )\cong 0.351, a rather small value due to mass concentration close to mean value, which is coherent  with the SD.  This implies that, though LF looks favorable to C2, the posterior PDF turns in favor of C1, because of the unfavorable prior.  However, as proved below, the MLE computed from four poll data perfectly agrees with Bayesian inference results, as expected, due to large data batches, and the fact that the posterior PDF is proportional to the four poll MLE and their mean value and mode coincide.


Figure 4 - The components of the Bayesian inference: the prior Beta PDF (in blue, from the oldest polls), the normalized likelihood function  (in red, from the last poll) and the posterior Beta whose mode is less than 1/2.                        

A similar result can be found from the cumulative MLE of the four poll data in Table 1:

\hat{p}_{\textup{MLE, cum}}=\frac{\sum_{i=1}^{4}k_{i}}{\sum_{i=1}^{4}n_{i}}\cong 0.4965<0.5

which is the same value of the posterior mean. Figure 5 shows the cumulative MLE (horizontal green line) together with the sequence of the single poll MLEs. The winning probability of the candidate 2 (C2) is increasing with the ballot campaign, but the cumulative MLE shows that they look insufficient to reach a winning probability larger than 1/2, a result in agreement with Bayesian inference. Data from [1] refer to 2004 US presidential elections, but restricted to Ohio state (November 2004). Actually they were not a ballot, but the score of other candidates was negligible. C2 did not win the ballot (see row 6 of Table 1).

Figure 5 - History of the winning probability of the candidate 2 (C2) from single poll data (pointwise blue) and cumulative data (pointwise green). The 1/2 threshold is in red.

Bayesian inference and MLE results are summarized in Table 2.

Table 2 – Prior, posterior poll estimates of C2 preference probability
No Parameter Symbol Unit Value Remark
1 Prior mean E\left \{ f\left ( \right ) \right \} fraction 0.4830 From i=1,2,3
2 Prior SD SD\left \{ f\left ( \right ) \right \} fraction 0.0113 same
3 Posterior mean E\left \{ g\left ( \right ) \right \} fraction 0.4965 From i=1,2,3,4
4 Posterior SD SD\left \{ g\left ( \right ) \right \} fraction 0.0091 same
5 Last poll MLE \hat{p}_{MLE,4} fraction 0.521 From i=4
6 Cumulative poll MLE \hat{p}_{MLE,cum} fraction 0.4965 From i=1,2,3,4
7 Voting result p_{true} fraction 0.489  

Remark. The progression of the MLE estimates in Figure 5 in contrast with the voting result (row 7 of Table 2) looks a bit odd, since the latter appears more coherent with the oldest polls than those close to election day. The confirmation comes from the cumulative poll MLE (row 6, Table 2) which shows coherency with the voting result, but at the price of a uniform weight of past and recent data. We cannot investigate further this oddity, but clearly poll data hide other assumptions than those employed here.

Multinomial inference

In the previous section Bayesian inference aimed to infer the occurrence probability of one outcome between two alternatives (binomial inference) like {head, tail}, {candidate 1, candidate 2}. Here we extend the inference to an arbitrary finite set \left \{ \omega _{1},...,\omega _{k}, ...,\omega _{m} \right \} of outcomes, like for instance the m=6 faces of a die. Let us associate the integer RV X_{i}=\left \{ 1, ..,k,...,m \right \} with generic outcome x_{i}. As in the binomial case, we assume to observe n independent repetitions of the same experiment, or which is the equivalent, n extractions X= \left \{ X_{i},i=1,...,n \right \} of X_{i}, whose generic outcome is denoted by x=\left \{ x_{1},...,x_{i},...,x_{n} \right \}. Let us denote with p_{k} the occurrence probability of the outcome k, such that

\sum_{k=1}^{m}p_{k}=1; 0\leq p_{k}\leq 1\; (8)

and with n_{k}  the number of  occurrences in the observed sequence x.  The likelihood function of the repeated trials is the probability of observing n_{k}, k=1,...,m repetitions of the outcome k in a specific sequence x. Let us introduce the inline vector \mathbf{p}=\left [ p_{1},...,p_{k},...,p_{m} \right ]. It holds

P\left ( X=x/\mathbf{p} \right )=\prod_{k=1}^{m}p_{k}^{n_{k}}\: \; (9)

Remark 1. MLE. From (9) we can compute the MLE, by taking the logarithm of the LF and adding the constraint in (8) multiplied by a Lagrange multiplier \lambda, as follows

\begin{matrix}L( \mathbf{p},\lambda )=\textup{log}P\left ( X=x/\mathbf{p} \right)+\lambda \left ( 1-\sum_{k=1}^{m} p_{k}\right )= \\ =\sum_{k=1}^{m}n_{k}\textup{log}p_{k}+\lambda \left ( 1-\sum_{k=1}^{m} p_{k}\right ) \end{matrix}\: \; (10)

Setting the derivatives of the Lagrangian in (10) to zero we obtain

\begin{matrix}\frac{\partial L}{\partial p_{k}}=\frac{n_{k}}{p_{k}}-\lambda=0; \; \frac{\partial L}{\partial \lambda }=1-\sum_{k=1}^{k}p_{k}=0 \\ \frac{n_{k}}{p_{k}}=\lambda \Rightarrow n=\sum_{k=1}^{k}n_{k}=\lambda\sum_{k=1}^{k}p_{k} =\lambda \\ \hat{p}_{k,MLE}=\frac{n_{k}}{\lambda }=\frac{n_{k}}{n }\geqslant 0 \end{matrix}\: \; (10)

namely the MLE which satisfies the constraints in (8).

The conjugate prior of the LF is the multinomial Beta PDF, known as Dirichlet PDF (see the Appendix). By avoiding -1 in the exponent, by defining the inline vector \mathbf{p}=\left [ p_{1},...,p_{k},...,p_{m}\right ]  and the exponent vector \mathbf{a }=\left [ \alpha _{1},..,\alpha _{k}, ...,\alpha _{m}\right ], we write

\begin{matrix}\textup{Dir}(\mathbf{p};\mathbf{a})=f\left (\mathbf{ p};\mathbf{a} \right )=\frac{1}{B\left (\mathbf{a} \right )}\prod_{k=1}^{m}p_{k}^{\alpha _{k}} \\B\left (\mathbf{a} \right )=\frac{\prod_{k=1}^{m}\Gamma \left (\alpha _{k} +1\right )}{\Gamma \left ( \sum_{k=1}^{m} \left (\alpha_{k}+1 \right )\right )} \end{matrix}\; \left (11 \right )

Composition of prior PDF and likelihood function provides, as in the Introduction, the posterior PDF in the Dirichlet form as follows

\begin{matrix}\textup{Dir}(\mathbf{p};\mathbf{a+n})=g\left (\mathbf{ p}/X=x;\mathbf{a+n} \right )=\frac{1}{B\left (\mathbf{a+n} \right )}\prod_{k=1}^{m}p_{k}^{\alpha _{k}+n_{k}} \\B\left (\mathbf{a+n} \right )=\frac{\prod_{k=1}^{m}\Gamma \left (\alpha _{k}+n_{k} +1\right )}{\Gamma \left ( \sum_{k=1}^{m} \left (\alpha _{k}+n_{k}+1 \right )\right )} \end{matrix}\; \left (12 \right )

Remark 2. MAP estimate. It is obtained in the same way as the MLE by maximizing the logarithm of (11) added with the constraint in (8) multiplied by \lambda. The estimate follows from (10) by replacing n_{k} with n_{k}+\alpha _{k}, that is

\hat{p}_{k,MAP}=\frac{n_{k}+\alpha _{k}}{n+\sum_{k=1}^{m}\alpha _{k} } \: \; (13)

Remark. Zero count problem. The MLE  estimate provides zero estimate when n_{k}=0. The MAP estimate provides zero estimate when in the prior PDF is uniform and  \alpha _{k}=0.

Naive Bayesian classifiers

Consider n discrete independent RV \left \{ X_{k},k=1,...,m \right \}, called features which are employed to classify into n classes/categories C=\left \{ c_{i} , i=1,.,n\right \} elements of a population, like documents, people, generic objects, … Let us denote an outcome of the feature RV with \mathbf{x}=[ x_{1},...,x_{k},...,x_{m} ]. Each feature may assume a finite, discrete but arbitrary set of values, X_{k}=\left \{ x_{k1}, ...,x_{kj},...,x_{kn(k)} \right \}.  Given the Cartesian product X=X_{1}\times ...\times X_{k}\times ...\times X_{m}, a class c_{i}\subset X,\; c_{i}\cap c_{j}=0,i\neq j is a known subset of  X, such that \bigcup_{i=1}^{n}c_{i}=X. More generically features may also assume real values. We aim to find the posterior probability P\left ( C=c_{i} /X=\mathbf{x}\right ) of classifying the feature outcome \mathbf{x} into c_{i}. The posterior can be rewritten by the Bayes theorem as

P\left ( C=c_{i} /X=\mathbf{x}\right )=P\left ( C=c_{i} \right )\frac{P\left ( X=\mathbf{x}/C=c_{i} \right )}{P\left ( X=\mathbf{x} \right )}=\frac{P\left ( C=c_{i},X=\mathbf{x} \right )}{P\left ( X=\mathbf{x} \right )}

where P\left ( C=c_{i},X=\mathbf{x} \right ) is the joint probability that the feature outcome \mathbf{x} and the class c_{i} occurs together. The naive assumption amounts to how compute this joint probability. Let us factor the joint probability by repeated used of conditional probabilities

\begin{matrix}P\left ( C=c_{i},X=\mathbf{x} \right )=P\left ( x_{1}/C=c_{i},x_{2},...,x_{m}\right )P\left ( C=c_{i},x_{2},...,x_{m}\right )= \\ =P\left ( x_{1}/C=c_{i},x_{2},...,x_{m}\right )P\left ( x_{2}/C=c_{i},x_{3},...,x_{m}\right )P\left ( C=c_{i},x_{3},...,x_{m}\right )= \\ ... \\ =P\left ( x_{1}/C=c_{i},x_{2},...,x_{m}\right )...P\left ( x_{m-1}/C=c_{i},x_{m}\right )P\left ( x_{m}/C=c_{i}\right )P\left ( C=c_{i}\right ) \end{matrix}

The naive assumption simplifies the above expression, by assuming that given a category c_{i}, the m different features occur independently one from another (conditional independence). In other words, when we observe features related to different categories they may be correlated, but when we observe features within a well defined category, they are independent. In general the assumption is far form reality, except when the values of class features are variations around a mean value, typical of the class, and variations are feature uncorrelated.

For instance, you want to classify a day from temperature, wind, humidity, outlook (sunny, cloudy, overcast, rainy, …) in view of open air activities. Classes may be: good, uncertain, bad. Of course the above features are in general correlated, but we assume them independent within a class.

Reference

[1] Basics of Bayesian Statistics, from http://www.stat.cmu.edu/~brian/463-663/week09/Chapter%2003.pdf.

[2] K.P. Murphy, Machine learning: a probabilistic perspective, The MIT Press, Cambridge, MA, 2012.

Appendix: discrete PMF and real PDF

We will use the Iverson bracket, which generalizes the Kronecker delta. Given a statement S which may be either true or false, we write

\left [ S \right ]=\left\{\begin{matrix}1, \textup{true} \\ 0, \textup{false} \end{matrix}\right.

Discrete  probability mass functions(or distributions)

Bernoulli distribution \textup{Ber}(x;p). Given a binary RV X=\left \{ 0,1 \right \} , the generic outcome x and the parameter 0\leq p\leq 1, the PMF is defined as

\begin{matrix}\textup{Ber}(x;p)=P\left ( X=x \right )=p^{\left [ x=1 \right ]}\left ( 1-p \right )^{\left [ x=0 \right ]} \\ E\left \{ X \right \}=p,\; \textup{var}\left \{ X \right \}=p\left ( 1-p \right ) \end{matrix} \: \; (A1)

Binomial distribution \textup{Bin}(x;p,k,n). Given a sequence X=\left \{ X_{i},i=1...,n \right \} of independent binary RV X_{i}=\left \{ 0,1 \right \}  we construct the RV K=\sum_{i=1}^{n}X_{i} with generic outcome 0\leq k\leq n. The PMF is defined as

\begin{matrix}\textup{Bin}(k;n,p)=P\left ( K=k \right )=\begin{pmatrix}n\\ k\end{pmatrix}p^{k}\left ( 1-p \right )^{n-k} \\ E\left \{ K \right \}=np,\; \textup{var}\left \{ K \right \}=np\left ( 1-p \right ) \end{matrix}\; (A2)

Categorical distribution \textup{Cat}(x;\mathbf{p},m), \mathbf{p}=[p_{1},...,p_{k},...,p_{m}]Given a discrete RV X=\left \{ x_{1},...,x_{k},...,x_{m} \right \} where the integer x_{k} denotes a class or category of different objects or features (e.g., die faces) with probability 0\leq p_{k}\leq 1 such that \sum_{k=1}^{m}p_{k}=1, the PMF is defined by 

\begin{matrix}\textup{Cat}(x;\mathbf{p})=P\left ( X=x_{k} \right )=\prod_{k=1}^{m}p_{k}^{\left [ x=x_{k} \right ]}=\sum_{k=1}^{m} \left [ x=x_{k} \right ]p_{k}\\ \end{matrix}\; \left (A3 \right )

Multinomial distribution \textup{Mu}(x;n,m, \mathbf{p}), \mathbf{p}=[p_{1},...,p_{k},...,p_{m}]. Given a sequence of  X=\left \{ X_{i},i=1...,n \right \} of n identical and independent categorical RV  X_{i}=\left \{ x_{1},...,x_{k},...,x_{m} \right \}, the PMF of the random vector \mathbf{\mathrm{}N}=\left [ N_{1},...,N_{k},...,N_{m}\right ] ,where the RV N_{k} accounts for the occurrence number of the feature x_{k},

\begin{matrix}\textup{Mu}(\mathbf{n};n,m,\mathbf{p})=P\left ( \mathbf{N}=\mathbf{n} \right )=\begin{pmatrix}n\\ n_{1}...n_{k}...n_{m}\end{pmatrix}\prod_{k=1}^{m}p_{k}^{n_{k}} \\ E\left \{ N_{} \right \}=np_{k},\; \textup{var}\left \{ N_{k} \right \}=np_{k}\left ( 1-p_{k} \right ) \end{matrix}\: \; (A4)

The categorical PMF is equivalent to the multinomial distribution with n=1, that is

\textup{Cat}(x;\mathbf{p})=\textup{Mu}(\mathbf{n};1,m,\mathbf{p}); \; \mathbf{n}=[0 ...1...0]\: \; (A5)

Probability density functions

Gaussian or normal distribution. The univariate normal PDF holds

\begin{matrix}N\left ( x;\mu ,\sigma \right )=\frac{1}{\sigma \sqrt{2\pi }}\textup{exp}\left ( -\frac{\left ( x-\mu \right )^{2}}{2\sigma ^{2}} \right ) \\ E\left \{ x \right \}=\mu ,\textup{var}\left \{ x \right \}=\sigma ^{2} \end{matrix}\: \; (A6)

The standard normal PDF assumes \mu =0,\sigma =1. The multivariate Gaussian PDF of a RV \mathbf{x}=\left [ x_{1},...,x_{i},...,x_{n} \right ] holds

\begin{matrix}N\left ( \mathbf{x};\mathbf{\mu} ,S^{2} \right )=\frac{1}{\left (2\pi \right )^{n/2}\textup{det}S}\textup{exp}\left ( -\frac{1}{2}\left ( \mathbf{x}-\mathbf{\mu} \right )^{T}S ^{-2} \left ( \mathbf{x}-\mathbf{\mu} \right )\right) \\ E\left \{ \mathbf{x} \right \}=\mathbf{\mu} ,\textup{covar}\left \{ \mathbf{x} \right \}=S^{2} \end{matrix}\; (A7)

where the covariance matrix has been raised to the exponent 2 being a nonnegative definite matrix, with nonnegative eigenvalues and (not unique) square root such that S^{2}=S^{T}S. S ^{2}\left ( i,i \right ) is the variance of x_{i}, and \textup{tr}S^{2}=\sum_{i=1}^{n}S^{2}(i,i) equals the sum of the eigenvalues.

Beta distribution. The univariate Beta PDF refers to the bounded RV 0\leq X\leq 1. By dropping -1 from the exponents, we write

\begin{matrix}\textup{Beta}\left ( x;\alpha, \beta \right )=\frac{x^{\alpha }\left ( 1-x \right )^{\beta }}{B\left ( \alpha ,\beta \right )},\; B\left ( \alpha ,\beta \right )=\frac{\Gamma \left ( \alpha +1 \right )\Gamma \left ( \beta +1 \right )}{\Gamma \left ( \alpha +\beta+2 \right )} \\ E\left \{ x \right \}= \frac{\alpha +1}{\alpha+\beta +2}; \;\textup{ var}\left ( x \right )=\frac{\left ( \alpha +1 \right )\left ( \beta +1 \right )}{\left ( \alpha +\beta+2 \right )^{2}\left ( \alpha +\beta+3 \right )} \end{matrix}\: \; (A8)

The multivariate Beta PDF, known as Dirichlet PDF, is defined over the simplex S_{m}=\left \{ \mathbf{x};0 \leq x_{k}\leq 1,\sum_{k=1}^{m} x_{k}=1\right \}. The PDF expression is

\begin{matrix}\textup{Dir}\left ( \mathbf{x}; \mathbf{a }=\left [ \alpha _{1},...,\alpha _{m} \right ]\right )=\frac{\prod_{k=1}^{m}x_{k}^{\alpha _{k}}\left [ \mathbf{x}\in S_{m} \right ]}{B\left ( \mathbf{a} \right )},\; B\left ( \mathbf{a} \right )=\frac{\prod_{i=1}^{m}\Gamma \left ( \alpha _{i}+1\right )}{\Gamma \left ( \sum_{i=1}^{m}\left ( \alpha _{k}+1\right ) \right )} \\ E\left \{ x_{k} \right \}=\frac{\alpha _{k}+1}{\alpha_{0} +m },\; \textup{var}\left \{ x_{k} \right \}=\frac{\left (\alpha _{k}+1 \right )\left ( \alpha_{0} +m+ 1-\alpha _{k} \right )}{\left ( \alpha_{0} +m \right )^{2}\left ( \alpha_{0} +m+1 \right ) },\: \; \alpha_{0} =\sum_{i=1}^{m}\alpha _{k} \end{matrix}\; (A9)

Under uniform exponent \alpha _{k}=\alpha /m, mean and variance in (A9) become

E\left \{ x_{k} \right \}=\frac{1}{m },\; \textup{var}\left \{ x_{k} \right \}=\frac{1}{m^{2}}\frac{1-1/m}{ \alpha +1+1/m } \; (A10)

Thus increasing \alpha and m increases precision.

Remark A1. Let us denote the random vector of the Dirichlet PDF with X=\left [ X_{1},...,X_{k},...,X_{m} \right ]\mathbf{x} denotes the vector of outcomes.   The vector is constrained by \sum_{k=1}^{m}X_{k}=1.  Under m=2, \textup{Dir}\left ( \mathbf{x=}\left [ x_{1},x_{2}=1-x_{1} \right ]; \mathbf{a }=\left [ \alpha _{1},\alpha _{2} \right ]\right) is a Beta distribution \textup{Beta}\left ( x_{1}=1-x_{2};\alpha_{1}, \alpha_{2} \right ).

Remark 2. Stick-breaking approach to Dirichlet RV generation. Consider a segment of length y=1 and randomly divided into two intervals of length x_{1}=y_{1}, 1-y_{1}. Then further subdivide the segment of length 1-y_{1}  into two segments of fractional length y_{2},1-y_{2}, which implies that segment k=2 has length x_{2}=y_{2}\left ( 1-y_{1} \right ) and the remaining segment has length r_{2}=\left ( 1-y_{2} \right )\left ( 1-y_{1} \right ). In other terms, y_{2} is the length of the partition of  a unit segment. By proceeding in the same way, the last segment k=m-1 to be further subdivided, has length r_{m-2}=\prod_{k=1}^{m-2}\left ( 1-y_{k} \right ) . The length of the last two segments holds

x_{m-1}=y_{m-1}\prod_{k=1}^{m-2} \left ( 1-y_{k} \right ),\; r_{m-1}=\prod_{k=1}^{m-1} \left ( 1-y_{k} \right )

The length sum of the m segments holds one, namely \sum_{k=1}^{m-1}x_{k}+r_{m-1}=1. Since the sequence \left \{ y_{k} ,k=1,...,m-1\right \} has been obtained by partitioning segments of unit length, it can be obtained as the outcome of a sequence of Beta  distributions

\textup{Beta}\left ( y_{1};\alpha_{1},\sum_{ k=2}^{m}\alpha _{k}\right ),\textup{Beta}\left ( y_{2};\alpha_{2},\sum_{ k=3}^{m}\alpha _{k}\right ),...,\textup{Beta}\left ( y_{m-1};\alpha_{m-1},\alpha_{m}\right )

and the Dirichlet outcome \mathbf{x} is found to be

\mathbf{x}=\left [ x_{1},x_{2}, ..,x_{k},...x_{m-1},r_{m-1} \right ]

Figure A1 - Left: the q-q plot of \textup{Beta}\left ( x_{1}=y_{1};2,4 \right ). Right: 200 samples of 3D Dirichlet distribution \textup{Dir}\left ( x_{1},x_{2},x_{3};2,2,2 \right ) lying on the plane x_{1}+x_{2}+x_{3}=1.
Appendix: nucleic acid sequence

A categorical sequence in biology is the nucleic acid sequence (either DNA or RNA), where either acid is made by a chain of linked nucleotides, each one made by a phosphate group, a sugar and one of four canonical/primary nucleo(nitrogenous)bases, adenine (A), cytosin (C), guanine (G), thymine (T), which differentiate the four nucleotides. Thymine (T) is replaced by Uracil (U) in RNA. A sequence of three nucleotides (in total 4^{3}=64 combinations), known as codon, is the genetic code for synthesizing a protein (long chains of amino acids). Protein coding frames are opened by a start codon, usually the triple AUG, and end with a stop codon. In turn, codons are the core of the protein-coding genes, complex DNA portions capable of activating and regulating the production of the encoded protein. Mutations of genes may result in genetic disorder and relevant diseases.

Alignment of amino acid sequence between a target 3D protein structure and experimental homologous structures is an essential tool for extrapolating and checking the 3D structure of proteins from amino acid sequence. Protein structures are conservative, but not one-to-one related to detected sequences due to variable 3D organization of backbone hydrogen bonds.

The categorical RV is X=\left \{ x_{1} =A, x_{2}=C,x_{3}=G, x_{4}=T/U,x_{5}=Z\right \}, with m=5. where Z denotes a gap which may be artificially inserted to improve alignment and to keep a uniform length. A typical sequence analysis is the multiple sequence alignment (MSA), in which r sequences of equal length n are aligned to discover conservation or mutation of the nucleotides at a sequence location i=1,…,n. The r aligned sequences form an r\times n matrix, where no column made of gaps (Z) exists. Columns of high purity, i.e. dominated by a single nucleobasis (denoted by a letter), are cornerstones for finding the best alignment. The interest is the distribution of the four nucleotides in a generic column (location) i=1,…,n. The occurrence probability of the category x_{k} in a generic column i is a binomial PMF. To this end, the RV X_{i} becomes a random vector \mathbf{X}_{i}=\left \{ X_{ji} ,j=1,...,r\right \}. Let us denote the number of occurrences (RV) of x_{k} with N_{k}\left ( i \right )=\sum_{j=1}^{r}\left [ X_{ji}=x_{k} \right ]. We write the PMF

P\left ( N_{k(i)}=n_{k}(i) \right )=\begin{pmatrix}r\\ n_{k}(i)\end{pmatrix}p_{k}(i)^{n_{k}(i)}(1-p_{k}(i))^{r-n_{k}(i)}

which becomes the likelihood function of an observed sequence x_{ji}. The MLE holds \hat{p}_{k}(i)=n_{k}(i)/r. The argument \hat{k}(i)=\textup{arg}\textup{max}_{k}\hat{p}_{k}(i) denotes the dominant category (nucleobasis) of the column i. The sequence \hat{K}=\left \{ \hat{k} (i)\right \} is the sequence of the column MLE, known as the consensus sequence. Let us denote the MLE of \hat{k}(i)=\textup{arg}\textup{max}_{k}\hat{p}_{k}(i) with p_{\hat{k}(i)}, namely p_{\hat{k}(i)}=\textup{max}_{k}\hat{p}_{k}(i). Given a suitable threshold p_{min}<1 but close to 1, a consensus sequence such that p_{\hat{k}(i)}>p_{min} is a candidate to declare the alignment acceptable. The frequency of the four nucleotides in protein-coding human genes is fairly uniform, p_{k}\cong 1/4. It becomes non uniform in non-coding genes.

Figure A1 shows 16 sequences of 16 columns and their column counts. The dominant counts define the consensus sequence CAAATCCGGTTCGAG.

Figure A2 - Left, the counts of each category (nucleobasis) and the max column count. Right: 16 category sequences. The last column is pure.