# Bayes inference I: events

##### Bayes inference I: events

Enrico Canuto, Former Faculty, Politecnico di Torino, Torino, Italy

September 5, 2020

Draft

###### Independence

Definition and examples

Given a set $\Omega$ of possible outcomes $\omega$, two events (sets of outcomes) $A\subset&space;\Omega$ and $B\subset&space;\Omega$ are independent if the probability of their intersection $A\cap&space;B$, i.e. the probability that both events occur, is the product of the event probabilities

$P\left&space;(&space;A\cap&space;B&space;\right&space;)=P\left&space;(&space;A&space;\right&space;)P\left&space;(&space;B&space;\right&space;)\Rightarrow&space;\textup{log}P\left&space;(&space;A\cap&space;B&space;\right&space;)=\textup{log}P\left&space;(&space;A&space;\right&space;)+\textup{log}P\left&space;(&space;B&space;\right&space;)\:&space;(1)$

The meaning is that event occurrence does no affect each other. Event occurrence may be simultaneous or not. (1) can be extended to multiple independent events. The logarithmic identity shows that (1) is a linear relation. The first order differential expression is of course linear.  Let  $\bar{P}\left&space;(&space;A&space;\right&space;),\bar{P}\left&space;(&space;B&space;\right&space;)$ be nominal values and  $dP\left&space;(&space;A&space;\right&space;),dP\left&space;(&space;B&space;\right&space;)$ small deviations. First order power expansion of (1) provides

$\begin{matrix}logP\left&space;(&space;A\cap&space;B&space;\right&space;)\cong&space;log\bar{P}\left&space;(&space;A&space;\right&space;)+log\bar{P}\left&space;(&space;B&space;\right&space;)+\frac{dP\left&space;(&space;A&space;\right&space;)}{\bar{P}\left&space;(&space;A&space;\right&space;)}+\frac{dP\left&space;(&space;B&space;\right&space;)}{\bar{P}\left&space;(&space;B&space;\right&space;)}&space;\\&space;dlogP\left&space;(&space;A\cap&space;B&space;\right&space;)=\frac{dP\left&space;(&space;A&space;\right&space;)}{\bar{P}\left&space;(&space;A&space;\right&space;)}+\frac{dP\left&space;(&space;B&space;\right&space;)}{\bar{P}\left&space;(&space;B&space;\right&space;)}&space;\end{matrix}&space;\:&space;\;&space;(2)$

In other terms the differential of (1) is linear in the fractional differential of independent event probabilities. A similar differential identity can be derived without logarithm

$\frac{dP\left&space;(&space;A\cap&space;B&space;\right&space;)}{\bar{P}\left&space;(&space;A&space;\right&space;)\bar{P}\left&space;(&space;B&space;\right&space;)}=\frac{dP\left&space;(&space;A&space;\right&space;)}{\bar{P}\left&space;(&space;A&space;\right&space;)}+\frac{dP\left&space;(&space;B&space;\right&space;)}{\bar{P}\left&space;(&space;B&space;\right&space;)}$

Remark.  From (1) we have $P\left&space;(&space;A\cap&space;B&space;\right&space;)\leq&space;P\left&space;(&space;A&space;\right&space;),P\left&space;(&space;A\cap&space;B&space;\right&space;)\leq&space;P\left&space;(&space;B&space;\right&space;)$. This is reasonable since the intersection event is such to reduce the possible outcomes, becoming a rare event. A common and useful construction of independent events is multiple experiment repetition with the care that each experiment does not affect the other ones (n repeated trials). Given the outcome set $\Omega$ and a set of events $\left&space;\{&space;E_{1},...,E_{k},...\subset&space;\Omega&space;\right&space;\}$, the outcome set is the n-fold Cartesian product $\Omega&space;_{N}=\Omega&space;\times&space;\cdots&space;\times&space;\Omega&space;=\left&space;\{&space;\Omega&space;,...,\Omega&space;\right&space;\}$. A generic event is $E=\left&space;\{&space;E_{1}&space;,...,E_{n}\right&space;\}=\left&space;\{&space;E_{1},...,\Omega&space;\right&space;\}\cap&space;...\cap&space;\left&space;\{&space;\Omega&space;,...,E_{n}&space;\right&space;\}$, which has been expressed as the intersection of elementary events $\left&space;\{\Omega&space;,...,&space;E_{n},...,\Omega&space;\right&space;\}$. Thus, if the elementary events can be assumed to be independent, we can write

$P\left&space;(&space;E&space;\right&space;)=P\left&space;(&space;E_{1}&space;\right&space;)\cdot&space;...\cdot&space;P\left&space;(&space;E_{n}&space;\right&space;)$

Example 1. Card drawing. Given a deck of M  fair cards of four colors (red, ...), two kind of drawings are possible: drawing with and without replacement. Consider the event $E=\left&space;\{&space;\textup{red}&space;,\textup{red}\right&space;\}$ of two subsequent drawings. With replacement, drawings can be assumed to be independent, which implies: $P\left&space;(&space;E\right&space;)=\left&space;(1/4&space;\right&space;)\times&space;\left&space;(1/4&space;\right&space;)$. Without replacement, the second outcome depends on the card drawn first, red or  other than red:

$\begin{matrix}P\left&space;(&space;E;&space;\;&space;\textup{other&space;color&space;drawn&space;first}\right&space;)=\frac{1}{4}\times&space;\frac{M/4}{M-1}&space;\\&space;P\left&space;(&space;E;&space;\;&space;\textup{&space;red&space;drawn&space;first}\right&space;)=\frac{1}{4}\times&space;\frac{M/4-1}{M-1}&space;\end{matrix}$

Identity (1) is similar to the probability of the union $A\cup&space;B$ of two disjoint (mutually exclusive) events, i.e. $A\cap&space;B=0$ , either event occurs, that is

$\begin{matrix}P\left&space;(&space;A\cup&space;B&space;\right&space;)=P\left&space;(&space;A&space;\right&space;)+&space;P\left&space;(&space;B&space;\right&space;),\:&space;\;&space;P\left&space;(&space;A\cap&space;B&space;\right&space;)=0\:&space;\;&space;\\&space;dP\left&space;(&space;A\cup&space;B&space;\right&space;)=dP\left&space;(&space;A&space;\right&space;)+&space;dP\left&space;(&space;B&space;\right&space;)&space;\end{matrix}(3)$

Identity (3) is again a linear relation, but this time the differential becomes the sum of the event probability differentials, in contrast to (2) where relative differentials appear. Mutual exclusive events contain different outcomes. (3) can be extended to multiple mutually exclusive events.

Example 2. Consider rolling a die, the outcomes are six $\Omega&space;=\left&space;\{&space;1,2,3,4,5,6&space;\right&space;\}=\left&space;\{&space;\omega&space;_{i}&space;\right&space;\}$. The die is fair if outcome probabilities are equal, that is $P\left&space;(&space;\omega&space;_{i}&space;\right&space;)=1/6$ . Since outcomes are mutually exclusive, the probability of the event even $E$ holds, by extending (3),

$P\left&space;(&space;E&space;\right&space;)=P\left&space;(&space;\omega&space;_{2}&space;\right&space;)+P\left&space;(&space;\omega&space;_{4}&space;\right&space;)+P\left&space;(&space;\omega&space;_{6}&space;\right&space;)=1/2$Now consider two different die throws (trials) and ask the probability of the event $\left&space;\{&space;E,E&space;\right&space;\}=\left&space;\{&space;\textup{even},\textup{even}&space;\right&space;\}=\left&space;\{&space;E,\Omega&space;\right&space;\}\cap\left&space;\{&space;\Omega,E&space;\right&space;\}$. A reasonable assumption is that each throw does not affect the outcome of the other, as in card drawing with replacement (Example 1). This implies independence and $P\left&space;\{&space;E,E&space;\right&space;\}=1/4$. Recall that the outcome set of two repeated trials is the Cartesian product of the original set, namely $\Omega&space;\times&space;\Omega&space;=\left&space;\{&space;\Omega&space;,\Omega&space;\right&space;\}$. 

Bernoulli trials.

Let $\Omega&space;=\left&space;\{&space;H=\textup{head,&space;success},T=\textup{tail,&space;failure}&space;\right&space;\}$  with $P(H)=p,P(T)=q=1-p$  be the outcome set of a single experiment to be repeated N times, identically and independently. A single outcome $E\left&space;(&space;k&space;\:&space;\;&space;\textup{times\:&space;\;&space;}&space;H;&space;\;&space;\textup{single&space;repetition;}\;&space;n&space;\:&space;\;&space;\textup{trials}\right&space;)$ showing k outcomes H and n-k outcomes T in arbitrary order, has probability $P\left&space;(&space;E&space;\right&space;)=p^{k}\left&space;(&space;1-p&space;\right&space;)^{n-k}$. Often the interest is to know the probability of successes out of n repetitions. To find it, we count the number of different sequences of k heads and n-k tails, which is given by the binomial coefficient

$\left&space;(&space;\begin{matrix}n\\&space;k\end{matrix}&space;\right&space;)=\frac{n!}{k!(n-k)!}$

Since sequences, being different, are mutually exclusive, the probability of the event  $E\left&space;(&space;k&space;;n,p\right&space;)=E\left&space;(&space;k&space;\:&space;\textup{times\:&space;}&space;H,&space;P(H)=p);&space;\;&space;\textup{all&space;repetitions,}\:&space;n&space;\:\textup{&space;trials}\right&space;)$  holds

$P\left&space;(&space;k;&space;n,p&space;\right&space;)=P\left&space;(&space;E\left&space;(&space;k;&space;n,p\right&space;)&space;\right&space;)=\left&space;(&space;\begin{matrix}n\\&space;k\end{matrix}&space;\right&space;)p^{k}\left&space;(&space;1-p&space;\right&space;)^{n-k}\:&space;\;&space;(4)$

The expression in (4) is known as the Binomial probability distribution (PD) of the integer random variable (RV) $K=\left&space;\{&space;0,1,...,n&space;\right&space;\}$. An integer random variable corresponds to a set of outcomes, in this case the n+1 sets of repetitions showing outcomes H, one-to-one associated to an interval of integer numbers. Because of mutual exclusion, it holds:

$\sum_{k=0}^{n}P\left&space;(&space;k;&space;n,p&space;\right&space;)=\sum_{k=0}^{n}\left&space;(&space;\begin{matrix}n\\&space;k\end{matrix}&space;\right&space;)p^{k}\left&space;(&space;1-p&space;\right&space;)^{n-k}=1\:&space;\;&space;(5)$

The mean or expected value $\mathit{E}\left&space;\{&space;K&space;\right&space;\}$ of the PD in (5), namely the mean value of k when a large number N of repeated trials defined by $\left&space;\{&space;k,n,p&space;\right&space;\}$ is performed, holds $\mathit{E}\left&space;\{&space;K&space;\right&space;\}=np$. The variance holds $\mathit{E}\left&space;\{&space;\left&space;(K-\mathit{E}\left&space;\{&space;K&space;\right&space;\}^{2}&space;\right&space;)\right&space;\}=np\left&space;(&space;1-p&space;\right&space;)=npq$. Moreover, under $n\rightarrow&space;\infty$ and when $np=o\left&space;(&space;n&space;\right&space;)$ and $npq=o\left&space;(&space;n&space;\right&space;)$ are the order of n, binomial distribution is well approximated by the normal density function $N\left&space;(&space;np,\sqrt{npq}&space;\right&space;)$ defined by

$N\left&space;(&space;np,\sqrt{npq}&space;\right&space;)=\frac{1}{\sqrt{2\pi&space;npq}}\textup{exp}\left&space;(&space;\frac{1}{2}&space;\frac{\left&space;(&space;x-np&space;\right&space;)^{2}}{npq}\right&space;)\:&space;(6)$

Remark. Be aware that $f(x)=N\left&space;(&space;np,\sqrt{npq}&space;\right&space;)$ in (6), where x is real,  is a probability density, which, in order to approximate an integer distribution, must be converted into the probability  $P\left&space;(&space;k-1/2\leq&space;x&space;<&space;k+1/2\right&space;)\cong&space;f\left&space;(&space;k&space;\right&space;)\left&space;(&space;1/2+1/2&space;\right&space;)=f\left&space;(&space;k&space;\right&space;)$ .  Since the interval between two adjacent integers is one, we can replace the real x with the integer k in (6).

Likelihood function and parameter estimation. An introduction to statistical inference.

Consider tossing a coin with outcome set $\Omega&space;=\left&space;\{&space;H,T&space;\right&space;\}$. We want to test whether the coin is fair, i.e. whether $p(H)=p=1/2$. To this end, we toss the coin times by assuming independent trials. Let us assume to find k outcomes equal to H. What we can infer from this result about p? Let us recall from (4) the binomial probability $P\left&space;(&space;k;&space;n,p&space;\right&space;)$  of finding k heads out of n repetitions, where the pair (k,n) is known and p is unknown.  We also recall that $P\left&space;(&space;k;&space;n,p&space;\right&space;)$ strictly depends on the assumption of two exclusive outcomes and of independent repetitions: they have been converted into the mathematical model $P\left&space;(&space;k;&space;n,p&space;\right&space;)$.  In our hands we only have  $P\left&space;(&space;k;&space;n,p&space;\right&space;)$, the values (k,n) and the range $0We admit that $P\left&space;(&space;k;&space;n,p&space;\right&space;)$ and (k,n) are a faithful representation of our coin behavior when tossed, and, since we have a degree of freedom p, we can use it to find the best fit by maximizing the probability $P\left&space;(&space;k;&space;n,p&space;\right&space;)$ with respect to p. For such reasons, $P\left&space;(&space;k;&space;n,p&space;\right&space;)$ is known as the likelihood function of coin tossing, and the argument

$\hat{p}=\textup{argmax}_{0is the best estimate we can obtain from model and data, known as the maximum likelihood estimate (MLE). By setting to zero the derivative of $P\left&space;(&space;k;&space;n,p&space;\right&space;)$ , we obtain

$\begin{matrix}\frac{d}{dp}P\left&space;(&space;k,b,p&space;\right&space;)=\left&space;(&space;\begin{matrix}n\\&space;k\end{matrix}&space;\right&space;)(kp^{k-1}\left&space;(&space;1-p&space;\right&space;)^{n-k}-\left&space;(&space;n-k&space;\right&space;)p^{k}\left&space;(&space;1-p&space;\right&space;)^{n-k-1})=&space;\\&space;=&space;\left&space;(&space;\begin{matrix}n\\&space;k\end{matrix}&space;\right&space;)p^{k-1}\left&space;(&space;1-p&space;\right&space;)^{n-k-1}\left&space;(&space;k-np&space;\right&space;)=0&space;\end{matrix}$

By discarding the values $\hat{p}=\left&space;\{&space;0,1&space;\right&space;\}$ as they zero the likelihood function, the intuitive MLE holds

$\hat{p}=k/n\:&space;\;&space;(7)$

Since k is an outcome of the Binomial random variable K, $\hat{p}$  itself becomes the outcome of the Binomial RV K/n, with mean value $E\left&space;\{&space;K/n&space;\right&space;\}=p$. In other terms, the MLE mean value equals the unknown parameter for any n. The variance  $E\left&space;\{&space;\left&space;(&space;K-np\right&space;)&space;\right&space;\}/n^{2}=p(1-p)/n$ converges to zero for large n. We can say that for $n\rightarrow&space;\infty$ the probability that k/n=p approaches one: the MLE (7) asymptotically approaches the true parameter!

###### Conditional probability

Definition

How identity (1) can be converted to the general case of dependent events? The solution is a new kind of probability, the conditional probability, either $P\left&space;(&space;A/B&space;\right&space;)$ or $P\left&space;(&space;B/A&space;\right&space;)$ , such that (multiplication rule)

$P(A\cap&space;B)=P\left&space;(&space;A&space;\right&space;)P\left&space;(&space;B/A&space;\right&space;)=P\left&space;(&space;B&space;\right&space;)P\left&space;(&space;A/B&space;\right&space;)\;&space;(8)$

where $P\left&space;(&space;A/B&space;\right&space;)$ reads as the probability of the event A occurrence given (known, assumed) the occurrence of the event B. From (8) the usual definition follows

$\begin{matrix}P\left&space;(&space;B/A&space;\right&space;)=\frac{P(A\cap&space;B)}{P\left&space;(&space;A&space;\right&space;)}\leq&space;1,\;&space;P\left&space;(&space;A&space;\right&space;)>&space;0&space;\\&space;P\left&space;(&space;A/B&space;\right&space;)=\frac{P(A\cap&space;B)}{P\left&space;(&space;B&space;\right&space;)}\leq&space;1,&space;\;&space;P\left&space;(&space;B&space;\right&space;)>&space;0&space;\end{matrix}\;&space;(9)$

where conditional probabilities satisfy all the probability axioms. Identities in (8) and (1) immediately prove that under independence $P\left&space;(&space;B/A&space;\right&space;)=P\left&space;(&space;B&space;\right&space;)$ and $P\left&space;(&space;A/B\right&space;)=P\left&space;(&space;A&space;\right&space;)$.  (9) is a construction equation, but construction of conditional probabilities may be rather delicate as the following example shows.$\begin{matrix}P\left&space;(E&space;\right&space;)=P\left&space;(E_{1}\cap&space;E&space;\right&space;)+&space;...+P\left&space;(E_{N}\cap&space;E&space;\right&space;)=&space;\\&space;=P\left&space;(E/&space;E_{1}&space;\right&space;)P(E_{1})+&space;...+P\left&space;(E&space;/E_{N}\right&space;)P(E_{N})&space;\end{matrix}&space;\:&space;(10)$

Law of total probability

Let $\left&space;\{&space;E_{1},...,E_{N}&space;\right&space;\}$ be a finite set of mutually exclusive events  such that $E_{1}\cup&space;...\cup&space;E_{N}=\Omega$ and let  $E\subset&space;\Omega$ be a generic event. E can be developed as the  union of intersections $E=\left&space;(E_{1}\cap&space;E&space;\right&space;)\cup&space;...\cup\left&space;(E_{N}\cap&space;E&space;\right&space;)$. Equation (8) allow to compute the event probability as

$\begin{matrix}P\left&space;(E&space;\right&space;)=P\left&space;(E_{1}\cap&space;E&space;\right&space;)+&space;...+P\left&space;(E_{N}\cap&space;E&space;\right&space;)=&space;\\&space;=P\left&space;(E/&space;E_{1}&space;\right&space;)P(E_{1})+&space;...+P\left&space;(E&space;/E_{N}\right&space;)P(E_{N})&space;\end{matrix}\;&space;(10)$

Examples

Example 3, Step 1 [1]. A family has two children; we know that at least one of them is male. Which is the probability that both children are male?  Consider first the outcome space of newborns $\Omega&space;_{1}=\left&space;\{&space;M=\textup{male},F&space;=\textup{female}\right&space;\}$, and assume that both outcomes have equal probability (=1/2).  A naive answer would be $P\left&space;(&space;M&space;\right&space;)=1/2$, but it does not account that children are two. In other terms, we should account (conditioning) that we know that at least one of the two children is a male, which knowledge must be formulated as an event. The outcome set of two children is

$\begin{matrix}\Omega&space;_{2}=\Omega&space;_{1}\times&space;\Omega&space;_{1}=\left&space;\{&space;\left&space;(&space;M,M&space;\right&space;),\left&space;(&space;M,F&space;\right&space;),\left&space;(&space;F,M&space;\right&space;),\left&space;(&space;F,F&space;\right&space;)&space;\right&space;\}&space;\\&space;P(M,M)=P(M,F)=P(F,M)=P(F,F)=1/4&space;\end{matrix}$We have to compute the probability of the outcome $M_{2}=\left&space;(&space;M,M&space;\right&space;)$  but conditioned to the knowledge that one or two  children are male, which corresponds to the event  $M_{\textup{at&space;least&space;one}}=&space;\left&space;(&space;M,M&space;\right&space;)\cup&space;\left&space;(&space;F,M&space;\right&space;)\cup&space;\left&space;(&space;M,F&space;\right&space;)$. Thus, using the law of total probability (10) we obtain

$\begin{matrix}P\left&space;(&space;M_{2}/M_{\textup{at&space;least&space;one}}&space;\right&space;)=\frac{P\left&space;(&space;M_{2}\cap&space;M_{\textup{at&space;least&space;one}}&space;\right&space;)}{P\left&space;(&space;M_{1\textup{at&space;least&space;one}}&space;\right&space;)}=\frac{1/4}{3/4}=1/3&space;\\&space;>P\left&space;(&space;M_{2}&space;\right&space;)=\frac{1}{4}&space;\end{matrix}\;&space;(11)$

As expected, conditional probability is larger than unconditional one.
Example 3. Step 2. Let us come back to naive probability 1/2, which, being larger than 1/3, must be conditioned by a finer knowledge corresponding to an event $M_{?}\subset&space;M_{\textup{at&space;least&space;one}}$. Consider for instance the event: to encounter one of the two children who happens to be male. The event can be written with the help of (10) as the union of the intersection with the four possible children pairs, that is

$M_{?}=M_{?}\cap&space;\left&space;(&space;M,M&space;\right&space;)\cup&space;M_{?}\cap&space;\left&space;(&space;M,F&space;\right&space;)\cup&space;M_{?}\cap&space;\left&space;(&space;F,M\right&space;)\cup&space;M_{?}\cap&space;\left&space;(&space;F,F&space;\right&space;)$
where $M_{?}\cap&space;\left&space;(&space;M,M&space;\right&space;)=\left&space;(&space;M,M&space;\right&space;)$  is the encounter event when both children are males, whose probability is 1/4 . By using (1) we derive the event probability  (Law of total probability)

$P\left&space;(M_{?}&space;\right&space;)=P\left&space;(M_{?}&space;/M,M&space;\right&space;)P\left&space;(&space;M,M&space;\right&space;)+&space;P\left&space;(M_{?}&space;/M,F&space;\right&space;)P\left&space;(&space;M,F&space;\right&space;)+&space;P\left&space;(M_{?}&space;/F,M&space;\right&space;)P\left&space;(&space;f,M&space;\right&space;)+&space;P\left&space;(M_{?}&space;/F,F&space;\right&space;)P\left&space;(&space;M,M&space;\right&space;)=\left&space;(&space;1+1/2+1/2+0&space;\right&space;)&space;\times&space;1/4=1/2$
and finally, by replacing the subscript ? with male encounter, we obtain the expected result

$P\left&space;(&space;M_{2}/M_{\textup{male&space;encounter}}&space;\right&space;)=\frac{P\left&space;(&space;M_{2}\cap&space;M_{\textup{male&space;encounter}}&space;\right&space;)}{P\left&space;(&space;M_{1\textup{male&space;encounter}}&space;\right&space;)}=\frac{1/4}{1/2}=1/2\;&space;(12)$
Where the difference between (11) and (12) lies? The probability of encountering a male when the pair is either (F,M) or (M,F)  is smaller (1/8) than the probability that the pair includes a female (1/4)! How delicate are concept and practice of conditional probability, and assessment of knowledge!
Example 4. Quality inspection. The output of a manufacturing line is classified as good (G), uncertain (U) and defective (D), with $\Omega&space;=\left&space;\{&space;G,U,D&space;\right&space;\}$ and probabilities $P\left&space;(&space;G&space;\right&space;)=0.9,&space;\,&space;\;P(D)&space;=0.08$.  The output passes through an inspection machine that is only instructed to label defective parts.   Which is the probability that the output of the inspection machine is good? The non defective event  is $D^{*}=\left&space;\{&space;G,U&space;\right&space;\}$. The searched probability refers to the conditioned event  $G/D^{*}$ and holds

$P\left&space;(&space;G/D^{*}&space;\right&space;)=\frac{P\left&space;(&space;G\cap&space;D^{*}&space;\right&space;)}{P\left&space;(&space;D^{*}&space;\right&space;)}=\frac{P\left&space;(&space;G&space;\right&space;)}{1-P\left&space;(&space;D&space;\right&space;)}=\frac{0.9}{0.92}=0.978$

Example 5. Sampling. n=5 good (G) and m=2 defective (D) parts are mixed in a box. To find defective parts, m=2 parts are randomly selected  without replacement and checked whether defective. Which is the probability of finding the defective parts? Let $D_{1}$  the event of finding the defective part in the first test and $D_{2}$ in the second test. The target event is $D_{1}\cap&space;D_{2}$ and the probability holds $P\left&space;(&space;D_{1}&space;\cap&space;D_{2}&space;\right&space;)=P\left&space;(&space;D_{1}&space;\right&space;)P\left&space;(&space;D_{2}&space;/&space;D_{1}&space;\right&space;)=\frac{1}{21}$
Example 6 [2]. A box contains one two-headed coin with outcome $\left&space;\{&space;H,H&space;\right&space;\}$ and n-1 fair coins with outcomes $\left&space;\{&space;H,T&space;\right&space;\}$ with $P\left&space;(&space;H&space;\right&space;)=1/2$.  One coin is randomly chosen and the toss result is H. Let us denote the corresponding event by $H_{1}$. How many fair coins are in the box, if $P\left&space;(&space;H&space;\right&space;)=11/20$?  Let us denote the random choice of a coin as R, with $P(R)=1/n$. The event $H_{1}$ is the union of the events (event 1 or event 2 or ...) that a single coin has been selected and the toss result is H, that is $H_{1}=(n+1)\left&space;(&space;H\cap&space;R&space;\right&space;)$, whose probability holds

$P(H_{1})=(n+1)P\left&space;(&space;H/&space;R&space;\right&space;)P\left&space;(&space;R&space;\right&space;)=\frac{n+1}{2n}$

which implies n=10 to satisfy  $P\left&space;(&space;H&space;\right&space;)=11/20$.
###### Bayes theorem

Bayes theorem (1763)  at first sight is just a reformulation of identities in (8), but is the basis of statistical inference and prediction.

Inference aims to derive statistical properties, in  the form of parameters, of a probability model from experimental data collected from a population which is assumed to be coherent with the model. We  have already seen a first example and method of inference, MLE, applied to the binomial distribution of coin tossing repeated trials. Bayes theorem allows inference problems and methods to be cast under a rather generic formulation.

Prediction aims to predict the output of future experiments from past data of similar experiments in terms of probability model.

Let us rewrite (8) in the form of Bayes theorem by changing notations and by adding some nomenclature:

$\begin{matrix}P\left&space;(&space;M/E&space;\right&space;)=P\left&space;(&space;M\right&space;)\frac{P\left&space;(&space;E/M&space;\right&space;)}{P\left&space;(&space;E\right&space;)}&space;\\&space;0

The event $E$ is known as the evidence or measurement whereas M stands for model or hypothesis. Evidence may have been observed or assumed. $P\left&space;(&space;M\right&space;)$ is known as the prior probability of the model unconditioned by the current evidence  (it may have been constructed from previous data). $P\left&space;(&space;E/M&space;\right&space;)$ is the likelihood, namely the  probability of observing E under the model/hypothesis M (let us remember the sequence of heads and tails under the assumption of the head probability p). $P\left&space;(&space;E&space;\right&space;)$ is known as the marginal likelihood and can be written and constructed as follows:

$\begin{matrix}P\left&space;(&space;E&space;\right&space;)=P\left&space;(&space;E&space;\cap&space;M\right&space;)+P\left&space;(&space;E&space;\cap&space;M^{*}\right&space;)=P\left&space;(&space;E/M\right&space;)P\left&space;(&space;M\right&space;)+P\left&space;(&space;E&space;/M^{*}\right&space;)P\left&space;(&space;M^{*}\right&space;)&space;\\&space;P\left&space;(&space;M\right&space;)+P\left&space;(&space;M^{*}\right&space;)=1&space;\end{matrix}\:&space;\;&space;(14)$

where $M^{*}$ denotes the complement of M in the outcome set $\Omega$. Finally, $P\left&space;(&space;M/E&space;\right&space;)$ is the posterior probability of the model/hypothesis conditioned by the observed evidence.

Example 7. The three door contest [1].

The problem is also known as the Monty Hall problem. Behind one of three closed doors there is a prize. A contestant must select one of three doors $\left&space;\{&space;1,2,3&space;\right&space;\}$, the host then opens one of the empty doors (if both the remaining doors are empty, he makes a random choice) and asks the contestant whether he wants change or not the selection. Three disjoint events: $M_{i}=\textup{prize&space;behind&space;door}\:&space;\;&space;i$, with $P\left&space;(M_{i}&space;\right&space;)=1/3$, prior probabilities. Let us assume that the contestant chooses door 1 and the host opens door 3. Which is the winning probability if he changes (to door 2) or not his selection? Of course, door numbering can be changed, without affecting the result. Let us denote the evidence event ‘contestant selected door 1 and host opened door 3’ with $E_{3}$. We aim to posterior probabilities $P\left&space;(&space;M_{1}&space;/E_{3}\right&space;)$winning without changing selection, and $P\left&space;(&space;M_{2}&space;/E_{3}\right&space;)$, winning by changing selection. $P\left&space;(E_{3}&space;\right&space;)$ is computed from the law of total probability

$\begin{matrix}P\left&space;(E_{3}&space;\right&space;)=P\left&space;(E_{3}/M_{1}&space;\right&space;)P\left&space;(&space;M_{1}&space;\right&space;)+P\left&space;(E_{3}/M_{2}&space;\right&space;)P\left&space;(&space;M_{2}&space;\right&space;)+P\left&space;(E_{3}/M_{3}&space;\right&space;)P\left&space;(&space;M_{3}&space;\right&space;)=&space;\\&space;=1/2\times&space;1/3+1\times&space;1/3+0\times&space;1/3=1/2&space;\end{matrix}$The following posterior probabilities prove that changing door is favorable:

$\begin{matrix}P\left&space;(&space;M_{2}/E_{3}&space;\right&space;)=P\left&space;(&space;M_{2}&space;\right&space;)\frac{P\left&space;(&space;E_{3}/&space;M_{2}\right&space;)}{P\left&space;(&space;E_{3}&space;\right&space;)}=\frac{1}{3}\frac{1}{1/2}=\frac{2}{3}&space;\\&space;P\left&space;(&space;M_{1}/E_{3}&space;\right&space;)=P\left&space;(&space;M_{1}&space;\right&space;)\frac{P\left&space;(&space;E_{3}/&space;M_{1}\right&space;)}{P\left&space;(&space;E_{3}&space;\right&space;)}=\frac{1}{3}\frac{1/2}{1/2}=\frac{1}{3}&space;\end{matrix}$

The reader is suggested to list the nine possible combinations of prized door (3) and selected door (3) together with the result of changing and not changing selection. To show again the subtle role of evidence and the relevant event E formulation, we look for an evidence such that it does not matter if you change or not, which implies that $P\left&space;(&space;M_{2}/E&space;\right&space;)=P\left&space;(&space;M_{1}/E&space;\right&space;)$ and that $E\supset&space;E_{3}$ (a coarser evidence). Assume that the evidence event is ‘the contestant selected door 1 and host revealed that door 3 was empty (but he did not open the door)’. Indeed, ‘door 3 empty’  is a coarser information since it does not imply that ‘door 3 will be opened’; indeed also ‘door 2 could be opened’ in the case door 1 is prized. Since $P\left&space;(&space;E&space;\right&space;)=P\left&space;(&space;M_{1}&space;\cup&space;M_{2}\right&space;)=2/3$ and $P\left&space;(&space;E&space;\cap&space;M_{i}\right&space;)=P\left&space;(&space;M_{i}&space;\right&space;)=1/3$ for i=1,2, we obtain as expected

$\begin{matrix}P\left&space;(&space;M_{2}/E&space;\right&space;)=\frac{P\left&space;(&space;E&space;\cap&space;M_{2}\right&space;)}{P\left&space;(&space;E&space;\right&space;)}=\frac{1}{3}\frac{1}{2/3}=\frac{1}{2}&space;\\&space;P\left&space;(&space;M_{1}/E&space;\right&space;)=\frac{P\left&space;(&space;E\cap&space;M_{1}\right&space;)}{P\left&space;(&space;E&space;\right&space;)}=\frac{1}{3}\frac{1}{2/3}=\frac{1}{2}&space;\end{matrix}$

Further examples

Example 8. Diagnostic test. A diagnostic test applied to affected people is false negative  4% of the cases. If applied to unaffected people is false positive 2% of the cases. 5% of the population to be tested is affected. Let T the positive test event, A the affected event and A* the complement, with

$\begin{matrix}P\left&space;(&space;A&space;\right&space;)=0.05,\;&space;P\left&space;(&space;A^{*}&space;\right&space;)=0.95&space;\\&space;P\left&space;(&space;T/A&space;\right&space;)=0.96,\;&space;P\left&space;(&space;T/A^{*}&space;\right&space;)=0.02&space;\end{matrix}$

We want to know the posterior probabilities  $P\left&space;(&space;A/T&space;\right&space;)$, that given a positive answer people are affected,  and $P\left&space;(&space;A^{*}/T&space;^{*}\right&space;)$, that given a negative answer people are unaffected. First we compute $P\left&space;(&space;T&space;\right&space;)$ from the law of total probability :

$P\left&space;(&space;T&space;\right&space;)=P\left&space;(&space;T/A&space;\right&space;)P\left&space;(&space;A&space;\right&space;)+P\left&space;(&space;T/A^{*}&space;\right&space;)P\left&space;(&space;A^{*}&space;\right&space;)=0.96\times0.05+0.02\times&space;0.95=0.0.067$

The posterior probabilities hold

$\begin{matrix}P\left&space;(&space;A/T&space;\right&space;)=P\left&space;(&space;A&space;\right&space;)\frac{P\left&space;(&space;T/A&space;\right&space;)}{P\left&space;(&space;T&space;\right&space;)}=0.071&space;\\&space;P\left&space;(&space;A^{*}/T^{*}&space;\right&space;)=P\left&space;(&space;A^{*}&space;\right&space;)\frac{P\left&space;(&space;T^{*}/A^{*}&space;\right&space;)}{P\left&space;(&space;T^{*}&space;\right&space;)}=P\left&space;(&space;A^{*}&space;\right&space;)\frac{1-P\left&space;(&space;T/A^{*}&space;\right&space;)}{1-P\left&space;(&space;T&space;\right&space;)}=0.998&space;\end{matrix}$
Example 9. Wrong evidence. A man says truth n<N times out of N about the outcome of a fair six-face die. Let $R_{4}$ the event that the man reports a specific number, say f=4. Let $E_{4}$  be the event that the f=4 occurs, with $P\left&space;(&space;E_{4}&space;\right&space;)=1/6$. Since the reported number may be false, the reporting event will occur more often than the  f=4 occurrence, which means that $P\left&space;(R_{4}&space;\right&space;)\geq&space;1/6$ . Which is the probability that the reported number is true, namely $P\left&space;(&space;E_{4}/R_{4}&space;\right&space;)$? Prior model: by assuming that the die is fair, $P\left&space;(&space;E_{4}&space;\right&space;)=1/6,&space;P\left&space;(&space;E_{4}^{*}&space;\right&space;)=5/6$. The evidence from the experience are the probabilities that when f=4 occurs the reported number is false or not:

$P\left&space;(&space;R_{4}/E_{4}^{*}&space;\right&space;)=1-n/N\:&space;(\textup{false});\;&space;P\left&space;(&space;R_{4}/E_{4}&space;\right&space;)=n/N&space;\:&space;(\textup{true})$

First we compute $P\left&space;(&space;R_{4}\right&space;)$ from the law of the total probability

$\begin{matrix}P\left&space;(&space;R_{4}&space;\right&space;)=P\left&space;(&space;R_{4}/E_{4}&space;\right&space;)P\left&space;(&space;E_{4}&space;\right&space;)+&space;P\left&space;(&space;R_{4}/E_{4}^{*}&space;\right&space;)P\left&space;(&space;E_{4}^{&space;*&space;}&space;\right&space;)=\frac{5}{6}-\frac{2}{3}&space;\frac{n}{N}&space;\\&space;n\rightarrow&space;N\Rightarrow&space;P\left&space;(&space;R_{4}&space;\right&space;)=1/6&space;\:&space;\left&space;(&space;\textup{truth&space;}\right&space;);\;&space;n\rightarrow&space;0\Rightarrow&space;P\left&space;(&space;R_{4}&space;\right&space;)=5/6&space;\:&space;\left&space;(&space;\textup{falsehood}\right&space;)\end{matrix}$
Finally

$\begin{matrix}P\left&space;(&space;E_{4}/R_{4}&space;\right&space;)=P\left&space;(E_{4}&space;\right&space;)\frac{P\left&space;(&space;R_{4}/E_{4}&space;\right&space;)}{P\left&space;(&space;R_{4}&space;\right&space;)}=\frac{n}{5N-4n}&space;\\&space;n\rightarrow&space;N\Rightarrow&space;P\left&space;(&space;E_{4}/R_{4}&space;\right&space;)=1&space;\end{matrix}\;&space;(15)$

TBC

###### References

[1] J. Mitchell, Examples: conditional probability, http://www.ams.sunysb.edu/~jsbm/courses/311/conditioning.pdf

[2] Brilliant, Conditional probability-Problem solving, https://brilliant.org/wiki/conditional-probability-problem-solving/