##### Bayes inference I: events

Enrico Canuto, Former Faculty, Politecnico di Torino, Torino, Italy

September 5, 2020

Draft

###### Independence

*Definition and examples*

Given a set of possible outcomes , two events (sets of outcomes) and are *independent* if the probability of their intersection , i.e. the probability that *both events occur,* is the product of the event probabilities

The meaning is that event occurrence does no affect each other. Event occurrence may be simultaneous or not. (1) can be extended to multiple independent events. The logarithmic identity shows that (1) is a linear relation. The first order differential expression is of course linear. Let be nominal values and small deviations. First order power expansion of (1) provides

In other terms the differential of (1) is *linear* in the *fractional* *differential of independent event probabilities*. A similar differential identity can be derived without logarithm

*Remark. *From (1) we have . This is reasonable since the intersection event is such to reduce the possible outcomes, becoming a *rare* event. A common and useful construction of independent events is multiple experiment repetition with the care that each experiment does not affect the other ones (*n* *repeated trials*). Given the outcome set and a set of events , the outcome set is the *n-*fold Cartesian product . A generic event is , which has been expressed as the intersection of elementary events . Thus, if the elementary events can be assumed to be independent, we can write

Example 1. Card drawing.Given a deck ofMfair cards of four colors (red, ...), two kind of drawings are possible: drawingwith and without replacement. Consider the event of two subsequent drawings. With replacement, drawings can be assumed to be independent, which implies: . Without replacement, the second outcome depends on the card drawn first, red or other than red:

Identity (1) is similar to the probability of the union of two *disjoint (mutually exclusive) events*, i.e. , either event occurs, that is

Identity (3) is again a *linear relation, *but this time the differential becomes the sum of the event probability differentials, in contrast to (2) where *relative differentials* appear. Mutual exclusive events contain different outcomes. (3) can be extended to multiple mutually exclusive events.

Example 2.Consider rolling a die, the outcomes are six . The die isfairif outcome probabilities areequal, that is . Since outcomes aremutually exclusive, the probability of the eventevenholds, by extending (3), Now consider two different die throws (trials) and ask the probability of the event . A reasonable assumption is that each throw does not affect the outcome of the other, as in card drawing with replacement (Example 1). This impliesindependenceand . Recall that the outcome set of two repeated trials is the Cartesian product of the original set, namely .

*Bernoulli trials. *

Let with be the outcome set of a single experiment to be repeated *N *times, identically and independently. A single outcome showing *k *outcomes *H *and *n-k *outcomes *T* in arbitrary order, has probability . Often the interest is to know the probability of *k *successes out of *n* repetitions. To find it, we count the number of different sequences of *k *heads and *n-k *tails, which is given by the *binomial coefficient *

Since sequences, being different, are mutually exclusive, the probability of the event holds

The expression in (4) is known as the *Binomial probability distribution (PD) *of the integer random variable (RV) . An *integer random variable* corresponds to a set of outcomes, in this case the *n+1* sets of repetitions showing *k * outcomes *H*, one-to-one associated to an interval of integer numbers. Because of mutual exclusion, it holds:

The mean or expected value of the PD in (5), namely the mean value of *k *when a large number *N *of repeated trials defined by is performed, holds . The variance holds . Moreover, under and when and are the order of *n, *binomial distribution is well approximated by the normal density function defined by

*Remark.* Be aware that in (6), where *x* is real, is a probability density, which, in order to approximate an integer distribution, must be converted into the probability . Since the interval between two adjacent integers is one, we can replace the real *x *with the integer *k *in (6).

*Likelihood function and parameter estimation. An introduction to statistical inference. *

Consider tossing a coin with outcome set . We want to test whether the coin is fair, i.e. whether . To this end, we toss the coin *n *times by assuming independent trials. Let us assume to find *k *outcomes equal to *H. *What we can *infer* from this result about *p*? Let us recall from (4) the binomial probability of finding *k* heads out of *n *repetitions, where the pair *(k,n)* is known and *p* is unknown. We also recall that strictly depends on the assumption of two exclusive outcomes and of independent repetitions: they have been converted into the mathematical model . In our hands we only have , the values *(k,n) *and the range * . *We admit that and *(k,n)* are a *faithful representation* of our coin behavior when tossed, and, since we have a *degree of freedom* *p, *we can use it to find the *best fit* by maximizing the probability with respect to *p. *For such reasons, is known as the *likelihood function* of coin tossing, and the argument

is the *best estimate* we can obtain from model and data, known as the *maximum likelihood estimate* (MLE). By setting to zero the derivative of , we obtain

By discarding the values as they zero the likelihood function, the intuitive MLE holds

Since *k *is an outcome of the Binomial random variable *K, *itself becomes the outcome of the Binomial RV *K/n,* with mean value . In other terms, the MLE mean value equals the unknown parameter for any *n*. The variance converges to zero for large *n. *We can say that for the probability that *k/n=p *approaches one: the MLE (7) asymptotically approaches the true parameter!

###### Conditional probability

*Definition*

How identity (1) can be converted to the general case of *dependent events*? The solution is a new kind of probability, the *conditional probability*, either or , such that (*multiplication rule*)

where reads as *the probability of the event A occurrence given (known, assumed) the occurrence of the event B. *From (8) the usual definition follows

where *conditional probabilities* satisfy all the probability axioms. Identities in (8) and (1) immediately prove that under independence and . (9) is a construction equation, but construction of conditional probabilities may be rather delicate as the following example shows.

*Law of total probability *

Let be a finite set of mutually exclusive events such that and let be a generic event. *E *can be developed as the union of intersections . Equation (8) allow to compute the event probability as

*Examples*

Example 3, Step 1[1]. A family has two children; we know that at least one of them is male. Which is the probability that both children are male? Consider first the outcome space of newborns , and assume that both outcomes have equal probability (=1/2). A naive answer would be , but it does not account that children are two. In other terms, we should account (conditioning) that weknowthatat least one of the two children is a male, which knowledge must be formulated as an event. The outcome set of two children is We have to compute the probability of the outcome but conditioned to the knowledge thatone or two children are male, which corresponds to the event . Thus, using the law of total probability (10) we obtain As expected, conditional probability is larger than unconditional one.

Example 3. Step 2.Let us come back to naive probability 1/2, which, being larger than 1/3, must be conditioned by afiner knowledgecorresponding to an event . Consider for instance the event:to encounter one of the two children who happens to be male. The event can be written with the help of (10) as the union of the intersection with the four possible children pairs, that is where is the encounter event when both children are males, whose probability is 1/4 . By using (1) we derive the event probability (Law of total probability) and finally, by replacing the subscript?withmale encounter,we obtain the expected result Where the difference between (11) and (12) lies? The probability ofencountering a malewhen the pair is either(F,M)or(M,F)is smaller (1/8) than the probability that thepair includes a female(1/4)! How delicate are concept and practice of conditional probability, and assessment of knowledge!

Example 4.Quality inspection.The output of a manufacturing line is classified as good (G), uncertain (U) and defective (D), with and probabilities . The output passes through an inspection machine that is only instructed to labeldefective parts. Which is the probability that the output of the inspection machine isgood? The non defective event is . The searched probability refers to the conditioned event and holds

Example 5. Sampling. n=5good (G) andm=2defective (D) parts are mixed in a box. To find defective parts,m=2parts are randomly selected without replacement and checked whether defective. Which is the probability of finding the defective parts? Letthe event of finding the defective part in the first test and in the second test. The target event is and the probability holds

Example 6[2]. A box contains one two-headed coin with outcome andn-1fair coins with outcomes with . One coin is randomly chosen and the toss result isH.Let us denote the corresponding event by.How many fair coins are in the box, if ? Let us denote the random choice of a coin asR,with . The event is the union of the events (event 1orevent 2or ...) that a single coin has been selectedandthe toss result isH,that is , whose probability holds which impliesn=10to satisfy .

###### Bayes theorem

Bayes theorem (1763) at first sight is just a reformulation of identities in (8), but is the basis of *statistical inference and prediction*.

*Inference* aims to derive statistical properties, in the form of parameters, of a probability model from experimental data collected from a population which is assumed to be coherent with the model. We have already seen a first example and method of inference, MLE, applied to the binomial distribution of coin tossing repeated trials. Bayes theorem allows inference problems and methods to be cast under a rather generic formulation.

*Prediction *aims to predict the output of future experiments from past data of similar experiments in terms of probability model.

Let us rewrite (8) in the form of Bayes theorem by changing notations and by adding some nomenclature:

The event is known as the evidence or *measurement* whereas *M *stands for *model* or hypothesis. Evidence may have been observed or assumed. is known as the *prior probability *of the model *M *unconditioned by the current evidence *E * (it may have been constructed from previous data). is the *likelihood, *namely the probability of observing *E* under the model/hypothesis *M *(let us remember the sequence of heads and tails under the assumption of the head probability *p*). is known as the *marginal likelihood* and can be written and constructed as follows:

where denotes the complement of *M* in the outcome set . Finally, is the *posterior probability *of the model/hypothesis conditioned by the observed evidence.

*Example 7.** The three door contest *[1].

The problem is also known as the *Monty Hall problem*. Behind one of three closed doors there is a prize. A contestant must *select* one of three doors , the host then *opens* one of the empty doors (if both the remaining doors are empty, he makes a random choice) and asks the contestant whether he wants *change or not* the selection. Three disjoint events: , with , *prior probabilities. *Let us assume that the contestant chooses door 1 and the host opens door 3. Which is the winning probability if he changes (to door 2) or not his selection? Of course, door numbering can be changed, without affecting the result. Let us denote the *evidence event* ‘contestant selected door 1 and host opened door 3’ with . We aim to *posterior probabilities , *winning without changing selection, and , winning by changing selection. is computed from the law of total probability

The following posterior probabilities prove that changing door is favorable:

The reader is suggested to list the nine possible combinations of prized door (3) and selected door (3) together with the result of changing and not changing selection. To show again the subtle role of *evidence *and the relevant event *E *formulation, we look for an evidence such that it does not matter if you change or not, which implies that and that * *(*a coarser evidence*). Assume that the evidence event is ‘the contestant selected door 1 and host revealed that door 3 was empty (but he did not open the door)’. Indeed, ‘door 3 empty’ is a coarser information since it does not imply that ‘door 3 will be opened’; indeed also ‘door 2 could be opened’ in the case door 1 is prized. Since and for *i=1,2, *we obtain as expected

*Further examples*

Example 8. Diagnostic test.A diagnostic test applied to affected people isfalse negative4% of the cases. If applied to unaffected people is false positive 2% of the cases. 5% of the population to be tested is affected. LetTthe positive test event,Athe affected event andA*the complement, with We want to know theposterior probabilities, that given a positive answer people are affected, and , that given a negative answer people are unaffected. First we compute from the law of total probability : The posterior probabilities hold

Example 9.Wrong evidence.A man says truthn<Ntimes out ofNabout the outcome of a fair six-face die. Let the event that the man reports a specific number, sayf=4. Let be the event that thef=4 occurs, with . Since the reported number may be false, the reporting event will occur more often than thef=4 occurrence, which means that . Which is the probability that the reported number is true, namely ?Prior model: by assuming that the die is fair, . Theevidencefrom the experience are the probabilities that whenf=4 occurs the reported number is false or not: First we compute from the law of the total probability Finally

TBC

###### References

[1] J. Mitchell, Examples: conditional probability, http://www.ams.sunysb.edu/~jsbm/courses/311/conditioning.pdf

[2] Brilliant, Conditional probability-Problem solving, https://brilliant.org/wiki/conditional-probability-problem-solving/