Probability¶

Probability provides the mathematical tools we use to model and describe randomness. The likeliness of a random event is characterized either in terms of

Long term frequency behavior, if viewed from a Frequentist perspective

Our degree of belief expressed as a probability, if viewed from a Bayesian perspective

(We will return to a discussion between these two paradigms later).

Naturally, probability provides the foundation for statistics and machine learning as all (good) data analyses seek to identifiy patterns in data while accounting for the random sampling variation and measurement error present in the data (populations) under consideration. Perhaps counterintuitively, while probability statements seem easy to express and interpret, our intuitions about randomness are often incorrect. This is because in actual life we only live one realization of the randomness and don’t experience all the counterfactual outcomes that actually could have occured (but didn’t actualize in our timeline). To protect ourselves against faulty intuition, we need to approach probability problems using a methodical and objective framework that computes event probabilities by enumerating all possible outcomes (using combinatorics).

Note

EXERCISE

What is the probability of a Queen in a 52 card deck?
What is the probability of a Queen or a King?
What is the probability of a Queen or a spade?

Formalization¶

There are three axioms of probability.

For some sample space \(S\) with events \(A \subseteq S\) and \(B \subseteq S\), a probability function Pr satisfies the three axioms of probability:

The probability of any event is positive and less than or equal to 1, i.e.,

\[0 \leq Pr(s \in A) = Pr(A) \leq 1\]

where \(s \in S\) is an outcome in the sample spaceand \(Pr(A)\) is a common shorthand notational convenience

The probability of a _sure_ event (that will absolutely happen)` is 1, i.e.,

\[Pr(S) = 1\]

If A and B are mutually exclusive, then:

\[Pr(A \cup B) = Pr(A) + Pr(B)\]

(Remember that two events are mutually exclusive if they cannot both be true at the same time – i.e., the two event sets are disjoint).

It’s amazing that it’s so simple, isn’t it? But as simple as these three axioms might be, every other property of probability that you’re familiar with can be derived from just these. E.g.,

The sum of the probabilities of an event and its complement is 1

\[Pr(A) + Pr\left(A^C\right) = Pr(S) = 1\]

The probability of an impossible event is zero.

\[Pr\left(S^C\right) = 0\]

Independence¶

Two events are independent (notated as \(A\bot B\)) if

\[Pr(A\cap B) = Pr(A)\times Pr(B)\]

or

\[Pr(A|B) = Pr(A)\]

where \(Pr(A|B)\) denotes a conditional probability which gives gives the probability of an event occuring given the knowledge that another event has occured. When considering the independence of two events, ask yourself: “Does knowing something about event \(A\) provide increased information about the likelihood of event \(B\)?

Note

PAIRED EXERCISE

Discuss with your neighbor what “knowing that \(A\) has occurred” tells us about the likelihood of \(B\) occuring

Under independence?
Without independence?

How you go about testing if two events were indeed independent?

Note

QUESTION

What is the relationship between independence and mutually exclusivity?

Events are mutually exclusive if the occurrence of one event excludes the occurrence of the complementary events. Mutually exclusive events cannot happen at the same time. For example: when tossing a coin, the result can either be heads or tails but cannot be both.

This means mutually exclusive events are not independent, and independent events cannot be mutually exclusive.

Conditional Probability¶

It turns out that it is always true that

\[Pr(A \cap B) = Pr(A|B) \times Pr(B)\]

This rule is known as the chain rule, and we shall generalize it to more than two events shortly. But for now, rearranging this equation gives us

\[Pr(A|B) = \frac{Pr(A \cap B)}{Pr(B)}\]

which is the definition of conditional probability. Conditional probability is kind of of the opposite of independence in the since that it is meaningful in situations when two events are not independent. On the other hand, if two events are independent we just write \(Pr(A|B) = Pr(A)\) and we don’t have conditional probabilities.

Note

EXERCISE

Take a moment to think about this question:

Three types of fair coins are in an urn: HH, HT, and TT

You pull a coin out of the urn, flip it, and it comes up H

Q: what is the probability it comes up H if you flip it a second time?

Hint: write out the sample space!

When you’re ready, compare your solution to those around you.

Chain Rule¶

In probability theory, the chain rule provides a way to calculate probabilities sequentially for any number of events according to the pattern of conditional probabilities

\[Pr(A \cap B \cap C) = Pr(A| B \cap C) \times Pr(B \cap C) = Pr(A|B \cap C) \times Pr(B|C) \times Pr(C)\]

where \(Pr(A)\) is a shorthand notational convencience specifying \(Pr(X=x \in A)\).

The interesting thing about the chain rule is that the order of the events \(A\) and \(B\) doesn’t matter for the probability calculation since \(Pr(A \cap B) = Pr(B|A) \times Pr(A) = Pr(A|B) \times Pr(B)\). So you can use whatever order feels more natural or intuitive for any given problem. In this sense, probabilities are sort of agnostic about the “direction of time”.

Note

EXERCISE

Calculate the probability of getting a Queen and a King if you draw two cards from a standard 52-card deck.

Law of Total Probability¶

The Law of Total Probability is a pretty intuitive idea, but it’s somewhat complex to express using mathematical notation. It will look especially challenging if you’re not familiar with mathematical summation notation which expresses the sum of \(n\) numbers as \(x_i, i = 1, \cdots, n\) as \(\displaystyle \sum^n_{i=1} x_i\).

Keep this simple formula in mind, the Law of Total Probability guarantees that for a partition \(\{A_1, A_2, \cdots A_n\}\) of a sample space S (i.e., a set of events such that \(\underset{i=1}{\overset{n}{\cup}} A_i = S\) and \(A_i \cap A_j=\emptyset\) for all \(i\) and \(j\) such that \(1 \leq i \not = j \leq n\)) and an event \(B \subseteq S\), we have that

\[\displaystyle Pr(B) = \sum^n_{i=1} Pr(B\cap A_i) = \sum^n_{i=1} Pr(B|A_i) Pr(A_i)\]

Bayes’ Rule¶

Bayes’ rule is a formula for computing the conditional probability of \(A|B\) based on the reverse conditional probability of \(B|A\). Bayes’s rule follows directly from a re-expression and a subsequent re-application of the chain rule:

\[P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{P(A|B)P(B)}{P(A)}\]

Note

EXERCISE [Extra]

Prove Bayes’ rule using the Chain Rule.
Use the Law of Total Probability to express \(P(A)\) in terms of \(P(A|B_i)P(B_i)\), where \(B_i\) is a member of a partition of the sample space in question.

(We will discuss a generalization of Bayes’ rule that results in an entire branch of statistics known as Bayesian statistics tomorrow).

Medical Testing¶

Suppose we are interested in screening a population for some condition \(C\) and have a test \(T\) which predicts if the condition is present or not.

The positive predictive value of the test is the probability that an individual who tested positive (i.e., \(i.e., T^{+}\)) truly does have the condition (i.e., \(C^{+}\)):

\(PV^{+} = Pr(C^{+} |T^{+})\)
The negative predicitve value of the test is the probability that an individual who tested negative (i.e., \(T^{-}\)) truly does not have the condition (i.e., \(C^{-}\)):

\(PV^{-} = Pr(C^{-} |T^{-} )\)
The sensitivity of the test is the probability the test detects the condition (i.e., \(T^{+}\)) when it should (i.e., when \(C^{+}\) is true):

\(Pr(T^{+} |C^{+})\)
The specificity of the test is the probability the test does not detect the condition (i.e., \(T^{-}\)) when it shouldn’t (i.e., when \(C^{-}\) is true):

\(Pr(T^{-} |C^{-})\)
And prevalance here refers to the overall rate at which the condition presentsitself in the poplulation being tested:

\(Pr(C^{+})\)
And finally, note that \(Pr(T^{+} |C^{-} ) = 1 - \textrm{specificity}\)

Suppose we would like to know how much to trust a postive result \((T^+)\). I.e., we are interested in the positive predictive value \(PV^{+}\) of the test \((T)\). Using Bayes’ rule, we can calculate this as follows:

\begin{eqnarray} Pr(C^{+} |T^{+}) &=& \frac{Pr(T^{+}|C^{+}) Pr(C^{+})}{Pr(C^{+})Pr(T^{+}|C{+})+Pr(C^{-})Pr(T^{+}|C^{-})} \\ &=& \frac{Pr(C^{+}) \times \textrm{sensitivity}}{Pr(C^{+}) \times \textrm{sensitivity}+(1-Pr(C^{+})) \times (1-\textrm{specificity})} \end{eqnarray}

So, if we were given a test with sensitivity of 0.84 and specificity of 0.77 and apply the test to condition with with a prevalence of 0.20 in the population under examination, then

\[PV^{+} = \frac{(0.2)(0.84)}{(0.2)(0.84)+(0.8)(0.23)} = 0.48\]

and

\[PV^{-} = \frac{(0.8)(0.77)}{(0.8)(0.77)+(0.2)(0.16)} = 0.95\]

Note

EXERCISE

Verify the that the answer given for \(PV^{-}\) above is correct by deriving \(Pr(C^{+} |T^{+})\) using Bayes’ rule and calculating the resulting formula.

Further resources¶

https://www.khanacademy.org/math/probability/probability-geometry/probability-basics/a/probability-the-basics

Visual introduction to probability and statistics