.. probability lecture Statistics Concepts =================== Statistics is the sister discipline to probability in mathematics. Statistics addresses the *inverse problem* of learning about probability distributions from data, as opposed to the *forward problem* of generating data from probability distributions. The term **statistic** is also a definition -- a *statistic* is a "mathematical calculation of some data". Since inference on probability distribution parameters relies upon *statistics*, the term **Statistics** is an appropriate name for the discipline. PDFs and CDFs ------------- As we have seen, there are two types of probability distributions: * **Discrete distributions** define positive probabilities for specific outcomes :math:`x` of a *discrete-valued* random variable :math:`X` that are defined using a **probability mass function (PMF)**, i.e., .. math:: Pr(X=x) .. * **Continuous distributions** on the other hand (perhaps unexpectedly and paradoxically) define :math:`Pr(X=x) = 0` for every potential outcome :math:`x` of a *continuous-valued* random variable :math:`X` but define positive probabilities for :math:`Pr(X=x \in E \subseteq \mathbb{R})` according to the area under the **probability density function (PDF)** :math:`\; f_X(X=x)` over the set :math:`E`, i.e., .. math:: Pr(X=x \in E) = \underset{E}{\int} f(X=x)\; dx so that a *PDF* defines outcome prevalence in a relative (rather than absolute, i.e. probability) manner. For example, if :math:`f(X=x_1) = 2f(X=x_2)` then in the long-run :math:`x_2` will occur *twice as frequently* as :math:`x_1`. For more explanation, see `this video about PDFs from Khan academy `_. .. note:: **EXERCISE** Have a look at the documentation for the `probability distribution functionality in SciPy `_, which shows how to work with probability distributions using Python. Specifically, it shows how to use Python to evaluate *probability mass functions* and *probability density functions* -- something we haven't done yet. For example, here's how to evaluate the *density* of the *gamma distribution* for a given outcome: >>> from scipy import stats >>> gamma_rv = stats.gamma(a = 5, scale = 10) >>> gamma_rv.pdf(10) 0.0015328310048810102 >>> poisson_rv = stats.poisson(mu=20) Once you've gotten the hang of this for the gamma distribution, try to generate analogous values for a *Poisson* distribution. If you're running into trouble, consider what method of a `poisson_rv` object you should be calling. (Hint: should a Poisson random variable have a probability *density* function?). Once you've got that working, what would you say is the biggest difference between these values as associated with the gamma distribution compared to the Poisson distribution? Both *discrete* and *continuous* distributions satisfy the *axioms of probability*, e.g., 1. :math:`0 \leq Pr(X=x) \leq 1` and :math:`0 \leq \underset{E \subseteq \mathbb{R}}{\int} f(X=x)\; dx \leq 1` .. 2. :math:`\sum_{x \in S_X} Pr(X_i=x) = 1` and :math:`\int_{-\infty}^{\infty} f(X_i=x) \; dx = 1` and both can be represented as a **cumulative distribution functions (CDF)**, which is defined as .. math:: F_X(X=x) = Pr(X=x\leq x_0) .. plot:: ./scripts/pdf-cdf-plot.py Notice that for *continuous distributions* .. math:: F_X(X=x_0) = \underset{x`_. .. note:: **EXERCISE** The second (central) moment of a distribution -- the **variance** -- is a measure of spread of the distribution and is defined as .. math:: Var[X] = E\left[(X-E[X])^2\right] and the **standard deviation**, which is defined on the original units of the random variable is defined as :math:`\sigma_X = \sqrt{Var[X]}`. How would you actually calculate the standard deviation of a random variable with a given discrete distribution, :math:`Pr(X=x)`? For more information, see: `Measures of spread (Khan academy) `_. Joint Distributions ------------------- When we're talking about random variables, we don't use the *set* notation that we did for events, e.g., :math:`A \cap B`. Instead, we specify the distribution associated with two random variables :math:`X_1` and :math:`X_2` as :math:`P(X_1, X_2)` where :math:`P` specifies either a PMF or a PDF. A distribution such as this that is specified for two or more random variables is called a **joint distribution**. And further, the *joint distribution* of a collection of random variables :math:`X_i, \; i = 1, \cdots, n` is defined by the distributional form of the *chain rule* which is .. math:: \displaystyle P\left(X_1, X_2, \cdots X_n\right) = \left(\prod_{i=2}^{n} P\left(X_i | X_{i-1}, \cdots X_1 \right)\right) \times P\left(X_1\right) Further, just as with *events*, if the :math:`X_i` are *independent* of each other then .. math:: \displaystyle P\left(X_1, X_2, \cdots X_n\right) = \prod_{i=1}^n P\left(X_i\right) Note that the the mathematical *multiplication notation* :math:`\displaystyle \prod_{i=1}^{n} c_i` for numbers :math:`c_i, i = 1, \cdots, n` is just like the mathematical *summation notation* :math:`\displaystyle \sum_{i=1}^{n} c_i` except that the :math:`c_i` are *multiplied* together instead of being *added* together. .. note:: **EXERCISE** Write out the distributional chain rule defining :math:`P\left(X_1, X_2, X_3, X_4, X_5\right)` and give an account of how it might be interpreted. E.g., "First we caclulate the probability of :math:`X_1`..." Linear Association ------------------ Linear association between two variables is encoded as the **covariance** of the joint distribution of those two variables .. math:: Cov[X,Y] &= E[(x - E[X])(y - E[Y])] &= \left[\underset{x,y \in S_X,S_Y}{\sum or\int}\right] (x - E[X])(y - E[Y])P(X=x,Y=y) \left[dxdy\right] where the brackets simply indicate appropriate notational usage depending on if we're talking about discrete or continuous random variables. Much like with standard deviation, it can be helpful to be on a more natural scale, so we often use **correlation** (which varies from -1 to +1 with 0 indicating "no linear association") rather than covariance (which is measured on the product of the two variables unit) -- to describe the strength of a linear relationship: .. math:: Corr[X,Y] = \frac{E[(x - E[X])(y - E[Y])]}{\sigma_X\sigma_Y} = \frac{Cov[X,Y]}{\sigma_X\sigma_Y} Marginal Distributions ---------------------- We have seen *marginal distributions* already -- they are simply distributions of a single random variable. However, recasting the *Law of Total Probability* in terms of random variables :math:`X` and :math:`Y`, we have for * **discrete distributions** .. math:: \displaystyle Pr(X=x) = \sum_{y \in S_Y} Pr(X=x, Y=y) = \sum_{y \in S_Y} Pr(X=x|Y=y) Pr(Y=y) * **continuous distributions** .. math:: \displaystyle f(X=x) = \int_{y \in S_Y} f(X=x, Y=y) \;dy = \int_{y \in S_Y} f(X=x|Y=y) f(Y=y) \;dy which shows how **marginal distributions** :math:`Pr(X=x)` and :math:`f(X=x)` can be derived from their higher order **joint distributions** :math:`Pr(X, Y)` and :math:`f(X, Y)`, respectively. Thus, a **marginal distribution** of a (possibly not independent) *multivariate (joint) distribution* is just the distribution of a single dimension (random variable) of the multivariate (joint) random variable. Marginal distributions are the unpacked variables of joint distributions. So, while the *chain rule* allows us to build up joint distributions from conditional ("marginal") distributions, the *law of total probability* allows us to unpack joing distributions into marginal distributions. .. figure:: ./images/MultivariateNormal.png :scale: 75% :align: center :alt: joint-distribution :figclass: align-center .. note:: **EXERCISE** Draw the above plot, labeling it with all the concepts we've covered so far. Statistics ---------- **Statistics** are often chosen for their correspondence to specific distributional *parameters* for the purposes of estimating those parameters. It's important to always remember the distinction between *statistics* and *parameters*, though: statistics are numerical calculations that use sample data for their calculation, while parameters are mathematical manipulations carried out on distributional forms. A statistic that corresponds to the *population mean* is, unsurprisingly, the **sample mean**: .. math:: \bar{x} = \frac{1}{n}\sum_j^n x_j However, alternative statistics with different robustness and behavior profiles, such as **sample median** and the **sample mode**, are available for measuring centrality. The statistic that corresponds to the *population variance* is the **sample variance**: .. math:: s^2 = \frac{1}{n-1} \sum_j^n (x_j - \bar{x})^2 But again, alternative statistics such as the **range** and **inter-quartile range** are available for measuring spead. And of course, the sample *standard deviation* :math:`s = \sqrt{s^2}` is much easier to interpret than the *sample variance*. There are a couple common choices for statistics that correspond to linear associations parameters. The **Pearson correlation** coefficient measures the linear relationship between two datasets. The alternative **Spearman correlation** is a nonparametric measure of the monotonicity of the relationship between two datasets, which is just a fancy way of saying that calculates the correlation on the *ranks* rather than original values. Here's how you can calculate these statistics using Python: >>> from scipy.stats import pearsonr >>> from scipy.stats import spearmanr >>> >>> pearsonr([1,2,3,4,5],[5,6,7,8,7]) (0.83205029433784372, 0.080509573298498519) >>> spearmanr([1,2,3,4,5],[5,6,7,8,7]) (0.82078268166812329, 0.088587005313543812) The first value in the above tuples is the correlation. The second is a *p-value* of a statistical *test* of the *null hypothesis* of no association. The two tests are based on different distributional assumptions and as such are, unsurprisingly, different. A **spurious relationship** is a relationship is said to exist between two or more random variables that are not causally related to each other but have a relationship due to a common **confounding factor**. A Warning --------- **Confounding** is just one of the many difficulties that will need to be dealt with in real data. When you actually begin working with real data you'll see that things can be quite messy. In fact, messy would be an understatement for some **outliers** that will be present in your data. These outliers can drastically affect your calculated statistics and hence your conclusions. Weary and vigilant attention is required to suss out these influential data points and decide what is to be done about them. And what if you have **missing data** that's not even available to look at? Will you impute the missing data? If so, with how much sophistication? Or will you simply disregard samples with missing entires? As you can see, there are many questions and, unfortunately, very often too few answers... .. note:: **EXERCISE** List out some statistics you could calculate with the data in the above plot that you drew. Further study ----------------- Most major statistical textbooks, for example (the free) `Elements of Statistical Learning `_ will begin with an overview of the topics in this section.