Statistics Essentials

Where are we going?

Within data science (and perhaps at its core) is the field of Machine learning, which seeks to accomplish two objectives:

  • Supervised learning - learn a mapping from inputs \(x\) to outputs \(y\)

  • Unsupervised learning - given only \(x\), learn interesting patterns in \(x\)

These tasks are a form of artificial intelligence that endow a computer with the capability to represent a general class of patterns.

Then through that representation they have the ability to predict outputs and identify patterns. Note that this is different than explicitly hardcoding some data relationship into a computer as though the specific relationship was already known beforehand.

Show More

In order to identify which specific patterns (out of a general class of patterns) are present in the data, machine learning makes extensive use of linear algebra—the branch of mathematics that works directly with matrices—in conjunction numerical optimization procedures.

This process of identifying a specific instance (out of a general class of patterns) that looks as similar to the data as possible is called “model fitting”. Once a machine has such a model representation of the data, then it has learned the pattern in the data and can use it as a part of other programatic instructions designed to accomplish some objective.


Machine Learning versus Statistics

What is the difference between statistics and machine learning?

Statistics and Machine Learning represent distinct quantitative analysis traditions that developed towards distinctive objectives that suited their idiosyncratic access to (primarily computationally) different problem solving methodologies and philosophies; however, both disciplines are rooted in the common enterprise of “data analysis” and so have found common ground on which to reconcile and merge methodologies, leading to the current situation in which the line between the two has become increasingly blurred. Nonetheless, some general statements related to the traditional domains and expertise claimed by each discipline can be made:

Statistics

  • utilizes confidence intervals, hypothesis tests, and optimal estimators

  • places paramount importance on characterizing uncertainty in estimation

  • bases methodological development on distribution and asymptotic theory

Machine Learning

  • utilizes nonparametric and complex models harnessed via regularization

  • places paramount importance on “out of sample” generalizability/performance

  • bases methodological development on empirical and computational techniques

Objectives

The purpose of this short course is to (a) equip you with actual quantitative tools that you can apply to more effectively tackle problems you’re interested in using data, and (b) to provide you with a appropriate foundation on which you can effectively build a synergistic data science skill set that leverages