Getting started¶
Where are we going?¶
We mentioned Machine Learning on the workshop land page —here’s a figure that provides a little more perspective. Note the emphasis on sample size “rules of thumb”. This is a characteristic of the empirical and computational nature of machine learning that enables the use of complex and sensitive models which subsequently require large amounts of data to actualize meaningful (as opposed to spurious) model fits. The figure also gives an indication of the vast and diverse set of tools and tasks that make up the machine learning world. If you’re interested in exploring this fascinating methodology space and learning how to leverage and apply these powerful tools (which, since you’re here, you probably are), then you’re going to need a fundamental understanding and serviceable skill set in programming, probability and statistics. And if you’re just starting your journey (congratulations!!) you’re in the right place!
http://scikit-learn.org/stable/tutorial/machine_learning_map
Industry Leading Tools¶
The above figure is part of Scikit-Learn’s extensive documentation and tutorial collection. Scikit-Learn is the leading open source collection of machine learning tools. There are some competitors (notably R, which has seen an increase in machine learning capabilities, but is still tailored more towards the more historical statistics and traditional data analysis communities), but it is safe to say that Scikit-Learn is the industry standard bearer. Scikit-Learn is a library in the python programming language – another industry standard in terms of programming languages – which has seen widespread adoption and the establishment of an incredibly vast and diverse ecosystem. Python is, and has been for quite some time now, the hottest most generally useful program language around. If you’re looking to build up your programming skill set, you cannot go wrong Python. One final industry leader is the Anaconda framework from Continuum Analytics. Anaconda is a platform that simplifies package management and deployment of open source data science capabilities. Data Scientists Love Anaconda.
Installing Python¶
Because (a) Python is the working language of the Data Science Immersive, and because (b) you’ll need to have a installation of Python on your computer if you want to follow along with the examples in this work shop, we encourage you to install Anaconda (which comes with a working installation of Python out of the box).
Installing an Editor¶
Data scientists need to be proficient at programming. To do this there is a specific kind of editor that is commonly used. There are many editors to choose from… Unless you have a programming editor that you are already comfortable with we recommend that you start with VSCode.
Another alternative for a beginner level editor is Atom.
Note
Programming practices, editors, version control, software engineering, and other related topics are not part of the scope of this short course.
Installing Git (optional)¶
On windows¶
Download the latest Git for Windows installer
When you’ve successfully started the installer, you should see the Git Setup wizard screen
Important
To use Git from the command prompt you must check the box
Open a Command Prompt (or Git Bash if that functionality was not enabled)
On OSX¶
Install Brew (If it is not already installed) using the instructions here
Install git from the command line
~$ brew install git
On Ubuntu¶
~$ sudo apt install git
Keeping your materials up to date¶
If you have Git installed on your machine. cd into the directory on your machine and
~$ git pull
If you are not using Git:
Copy the any downloaded materials you might have into a backup directory
Download the zip file from the repository and uncompress it
At the end of the course you may wish to fork the repository to your personal GitHub.
A note on Jupyter¶
Jupyter notebooks are a way to mix interactive Python with prose as expressed using Markdown. They are also both very powerful and an industry standard so practice with them is useful.
An in-depth understanding of Jupyter notebooks is not necessary for this course, but they are frequently used so it can be helpful to become familiar with the technology. Here are some resources to help you better understand this computing environment.