Docker Installation Guide

Docker is now firmly placed among the most important tools for a data science team. It is an open-source application that performs operating-system level containerization. These containers can hold multiple independently running applications. Docker creates a portable and reusable environment and it is far easier then running explicit servers or virtual environments. In data science today, Docker is the industry standard for containerization of machine learning models and AI services.

The community package of the Docker Engine is called docker-ce. Fundamentally, a docker container can be thought of as a running script with all the necessary components bundled within. The containers themselves sit on top of an operating system as shown in the following diagram.

containers

The Docker container is a running process that is kept isolated from the host and from other containers. One of the important consequences of this isolation is that each container interacts with its own private filesystem. A Docker image includes everything needed to run an application: code, runtime libraries, and a private filesystem.

Docker installation and setup

To install the Docker Engine use the appropriate guide from the following:

When you have finished the install you should be able to run the “hello world” example. To install it, perform the following steps:

~$ docker run hello-world

or if root privileges are required

~$ sudo docker run hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.

Congratulations! You have just run a container. It could have been anything from a fully functional spark environment to a simple model that a colleague has recently deployed. There are a number of arguments for the docker run command that we will get into, but this is the basis for running containers.

NVIDIA Docker

Important

This section on NVIDIA Docker is here for your reference, but it is not required for Galvanize trainings

To use TensorFlow with a GPU you need to ensure that the NVIDIA driver, CUDA, and additional required libraries are set up and versioned appropriately. Then you can install tensorflow-gpu. There is some amount of overhead involved in getting this ecosystem running smoothly. Additionally, there are maintenance requirements as the stable version changes over time. TensorFlow can also be installed via Docker with the use of a GPU.

The process is similar for PyTorch, Caffe, and (in general) any deep-learning framework that makes use of GPUs. The NVIDIA container toolkit, or simply nvidia-docker, is an incredibly convenient way to build and run GPU-accelerated Docker containers. Once this is done, you can pull down the latest GPU version of tensorflow (with Jupyter support) by doing this:

~$ docker pull tensorflow/tensorflow:latest-gpu-jupyter

NVIDIA Docker and GPU computing are not required for this course or any in this specialization, but knowledge of Dockerized versions of TensorFlow and similar tools can save significant amounts of time (you will need to ensure that docker-ce is installed).