Thinking in terms of vectors and matrices


Learning Objectives
1 Become familiar with linear algebra’s basic data structures: scalar, vector, matrix, tensor
2 Create, manipulate, and generally begin to get comfortable with NumPy arrays

So you may be asking why?

xkcd-1838

https://xkcd.com/1838

Here are just a few reasons why a solid understanding of linear algebra is crutial for a practicing data scientist

  • Linear models can concisly be written in vector notation
  • Regularization often makes use of matrix norms
  • Matrix decompsitions are commonly used in recommender systems

Scalars, vectors, matrices and tensors

Without knowing anything about vectors or matrices there is already a good chance that you have some intuition for these concepts. Think of a spreadsheet with rows and columns. Within a given cell there exists some value—lets call it a scalar; scalars are the contents of vectors and matrices. If we think of the idea of a column and the elements contained therein we now have a basis for the concept of a vector. More specifically, this is referred to as a column vector. The elements of a row are accordingly referred to as a row vector.

We collectively refer to the columns and rows as matrix.

Note

A matrix with \(m\) rows and \(n\) columns is a \(m \times n\) matrix and we refer to \(m\) and \(n\) as dimensions.

If a matrix is a two dimensional representation of data then a tensor is the generalization of that representation to any number of dimensions. Lets say we copied our spreadsheet and created several new tabs then we are now working with a tensor.

Machine Learning Notation Description
Scaler \(x\) a single real number (ints, floats etc)
Vector \(\mathbf{x}\) or \(\mathbf{x}^{T}\) a 1D array of numbers (real, binary, integer etc)
Matrix \(\mathbf{X}\) a 2D array of numbers
Tensor \(\hat{f}\) an array generalized to n dimensions

Matrices are also tensors. If we were working with a \(4 \times 4\) matrix it can be described as a tensor of rank 2. The rank is the formal term for the number of dimensions.

Questions

  1. So what are the dimensions of the following matrix
\[\begin{split}x = \begin{pmatrix} 0 & 0 & 1 & 0 \\ 1 & 2 & 0 & 1 \\ 1 & 0 & 0 & 1 \end{pmatrix}\end{split}\]
ANSWER

The matrix dimensions are \(3 \times 4\)


  1. Given a spreadsheet that has 4 tabs and each tab has 2 rows with 3 columns how might we represent that data with a tensor?
ANSWER

The tensor would be of rank 3 and have the following dimensions \(2 \times 3 \times 4\). In NumPy this is written as a \(4 \times 2 \times 3\). You will soon be working with NumPy and you can refer back to this example.

>>> x = np.array([[[ 0,  1,  2],
                   [ 3,  4,  5]],
                  [[ 9, 10, 11],
                   [15, 16, 17]],
                  [[ 2, 10, 8],
                   [5, 4, 9]],
                  [[18, 19, 20],
                   [24, 25, 26]]])
>>> print(x.shape)
(4, 2, 3)

Notation

Scalers have the standard math notation

\[x = 1\]

Vectors are denoted by lower case bold letters such as \(\mathbf{x}\), and all vectors are assumed to be column vectors.

\[\begin{split}\mathbf{x} = \begin{pmatrix} 0 \\ 1 \\ 2 \\ 3 \end{pmatrix}\end{split}\]

A superscript \(T\) denotes the transpose of a matrix or vector. This implies that \(\mathbf{x}^{T}\) is a row vector.

\[\mathbf{x}^{T} = \begin{pmatrix} 0 & 1 & 2 & 3 \end{pmatrix}\]

Upper-case bold letter denote.

\[\begin{split}\mathbf{X} = \begin{pmatrix} 0 & 0 & 1 & 0 \\ 1 & 2 & 0 & 1 \\ 1 & 0 & 0 & 1 \end{pmatrix}\end{split}\]

An introduction to NumPy and Arrays

Sometimes we need to write concepts on paper or see them in action through code before we can effectively establish our understanding. We will be learning the through a widely used Python package called NumPy to help bring to life the essentials of linear algebra.

In order to get the most out of this resource and to ensure that you can actively follow along it is easiest if you install a working Python environment.

Important

Familiarity with the Python language is not a prerequisite for this primer. The included code blocks are minimal and you should be able to follow even without prior experience in Python.

Once Python is installed you can start an interactive Python environment by typing the command ipython into a terminal. NumPy is the de facto standard for numerical computing in Python and it comes installed as part of the Conda bundle. It is highly optimized and extremely useful for working with matrices. The standard matrix class in NumPy is called an array. We will first get comfortable working with arrays then we will ease our way into the essential concepts of linear algebra. NumPy will provide you with a tool explore all concepts presented here.

The standard syntax for importing the package NumPy into a Python environment is

>>> import numpy as np

Note

Examples of code (like the import statement above) are line by line, where each line begins with >>>. This means that you can copy the code that comes after the line indicator directly into your interpreter

Arrays and their attributes

Python is an object-oriented programming language. The main object in NumPy is the homogeneous, multidimensional array. An array is our programmatic way to represent vectors and matrices. An example is a matrix \(\mathbf{X}\)

\[\begin{split}\mathbf{X} = \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{pmatrix}\end{split}\]

and can be represented through NumPy as

>>> import numpy as np
>>> X = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> X
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Lets break down that code statement. First

>>> a = [1,2,3]

is a native Python data structure called a list. We could create a vector from this list using the NumPy array class.

>>> a = np.array([1,2,3])

So to create the above X matrix it is a list of lists where each row corresponds to a list.

Because our array version of \(\mathbf{X}\) is an object it contains methods and attributes.

  • The methods are functions that act on our matrix
  • the attributes are data that are related to our matrix.

Lets start with some useful attributes. The array \(\mathbf{X}\) has 2 dimensions. The number of dimensions in linear algebra terminology is referred to as rank. We get at rank with the ndim attribute.

>>> X.ndim
2

similarly we have access to the dimensions themselves through shape

>>> X.shape
(3, 3)

Note that the number of axes is also equal to the or the length of x.shape. To return an integer representing the total number of elements one may use size.

>>> X.size
9

Warning

If you want to work with a vector where the dimensions exist explicitly you need to use double brackets. Otherwise it will be a 1D matrix and sometimes it may not give you the result you were looking for.

>>> np.array([1,2,3]).shape
(3,)
>>> np.array([[1,2,3]]).shape
(1, 3)

Arrays and their methods

We have seen that arrays have built in attributes that are useful. They also have numerous built-in methods that make them particularly convenient. Note that methods always have parenthesis that may or may not enclose arguments.

>>> X.sum(axis=0)
array([12, 15, 18])
>>> X.sum(axis=1)
array([ 6, 15, 24])
>>> X.mean(axis=0)
array([ 4.,  5.,  6.])
>>> X.mean(axis=1)
array([ 2.,  5.,  8.])

Commonly used arrays can be created with functions that are part of the NumPy package. For example, to make a sequence of numbers, we can use arange. This is similar to the standard python function range that returns a list instead of an array. Look carefully at the following examples to see how it works.

>>> np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.arange(5,10)
array([5, 6, 7, 8, 9])
>>> np.arange(5,10,0.5)
array([ 5. ,  5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])

Also we can recreate the first matrix by reshaping the output of arange.

>>> X = np.arange(1,10).reshape(3,3)
>>> X
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In that function we created an array with values from 1-10 then we reshaped it into a 2D array with 3 columns and 3 rows. Another similar function to arange is linspace which fills a vector with evenly spaced variables for a specified interval.

>>> x = np.linspace(0,5,5)
>>> x
array([ 0.  ,  1.25,  2.5 ,  3.75,  5.  ])

As a reminder you may access the Python documentation at anytime from the command line using

~$ pydoc numpy.linspace

The following plot visualizes linspace. It is an important function, but it less important that you understand the plotting portion of the code.

import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
N = 8
y = np.zeros(N)
x1 = np.linspace(0, 10, N, endpoint=True)
p1 = plt.plot(x1, y, 'o')
ax.set_xlim([-0.5,10.5])
plt.show()

(Source code, png, hires.png, pdf)

_images/linspace-example.png

Important

Did you notice that arange starts counting at zero? Python uses zero based indexing, so the initial element of a sequence has index 0.

This is a good time to introduce the idea that arrays may be made of different types of data, but they can only be one data type at a given time.

>>> x = np.array([1,2,3])
>>> x.dtype
dtype('int64')
>>> x = np.array([0.1,0.2,0.3])
>>> x
array([ 0.1,  0.2,  0.3])
>>> x.dtype
dtype('float64')
>>> x = np.array([1,2,3],dtype='float64')
>>> x.dtype
dtype('float64')

There are several convenience functions for making arrays that you should be aware of:

>>> x = np.zeros([3,4])
>>> x
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])
>>> x = np.ones([3,4])
>>> x
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

Exercise

  1. Create the following matrix using a NumPy array (1 line)
\[\begin{split}A = \begin{pmatrix} 1 & 2 & \cdots & 10 \\ 11 & 12 & \cdots & 20 \\ \vdots & \ddots & \ddots & \vdots \\ 91 & 92 & \cdots & 100 \end{pmatrix}\end{split}\]
ANSWER
>>> a = np.arange(1,101).reshape(10,10)

  1. Use the array object to get the rank, number of elements, and dimensions
ANSWER
>>> print("Rank: {}\nSize: {}\nDimensions: {}".format(a.ndim,a.size,a.shape))
Rank: 2
Size: 100
Dimensions: (10, 10)

  1. Get the mean of the rows and columns
ANSWER
>>> print("Row means: {}".format(a.mean(axis=1)))
Row means: [  5.5  15.5  25.5  35.5  45.5  55.5  65.5  75.5  85.5  95.5]
>>> print("Column means: {}".format(a.mean(axis=0)))
Column means: [ 46.  47.  48.  49.  50.  51.  52.  53.  54.  55.]

  1. How do you create a vector that has exactly 50 points and spans the range 11 to 23?
ANSWER
>>> b = np.linspace(11,23,50)

  1. [extra] If you want a peak at whats to come see what happens when you do the following

    • np.log(a)
    • np.cumsum(a)
    • np.power(a,2)