## SVD for topic analysis

We can use SVD to determine what we call ***latent features***. This will be best demonstrated with an example.

### Example

Let's look at users ratings of different movies. The ratings are from 1-5. A rating of 0 means the user hasn't watched the movie.

|       | Matrix | Alien | StarWars | Casablanca | Titanic |
| ----- | ------ | ----- | -------- | ---------- | ------ |
| **Alice** |      1 |     2 |        2 |          0 |      0 |
|   **Bob** |      3 |     5 |        5 |          0 |      0 |
| **Cindy** |      4 |     4 |        4 |          0 |      0 |
|   **Dan** |      5 |     5 |        5 |          0 |      0 |
| **Emily** |      0 |     2 |        0 |          4 |      4 |
| **Frank** |      0 |     0 |        0 |          5 |      5 |
|  **Greg** |      0 |     1 |        0 |          2 |      2 |

Note that the first three movies (Matrix, Alien, StarWars) are Sci-fi movies and the last two (Casablanca, Titanic) are Romance. We will be able to mathematically pull out these topics!

Let's do the computation with Python.

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [8]:
M = np.array([[1, 2, 2, 0, 0],
              [3, 5, 5, 0, 0],
              [4, 4, 4, 0, 0],
              [5, 5, 5, 0, 0],
              [0, 2, 0, 4, 4],
              [0, 0, 0, 5, 5],
              [0, 1, 0, 2, 2]])

In [19]:
# Compute SVD
from numpy.linalg import svd
U, sigma, VT = svd(M,full_matrices=False)

## Part 1

Describe in your own words what the matrices contain and how they might be used

In [1]:
## U matrix
## print the shape and add your description

In [None]:
## sigma matrix
## print the shape and add your description

In [2]:
## VT matrix
## print the shape and add your description

## Part 2

Making use of the factorized version of our ratings

In [20]:
# Make interpretable
movies = ['Matrix','Alien','StarWars','Casablanca','Titanic']
users = ['Alice','Bob','Cindy','Dan','Emily','Frank','Greg']

U, sigma, VT = (np.around(x,2) for x in (U,sigma,VT))
df_U = pd.DataFrame(U, index=users)
df_VT = pd.DataFrame(VT, columns=movies)

print(df_U)
print("--------------------------------------")
print(np.diag(sigma))
print("--------------------------------------")
print(df_VT)

          0     1     2     3     4
Alice -0.21  0.02  0.31  0.26  0.66
Bob   -0.55  0.06  0.53  0.46 -0.33
Cindy -0.50  0.07 -0.31 -0.20 -0.37
Dan   -0.62  0.08 -0.39 -0.24  0.36
Emily -0.12 -0.60  0.40 -0.52  0.20
Frank -0.04 -0.73 -0.42  0.53 -0.00
Greg  -0.06 -0.30  0.20 -0.26 -0.40
--------------------------------------
[[ 13.84   0.     0.     0.     0.  ]
 [  0.     9.52   0.     0.     0.  ]
 [  0.     0.     1.69   0.     0.  ]
 [  0.     0.     0.     1.02   0.  ]
 [  0.     0.     0.     0.     0.  ]]
--------------------------------------
   Matrix  Alien  StarWars  Casablanca  Titanic
0   -0.50  -0.62     -0.60       -0.06    -0.06
1    0.09  -0.05      0.11       -0.70    -0.70
2   -0.78   0.62      0.03       -0.07    -0.07
3   -0.36  -0.48      0.79        0.05     0.05
4    0.00   0.00     -0.00       -0.71     0.71


Add your own description of how the matrices relate to each other

## Trim the matrices to represent a factorization from only the top two factors

## Part 3: Does your approximate version of the matrix still reasonably reconstruct the original?

In [22]:
# Use this code but swap in your matrices
np.around(df_U.dot(np.diag(sigma)).dot(_VT))

Unnamed: 0,Matrix,Alien,StarWars,Casablanca,Titanic
Alice,1.0,2.0,2.0,0.0,0.0
Bob,3.0,5.0,5.0,0.0,0.0
Cindy,4.0,4.0,4.0,-0.0,-0.0
Dan,5.0,5.0,5.0,0.0,0.0
Emily,-0.0,2.0,-0.0,4.0,4.0
Frank,0.0,-0.0,-0.0,5.0,5.0
Greg,-0.0,1.0,-0.0,2.0,2.0


## Part 4: Make some recommendations

Use cosine similarity to compare all other users to Alice (using movie profiles)

Use cosine similarity to comare all other movies to StarWars (using user profiles)

Provide a new vector of ratings and determine which is closest