HW7: Movie Reviews (25 Points)
Overview / Logistics
The purpose of this assignment is to get you practice with Python dictionaries and machine learning for natural language processing. By the end of this assignment, you will have a system for automatically determining whether a movie review is positive or negative. Click here to download the starter code for this assignment.
What to submit: When you are finished, you should submit a file MovieReviews.py
to Canvas, along with which two movie reviews you used in part 3 and what scores you got for them
The Problem
In class, we have seen that if we can "vectorize" our data by giving it coordinates, we can measure distances between data points in a meaningful way. We used this both to visualize data and to do supervised learning, or teaching the computer how to categorize data based on examples.
In this assignment, you will do supervised learning using a vectorized representation of text. Given a collection of 1000 positive reviews and 1000 negative movie reviews from the early 2000s (citation), you will train a model to tell the difference between negative and positive reviews. You will examine this model to see what telltale words are for positive and negative reviews, and you will then find a movie review of your own that you believe is positive or negative and score it with the model.
Background: Vectorizing Text with Binary Bag of Words (BBOW)
When we had the images of the digits, it was straightforward, since every image had the same number of pixels, and we treated each pixel as a dimension. By contrast, it is not immediately obvious how to turn a text document into a vector in the same manner, since the documents in a collection don't necessarily all have the same number of words. In this assignment, we will be exploring a simple "binary bag of words" (BBOW) approach, in which we completely disregard the order of the words and simply keep track of which words occur in each document. For example, the phrase "first go left, then go right, then go left, then go right and right again" would simply have the words ["go","again", "left", "then", "right", "first", "and"]
, in no particular order. Even though our representation loses all information about the sequence, this works surprisingly well in many natural language processing tasks.
As usual, we will set up a matrix in which every row is a data point and in which every column is a dimension. In a BBOW representation, each row corresponds to a document, and each column corresponds to a word. We call the set of all words across all columns the vocabulary of our representation. For a particular document, we put a 1 in a column if the corresponding word exists in that document, and a 0 otherwise. To demonstrate a data matrix in a BBOW representation, we show below a limerick by Kaitlyn Guenther in which every "document" is simply a single line of text. The data matrix then looks like this (where the columns of the vocabulary can be in an arbitrary order, but one which is consistent across documents)
Document | there | once | was | a | wonderful | star | who | thought | she | would | go | very | far | until | fell | down | and | looked | like | clown | knew | never |
There once was a wonderful star | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Who thought she would go very far | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Until she fell down | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
And looked like a clown | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 |
She knew she would never go far | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
Code To Write
Part 1: BBOW Representation (15 Points)
Your task
In the first part of this assignment, you will create a bag of words representation from a set of all positive and negative reviews. You should loop through every word in every document (much like you looped through every word in every tweet in HW5) to determine the vocabulary. You should then build a sparse data matrix with a row for every document and a column for every word in the vocabulary. You should put a 1 in row i
column j
if document i
contains word j
.
A Note on Sparse Matrices
If you've done this properly, you will end up with a vocabulary with over 50,000 words. However, each review will only use about 340 of these words on average. This means that each row has a ton of zeros. It is wasteful to store all of these zeros in memory, and it also slows down computation. It's much better to use a data structure known as a sparse matrix, which has a mechanism to only store values that are nonzero. Thankfully, the scipy library in python makes this quite easy. You can simply initialize a sparse matrix with the code
where N
is the number of rows and d
is the number of columns. You can now index X
as an ordinary 2D array.
Tips
- To figure out the words in the vocabulary, you can use a dictionary whose key is a lowercase word and whose value is simple "True." You won't ever need to use the value; you just need to figure out a vocabulary that covers all words across all documents, so they keys of this dictionary are sufficient.
- After you create a list of words in the vocabulary, you need to choose which column in your data matrix corresponds to which word. They can be in any order, but it's probably easiest if you order them the same way they are ordered in the list. To make it easy to determine the index in the list of each word, you should create a dictionary whose key is the word and whose value is the index of that word in the list.
Part 2: Classification Model And Telltale Words (5 Points)
Once you have the data matrix setup, you can build a model to predict if a review is positive or negative. As it turns out, most of the words in the vocabulary don't tell us much about whether the review is positive or negative, so most of the dimensions in our vectorization don't matter. This means that an approach like k-nearest neighbors wouldn't perform very well in this context, since the distance would be swamped by unimportant dimensions. Instead, we're going to have to do a regression to learn which dimensions are important.
Here, we need to predict a binary variable, which is 1 (positive) or 0 (negative), given the independent variables, which are words in our case. There is a related type of regression called logistic regression which is setup for this binary scenario, and you will use that in this part of the assignment.
Your Task
Assuming you have setup a matrix X
as your data matrix, and that the first 1000 rows of X
correspond to positive reviews and the last 1000 rows of X
correspond to negative reviews, the following code will accomplish this
The code uses "5-fold cross-validation" to compute the model; that is, it splits the data up into 5 random subgroups, referred to as "folds." Each fold has 80% of the data, and it trains a model on that 80% and tests on the 20% that's been left out. The score at the end is a number between 0 and 1 which indicates the fraction of predictions that were correct over all folds (though we technically need a test set to accurately assess its performance, but that is beyond the scope of this assignment).
Once the model is trained, we can analyze it. If you have N words in your vocabulary (and hence N columns in your data matrix X
), then clf.coef_.flatten()
contains an array of N weights for each column. If a weight is positive in a particular column, then it means the corresponding word contributes towards making the review positive. Conversely, if a particular column is negative, it means the corresponding word contributes towards making the review negative. To complete this task, print out the 15 words with the largest positive coefficient and the 15 words with the largest negative coefficient.
Part 3: Reviews of Your Choice (5 Points)
Now that you have a model, you should apply it to data beyond the training set that you have to see how well it does. Go out on the internet and find a review that you think is very positive, and find a review that you think is very negative. Then load each review in and vectorize it according to your vocabulary. If there's a word in your review that isn't in the vocabulary of the training set, simply ignore it.
Once you have a vectorization, you can see how positive or negative the model says it is. If your vector is x
, you can simply write
This will sum up every weight in the model associated to words that are used. If the review is positive, you should get a positive number back, and if the review is negative, you should get a negative number back