Data Scientist, Square
Insight Fellow, 2013
UC Santa Cruz Christopher Moody, UC Santa Cruz PhD in Astrophysics and Insight Fellow, describes the data product he built in three weeks during the Insight Data Science Fellows Program.
Natural language isn't that great for searching. When you type a search query into Google, you miss out on a wide spectrum of human concepts and human emotions. Queried words have to be present in the web page, and then those pages are ranked according to the number of inbound and outbound links. That's great for filtering out the cruft on the internet -- and there's a lot of that out there. What it doesn't do is understand the relationships between words and understand the similarities or dissimilarities.
That's where ThisPlusThat comes in -- a search site I built to experiment with the word2vec algorithm recently released by Google. word2vec allows you to add and subtract concepts as if they were vectors, and get out sensible, and interesting results. I applied it to the Wikipedia corpus, and in doing so, tried creating an interactive search site that would allow users to put word2vec through it's paces.
For example, word2vec allows you to evaluate a query like King - Man + Woman and get the result Queen. This means you can do some totally new searches.
Say I like the movie The Matrix, but it's been a long day, so I want something a bit dumbed down. I can enter the query: The Matrix - Thoughtful + Dumb and get Blade II.
If I like Mitt Romney but like my politicians to have a little less experience and a bit more celebrity, Mitt Romney - Experience + Celebrity will get me Sarah Palin.
You can even use this to detect the luxury brand in a line up -- great for fraud detection: Old Navy, Banana Republic, Macy's, Louis Vuitton. ThisPlusThat will pick out Louis Vuitton.
How It Works
word2vec is a type of distributed word representation algorithm that trains a neural network in order to assign a vector to every word. Each of the dimensions in the vector tries to encapsulate some property of the word. Crudely speaking, one dimension could encode that man, woman, king and queen are all 'people,' whereas other dimensions could encode associations or dissociations with 'royalty' and 'gender'. These traits are learned by trying to predict the context in a sentence and learning from correct and incorrect guesses.
In the case of ThisPlusThat, I first tried applying it to movies using a library of movie reviews, then to startups using the data from CrunchBase, before finally settling on a generalized search. In the end, the algorithm was trained on the whole Wikipedia corpus, allowing a pretty good grasp of not only common words like 'smart' or 'American' but also loads of human concepts and real world objects, allowing us to manipulate proper nouns like "The Matrix" and "Mitt Romney".
Disambiguating words in the text was particularly difficult: is it python the snake, or Python the programming language? Wikipedia links might show the simple name "python" but then the url points to the canonical name, "python_programming_language". Fortunately, I could use Wikipedia links to canonicalize the names. Unfortunately, doing millions of simple search-and-replaces on the text turned out to be incredibly computationally intensive, even for just 42GB of text. The solution was to use Hadoop's Map functions to fan out the search-and-replace to many nodes, which helped to coax the data into the right shape quickly.
After churning through the text, word2vec spits out an enormous 8GB table of thousand-dimensional vectors. When you type in 'The Matrix - Thoughtful + Dumb' each word gets turned into a vector. The vector for 'Thoughtful' is subtracted from 'The Matrix', and 'Dumb' is added back. You then need to find the nearest word to that answer, which means sussing out the most similar vector in the whole table of vectors. I prototyped the search code using NumPy, but taking dot products on these huge matrices was computationally intense and delayed getting results to the user by 3-4 seconds. I tried speeding it up using Cython and Numba (before running into cryptic and undocumented bugs) and eventually settled on using the automatically parallelized Numexpr to compute vectors and distances. If you're interested, you can check out my code and notebook experiments.
When I stumbled across word2vec algorithm, I was immediately blown away. After building ThisPlusThat, and seeing some of the creative ways people are using the site, I'm excited to continue exploring the cutting edge of natural language algorithms.
Already a data scientist or engineer?
Find out more about our Advanced Workshops for Data Professionals. Register for two-day workshops in Apache Spark and Data Visualization, or sign up for workshop updates.