Predicting Movie Genres Based on Plot Summaries

Predicting Movie Genres Based on Plot Summaries

Supervised text classification is a mature tool that has achieved great success in a wide range of applications such as sentiment analysis and topic classification. When applying to movies, most of previous work has been focused on predicting movie reviews or revenue, and few research was done to predict movie genres. Movie genres are still tagged through a manual process in which users send their suggestions to email address of The Internet Movie Database (IMDB). As a plot summary conveys much information about a movie, I explore in this project different machine learning methods to classify movie genres using synopsis. I first perform experiment with Naive Bayes using bag-of-word features. Next, I make use of the pretrained word2vec embeddings to turn plot summaries into vectors, which are then used as inputs for an XGBoost classifier . Finally, I train a Gated Recurrent Unit (GRU) neural network for the genre tagging task.

The rest of the report is organized as follows. Section [sec:Related-work] discusses the related work. Section [sec:Data] describes the dataset used for this project. Section [sec:Methodology] outlines the models for experiments. Section [sec:Experiment] presents experiment results. Finally, Section [sec:Conclusion] summarizes the paper and discusses directions for future work.

This project explores several Machine Learning methods to predict movie genres based on plot summaries. This task is very challenging due to the ambiguity involved in the multi-label classification problem. For example, predicting adventure and thriller against true labels of adventure and action yields a Jaccard Index of only 33.3%.

Word2vec$`+`$XGBoost performs poorly as the average of embedding vectors of words in a document proves a weak representation. A future direction is to apply doc2vec to learn richer representations of documents. Experiments with both Naive Bayes and GRU networks show that combining a probabilistic classifier with a probability threshold regressor works better than the k-binary transformation and the rank methods for the multi-label classification problem. Using a GRU network as the probabilistic classifier in this approach, the model achieves impressive performance with a Jaccard Index of 50.0%, F-score of 0.56 and hit rate of 80.5%.

There are several potential directions to improve the GRU network. As 46% of words in the vocabulary occur fewer than 20 times in the train data, most word embeddings get only a few weight updates. Using pretrained embeddings might be better for these words. In addition, about 75% of tokens are turned into “UNK”. As a result, the GRU network learns the same embedding for these words. Finding a way to adapt the UNK’s embedding based on context might make the network more powerful. Finally, the data is highly skewed, making the model biased towards popular genres such as drama or comedy. Dealing with this issue would further improve the performance.