Description
This is a machine learning exploration for:
- Finding of similar movies based on their Wikipdia abstracts
- Automatic determination of common topics based on their abstracts
- 107320 Wikipedia movies were analyzed containing 195317 words in total
Learning Proceure
The learning of movie similarities and topics done in Python. The process included few steps:
- The abstracts for all Wikipedia moveis were extracted
- TF-IDF was calculated for all movies. The total number of words in the vocabulary is 195317
- Apply K Nearest Neighbor (KNN) with Cosine distance on each movie to find its similar movies
- Topic extraction using Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) for 100 topics
- Add UI to visualize the results - all results loaded on start to perform fast manipulations (it takes few sec to load)
As in any machine learning, there are some errors. Learning results kept as is, no attempt of fixing errors.
For more details on each step, please take a look
here
How to use
After the page is loaded, search for a movie or topics. After finding the movie in interest, click on the table row to find similar movies. Each similar movie include a similarity level indicating how similar this movie to the selected one.
Clicking on the movie link will open the Wikipedia page for that movie.