Name: Patrick Mugisha

FINAL PROJECT

Business understanding

The goal of this project is to predict movie success (the column named “imdb_score”).

Project description & Introduction: business problem and contexts For this final project, I have to analyze the IMDB movie dataset using a variety of business intelligence techniques and machine learning models. - The goal of this project is to predict movie success while using "imdb_score" as the y variable and potentially 27 other variables as predictors from a total of 5043 movies ranging from the year 1916 to 2016.

Data understanding

Data understanding & transformation: data quality issues and your solutions

Describe data

The goal of this project is to predict movie success while using "imdb_score" as the y variable and potentially 27 other variables acting as predictors from a total of 5043 movies ranging from the year 1916 to 2016.

Identifying data quality issues

We have 1287 rows that have missing values in them, dropping those rows seems like an aggressive approach instead I will fill some of them, then drop the rest.

Identify data types

There is a combination of Numerical and categorical columns in this dataset. some columns have numbers in them and others have strings.

Numerical columns: num_critic_for_reviews, duration, director_facebook_likes, actor_3_facebook_likes, actor_1_facebook_likes, gross, num_voted_users, cast_total_facebook_likes, facenumber_in_poster, num_user_for_reviews, budget, title_year, actor_2_facebook_likes, imdb_score, aspect_ratio, movie_facebook_likes

Categorial columns: color, director_name, actor_2_name, genres, actor_1_name, movie_title, actor_3_name, plot_keywords, movie_imdb_link, language, country, content_rating

Identify value counts of a selective list of columns considered to be important to predict a movie’s success (imdb_score)

Data preparation

Handling duplicate rows

After dropping duplicate rows, now we have 4998 rows left from 5043 rows, meaning there were 45 duplicate rows.

Dropping unnecessary columns

Handling null values

Filling some rows with null values

Business intelligence

Variables that i think are good predictors for a movie's success are: duration, director_name, director_facebook_likes, actor_1_name, actor_1_facebook_likes, actor_2_name, actor_2_facebook_likes, actor_3_name, actor_3_facebook_likes, num_user_for_reviews, num_critic_for_reviews, num_voted_users, cast_total_facebook_likes, movie_facebook_likes, genres, content_rating, gross

What are the top 5 genres in terms of gross?

Adventure is the highest grossing genre type followed by comedy.

Classify movies from good, average to bad according to their imdb_score

The movies seem to be evenly distributed between bad, average, and good.

Results from business intelligence - The director with the most movies is Steven Spielberg with 26 movies. - The director with the most gross earning movie is Steven Spielberg with 4.114233e+09. - The director with the most online likes is Steven Spielberg with 364000 facebook likes. - The most movies a main actor has played in is 49 and it is by Robert De Niro. - The actor with the highest grossing movies is Johnny Depp with 3.688020e+09. - Movie with the most critical reviews is Skyfall with 1500. - Movie which was voted for most is The Shawshank Redemption with 1689764. - Adventure is the most grossing genre with 7.855534e+10. - Towering Inferno has the highest imdb_score with 9.5 - When using bins(3) for imdb_score, i found that 35.0% bad movies, 33.5% average movies and 33.5% good movies. - Doona Bae starred in the biggest budgeted movie(The Host) with 1.221550e+10. - The most successfull movie both a director(James Cameron) and primary actor(CCH Pounder) made together in terms of gross is Avatar with 760505847.0 - There are 2098 movies(the most) rated R.

Data visualization

Correlation analysis

Variables highly correlated to imdb_score are; movie_facebook_likes, num_user_for_reviews, gross, num_voted_users, director_facebook_likes, duration, num_critic_for_reviews

Key variables that are correlated to each other

below are the pearson correlation of the highly correlated variables. a high r2 indicates a strong correlation and low r2 indicates a low correlation among then.

Regression

Full model

Comments about Full Model#1: This full model has MSE=0.698 and r-squared=0.368

Statmodel

Building the full model (with all X variables) using statsmodel. Interpret p value.

Full Model#2 (fewer X variables)

Comment on model#2: this model has MSE=0.758 and r-squared=0.313 it is no better than the first model.

Lasso Regression (Regularization)

Comment on Lasso model: This regularized model using Lasso has MSE=0.745 and r-squared=0.3245 still not better than the full model.

Feature Selection model

The 2 variables f_regression determined are the most important are duration and num_voted_users.

Comments on Feature Selection model#1: this model has MSE=0.815 anf r-squared=0.262 still not better than the full model.

Feature Selection model#2

Comment on Feature Selection model#2: Feature Selection model#1 is better than Feature Selection model#2. Feature Selection model#2 has MSE=0.815 and r-squared=0.2615

Recursive Feature Selection (RFE): Another Feature Selection Method

RFE model

Comments on RFE model: this model has MSE=1.076 and r-squared=0.025 which is VERY bad.

RandormForestRegressor for Feature Selection

Comments about Regression models: of all the regression models I made, the RandomForestRegression was the best. it has MSE=0.475 and r-squared=0.569

**REGRESSION** - we start by creating a new dataframe (copy from dfr) - we remove unncecessary columns - split genre and create new column (main_genre) where each movie will have one genre type - use label encoder to assign every genre type a number - use label encoder for the rest of the categorical variables and make them numerical - make new data frame for regression model (dfRM) - remove genres since we don't need it anymore FULL MODEL RESULTS -used dfRM dataframe -Full model 1 evaluation: **MSE=0.6975**, **variance or r-squared: 0.3676** -Statmodel: **r-squared=0.383** -Full model with fewer x variables(13): **MSE=0.758**, **variance or r-squared: 0.313** it is no better than the first model, the less X variables applied the lower the r-square becomes. LASSO REGRESSION MODEL -used dfRM dataframe -**MSE=0.745**, **variance or r-squared=0.325** -still not better than the full model FEATURE SELECTION MODEL -k=2 -The 2 variables f_regression determined are the most important are duration and num_voted_users -**MSE=0.8147** and **variance or r-squared=0.262** -still not better than the full model -feature selection model#2 with k=3: **MSE=0.8149** and **variance or r-squared=0.262** -still not better than the full model RFE MODEL -**MSE=1.075** and **variance or r-squared=0.0252** - VERY BAD MODEL RandormForestRegressor for Feature Selection -**MSE=0.475** and **variance or r-squared=0.569** - BEST REGRESSION MODEL SO FAR - num_voted_users, duration, budget, num_user_for_reviews, and gross are the top 5 important variables

Classification

The goal is to build a classification model to predict if a movie is good or bad.

Decision tree

Confusion Matrix explanation

141 movies were misclassified as bad movies

155 movies were misclassified as good movies

The decision tree model is 74% accurate

Therefore, we expect that the model will be about 74% accurate when the model is applied into a real-world situation

True Positive Rate (Sensitivity) = 655/810 = 0.80

False Positive Rate = 141/330 = 0.427

True Negative Rate (Specificity) = 189/330 = 0.572

False Negatve Rate = 155/810 = 0.191

Visualizing decision tree

Simpler decision tree with less variables

Decision Tree interpretation

All those 202 movies that have num_voted_users = 86645.5, duration <= 108.5, and budget is not <= 15550000 (meaning greater than 15550000) are predicted as bad.

10 fold cross validation: The basic idea is that, rather than testing the model quality only once, cross validation (or 10-fold CV) tests the model 10 times with 10 different testing datasets.

Random Forest (Ensemble model)

Building multiple decision trees (ensembled decision trees) with the purpose to improve the model accuracy RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0) n_estimators = the number of decision trees in the forest Combining different opionions is likely to lead to high accuracy.

The RandomForestClassifier has an accuracy of 80.7%

num_voted_users and num_critic_for_reviews appear to be two most important predictors

K-Nearest Neighbor (KNN)

When using KNN, I only have 65.26% accuracy

Appendix 1: 10 fold cross validation

Appendix 2: Search for the optimal k value (GridSearch)

Appendix 3: Model Evaluation with ROC

Logistic Regression

This Logistic Regression model has 72.1% accuracy

Interpretation:

Comparing Algorithms

Interpretation and conclusion on Classification

Results from regression, **classification**, and clustering **CLASSIFICATION** -dfCM dataset (copy of dfRM) -createD a new “categorical” column from imdb_score in order to build classification models. I “convertED” the imdb_score into 2 categories (or classes): “1~5 and 6~10, which represents bad (or 0) and good (or 1) respectively -correlated variables to imdb_quality(new imdb_score) are; num_voted_users, duration, num_critic_for_reviews, num_user_for_reviews, and movie_facebook_likes. DECISION TREE - Accuracy = 0.74 -True Positive Rate (Sensitivity) = 655/810 = 0.80 -False Positive Rate = 141/330 = 0.427 -True Negative Rate (Specificity) = 189/330 = 0.572 -False Negatve Rate = 155/810 = 0.191 Random Forest (Ensemble model) -ACCURACY= 0.807 -Top 5 IMPORTANT VARIABLES: num_voted_users, duration, num_critic_for_reviews, gross and budget. K-Nearest Neighbor (KNN) -ACCURACY = 0.653 -USING 10 fold cross validation: accuracy = 0.686 Logistic model -ACCURACY =0.721 Comparing Algorithms -The best model to use is the Random Forest it has the biggest ROC=0.73. -The 5 most important variables are num_voted_users, duration, num_critic_for_reviews, gross, and budget.

Clustering

Objective: Analyze the data using K-means algorithm. I determine the optimal K value for Kmeans. report the movie “profiles” based on clustering analysis.

Clustering analysis (k = 2)

Cluster 0 has 1871 observations.

Cluster 1 has 1929 observations.

Profile of each cluster

**CLUSTERING** K-MEANS -K=2 -Cluster 0 has 1871 observations. -Cluster 1 has 1929 observations. Profile of each cluster - movies in cluster 0, on average, is characterized by a lower director_name, director_facebook_likes, actor_2_name, actor_1_name, and high actor_1_facebook_likes, cast_total_facebook_likes, and budget. - movies in cluster 1, on average, are characterized by a high director_name, director_facebook_likes, actor_2_name, actor_1_name, and low cast_total_facebook_likes
Storytelling

Objective is to develop useful insights from your business intelligence (data visualization, correlation, pivot tables) and models (regression, classification, and clustering).

- The director with the most movies is Steven Spielberg with 26 movies. - The director with the most gross earning movie is Steven Spielberg with 4.114233e+09. - The director with the most facebook likes is Steven Spielberg with 364000. - The most movies a main actor has played in is 49 and it is Robert De Niro. - The actor with the highest grossing movies is Johnny Depp with 3.688020e+09. - Movie with the most critical reviews is Skyfall with 1500. - Movie which was voted for most is The Shawshank Redemption with 1689764. - Adventure is the most grossing genre with 7.855534e+10. - Towering Inferno has the highest imdb_score with 9.5. - Doona Bae starred in the biggest budgeted movie(The Host) with 1.221550e+10. - The most successfull movie both a director(James Cameron) and primary actor(CCH Pounder) made together in terms of gross is Avatar with 760505847.0 - There are 2098 movies(the most) rated R. A box office hit movie in terms of gross would be made by; director: Steven Spielberg Actor_1_name: Johnny Depp genre: Adventure content_rating: R **My best regresion model** RandormForestRegressor for Feature Selection - MSE=0.475 and (variance or r-squared)=0.569 **My best classification model** Random Forest (Ensemble model) -ACCURACY = 0.798 **Most important variables** num_voted_users, duration, budget, num_user_for_reviews, num_critic_for_reviews and gross