Project: Investigate The Movie Database (TMDb)

Table of Contents

Introduction

This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. Having looked at the data, we have information about a movie's cast, crew, release date, ratings, budget and revenue. From my study and investigation of this dataset, I want to figure out:

  1. How does number of movies released per year change?
  2. How does runtime of movies change over years?
  3. What factors can make a movie successful?

What is operational definition for success?

Success in general terms is when you achieve what you wanted to achieve. For a movie star, success would be a popular movie or a highly voted movie. For a production house, success would be a larger revenue. Assuming that this dataset has missing values, I am not expecting a perfect success recipe. A key thing to note is that inflation plays a major role in budget as well as revenue. Lucky for us, the final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.

For this study, we will evaluate success in two ways:

  • Popularity (more popular, more successful)
  • Percentage ROI (Return on Investment) (higher ROI, more successful)

We will look into both aspects of being successful and evaluate the factors that can make a movie successful.

Potential factors that can affect the success:

  • Budget
    • Do high budget movies do well? Or Low budget movies are getting more ROI.
  • Votes
    • Are highly voted movies making more profit?
  • Cast and Director
    • Is there a group of director which always makes good profit or get high ratings? Or are there certain actors who always produce a hit movie?
  • Release year
    • Are movies getting more successful in the later years?
  • Genres
    • How are movies doing with genres? Is there any particular genres that people are liking the most? There can be a combination of director and genres, or an actor and genre or just an actor, director and genre that works very well. However, it's outside the scope of this study.
  • Movie runtime
    • Are short films making more money? Or movies with longer duration getting better ratings?

For this study, we will measure success based on budget, votes, release year and runtime. While measuring success in terms of budget, we will consider votes, release year and runtime. And when measuring success in terms of votes, we will consider profit, release year and runtime.

In [1]:
# First things first, import all the packages that I will be using

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
% matplotlib inline

Data Wrangling

Let's load the data and look at some general properties and descriptive statistics about the data. Before anything else, I will just print a few rows to get the feel of data.

General Properties

In [2]:
# Load your data

df = pd.read_csv('./data/tmdb-movies.csv')
df.head(3)
Out[2]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08

3 rows × 21 columns

So there are 21 columns in this dataset. From these three rows, I can see that cast is a bit tricky with string having all names separated by |. Although the directors are shown as single in these rows, I need to check further is same scheme of joining names is done for directors as well. genres and production_companies seem to concatenate multiple strings the same way. Release date looks consistent in the format of mm/dd/yy. However we don't have a profit column that I wanted to look into. I need to fix that as well. Before that let's just look at some basic info from the above data frame.

In [3]:
# see the column info and null values in the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

Looking at the above info, I can see a few anomalies already. The id column has 10866 entries, so I am assuming there are around 10866 movies in the database. But tagline, keywords and production_companies columns seems to be missing quite a few rows. Somewhere in the range of 1000-3000 rows are missing. That would be around 10-30% missing data. However, homepage seems to have a lot less entries. This is understandble as a lot of movies might not have a website. Also, this doesn't seem a very important factor in the investigation. Since I am not sure if the movies don't have a homepage or just this dataset is missing them, I can't possibly check the effect of having a website on movie popularity or revenues. For this study, I will just drop the homepage column.

Identification of missing and erroneous data and it's cleaning

In [4]:
# Printing out some descriptive statistics for the data
df.describe()
Out[4]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09

As we can see in the summary above, we have lots of 0 values in budget, revenue and some in runtime. Neither of these should have a 0 value. Let's just print a few of these to see if there is a pattern associated with 0 values.

In [5]:
# get movies with 0 revenue 
movies_with_zero_revenue = df.query('revenue == 0')

# print out a few of these
movies_with_zero_revenue.head(5)
Out[5]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
48 265208 tt2231253 2.932340 30000000 0 Wild Card Jason Statham|Michael Angarano|Milo Ventimigli... NaN Simon West Never bet against a man with a killer hand. ... When a Las Vegas bodyguard with lethal skills ... 92 Thriller|Crime|Drama Current Entertainment|Lionsgate|Sierra / Affin... 1/14/15 481 5.3 2015 2.759999e+07 0.0
67 334074 tt3247714 2.331636 20000000 0 Survivor Pierce Brosnan|Milla Jovovich|Dylan McDermott|... http://survivormovie.com/ James McTeigue His Next Target is Now Hunting Him ... A Foreign Service Officer in London tries to p... 96 Crime|Thriller|Action Nu Image Films|Winkler Films|Millennium Films|... 5/21/15 280 5.4 2015 1.839999e+07 0.0
74 347096 tt3478232 2.165433 0 0 Mythica: The Darkspore Melanie Stone|Kevin Sorbo|Adam Johnson|Jake St... http://www.mythicamovie.com/#!blank/wufvh Anne K. Black NaN ... When Teela’s sister is murdered and a powerf... 108 Action|Adventure|Fantasy Arrowstorm Entertainment 6/24/15 27 5.1 2015 0.000000e+00 0.0
75 308369 tt2582496 2.141506 0 0 Me and Earl and the Dying Girl Thomas Mann|RJ Cyler|Olivia Cooke|Connie Britt... http://www.foxsearchlight.com/meandearlandthed... Alfonso Gomez-Rejon A Little Friendship Never Killed Anyone. ... Greg is coasting through senior year of high s... 105 Comedy|Drama Indian Paintbrush 6/12/15 569 7.7 2015 0.000000e+00 0.0
92 370687 tt3608646 1.876037 0 0 Mythica: The Necromancer Melanie Stone|Adam Johnson|Kevin Sorbo|Nicola ... http://www.mythicamovie.com/#!blank/y9ake A. Todd Smith NaN ... Mallister takes Thane prisoner and forces Mare... 0 Fantasy|Action|Adventure Arrowstorm Entertainment|Camera 40 Productions... 12/19/15 11 5.4 2015 0.000000e+00 0.0

5 rows × 21 columns

Movie with id 308369 seems to be rated well, so having a 0 value in budget and revenue doesn't make sense. Even other movies are released so they will have some budget and some revenue. This looks like a case of missing data to me. Let's look at the budget as well.

In [6]:
# get movies with 0 budget 
movies_with_zero_budget = df.query('budget == 0')

# print out a few of these
movies_with_zero_budget.head(5)
Out[6]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
30 280996 tt3168230 3.927333 0 29355203 Mr. Holmes Ian McKellen|Milo Parker|Laura Linney|Hattie M... http://www.mrholmesfilm.com/ Bill Condon The man behind the myth ... The story is set in 1947, following a long-ret... 103 Mystery|Drama BBC Films|See-Saw Films|FilmNation Entertainme... 6/19/15 425 6.4 2015 0.0 2.700677e+07
36 339527 tt1291570 3.358321 0 22354572 Solace Abbie Cornish|Jeffrey Dean Morgan|Colin Farrel... NaN Afonso Poyart A serial killer who can see your future, a psy... ... A psychic doctor, John Clancy, works with an F... 101 Crime|Drama|Mystery Eden Rock Media|FilmNation Entertainment|Flynn... 9/3/15 474 6.2 2015 0.0 2.056620e+07
72 284289 tt2911668 2.272044 0 45895 Beyond the Reach Michael Douglas|Jeremy Irvine|Hanna Mangan Law... NaN Jean-Baptiste Léonetti NaN ... A high-rolling corporate shark and his impover... 95 Thriller Furthur Films 4/17/15 81 5.5 2015 0.0 4.222338e+04
74 347096 tt3478232 2.165433 0 0 Mythica: The Darkspore Melanie Stone|Kevin Sorbo|Adam Johnson|Jake St... http://www.mythicamovie.com/#!blank/wufvh Anne K. Black NaN ... When Teela’s sister is murdered and a powerf... 108 Action|Adventure|Fantasy Arrowstorm Entertainment 6/24/15 27 5.1 2015 0.0 0.000000e+00
75 308369 tt2582496 2.141506 0 0 Me and Earl and the Dying Girl Thomas Mann|RJ Cyler|Olivia Cooke|Connie Britt... http://www.foxsearchlight.com/meandearlandthed... Alfonso Gomez-Rejon A Little Friendship Never Killed Anyone. ... Greg is coasting through senior year of high s... 105 Comedy|Drama Indian Paintbrush 6/12/15 569 7.7 2015 0.0 0.000000e+00

5 rows × 21 columns

Again I can see that all of these have production companies with them, so it can't be the case of actual 0 budget. This also looks like missing data to me. However, if we are using 0 as budget for missing values, and there are lot's of missing values, this will give us some false statistics. Since we want to predict success of movie based on revenue and budget, it will be a hard call to drop the rows with missing data. In case we have to drop the rows, let's see how many of them will be gone.

In [7]:
# Movie count without revenue, using id to count as some imdb_id seems to be missing.
movies_with_zero_revenue.groupby('revenue').count()['id']
Out[7]:
revenue
0    6016
Name: id, dtype: int64
In [8]:
# Movie count without budget, using id to count as some imdb_id seems to be missing.
movies_with_zero_budget.groupby('budget').count()['id']
Out[8]:
budget
0    5696
Name: id, dtype: int64
In [9]:
# Movies with both revenue and budget missing
movies_with_zero_revenue_and_budget = df.query('budget == 0 and revenue == 0')
movies_with_zero_revenue_and_budget.groupby('budget').count()['id']
Out[9]:
budget
0    4701
Name: id, dtype: int64

So we have total 6016 movies without revenue data, 5696 movies without budget data and 4701 movies without either of these. Since I have two operational definiotions for success of movies, one based on ratings and other on revenue, I will keep this data for now, but change all the 0 values to null. Since I am also considering the effecto of runtime, let's check for missing data in that as well.

In [10]:
# get movies with 0 budget 
movies_with_zero_runtime = df.query('runtime == 0')

# print out a few of these
movies_with_zero_runtime.head(5)
Out[10]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
92 370687 tt3608646 1.876037 0 0 Mythica: The Necromancer Melanie Stone|Adam Johnson|Kevin Sorbo|Nicola ... http://www.mythicamovie.com/#!blank/y9ake A. Todd Smith NaN ... Mallister takes Thane prisoner and forces Mare... 0 Fantasy|Action|Adventure Arrowstorm Entertainment|Camera 40 Productions... 12/19/15 11 5.4 2015 0.0 0.0
334 361931 tt5065822 0.357654 0 0 Ronaldo Cristiano Ronaldo http://www.ronaldothefilm.com Anthony Wonke Astonishing. Intimate. Definitive. ... Filmed over 14 months with unprecedented acces... 0 Documentary On The Corner Films|We Came, We Saw, We Conque... 11/9/15 80 6.5 2015 0.0 0.0
410 339342 tt2948712 0.097514 0 0 Anarchy Parlor Robert LaSardo|Jordan James Smith|Sara Fabel|T... NaN Kenny Gage|Devon Downs NaN ... Six young college hopefuls vacationing and par... 0 Horror NaN 1/1/15 15 5.6 2015 0.0 0.0
445 353345 tt3800796 0.218528 0 0 The Exorcism of Molly Hartley Sarah Lind|Devon Sawa|Gina Holden|Peter MacNei... NaN Steven R. Monroe NaN ... Taking place years after The Haunting of Molly... 0 Horror WT Canada Productions 10/9/15 52 5.0 2015 0.0 0.0
486 333653 tt4058368 0.176744 0 0 If There Be Thorns Heather Graham|Jason Lewis|Rachael Carpani|Mas... NaN Nancy Savoca NaN ... The third installment in V.C. Andrews’ bests... 0 TV Movie|Drama A+E Studios|Jane Startz Productions 4/5/15 11 5.4 2015 0.0 0.0

5 rows × 21 columns

In [11]:
# count of movies with 0 runtime
movies_with_zero_runtime.groupby('runtime').count()['id']
Out[11]:
runtime
0    31
Name: id, dtype: int64

A quick search on the internet shows that these movies had a runtime, and definetly it wasn't 0. Since there are just 31 such entries, we can remove them all without affecting our data much.

I have also not defined the ratings aspect as of now, so let's address that. This dataset provides three columns related to ratings:

  1. popularity
  2. vote_count
  3. vote_average

As per TMDB, here are the factors considered in popularity:

  • Number of votes for the day
  • Number of views for the day
  • Number of users who marked it as a "favourite" for the day
  • Number of users who added it to their "watchlist" for the day
  • Release date
  • Number of total votes
  • Previous days score

This seems like a good measure or popularity for a particular movie. This dataset also contains user vote_count and vote_average. With a good number of vote_count, the vote_average will also be a good measure of public opinion about eh popularity of movie. In this dataset however, we have votes raning from 10 to around 9500 votes. One consideration can be that a movie which is less popular will have lower number of votes. However, we can't definetly say that. I am also going to keep the vote_average as one of my variables and see how this affects the revenue.

So what all are we cleaning?

  • We don't need the imdb_id, homepage, tagline, overview, cast, keywords, director and production_companies columns
  • We will drop rows with zero runtime.
  • We will drop null in genres.
  • We will replace 0 in budget, revenue, budget_adj and revenue_adj with null.
  • We also don't want any duplicates, so let's drop duplicates as well
In [12]:
# Drop unnecessary columns
columns = ['imdb_id', 'homepage', 'tagline', 'overview', 'cast', 'director', 'keywords', 'production_companies']
df.drop(columns, axis=1, inplace=True)
In [13]:
# Our revised dataframe
df.head(1)
Out[13]:
id popularity budget revenue original_title runtime genres release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 32.985763 150000000 1513528810 Jurassic World 124 Action|Adventure|Science Fiction|Thriller 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
In [14]:
# Inplace remove the rows with 0 runtime
df.query('runtime != 0', inplace=True)
In [15]:
# drop null genres
df.dropna(subset = ['genres'], how='any', inplace=True)
In [16]:
# replace 0 in budget, revenue, budget_adj and revenue_adj

df['budget'] = df['budget'].replace(0, np.NaN)
df['budget_adj'] = df['budget_adj'].replace(0, np.NaN)
df['revenue'] = df['revenue'].replace(0, np.NaN)
df['revenue_adj'] = df['revenue_adj'].replace(0, np.NaN)
In [17]:
# remove duplicates
df.drop_duplicates(inplace=True)
In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10812 entries, 0 to 10865
Data columns (total 13 columns):
id                10812 non-null int64
popularity        10812 non-null float64
budget            5165 non-null float64
revenue           4849 non-null float64
original_title    10812 non-null object
runtime           10812 non-null int64
genres            10812 non-null object
release_date      10812 non-null object
vote_count        10812 non-null int64
vote_average      10812 non-null float64
release_year      10812 non-null int64
budget_adj        5165 non-null float64
revenue_adj       4849 non-null float64
dtypes: float64(6), int64(4), object(3)
memory usage: 1.2+ MB

Summary of cleaning

From our info printed above, we can see that removing 0 budgets and revenues made our dataset significantly smaller. This is a tradeoff we will make for our analysis. Our data in both dataframes are now clean, with null values removed, useless columns removed and 0's converted to null. Let's again look at some descriptive statistics on the dataframe.

In [19]:
# cleaned up data with null values for missign budget and revenue
df.describe()
Out[19]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10812.000000 10812.000000 5.165000e+03 4.849000e+03 10812.000000 10812.000000 10812.000000 10812.000000 5.165000e+03 4.849000e+03
mean 65558.945523 0.648730 3.076120e+07 8.923886e+07 102.421846 218.369404 5.975379 2001.288938 3.691521e+07 1.151009e+08
std 91662.645876 1.001976 3.891166e+07 1.620801e+08 30.871363 576.886018 0.934122 12.819746 4.196662e+07 1.988557e+08
min 5.000000 0.000065 1.000000e+00 2.000000e+00 2.000000 10.000000 1.500000 1960.000000 9.210911e-01 2.370705e+00
25% 10576.750000 0.209045 6.000000e+06 7.732325e+06 90.000000 17.000000 5.400000 1995.000000 8.108664e+06 1.046585e+07
50% 20500.500000 0.385298 1.700000e+07 3.185308e+07 99.000000 38.000000 6.000000 2006.000000 2.274082e+07 4.395666e+07
75% 74725.250000 0.716608 4.000000e+07 9.996575e+07 112.000000 146.000000 6.600000 2011.000000 5.008384e+07 1.316482e+08
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09

Exploratory Data Analysis

Now that we have our data cleaned up, let's start with our analysis and try to figure out answers to the above posed questions.

Movies Released per year

In [20]:
# count movies released per year
release_per_year = df.groupby('release_year').count()['id']
release_per_year.head()
Out[20]:
release_year
1960    32
1961    31
1962    32
1963    34
1964    42
Name: id, dtype: int64

We are working on the cleaned up dataset and we have our information for movies released per year. Let's plot this data to see the trend, which seems to be upward.

In [21]:
# plot movies released by year
plt.figure(figsize=(10, 6))
plt.plot(release_per_year)
# title and labels
plt.title('Movies released by Years')
plt.xlabel('Year')
plt.ylabel('Number of Movies');

Clearly the number of movies released is going up every year, except a couple of times where it took a small dip. The curve also seems much steep after year 2000.

Average runtime of movies by year

In [22]:
# count movie runtime per year
min_runtime_per_year = df.groupby('release_year').min()['runtime']
max_runtime_per_year = df.groupby('release_year').max()['runtime']
mean_runtime_per_year = df.groupby('release_year').mean()['runtime']

Let's plot the data for minimum, maximum and mean runtime in different years to see if anything noticable happened.

In [23]:
# build the index location for x-axis
min_index = min_runtime_per_year.index
max_index = max_runtime_per_year.index
mean_index = mean_runtime_per_year.index

# set axes for the plot
x1, y1 = min_index, min_runtime_per_year
x2, y2 = max_index, max_runtime_per_year
x3, y3 = mean_index, mean_runtime_per_year

# create the plot
plt.figure(figsize=(10, 6))
plt.plot(x1, y1, label = 'Minimum')
plt.plot(x2, y2, label = 'Maximum')
plt.plot(x3, y3, label = 'Mean')

# title and labels
plt.title('Runtime by Years')
plt.xlabel('Year')
plt.ylabel('Duration (minutes)');
plt.legend(loc='upper left')
Out[23]:
<matplotlib.legend.Legend at 0x111c10110>

We can see that the average duration of movies more or less remains the same. However, there seems to be a contant trend of small films after year 2000. The maximum length of movies also seem to increase overall, and we can see a few very very lengthy movies in later years.

What makes a successful movie

As we have two operational definitions for success, we will explore both aspects one by one.

Popularity as the measure of success

In my first analysis, I will consider popularity as a measure of success. Higher popularity would mean more success. We will try to answer the following questions:

  1. Does budget makes a movie successful?
  2. Does runtime contribute in success of movie?
  3. How does success change with year of release?
  4. Does user rating play a role in popularity?

Role of budget on success (in terms of popularity)

Let's see how popularity of movies look like when we take the budget into consideration. I will plot the mean popularity against the budget first.

In [24]:
# plot budget vs average Popularity
plt.figure(figsize=(10, 6))
mean_popularity_grouped_by_budget = df.groupby('budget').mean()['popularity']

plt.plot(mean_popularity_grouped_by_budget.index, mean_popularity_grouped_by_budget)

# title and labels
plt.title('Budget vs mean Popularity')
plt.xlabel('Adjusted Budjet for inflation')
plt.ylabel('Popularity rating');

We can see that popularity seems to increase with budget in general. Although there is a huge dip at the end for movie with very high budget, but the general trend suggests that higher the budget, higher is popularity. Let's look at the scatter of budget with popularity to gain more insight about this.

In [25]:
# plot budget vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['budget_adj'], df['popularity'])

# title and labels
plt.title('Budget vs Popularity')
plt.xlabel('Adjusted Budjet for inflation')
plt.ylabel('Popularity rating');

From the above scatter plot, we can see that as the budget increases, the movies seem to be slightly more popular, but this is not a definitive trend. Some of the most popular movies lie in the middle of our budget scale. We can see that we have popular movies at all types of budget. But most of the lower budget movies seem to be less popular. The dip in the mean popularity is explained by the low popularity and low movie count in high budget.

So budget of a movie might not be a direct factor in success, but movies having a higher budget seems to be popular

It doesn't however mean that higher budget makes a movie popular.

Role of runtime on success

Let's see how popularity of movies look like when we take the movie runtime into consideration. We will fist look into the mean popularity and then look at the scatter plot to analyze further.

In [26]:
# plot runtime vs average Popularity
plt.figure(figsize=(10, 6))
mean_popularity_grouped_by_runtime = df.groupby('runtime').mean()['popularity']

plt.plot(mean_popularity_grouped_by_runtime.index, mean_popularity_grouped_by_runtime)

# title and labels
plt.title('Runtime vs mean Popularity')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Popularity rating');

We can see that there is an increase in popularity as the runtime increases initially and then it starts falling down. There is a peak around 155-175 minutes of runtime. Let's ananlyze the distribution of runtime and popularity with a scatter plot now.

In [27]:
# plot runtime vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['runtime'], df['popularity'])

# title and labels
plt.title('Runtime vs Popularity')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Popularity rating');

We can see that movies with runtime around 120 to 140 are more popular. So short films are not doing that well. Very long ones are also not performing well. So runtime does seem to affect the popularity of movies. People like their movies around 2 hours long.

Movies which are around 120 to 140 minutes long seem to be more popular with audience. Most popular movies have time around 120 minutes. However movies around 150-170 minutes have more popularity in terms of average.

Role of release year on success

Let's look at the change in popularity over the years. To get the trend for popularity, we can look at the mean popularity per year first. Later we will analyze the distribution of popularity from a scatter plot over years.

In [28]:
# plot mean popularity vs year
mean_popularity_grouped_by_year = df.groupby('release_year').mean()['popularity']
mean_index = mean_popularity_grouped_by_year.index

# set axes for the plot
x1, y1 = mean_index, mean_popularity_grouped_by_year

plt.figure(figsize=(10, 6))
plt.plot(x1, y1, label = 'Mean Popularity')

# title and labels
plt.title('Popularity by Years')
plt.xlabel('Year')
plt.ylabel('Popularity');
plt.legend(loc='upper left')
Out[28]:
<matplotlib.legend.Legend at 0x115785e90>

Popularity seems to be moving upwards as the years progress. This seems correct as the newer movies are more easily dicovered and hence they will be more popular. Since mean values measure more of a central tendency, let's look at the absolute values in a scatter plot.

In [29]:
# plot release year vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['release_year'], df['popularity'])

# title and labels
plt.title('Release year vs Popularity')
plt.xlabel('Release year')
plt.ylabel('Popularity rating');

We can see that most movies over the years are not very popular, btu there is a definite rise in popular movies over the years.

Movies which are releases in recent past seem to be more popular. So there is a good chance of success for newly released movies

Given the fact that a lot of new movies are still densly located at low popularity rankings, release year does not ensure success as it might seem from the mean plot above.

Role of user rating on success

Let's see how popularity is affected by user ratings. However both of these seem to be the same thing, let's see if there is a relation between both. We will have two trends plotted, one where we take number of user ratings in account, and other where we take the average user rating. Again we will consider mean popularity and then look at the distribution in scatter plots.

In [30]:
# plot average vote vs average Popularity
plt.figure(figsize=(10, 6))
mean_popularity_grouped_by_vote_average = df.groupby('vote_average').mean()['popularity']

plt.plot(mean_popularity_grouped_by_vote_average.index, mean_popularity_grouped_by_vote_average)

# title and labels
plt.title('Avergae vote vs mean Popularity')
plt.xlabel('Average vote')
plt.ylabel('Popularity rating');

We can see that on an average, movies with higher vote average are more popular. Again we see a dip in the popularity at the extreme of average vote where a highly voted movie is not very popular. But the general trend suggests that better the vote average, more the popularity and hence movie is a success. Now let's look at the distribution of vote average with popularity in a scatter plot.

In [31]:
# plot vote average vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['vote_average'], df['popularity'])

# title and labels
plt.title('Avergae vote vs Popularity')
plt.xlabel('Average vote')
plt.ylabel('Popularity rating');

While we can see that movies with higher average vote have a better chance of being popular, we have some highly voted movies that are not very popular.

While movies with high average votes seem to be more popular, there is still a chance that high vote average movie might not be popular. But low votes definetly show that movie won't be popular

Now lets look how number of votes casted fare with popularity of movie. We will look at the mean popularity first and then have a look at the scatter plot to further analyze.

In [32]:
# plot vote count vs average Popularity
plt.figure(figsize=(10, 6))
mean_popularity_grouped_by_vote_count = df.groupby('vote_count').mean()['popularity']

plt.plot(mean_popularity_grouped_by_vote_count.index, mean_popularity_grouped_by_vote_count)

# title and labels
plt.title('Vote count vs mean Popularity')
plt.xlabel('Vote count')
plt.ylabel('Popularity rating');

We can infer from this graph that on average, movies with more votes casted seem to be more popular. Let's look at the distribution of votes count and popularity to see if that also confirms our observation.

In [33]:
# plot vote count vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['vote_count'], df['popularity'])

# title and labels
plt.title('Vote count vs Popularity')
plt.xlabel('vote count')
plt.ylabel('Popularity rating');

We can see in the graph that popularity increases as the number of votes increase. This makes sense as more the number of people watching it, more people will vote and more popular the movie will be.

More the numbers of vote casted, more popular the movie is. This loop kind of feeds itself as more votes means more popular and then more people watch it and more will vote.

Return on Inestment as the measure of success

In the second part of my analysis, I will consider percentage profit or return on investment as a measure of success. Higher ROI would mean more success. We will try to answer the following questions:

  1. Is there any affect of release year on ROI?
  2. Do popular movies have better ROI?
  3. Does runtime of movie affect the ROI?
  4. Does budget affect ROI?
  5. Does the user rating affect ROI?

First things first, let's calculate ROI percent in our data.

In [34]:
# calculate the ROI per 100
df['roi'] = ((df['revenue_adj'] - df['budget_adj']) / df['budget_adj'])*100

Role of release year on success

In this part of our analysis, success is determined by ROI. First let's have a look at the ROI over years.

In [35]:
# plot year of release vs ROI
plt.figure(figsize=(12, 6))
plt.scatter( df['release_year'], df['roi'])

# title and labels
plt.title('ROI by Year of release')
plt.xlabel('Year of release')
plt.ylabel('ROI (in %)');

It looks like ROI remains low for most part. But for some reason, around 1985-1987, there was massive profits on average. Lets look at a few rows with such high profits to see if there is some error with data.

In [36]:
movies_with_high_roi = df.query('roi > 100000')
movies_with_high_roi.head()
Out[36]:
id popularity budget revenue original_title runtime genres release_date vote_count vote_average release_year budget_adj revenue_adj roi
2449 2667 0.934621 25000.0 248000000.0 The Blair Witch Project 81 Horror|Mystery 7/14/99 522 6.3 1999 32726.321165 3.246451e+08 9.919000e+05
3581 59296 0.520430 1.0 1378.0 Love, Wedding, Marriage 90 Comedy|Romance 6/3/11 55 5.3 2011 0.969398 1.335831e+03 1.377000e+05
3608 50217 0.463510 93.0 2500000.0 From Prada to Nada 107 Comedy|Drama|Romance 1/28/11 47 5.2 2011 90.154018 2.423495e+06 2.688072e+06
6179 11338 0.132713 114.0 6700000.0 Into the Night 115 Comedy|Drama|Thriller 2/15/85 24 6.1 1985 231.096930 1.358201e+07 5.877093e+06
7447 23827 1.120442 15000.0 193355800.0 Paranormal Activity 86 Horror|Mystery 9/14/07 714 6.1 2007 15775.028739 2.033462e+08 1.288939e+06

This just doesn't looks right. A movie of runtime 115 minutes can't have a budget of $114. Other movies are also having incorrect data for budget. Hence the unusual ROI. I will filter out the ROI's greater than 20 times and then plot the same graph again. Before that, I will plot a mean ROI by years to see the trend after cleaning up the data.

In [37]:
# remove data where ROI is unrealistic. For my study, I have decided that an ROI of more than 1000% is not realistic.

movies_with_normal_roi = df.query('roi < 1000', inplace=True)
normal_roi_mean = df.groupby('release_year').mean()['roi']
mean_index = normal_roi_mean.index

# set axes for the plot
x1, y1 = mean_index, normal_roi_mean

plt.figure(figsize=(10, 6))
plt.plot(x1, y1, label = 'Mean ROI')

# title and labels
plt.title('Mean ROI by Years')
plt.xlabel('Year of release')
plt.ylabel('ROI (in %)');
plt.legend(loc='upper left')
Out[37]:
<matplotlib.legend.Legend at 0x110faa090>

Now this filtered out data looks more precise. We can see that ROI is going down by the years. Around 1970's, there seems to be some outliers with great profit and very heavy loss, but rest of the trend suggests a dip in ROI. The curve if going upwards from around 2000, but the slope is low. Let's also take a look at the ROI distribution using a scatter plot.

In [38]:
# plot release year vs ROI
plt.figure(figsize=(12, 6))
plt.scatter( df['release_year'], df['roi'])

# title and labels
plt.title('ROI by Years')
plt.xlabel('Year of release')
plt.ylabel('ROI (in %)');

We can see that as the years are passing, movies with low ROI are increasing. Which brings the average down and hence the plot before that.

ROI is going down on an average as the years pass by. This can be due to higher budgets or more movies coming in and not performing well. However we can see that the number of movies with higher ROI is also increasing. This can be seen in the density of the above scatter plot. So movies coming in later years have a lower chance of being successful if we just look at the release time in isolation. The possible reasons could be increase in budget and flooding of movies causign decrease in revenue.

How does popularity affect the ROI?

To look at the affect of popularity index on the ROI, let's plot ROI against popularity. Again we will first take a look at the mean ROI grouped by popoularity and then look at the scatter plot to further analyze on the trend.

In [39]:
# plot popularity vs average ROI
plt.figure(figsize=(10, 6))
mean_roi_grouped_by_popularity = df.groupby('popularity').mean()['roi']

plt.plot(mean_roi_grouped_by_popularity.index, mean_roi_grouped_by_popularity)

# title and labels
plt.title('Popularity vs mean ROI')
plt.xlabel('Popularity rating')
plt.ylabel('ROI (in %)');

We don't have a very conclusive trend here, but as the popularity increases, the average seems to go up higher for ROI. LEt's look at the scatter plot to see what is the reason of this strange distribution.

In [40]:
# plot popularity vs ROI
plt.figure(figsize=(10, 6))
plt.scatter(df['popularity'], df['roi'])

# title and labels
plt.title('Popularity vs ROI')
plt.xlabel('Popularity rating')
plt.ylabel('ROI (in %)');

If we just look at the above graph, it suggests that as popularity increases, there is certainly and increase in ROI. However, movies which are not very popular also have a great ROI. One possible cause is that we have incorrect revenue and budget data, hence ROI calculated is not correct. But if we look at the highly popular movies, they seem to have a better ROI.

There isn't very strong connection unless the popularity goes beyond a certain index of 3-4. After that we can see a positive relationship in the sense that more popular a movie, better is the ROI.

Effect of runtime on ROI

We have seen earlier that runtime between 120-140 minutes seem to common factor among popular movies. So my initial guess would be that these runtime should have a higher ROI as well. Let's plot the runtime vs ROI plot and see what the trend suggests.

In [41]:
# plot runtime vs average ROI
plt.figure(figsize=(10, 6))
mean_roi = df.groupby('runtime').mean()['roi']

plt.plot(mean_roi.index, mean_roi)

# title and labels
plt.title('Runtime vs mean ROI')
plt.xlabel('Runtime (minutes)')
plt.ylabel('ROI (in %)');

We can see that the mean ROI goes up towards 140-180 minutes mark. So on average movies with this lengths are having a higher ROI. Let's also look at the ROI distribution with runtime.

In [42]:
# plot runtime vs ROI
plt.figure(figsize=(10, 6))

plt.scatter(df['runtime'], df['roi'])

# title and labels
plt.title('Runtime vs ROI')
plt.xlabel('Runtime (minutes)')
plt.ylabel('ROI (in %)');

We can see in this distrubution that in general, movies with time around 140-180 minutes do have a higher ROI. Although, there is a good density of movies around 80-120 minutes mark with good ROI. Since this runtime has a lot of movies, the mean ROI goes down.

I can see that there is a good density of high ROI movies in 80-120 min interval. Also, 140-180 minutes interval have a high mean ROI. These two intervals would be a good runtime to look for high ROI movies.

Effect of budget on ROI

While budget is directly involved in calculating the ROI, I wanted to see if high budget movies are getting better ROI, because a higher budget can enalbe better cast and crew. Let's access both mean ROI as well as ROI distribution with budget.

In [43]:
# plot budget vs ROI
plt.figure(figsize=(12, 6))
plt.scatter( df['budget_adj'], df['roi'])

# title and labels
plt.title('Budget vs ROI')
plt.xlabel('Budget (adjusted for inflation in $)')
plt.ylabel('ROI (in %)');

We can see that there is a lot of density towards low budget and low ROI. As the budget is increased, there seems to be a slight increase in ROI. We can also see a nice chunk of low budget movies getting good ROI. Let's see how the mean ROI looks against budget.

In [44]:
# plot budget vs average ROI
plt.figure(figsize=(10, 6))
mean_roi_grouped_by_budget = df.groupby('budget_adj').mean()['roi']

plt.plot(mean_roi_grouped_by_budget.index, mean_roi_grouped_by_budget)

# title and labels
plt.title('Budget vs mean ROI')
plt.xlabel('Budget (adjusted for inflation in $)')
plt.ylabel('ROI (in %)');

The mean ROI can also be seen going upwards with increase in budget. Although the curve is not very smooth, but if we have to generalize and see a common relationship, higher budget can be a common factor in successful movies. I am a bit ignorant towards that high ROI movies with very low budget because this is not realistic and seems more like incorrect data.

General trend suggests that for a higher ROI, either keep a very small budget or have a larger budget. Both areas seem to have better ROIs. Since budget is a denominator while calculating ROI, a small budget with decent revenue would result in a better ROI

Effect of user rating on ROI

Lastly, let's look at the average user ratings and their relationship with ROI.

In [45]:
# plot vote average vs ROI
plt.figure(figsize=(12, 6))
plt.scatter( df['vote_average'], df['roi'])

# title and labels
plt.title('Vote Average vs ROI')
plt.xlabel('Vote average')
plt.ylabel('ROI (in %)');

We can see that a lot of movies with better ROI have a larger vote average. So high average vote looks like a characteristic of successful movie. However we can see that there is a good density of non-successful movies as well with high vote average. Let's look at the mean ROI to see if we can get a trend here.

In [46]:
# plot runtime vs average ROI
plt.figure(figsize=(10, 6))
mean_roi = df.groupby('vote_average').mean()['roi']

plt.plot(mean_roi.index, mean_roi)

# title and labels
plt.title('Vote Average vs ROI')
plt.xlabel('Vote average')
plt.ylabel('ROI (in %)');

Well, the mean ROI clears things up with vote average. We can see a very clear thrend that ROI on average increases with the average vote. There are a few outliers, but high average vote seems to be a requirement for success of movie.

Conclusions

My research goals were to answer three major questions:

  • How does number of movies released per year change?

    From my analysis, we clearly saw that number of movies is an increasing trend. It sort of exploded after year 2000.

  • How does runtime of movies change over years?

    We saw that as the years passed, the average duration of movies remained more of less the same. However, we now have a lot of short films, as well as a few very lengthy films as well.

  • What factors can make a movie successful?

    We had two operational definition for success, popularity and ROI. From my study of given dataset, I can conclude that following are the recipies for successful movies:

    1. Popular movies tend to have a high budget, runtime of around 120-140 minutes, recenty release year, high vote counts and high vote average.
    2. Movies with better ROI's generally have high popularity, early release years, runtime of around 80-120 minutes or 140-180 minutes, high vote counts and high vote average.

    So we can see that high vote average and high vote counts are part of success either way. Recently released movies seem to be more popular, but ROI is getting down in the recent years, so movies released in the past are more successful in terms of ROI. We don't have a clear runtime slot, but short movies don't seem to be doing well on either definiotions.

Limitations of this study

First and foremost, the study and conclusions are limited by the quality of data. As seen in the wrangling phase, we had lots of missing values for budget and revenue. Later on while plotting ROI, we saw that budget values might be wrong as well. I did a filter based on my guess of ROI, but this still pollutes the result. Also some assumptions were made about outliers and we proceeded to look at the general trend.

This study shows definitive results for movies released per year and runtime, but it doesn't guarantee a recipe for success while evaluating the third question. It just points out what things are common across successful movies. While these things are common, they might not be the reason for the success of movie.