Project: Investigate The Movie Database (TMDb)¶

Table of Contents¶

Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions

Introduction¶

This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. Having looked at the data, we have information about a movie's cast, crew, release date, ratings, budget and revenue. From my study and investigation of this dataset, I want to figure out:

How does number of movies released per year change?
How does runtime of movies change over years?
What factors can make a movie successful?

What is operational definition for success?¶

Success in general terms is when you achieve what you wanted to achieve. For a movie star, success would be a popular movie or a highly voted movie. For a production house, success would be a larger revenue. Assuming that this dataset has missing values, I am not expecting a perfect success recipe. A key thing to note is that inflation plays a major role in budget as well as revenue. Lucky for us, the final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.

For this study, we will evaluate success in two ways:

Popularity (more popular, more successful)
Percentage ROI (Return on Investment) (higher ROI, more successful)

We will look into both aspects of being successful and evaluate the factors that can make a movie successful.

Potential factors that can affect the success:

Budget
- Do high budget movies do well? Or Low budget movies are getting more ROI.
Votes
- Are highly voted movies making more profit?
Cast and Director
- Is there a group of director which always makes good profit or get high ratings? Or are there certain actors who always produce a hit movie?
Release year
- Are movies getting more successful in the later years?
Genres
- How are movies doing with genres? Is there any particular genres that people are liking the most? There can be a combination of director and genres, or an actor and genre or just an actor, director and genre that works very well. However, it's outside the scope of this study.
Movie runtime
- Are short films making more money? Or movies with longer duration getting better ratings?

For this study, we will measure success based on budget, votes, release year and runtime. While measuring success in terms of budget, we will consider votes, release year and runtime. And when measuring success in terms of votes, we will consider profit, release year and runtime.

# First things first, import all the packages that I will be using

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
% matplotlib inline

Data Wrangling¶

Let's load the data and look at some general properties and descriptive statistics about the data. Before anything else, I will just print a few rows to get the feel of data.

General Properties¶

# Load your data

df = pd.read_csv('./data/tmdb-movies.csv')
df.head(3)

So there are 21 columns in this dataset. From these three rows, I can see that cast is a bit tricky with string having all names separated by |. Although the directors are shown as single in these rows, I need to check further is same scheme of joining names is done for directors as well. genres and production_companies seem to concatenate multiple strings the same way. Release date looks consistent in the format of mm/dd/yy. However we don't have a profit column that I wanted to look into. I need to fix that as well. Before that let's just look at some basic info from the above data frame.

# see the column info and null values in the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

Looking at the above info, I can see a few anomalies already. The id column has 10866 entries, so I am assuming there are around 10866 movies in the database. But tagline, keywords and production_companies columns seems to be missing quite a few rows. Somewhere in the range of 1000-3000 rows are missing. That would be around 10-30% missing data. However, homepage seems to have a lot less entries. This is understandble as a lot of movies might not have a website. Also, this doesn't seem a very important factor in the investigation. Since I am not sure if the movies don't have a homepage or just this dataset is missing them, I can't possibly check the effect of having a website on movie popularity or revenues. For this study, I will just drop the homepage column.

Identification of missing and erroneous data and it's cleaning¶

# Printing out some descriptive statistics for the data
df.describe()

As we can see in the summary above, we have lots of 0 values in budget, revenue and some in runtime. Neither of these should have a 0 value. Let's just print a few of these to see if there is a pattern associated with 0 values.

# get movies with 0 revenue 
movies_with_zero_revenue = df.query('revenue == 0')

# print out a few of these
movies_with_zero_revenue.head(5)

Movie with id 308369 seems to be rated well, so having a 0 value in budget and revenue doesn't make sense. Even other movies are released so they will have some budget and some revenue. This looks like a case of missing data to me. Let's look at the budget as well.

# get movies with 0 budget 
movies_with_zero_budget = df.query('budget == 0')

# print out a few of these
movies_with_zero_budget.head(5)

Again I can see that all of these have production companies with them, so it can't be the case of actual 0 budget. This also looks like missing data to me. However, if we are using 0 as budget for missing values, and there are lot's of missing values, this will give us some false statistics. Since we want to predict success of movie based on revenue and budget, it will be a hard call to drop the rows with missing data. In case we have to drop the rows, let's see how many of them will be gone.

# Movie count without revenue, using id to count as some imdb_id seems to be missing.
movies_with_zero_revenue.groupby('revenue').count()['id']

revenue
0    6016
Name: id, dtype: int64

# Movie count without budget, using id to count as some imdb_id seems to be missing.
movies_with_zero_budget.groupby('budget').count()['id']

budget
0    5696
Name: id, dtype: int64

# Movies with both revenue and budget missing
movies_with_zero_revenue_and_budget = df.query('budget == 0 and revenue == 0')
movies_with_zero_revenue_and_budget.groupby('budget').count()['id']

budget
0    4701
Name: id, dtype: int64

So we have total 6016 movies without revenue data, 5696 movies without budget data and 4701 movies without either of these. Since I have two operational definiotions for success of movies, one based on ratings and other on revenue, I will keep this data for now, but change all the 0 values to null. Since I am also considering the effecto of runtime, let's check for missing data in that as well.

# get movies with 0 budget 
movies_with_zero_runtime = df.query('runtime == 0')

# print out a few of these
movies_with_zero_runtime.head(5)

# count of movies with 0 runtime
movies_with_zero_runtime.groupby('runtime').count()['id']

runtime
0    31
Name: id, dtype: int64

A quick search on the internet shows that these movies had a runtime, and definetly it wasn't 0. Since there are just 31 such entries, we can remove them all without affecting our data much.

I have also not defined the ratings aspect as of now, so let's address that. This dataset provides three columns related to ratings:

popularity
vote_count
vote_average

As per TMDB, here are the factors considered in popularity:

Number of votes for the day
Number of views for the day
Number of users who marked it as a "favourite" for the day
Number of users who added it to their "watchlist" for the day
Release date
Number of total votes
Previous days score

This seems like a good measure or popularity for a particular movie. This dataset also contains user vote_count and vote_average. With a good number of vote_count, the vote_average will also be a good measure of public opinion about eh popularity of movie. In this dataset however, we have votes raning from 10 to around 9500 votes. One consideration can be that a movie which is less popular will have lower number of votes. However, we can't definetly say that. I am also going to keep the vote_average as one of my variables and see how this affects the revenue.

So what all are we cleaning?¶

We don't need the imdb_id, homepage, tagline, overview, cast, keywords, director and production_companies columns
We will drop rows with zero runtime.
We will drop null in genres.
We will replace 0 in budget, revenue, budget_adj and revenue_adj with null.
We also don't want any duplicates, so let's drop duplicates as well

# Drop unnecessary columns
columns = ['imdb_id', 'homepage', 'tagline', 'overview', 'cast', 'director', 'keywords', 'production_companies']
df.drop(columns, axis=1, inplace=True)

# Our revised dataframe
df.head(1)

# Inplace remove the rows with 0 runtime
df.query('runtime != 0', inplace=True)

# drop null genres
df.dropna(subset = ['genres'], how='any', inplace=True)

# replace 0 in budget, revenue, budget_adj and revenue_adj

df['budget'] = df['budget'].replace(0, np.NaN)
df['budget_adj'] = df['budget_adj'].replace(0, np.NaN)
df['revenue'] = df['revenue'].replace(0, np.NaN)
df['revenue_adj'] = df['revenue_adj'].replace(0, np.NaN)

# remove duplicates
df.drop_duplicates(inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10812 entries, 0 to 10865
Data columns (total 13 columns):
id                10812 non-null int64
popularity        10812 non-null float64
budget            5165 non-null float64
revenue           4849 non-null float64
original_title    10812 non-null object
runtime           10812 non-null int64
genres            10812 non-null object
release_date      10812 non-null object
vote_count        10812 non-null int64
vote_average      10812 non-null float64
release_year      10812 non-null int64
budget_adj        5165 non-null float64
revenue_adj       4849 non-null float64
dtypes: float64(6), int64(4), object(3)
memory usage: 1.2+ MB

Summary of cleaning¶

From our info printed above, we can see that removing 0 budgets and revenues made our dataset significantly smaller. This is a tradeoff we will make for our analysis. Our data in both dataframes are now clean, with null values removed, useless columns removed and 0's converted to null. Let's again look at some descriptive statistics on the dataframe.

# cleaned up data with null values for missign budget and revenue
df.describe()

Exploratory Data Analysis¶

Now that we have our data cleaned up, let's start with our analysis and try to figure out answers to the above posed questions.

Movies Released per year¶

# count movies released per year
release_per_year = df.groupby('release_year').count()['id']
release_per_year.head()

release_year
1960    32
1961    31
1962    32
1963    34
1964    42
Name: id, dtype: int64

We are working on the cleaned up dataset and we have our information for movies released per year. Let's plot this data to see the trend, which seems to be upward.

# plot movies released by year
plt.figure(figsize=(10, 6))
plt.plot(release_per_year)
# title and labels
plt.title('Movies released by Years')
plt.xlabel('Year')
plt.ylabel('Number of Movies');

Clearly the number of movies released is going up every year, except a couple of times where it took a small dip. The curve also seems much steep after year 2000.

Average runtime of movies by year¶

# count movie runtime per year
min_runtime_per_year = df.groupby('release_year').min()['runtime']
max_runtime_per_year = df.groupby('release_year').max()['runtime']
mean_runtime_per_year = df.groupby('release_year').mean()['runtime']

Let's plot the data for minimum, maximum and mean runtime in different years to see if anything noticable happened.

# build the index location for x-axis
min_index = min_runtime_per_year.index
max_index = max_runtime_per_year.index
mean_index = mean_runtime_per_year.index

# set axes for the plot
x1, y1 = min_index, min_runtime_per_year
x2, y2 = max_index, max_runtime_per_year
x3, y3 = mean_index, mean_runtime_per_year

# create the plot
plt.figure(figsize=(10, 6))
plt.plot(x1, y1, label = 'Minimum')
plt.plot(x2, y2, label = 'Maximum')
plt.plot(x3, y3, label = 'Mean')

# title and labels
plt.title('Runtime by Years')
plt.xlabel('Year')
plt.ylabel('Duration (minutes)');
plt.legend(loc='upper left')

<matplotlib.legend.Legend at 0x111c10110>

We can see that the average duration of movies more or less remains the same. However, there seems to be a contant trend of small films after year 2000. The maximum length of movies also seem to increase overall, and we can see a few very very lengthy movies in later years.

What makes a successful movie¶

As we have two operational definitions for success, we will explore both aspects one by one.

Popularity as the measure of success¶

In my first analysis, I will consider popularity as a measure of success. Higher popularity would mean more success. We will try to answer the following questions:

Does budget makes a movie successful?
Does runtime contribute in success of movie?
How does success change with year of release?
Does user rating play a role in popularity?

Role of budget on success (in terms of popularity)¶

Let's see how popularity of movies look like when we take the budget into consideration. I will plot the mean popularity against the budget first.

# plot budget vs average Popularity
plt.figure(figsize=(10, 6))
mean_popularity_grouped_by_budget = df.groupby('budget').mean()['popularity']

plt.plot(mean_popularity_grouped_by_budget.index, mean_popularity_grouped_by_budget)

# title and labels
plt.title('Budget vs mean Popularity')
plt.xlabel('Adjusted Budjet for inflation')
plt.ylabel('Popularity rating');

We can see that popularity seems to increase with budget in general. Although there is a huge dip at the end for movie with very high budget, but the general trend suggests that higher the budget, higher is popularity. Let's look at the scatter of budget with popularity to gain more insight about this.

# plot budget vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['budget_adj'], df['popularity'])

# title and labels
plt.title('Budget vs Popularity')
plt.xlabel('Adjusted Budjet for inflation')
plt.ylabel('Popularity rating');

From the above scatter plot, we can see that as the budget increases, the movies seem to be slightly more popular, but this is not a definitive trend. Some of the most popular movies lie in the middle of our budget scale. We can see that we have popular movies at all types of budget. But most of the lower budget movies seem to be less popular. The dip in the mean popularity is explained by the low popularity and low movie count in high budget.

So budget of a movie might not be a direct factor in success, but movies having a higher budget seems to be popular

It doesn't however mean that higher budget makes a movie popular.

Role of runtime on success¶

Let's see how popularity of movies look like when we take the movie runtime into consideration. We will fist look into the mean popularity and then look at the scatter plot to analyze further.

# plot runtime vs average Popularity
plt.figure(figsize=(10, 6))
mean_popularity_grouped_by_runtime = df.groupby('runtime').mean()['popularity']

plt.plot(mean_popularity_grouped_by_runtime.index, mean_popularity_grouped_by_runtime)

# title and labels
plt.title('Runtime vs mean Popularity')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Popularity rating');

We can see that there is an increase in popularity as the runtime increases initially and then it starts falling down. There is a peak around 155-175 minutes of runtime. Let's ananlyze the distribution of runtime and popularity with a scatter plot now.

# plot runtime vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['runtime'], df['popularity'])

# title and labels
plt.title('Runtime vs Popularity')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Popularity rating');

We can see that movies with runtime around 120 to 140 are more popular. So short films are not doing that well. Very long ones are also not performing well. So runtime does seem to affect the popularity of movies. People like their movies around 2 hours long.

Movies which are around 120 to 140 minutes long seem to be more popular with audience. Most popular movies have time around 120 minutes. However movies around 150-170 minutes have more popularity in terms of average.

Role of release year on success¶

Let's look at the change in popularity over the years. To get the trend for popularity, we can look at the mean popularity per year first. Later we will analyze the distribution of popularity from a scatter plot over years.

# plot mean popularity vs year
mean_popularity_grouped_by_year = df.groupby('release_year').mean()['popularity']
mean_index = mean_popularity_grouped_by_year.index

# set axes for the plot
x1, y1 = mean_index, mean_popularity_grouped_by_year

plt.figure(figsize=(10, 6))
plt.plot(x1, y1, label = 'Mean Popularity')

# title and labels
plt.title('Popularity by Years')
plt.xlabel('Year')
plt.ylabel('Popularity');
plt.legend(loc='upper left')

<matplotlib.legend.Legend at 0x115785e90>

Popularity seems to be moving upwards as the years progress. This seems correct as the newer movies are more easily dicovered and hence they will be more popular. Since mean values measure more of a central tendency, let's look at the absolute values in a scatter plot.

# plot release year vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['release_year'], df['popularity'])

# title and labels
plt.title('Release year vs Popularity')
plt.xlabel('Release year')
plt.ylabel('Popularity rating');

We can see that most movies over the years are not very popular, btu there is a definite rise in popular movies over the years.

Movies which are releases in recent past seem to be more popular. So there is a good chance of success for newly released movies

Given the fact that a lot of new movies are still densly located at low popularity rankings, release year does not ensure success as it might seem from the mean plot above.

Role of user rating on success¶

Let's see how popularity is affected by user ratings. However both of these seem to be the same thing, let's see if there is a relation between both. We will have two trends plotted, one where we take number of user ratings in account, and other where we take the average user rating. Again we will consider mean popularity and then look at the distribution in scatter plots.

# plot average vote vs average Popularity
plt.figure(figsize=(10, 6))
mean_popularity_grouped_by_vote_average = df.groupby('vote_average').mean()['popularity']

plt.plot(mean_popularity_grouped_by_vote_average.index, mean_popularity_grouped_by_vote_average)

# title and labels
plt.title('Avergae vote vs mean Popularity')
plt.xlabel('Average vote')
plt.ylabel('Popularity rating');

We can see that on an average, movies with higher vote average are more popular. Again we see a dip in the popularity at the extreme of average vote where a highly voted movie is not very popular. But the general trend suggests that better the vote average, more the popularity and hence movie is a success. Now let's look at the distribution of vote average with popularity in a scatter plot.

# plot vote average vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['vote_average'], df['popularity'])

# title and labels
plt.title('Avergae vote vs Popularity')
plt.xlabel('Average vote')
plt.ylabel('Popularity rating');

While we can see that movies with higher average vote have a better chance of being popular, we have some highly voted movies that are not very popular.

While movies with high average votes seem to be more popular, there is still a chance that high vote average movie might not be popular. But low votes definetly show that movie won't be popular

Now lets look how number of votes casted fare with popularity of movie. We will look at the mean popularity first and then have a look at the scatter plot to further analyze.

# plot vote count vs average Popularity
plt.figure(figsize=(10, 6))
mean_popularity_grouped_by_vote_count = df.groupby('vote_count').mean()['popularity']

plt.plot(mean_popularity_grouped_by_vote_count.index, mean_popularity_grouped_by_vote_count)

# title and labels
plt.title('Vote count vs mean Popularity')
plt.xlabel('Vote count')
plt.ylabel('Popularity rating');

We can infer from this graph that on average, movies with more votes casted seem to be more popular. Let's look at the distribution of votes count and popularity to see if that also confirms our observation.

# plot vote count vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['vote_count'], df['popularity'])

# title and labels
plt.title('Vote count vs Popularity')
plt.xlabel('vote count')
plt.ylabel('Popularity rating');

We can see in the graph that popularity increases as the number of votes increase. This makes sense as more the number of people watching it, more people will vote and more popular the movie will be.

More the numbers of vote casted, more popular the movie is. This loop kind of feeds itself as more votes means more popular and then more people watch it and more will vote.

Return on Inestment as the measure of success¶

In the second part of my analysis, I will consider percentage profit or return on investment as a measure of success. Higher ROI would mean more success. We will try to answer the following questions:

Is there any affect of release year on ROI?
Do popular movies have better ROI?
Does runtime of movie affect the ROI?
Does budget affect ROI?
Does the user rating affect ROI?

First things first, let's calculate ROI percent in our data.

# calculate the ROI per 100
df['roi'] = ((df['revenue_adj'] - df['budget_adj']) / df['budget_adj'])*100

Role of release year on success¶

In this part of our analysis, success is determined by ROI. First let's have a look at the ROI over years.

# plot year of release vs ROI
plt.figure(figsize=(12, 6))
plt.scatter( df['release_year'], df['roi'])

# title and labels
plt.title('ROI by Year of release')
plt.xlabel('Year of release')
plt.ylabel('ROI (in %)');

It looks like ROI remains low for most part. But for some reason, around 1985-1987, there was massive profits on average. Lets look at a few rows with such high profits to see if there is some error with data.

movies_with_high_roi = df.query('roi > 100000')
movies_with_high_roi.head()

This just doesn't looks right. A movie of runtime 115 minutes can't have a budget of $114. Other movies are also having incorrect data for budget. Hence the unusual ROI. I will filter out the ROI's greater than 20 times and then plot the same graph again. Before that, I will plot a mean ROI by years to see the trend after cleaning up the data.

# remove data where ROI is unrealistic. For my study, I have decided that an ROI of more than 1000% is not realistic.

movies_with_normal_roi = df.query('roi < 1000', inplace=True)
normal_roi_mean = df.groupby('release_year').mean()['roi']
mean_index = normal_roi_mean.index

# set axes for the plot
x1, y1 = mean_index, normal_roi_mean

plt.figure(figsize=(10, 6))
plt.plot(x1, y1, label = 'Mean ROI')

# title and labels
plt.title('Mean ROI by Years')
plt.xlabel('Year of release')
plt.ylabel('ROI (in %)');
plt.legend(loc='upper left')

<matplotlib.legend.Legend at 0x110faa090>

Now this filtered out data looks more precise. We can see that ROI is going down by the years. Around 1970's, there seems to be some outliers with great profit and very heavy loss, but rest of the trend suggests a dip in ROI. The curve if going upwards from around 2000, but the slope is low. Let's also take a look at the ROI distribution using a scatter plot.

# plot release year vs ROI
plt.figure(figsize=(12, 6))
plt.scatter( df['release_year'], df['roi'])

# title and labels
plt.title('ROI by Years')
plt.xlabel('Year of release')
plt.ylabel('ROI (in %)');

We can see that as the years are passing, movies with low ROI are increasing. Which brings the average down and hence the plot before that.

ROI is going down on an average as the years pass by. This can be due to higher budgets or more movies coming in and not performing well. However we can see that the number of movies with higher ROI is also increasing. This can be seen in the density of the above scatter plot. So movies coming in later years have a lower chance of being successful if we just look at the release time in isolation. The possible reasons could be increase in budget and flooding of movies causign decrease in revenue.

How does popularity affect the ROI?¶

To look at the affect of popularity index on the ROI, let's plot ROI against popularity. Again we will first take a look at the mean ROI grouped by popoularity and then look at the scatter plot to further analyze on the trend.

# plot popularity vs average ROI
plt.figure(figsize=(10, 6))
mean_roi_grouped_by_popularity = df.groupby('popularity').mean()['roi']

plt.plot(mean_roi_grouped_by_popularity.index, mean_roi_grouped_by_popularity)

# title and labels
plt.title('Popularity vs mean ROI')
plt.xlabel('Popularity rating')
plt.ylabel('ROI (in %)');

We don't have a very conclusive trend here, but as the popularity increases, the average seems to go up higher for ROI. LEt's look at the scatter plot to see what is the reason of this strange distribution.

# plot popularity vs ROI
plt.figure(figsize=(10, 6))
plt.scatter(df['popularity'], df['roi'])

# title and labels
plt.title('Popularity vs ROI')
plt.xlabel('Popularity rating')
plt.ylabel('ROI (in %)');

If we just look at the above graph, it suggests that as popularity increases, there is certainly and increase in ROI. However, movies which are not very popular also have a great ROI. One possible cause is that we have incorrect revenue and budget data, hence ROI calculated is not correct. But if we look at the highly popular movies, they seem to have a better ROI.

There isn't very strong connection unless the popularity goes beyond a certain index of 3-4. After that we can see a positive relationship in the sense that more popular a movie, better is the ROI.

Effect of runtime on ROI¶

We have seen earlier that runtime between 120-140 minutes seem to common factor among popular movies. So my initial guess would be that these runtime should have a higher ROI as well. Let's plot the runtime vs ROI plot and see what the trend suggests.

# plot runtime vs average ROI
plt.figure(figsize=(10, 6))
mean_roi = df.groupby('runtime').mean()['roi']

plt.plot(mean_roi.index, mean_roi)

# title and labels
plt.title('Runtime vs mean ROI')
plt.xlabel('Runtime (minutes)')
plt.ylabel('ROI (in %)');

We can see that the mean ROI goes up towards 140-180 minutes mark. So on average movies with this lengths are having a higher ROI. Let's also look at the ROI distribution with runtime.

# plot runtime vs ROI
plt.figure(figsize=(10, 6))

plt.scatter(df['runtime'], df['roi'])

# title and labels
plt.title('Runtime vs ROI')
plt.xlabel('Runtime (minutes)')
plt.ylabel('ROI (in %)');

We can see in this distrubution that in general, movies with time around 140-180 minutes do have a higher ROI. Although, there is a good density of movies around 80-120 minutes mark with good ROI. Since this runtime has a lot of movies, the mean ROI goes down.

I can see that there is a good density of high ROI movies in 80-120 min interval. Also, 140-180 minutes interval have a high mean ROI. These two intervals would be a good runtime to look for high ROI movies.

Effect of budget on ROI¶

While budget is directly involved in calculating the ROI, I wanted to see if high budget movies are getting better ROI, because a higher budget can enalbe better cast and crew. Let's access both mean ROI as well as ROI distribution with budget.

# plot budget vs ROI
plt.figure(figsize=(12, 6))
plt.scatter( df['budget_adj'], df['roi'])

# title and labels
plt.title('Budget vs ROI')
plt.xlabel('Budget (adjusted for inflation in $)')
plt.ylabel('ROI (in %)');

We can see that there is a lot of density towards low budget and low ROI. As the budget is increased, there seems to be a slight increase in ROI. We can also see a nice chunk of low budget movies getting good ROI. Let's see how the mean ROI looks against budget.

# plot budget vs average ROI
plt.figure(figsize=(10, 6))
mean_roi_grouped_by_budget = df.groupby('budget_adj').mean()['roi']

plt.plot(mean_roi_grouped_by_budget.index, mean_roi_grouped_by_budget)

# title and labels
plt.title('Budget vs mean ROI')
plt.xlabel('Budget (adjusted for inflation in $)')
plt.ylabel('ROI (in %)');

The mean ROI can also be seen going upwards with increase in budget. Although the curve is not very smooth, but if we have to generalize and see a common relationship, higher budget can be a common factor in successful movies. I am a bit ignorant towards that high ROI movies with very low budget because this is not realistic and seems more like incorrect data.

General trend suggests that for a higher ROI, either keep a very small budget or have a larger budget. Both areas seem to have better ROIs. Since budget is a denominator while calculating ROI, a small budget with decent revenue would result in a better ROI

Effect of user rating on ROI¶

Lastly, let's look at the average user ratings and their relationship with ROI.

# plot vote average vs ROI
plt.figure(figsize=(12, 6))
plt.scatter( df['vote_average'], df['roi'])

# title and labels
plt.title('Vote Average vs ROI')
plt.xlabel('Vote average')
plt.ylabel('ROI (in %)');

We can see that a lot of movies with better ROI have a larger vote average. So high average vote looks like a characteristic of successful movie. However we can see that there is a good density of non-successful movies as well with high vote average. Let's look at the mean ROI to see if we can get a trend here.

# plot runtime vs average ROI
plt.figure(figsize=(10, 6))
mean_roi = df.groupby('vote_average').mean()['roi']

plt.plot(mean_roi.index, mean_roi)

# title and labels
plt.title('Vote Average vs ROI')
plt.xlabel('Vote average')
plt.ylabel('ROI (in %)');

Well, the mean ROI clears things up with vote average. We can see a very clear thrend that ROI on average increases with the average vote. There are a few outliers, but high average vote seems to be a requirement for success of movie.

Conclusions¶

My research goals were to answer three major questions:

How does number of movies released per year change?

From my analysis, we clearly saw that number of movies is an increasing trend. It sort of exploded after year 2000.

How does runtime of movies change over years?

We saw that as the years passed, the average duration of movies remained more of less the same. However, we now have a lot of short films, as well as a few very lengthy films as well.

What factors can make a movie successful?

We had two operational definition for success, popularity and ROI. From my study of given dataset, I can conclude that following are the recipies for successful movies:
1. Popular movies tend to have a high budget, runtime of around 120-140 minutes, recenty release year, high vote counts and high vote average.
2. Movies with better ROI's generally have high popularity, early release years, runtime of around 80-120 minutes or 140-180 minutes, high vote counts and high vote average.
So we can see that high vote average and high vote counts are part of success either way. Recently released movies seem to be more popular, but ROI is getting down in the recent years, so movies released in the past are more successful in terms of ROI. We don't have a clear runtime slot, but short movies don't seem to be doing well on either definiotions.

Limitations of this study¶

First and foremost, the study and conclusions are limited by the quality of data. As seen in the wrangling phase, we had lots of missing values for budget and revenue. Later on while plotting ROI, we saw that budget values might be wrong as well. I did a filter based on my guess of ROI, but this still pollutes the result. Also some assumptions were made about outliers and we proceeded to look at the general trend.

This study shows definitive results for movies released per year and runtime, but it doesn't guarantee a recipe for success while evaluating the third question. It just points out what things are common across successful movies. While these things are common, they might not be the reason for the success of movie.

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185	7.1	2015	1.379999e+08	3.481613e+08
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	One Choice Can Destroy You	...	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	3/18/15	2480	6.3	2015	1.012000e+08	2.716190e+08

	id	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10866.000000	10866.000000	1.086600e+04	1.086600e+04	10866.000000	10866.000000	10866.000000	10866.000000	1.086600e+04	1.086600e+04
mean	66064.177434	0.646441	1.462570e+07	3.982332e+07	102.070863	217.389748	5.974922	2001.322658	1.755104e+07	5.136436e+07
std	92130.136561	1.000185	3.091321e+07	1.170035e+08	31.381405	575.619058	0.935142	12.812941	3.430616e+07	1.446325e+08
min	5.000000	0.000065	0.000000e+00	0.000000e+00	0.000000	10.000000	1.500000	1960.000000	0.000000e+00	0.000000e+00
25%	10596.250000	0.207583	0.000000e+00	0.000000e+00	90.000000	17.000000	5.400000	1995.000000	0.000000e+00	0.000000e+00
50%	20669.000000	0.383856	0.000000e+00	0.000000e+00	99.000000	38.000000	6.000000	2006.000000	0.000000e+00	0.000000e+00
75%	75610.000000	0.713817	1.500000e+07	2.400000e+07	111.000000	145.750000	6.600000	2011.000000	2.085325e+07	3.369710e+07
max	417859.000000	32.985763	4.250000e+08	2.781506e+09	900.000000	9767.000000	9.200000	2015.000000	4.250000e+08	2.827124e+09

	id	imdb_id	popularity	budget	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj
48	265208	tt2231253	2.932340	30000000	Wild Card	Jason Statham\|Michael Angarano\|Milo Ventimigli...	NaN	Simon West	Never bet against a man with a killer hand.	...	When a Las Vegas bodyguard with lethal skills ...	92	Thriller\|Crime\|Drama	Current Entertainment\|Lionsgate\|Sierra / Affin...	1/14/15	481	5.3	2015	2.759999e+07
67	334074	tt3247714	2.331636	20000000	Survivor	Pierce Brosnan\|Milla Jovovich\|Dylan McDermott\|...	http://survivormovie.com/	James McTeigue	His Next Target is Now Hunting Him	...	A Foreign Service Officer in London tries to p...	96	Crime\|Thriller\|Action	Nu Image Films\|Winkler Films\|Millennium Films\|...	5/21/15	280	5.4	2015	1.839999e+07
74	347096	tt3478232	2.165433	0	Mythica: The Darkspore	Melanie Stone\|Kevin Sorbo\|Adam Johnson\|Jake St...	http://www.mythicamovie.com/#!blank/wufvh	Anne K. Black	NaN	...	When Teelaâ€™s sister is murdered and a powerf...	108	Action\|Adventure\|Fantasy	Arrowstorm Entertainment	6/24/15	27	5.1	2015	0.000000e+00
75	308369	tt2582496	2.141506	0	Me and Earl and the Dying Girl	Thomas Mann\|RJ Cyler\|Olivia Cooke\|Connie Britt...	http://www.foxsearchlight.com/meandearlandthed...	Alfonso Gomez-Rejon	A Little Friendship Never Killed Anyone.	...	Greg is coasting through senior year of high s...	105	Comedy\|Drama	Indian Paintbrush	6/12/15	569	7.7	2015	0.000000e+00
92	370687	tt3608646	1.876037	0	Mythica: The Necromancer	Melanie Stone\|Adam Johnson\|Kevin Sorbo\|Nicola ...	http://www.mythicamovie.com/#!blank/y9ake	A. Todd Smith	NaN	...	Mallister takes Thane prisoner and forces Mare...	0	Fantasy\|Action\|Adventure	Arrowstorm Entertainment\|Camera 40 Productions...	12/19/15	11	5.4	2015	0.000000e+00

	id	imdb_id	popularity	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	revenue_adj
30	280996	tt3168230	3.927333	29355203	Mr. Holmes	Ian McKellen\|Milo Parker\|Laura Linney\|Hattie M...	http://www.mrholmesfilm.com/	Bill Condon	The man behind the myth	...	The story is set in 1947, following a long-ret...	103	Mystery\|Drama	BBC Films\|See-Saw Films\|FilmNation Entertainme...	6/19/15	425	6.4	2015	2.700677e+07
36	339527	tt1291570	3.358321	22354572	Solace	Abbie Cornish\|Jeffrey Dean Morgan\|Colin Farrel...	NaN	Afonso Poyart	A serial killer who can see your future, a psy...	...	A psychic doctor, John Clancy, works with an F...	101	Crime\|Drama\|Mystery	Eden Rock Media\|FilmNation Entertainment\|Flynn...	9/3/15	474	6.2	2015	2.056620e+07
72	284289	tt2911668	2.272044	45895	Beyond the Reach	Michael Douglas\|Jeremy Irvine\|Hanna Mangan Law...	NaN	Jean-Baptiste LÃ©onetti	NaN	...	A high-rolling corporate shark and his impover...	95	Thriller	Furthur Films	4/17/15	81	5.5	2015	4.222338e+04
74	347096	tt3478232	2.165433	0	Mythica: The Darkspore	Melanie Stone\|Kevin Sorbo\|Adam Johnson\|Jake St...	http://www.mythicamovie.com/#!blank/wufvh	Anne K. Black	NaN	...	When Teelaâ€™s sister is murdered and a powerf...	108	Action\|Adventure\|Fantasy	Arrowstorm Entertainment	6/24/15	27	5.1	2015	0.000000e+00
75	308369	tt2582496	2.141506	0	Me and Earl and the Dying Girl	Thomas Mann\|RJ Cyler\|Olivia Cooke\|Connie Britt...	http://www.foxsearchlight.com/meandearlandthed...	Alfonso Gomez-Rejon	A Little Friendship Never Killed Anyone.	...	Greg is coasting through senior year of high s...	105	Comedy\|Drama	Indian Paintbrush	6/12/15	569	7.7	2015	0.000000e+00

	id	imdb_id	popularity	original_title	cast	homepage	director	tagline	...	overview	genres	production_companies	release_date	vote_count	vote_average	release_year
92	370687	tt3608646	1.876037	Mythica: The Necromancer	Melanie Stone\|Adam Johnson\|Kevin Sorbo\|Nicola ...	http://www.mythicamovie.com/#!blank/y9ake	A. Todd Smith	NaN	...	Mallister takes Thane prisoner and forces Mare...	Fantasy\|Action\|Adventure	Arrowstorm Entertainment\|Camera 40 Productions...	12/19/15	11	5.4	2015
334	361931	tt5065822	0.357654	Ronaldo	Cristiano Ronaldo	http://www.ronaldothefilm.com	Anthony Wonke	Astonishing. Intimate. Definitive.	...	Filmed over 14 months with unprecedented acces...	Documentary	On The Corner Films\|We Came, We Saw, We Conque...	11/9/15	80	6.5	2015
410	339342	tt2948712	0.097514	Anarchy Parlor	Robert LaSardo\|Jordan James Smith\|Sara Fabel\|T...	NaN	Kenny Gage\|Devon Downs	NaN	...	Six young college hopefuls vacationing and par...	Horror	NaN	1/1/15	15	5.6	2015
445	353345	tt3800796	0.218528	The Exorcism of Molly Hartley	Sarah Lind\|Devon Sawa\|Gina Holden\|Peter MacNei...	NaN	Steven R. Monroe	NaN	...	Taking place years after The Haunting of Molly...	Horror	WT Canada Productions	10/9/15	52	5.0	2015
486	333653	tt4058368	0.176744	If There Be Thorns	Heather Graham\|Jason Lewis\|Rachael Carpani\|Mas...	NaN	Nancy Savoca	NaN	...	The third installment in V.C. Andrewsâ€™ bests...	TV Movie\|Drama	A+E Studios\|Jane Startz Productions	4/5/15	11	5.4	2015

	id	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10812.000000	10812.000000	5.165000e+03	4.849000e+03	10812.000000	10812.000000	10812.000000	10812.000000	5.165000e+03	4.849000e+03
mean	65558.945523	0.648730	3.076120e+07	8.923886e+07	102.421846	218.369404	5.975379	2001.288938	3.691521e+07	1.151009e+08
std	91662.645876	1.001976	3.891166e+07	1.620801e+08	30.871363	576.886018	0.934122	12.819746	4.196662e+07	1.988557e+08
min	5.000000	0.000065	1.000000e+00	2.000000e+00	2.000000	10.000000	1.500000	1960.000000	9.210911e-01	2.370705e+00
25%	10576.750000	0.209045	6.000000e+06	7.732325e+06	90.000000	17.000000	5.400000	1995.000000	8.108664e+06	1.046585e+07
50%	20500.500000	0.385298	1.700000e+07	3.185308e+07	99.000000	38.000000	6.000000	2006.000000	2.274082e+07	4.395666e+07
75%	74725.250000	0.716608	4.000000e+07	9.996575e+07	112.000000	146.000000	6.600000	2011.000000	5.008384e+07	1.316482e+08
max	417859.000000	32.985763	4.250000e+08	2.781506e+09	900.000000	9767.000000	9.200000	2015.000000	4.250000e+08	2.827124e+09

	id	popularity	budget	revenue	original_title	runtime	genres	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj	roi
2449	2667	0.934621	25000.0	248000000.0	The Blair Witch Project	81	Horror\|Mystery	7/14/99	522	6.3	1999	32726.321165	3.246451e+08	9.919000e+05
3581	59296	0.520430	1.0	1378.0	Love, Wedding, Marriage	90	Comedy\|Romance	6/3/11	55	5.3	2011	0.969398	1.335831e+03	1.377000e+05
3608	50217	0.463510	93.0	2500000.0	From Prada to Nada	107	Comedy\|Drama\|Romance	1/28/11	47	5.2	2011	90.154018	2.423495e+06	2.688072e+06
6179	11338	0.132713	114.0	6700000.0	Into the Night	115	Comedy\|Drama\|Thriller	2/15/85	24	6.1	1985	231.096930	1.358201e+07	5.877093e+06
7447	23827	1.120442	15000.0	193355800.0	Paranormal Activity	86	Horror\|Mystery	9/14/07	714	6.1	2007	15775.028739	2.033462e+08	1.288939e+06