This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. Having looked at the data, we have information about a movie's cast, crew, release date, ratings, budget and revenue. From my study and investigation of this dataset, I want to figure out:
Success in general terms is when you achieve what you wanted to achieve. For a movie star, success would be a popular movie or a highly voted movie. For a production house, success would be a larger revenue. Assuming that this dataset has missing values, I am not expecting a perfect success recipe. A key thing to note is that inflation plays a major role in budget as well as revenue. Lucky for us, the final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.
For this study, we will evaluate success in two ways:
We will look into both aspects of being successful and evaluate the factors that can make a movie successful.
Potential factors that can affect the success:
For this study, we will measure success based on budget, votes, release year and runtime. While measuring success in terms of budget, we will consider votes, release year and runtime. And when measuring success in terms of votes, we will consider profit, release year and runtime.
# First things first, import all the packages that I will be using
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
% matplotlib inline
# Load your data
df = pd.read_csv('./data/tmdb-movies.csv')
df.head(3)
So there are 21 columns in this dataset. From these three rows, I can see that cast
is a bit tricky with string having all names separated by |
. Although the directors
are shown as single in these rows, I need to check further is same scheme of joining names is done for directors as well. genres
and production_companies
seem to concatenate multiple strings the same way. Release date looks consistent in the format of mm/dd/yy
. However we don't have a profit column that I wanted to look into. I need to fix that as well. Before that let's just look at some basic info from the above data frame.
# see the column info and null values in the dataset
df.info()
Looking at the above info, I can see a few anomalies already. The id column has 10866 entries, so I am assuming there are around 10866 movies in the database. But tagline
, keywords
and production_companies
columns seems to be missing quite a few rows. Somewhere in the range of 1000-3000 rows are missing. That would be around 10-30%
missing data. However, homepage seems to have a lot less entries. This is understandble as a lot of movies might not have a website. Also, this doesn't seem a very important factor in the investigation. Since I am not sure if the movies don't have a homepage or just this dataset is missing them, I can't possibly check the effect of having a website on movie popularity or revenues. For this study, I will just drop the homepage column.
# Printing out some descriptive statistics for the data
df.describe()
As we can see in the summary above, we have lots of 0 values in budget, revenue and some in runtime. Neither of these should have a 0 value. Let's just print a few of these to see if there is a pattern associated with 0 values.
# get movies with 0 revenue
movies_with_zero_revenue = df.query('revenue == 0')
# print out a few of these
movies_with_zero_revenue.head(5)
Movie with id 308369 seems to be rated well, so having a 0 value in budget and revenue doesn't make sense. Even other movies are released so they will have some budget and some revenue. This looks like a case of missing data to me. Let's look at the budget as well.
# get movies with 0 budget
movies_with_zero_budget = df.query('budget == 0')
# print out a few of these
movies_with_zero_budget.head(5)
Again I can see that all of these have production companies with them, so it can't be the case of actual 0 budget. This also looks like missing data to me. However, if we are using 0 as budget for missing values, and there are lot's of missing values, this will give us some false statistics. Since we want to predict success of movie based on revenue and budget, it will be a hard call to drop the rows with missing data. In case we have to drop the rows, let's see how many of them will be gone.
# Movie count without revenue, using id to count as some imdb_id seems to be missing.
movies_with_zero_revenue.groupby('revenue').count()['id']
# Movie count without budget, using id to count as some imdb_id seems to be missing.
movies_with_zero_budget.groupby('budget').count()['id']
# Movies with both revenue and budget missing
movies_with_zero_revenue_and_budget = df.query('budget == 0 and revenue == 0')
movies_with_zero_revenue_and_budget.groupby('budget').count()['id']
So we have total 6016 movies without revenue data, 5696 movies without budget data and 4701 movies without either of these. Since I have two operational definiotions for success of movies, one based on ratings and other on revenue, I will keep this data for now, but change all the 0 values to null. Since I am also considering the effecto of runtime, let's check for missing data in that as well.
# get movies with 0 budget
movies_with_zero_runtime = df.query('runtime == 0')
# print out a few of these
movies_with_zero_runtime.head(5)
# count of movies with 0 runtime
movies_with_zero_runtime.groupby('runtime').count()['id']
A quick search on the internet shows that these movies had a runtime, and definetly it wasn't 0. Since there are just 31 such entries, we can remove them all without affecting our data much.
I have also not defined the ratings aspect as of now, so let's address that. This dataset provides three columns related to ratings:
As per TMDB, here are the factors considered in popularity:
This seems like a good measure or popularity for a particular movie. This dataset also contains user vote_count and vote_average. With a good number of vote_count, the vote_average will also be a good measure of public opinion about eh popularity of movie. In this dataset however, we have votes raning from 10 to around 9500 votes. One consideration can be that a movie which is less popular will have lower number of votes. However, we can't definetly say that. I am also going to keep the vote_average as one of my variables and see how this affects the revenue.
# Drop unnecessary columns
columns = ['imdb_id', 'homepage', 'tagline', 'overview', 'cast', 'director', 'keywords', 'production_companies']
df.drop(columns, axis=1, inplace=True)
# Our revised dataframe
df.head(1)
# Inplace remove the rows with 0 runtime
df.query('runtime != 0', inplace=True)
# drop null genres
df.dropna(subset = ['genres'], how='any', inplace=True)
# replace 0 in budget, revenue, budget_adj and revenue_adj
df['budget'] = df['budget'].replace(0, np.NaN)
df['budget_adj'] = df['budget_adj'].replace(0, np.NaN)
df['revenue'] = df['revenue'].replace(0, np.NaN)
df['revenue_adj'] = df['revenue_adj'].replace(0, np.NaN)
# remove duplicates
df.drop_duplicates(inplace=True)
df.info()
From our info printed above, we can see that removing 0 budgets and revenues made our dataset significantly smaller. This is a tradeoff we will make for our analysis. Our data in both dataframes are now clean, with null values removed, useless columns removed and 0's converted to null. Let's again look at some descriptive statistics on the dataframe.
# cleaned up data with null values for missign budget and revenue
df.describe()
# count movies released per year
release_per_year = df.groupby('release_year').count()['id']
release_per_year.head()
We are working on the cleaned up dataset and we have our information for movies released per year. Let's plot this data to see the trend, which seems to be upward.
# plot movies released by year
plt.figure(figsize=(10, 6))
plt.plot(release_per_year)
# title and labels
plt.title('Movies released by Years')
plt.xlabel('Year')
plt.ylabel('Number of Movies');
Clearly the number of movies released is going up every year, except a couple of times where it took a small dip. The curve also seems much steep after year 2000.
# count movie runtime per year
min_runtime_per_year = df.groupby('release_year').min()['runtime']
max_runtime_per_year = df.groupby('release_year').max()['runtime']
mean_runtime_per_year = df.groupby('release_year').mean()['runtime']
Let's plot the data for minimum, maximum and mean runtime in different years to see if anything noticable happened.
# build the index location for x-axis
min_index = min_runtime_per_year.index
max_index = max_runtime_per_year.index
mean_index = mean_runtime_per_year.index
# set axes for the plot
x1, y1 = min_index, min_runtime_per_year
x2, y2 = max_index, max_runtime_per_year
x3, y3 = mean_index, mean_runtime_per_year
# create the plot
plt.figure(figsize=(10, 6))
plt.plot(x1, y1, label = 'Minimum')
plt.plot(x2, y2, label = 'Maximum')
plt.plot(x3, y3, label = 'Mean')
# title and labels
plt.title('Runtime by Years')
plt.xlabel('Year')
plt.ylabel('Duration (minutes)');
plt.legend(loc='upper left')
We can see that the average duration of movies more or less remains the same. However, there seems to be a contant trend of small films after year 2000. The maximum length of movies also seem to increase overall, and we can see a few very very lengthy movies in later years.
As we have two operational definitions for success, we will explore both aspects one by one.
In my first analysis, I will consider popularity as a measure of success. Higher popularity would mean more success. We will try to answer the following questions:
Let's see how popularity of movies look like when we take the budget into consideration. I will plot the mean popularity against the budget first.
# plot budget vs average Popularity
plt.figure(figsize=(10, 6))
mean_popularity_grouped_by_budget = df.groupby('budget').mean()['popularity']
plt.plot(mean_popularity_grouped_by_budget.index, mean_popularity_grouped_by_budget)
# title and labels
plt.title('Budget vs mean Popularity')
plt.xlabel('Adjusted Budjet for inflation')
plt.ylabel('Popularity rating');
We can see that popularity seems to increase with budget in general. Although there is a huge dip at the end for movie with very high budget, but the general trend suggests that higher the budget, higher is popularity. Let's look at the scatter of budget with popularity to gain more insight about this.
# plot budget vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['budget_adj'], df['popularity'])
# title and labels
plt.title('Budget vs Popularity')
plt.xlabel('Adjusted Budjet for inflation')
plt.ylabel('Popularity rating');
From the above scatter plot, we can see that as the budget increases, the movies seem to be slightly more popular, but this is not a definitive trend. Some of the most popular movies lie in the middle of our budget scale. We can see that we have popular movies at all types of budget. But most of the lower budget movies seem to be less popular. The dip in the mean popularity is explained by the low popularity and low movie count in high budget.
So budget of a movie might not be a direct factor in success, but movies having a higher budget seems to be popular
It doesn't however mean that higher budget makes a movie popular.
Let's see how popularity of movies look like when we take the movie runtime into consideration. We will fist look into the mean popularity and then look at the scatter plot to analyze further.
# plot runtime vs average Popularity
plt.figure(figsize=(10, 6))
mean_popularity_grouped_by_runtime = df.groupby('runtime').mean()['popularity']
plt.plot(mean_popularity_grouped_by_runtime.index, mean_popularity_grouped_by_runtime)
# title and labels
plt.title('Runtime vs mean Popularity')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Popularity rating');
We can see that there is an increase in popularity as the runtime increases initially and then it starts falling down. There is a peak around 155-175 minutes of runtime. Let's ananlyze the distribution of runtime and popularity with a scatter plot now.
# plot runtime vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['runtime'], df['popularity'])
# title and labels
plt.title('Runtime vs Popularity')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Popularity rating');
We can see that movies with runtime around 120 to 140 are more popular. So short films are not doing that well. Very long ones are also not performing well. So runtime does seem to affect the popularity of movies. People like their movies around 2 hours long.
Movies which are around 120 to 140 minutes long seem to be more popular with audience. Most popular movies have time around 120 minutes. However movies around 150-170 minutes have more popularity in terms of average.
Let's look at the change in popularity over the years. To get the trend for popularity, we can look at the mean popularity per year first. Later we will analyze the distribution of popularity from a scatter plot over years.
# plot mean popularity vs year
mean_popularity_grouped_by_year = df.groupby('release_year').mean()['popularity']
mean_index = mean_popularity_grouped_by_year.index
# set axes for the plot
x1, y1 = mean_index, mean_popularity_grouped_by_year
plt.figure(figsize=(10, 6))
plt.plot(x1, y1, label = 'Mean Popularity')
# title and labels
plt.title('Popularity by Years')
plt.xlabel('Year')
plt.ylabel('Popularity');
plt.legend(loc='upper left')
Popularity seems to be moving upwards as the years progress. This seems correct as the newer movies are more easily dicovered and hence they will be more popular. Since mean values measure more of a central tendency, let's look at the absolute values in a scatter plot.
# plot release year vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['release_year'], df['popularity'])
# title and labels
plt.title('Release year vs Popularity')
plt.xlabel('Release year')
plt.ylabel('Popularity rating');
We can see that most movies over the years are not very popular, btu there is a definite rise in popular movies over the years.
Movies which are releases in recent past seem to be more popular. So there is a good chance of success for newly released movies
Given the fact that a lot of new movies are still densly located at low popularity rankings, release year does not ensure success as it might seem from the mean plot above.
Let's see how popularity is affected by user ratings. However both of these seem to be the same thing, let's see if there is a relation between both. We will have two trends plotted, one where we take number of user ratings in account, and other where we take the average user rating. Again we will consider mean popularity and then look at the distribution in scatter plots.
# plot average vote vs average Popularity
plt.figure(figsize=(10, 6))
mean_popularity_grouped_by_vote_average = df.groupby('vote_average').mean()['popularity']
plt.plot(mean_popularity_grouped_by_vote_average.index, mean_popularity_grouped_by_vote_average)
# title and labels
plt.title('Avergae vote vs mean Popularity')
plt.xlabel('Average vote')
plt.ylabel('Popularity rating');
We can see that on an average, movies with higher vote average are more popular. Again we see a dip in the popularity at the extreme of average vote where a highly voted movie is not very popular. But the general trend suggests that better the vote average, more the popularity and hence movie is a success. Now let's look at the distribution of vote average with popularity in a scatter plot.
# plot vote average vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['vote_average'], df['popularity'])
# title and labels
plt.title('Avergae vote vs Popularity')
plt.xlabel('Average vote')
plt.ylabel('Popularity rating');
While we can see that movies with higher average vote have a better chance of being popular, we have some highly voted movies that are not very popular.
While movies with high average votes seem to be more popular, there is still a chance that high vote average movie might not be popular. But low votes definetly show that movie won't be popular
Now lets look how number of votes casted fare with popularity of movie. We will look at the mean popularity first and then have a look at the scatter plot to further analyze.
# plot vote count vs average Popularity
plt.figure(figsize=(10, 6))
mean_popularity_grouped_by_vote_count = df.groupby('vote_count').mean()['popularity']
plt.plot(mean_popularity_grouped_by_vote_count.index, mean_popularity_grouped_by_vote_count)
# title and labels
plt.title('Vote count vs mean Popularity')
plt.xlabel('Vote count')
plt.ylabel('Popularity rating');
We can infer from this graph that on average, movies with more votes casted seem to be more popular. Let's look at the distribution of votes count and popularity to see if that also confirms our observation.
# plot vote count vs popularity
plt.figure(figsize=(12, 6))
plt.scatter( df['vote_count'], df['popularity'])
# title and labels
plt.title('Vote count vs Popularity')
plt.xlabel('vote count')
plt.ylabel('Popularity rating');
We can see in the graph that popularity increases as the number of votes increase. This makes sense as more the number of people watching it, more people will vote and more popular the movie will be.
More the numbers of vote casted, more popular the movie is. This loop kind of feeds itself as more votes means more popular and then more people watch it and more will vote.
In the second part of my analysis, I will consider percentage profit or return on investment as a measure of success. Higher ROI would mean more success. We will try to answer the following questions:
First things first, let's calculate ROI percent in our data.
# calculate the ROI per 100
df['roi'] = ((df['revenue_adj'] - df['budget_adj']) / df['budget_adj'])*100
In this part of our analysis, success is determined by ROI. First let's have a look at the ROI over years.
# plot year of release vs ROI
plt.figure(figsize=(12, 6))
plt.scatter( df['release_year'], df['roi'])
# title and labels
plt.title('ROI by Year of release')
plt.xlabel('Year of release')
plt.ylabel('ROI (in %)');
It looks like ROI remains low for most part. But for some reason, around 1985-1987, there was massive profits on average. Lets look at a few rows with such high profits to see if there is some error with data.
movies_with_high_roi = df.query('roi > 100000')
movies_with_high_roi.head()
This just doesn't looks right. A movie of runtime 115 minutes can't have a budget of $114. Other movies are also having incorrect data for budget. Hence the unusual ROI. I will filter out the ROI's greater than 20 times and then plot the same graph again. Before that, I will plot a mean ROI by years to see the trend after cleaning up the data.
# remove data where ROI is unrealistic. For my study, I have decided that an ROI of more than 1000% is not realistic.
movies_with_normal_roi = df.query('roi < 1000', inplace=True)
normal_roi_mean = df.groupby('release_year').mean()['roi']
mean_index = normal_roi_mean.index
# set axes for the plot
x1, y1 = mean_index, normal_roi_mean
plt.figure(figsize=(10, 6))
plt.plot(x1, y1, label = 'Mean ROI')
# title and labels
plt.title('Mean ROI by Years')
plt.xlabel('Year of release')
plt.ylabel('ROI (in %)');
plt.legend(loc='upper left')
Now this filtered out data looks more precise. We can see that ROI is going down by the years. Around 1970's, there seems to be some outliers with great profit and very heavy loss, but rest of the trend suggests a dip in ROI. The curve if going upwards from around 2000, but the slope is low. Let's also take a look at the ROI distribution using a scatter plot.
# plot release year vs ROI
plt.figure(figsize=(12, 6))
plt.scatter( df['release_year'], df['roi'])
# title and labels
plt.title('ROI by Years')
plt.xlabel('Year of release')
plt.ylabel('ROI (in %)');
We can see that as the years are passing, movies with low ROI are increasing. Which brings the average down and hence the plot before that.
ROI is going down on an average as the years pass by. This can be due to higher budgets or more movies coming in and not performing well. However we can see that the number of movies with higher ROI is also increasing. This can be seen in the density of the above scatter plot. So movies coming in later years have a lower chance of being successful if we just look at the release time in isolation. The possible reasons could be increase in budget and flooding of movies causign decrease in revenue.
To look at the affect of popularity index on the ROI, let's plot ROI against popularity. Again we will first take a look at the mean ROI grouped by popoularity and then look at the scatter plot to further analyze on the trend.
# plot popularity vs average ROI
plt.figure(figsize=(10, 6))
mean_roi_grouped_by_popularity = df.groupby('popularity').mean()['roi']
plt.plot(mean_roi_grouped_by_popularity.index, mean_roi_grouped_by_popularity)
# title and labels
plt.title('Popularity vs mean ROI')
plt.xlabel('Popularity rating')
plt.ylabel('ROI (in %)');
We don't have a very conclusive trend here, but as the popularity increases, the average seems to go up higher for ROI. LEt's look at the scatter plot to see what is the reason of this strange distribution.
# plot popularity vs ROI
plt.figure(figsize=(10, 6))
plt.scatter(df['popularity'], df['roi'])
# title and labels
plt.title('Popularity vs ROI')
plt.xlabel('Popularity rating')
plt.ylabel('ROI (in %)');
If we just look at the above graph, it suggests that as popularity increases, there is certainly and increase in ROI. However, movies which are not very popular also have a great ROI. One possible cause is that we have incorrect revenue and budget data, hence ROI calculated is not correct. But if we look at the highly popular movies, they seem to have a better ROI.
There isn't very strong connection unless the popularity goes beyond a certain index of 3-4. After that we can see a positive relationship in the sense that more popular a movie, better is the ROI.
We have seen earlier that runtime between 120-140 minutes seem to common factor among popular movies. So my initial guess would be that these runtime should have a higher ROI as well. Let's plot the runtime vs ROI plot and see what the trend suggests.
# plot runtime vs average ROI
plt.figure(figsize=(10, 6))
mean_roi = df.groupby('runtime').mean()['roi']
plt.plot(mean_roi.index, mean_roi)
# title and labels
plt.title('Runtime vs mean ROI')
plt.xlabel('Runtime (minutes)')
plt.ylabel('ROI (in %)');
We can see that the mean ROI goes up towards 140-180 minutes mark. So on average movies with this lengths are having a higher ROI. Let's also look at the ROI distribution with runtime.
# plot runtime vs ROI
plt.figure(figsize=(10, 6))
plt.scatter(df['runtime'], df['roi'])
# title and labels
plt.title('Runtime vs ROI')
plt.xlabel('Runtime (minutes)')
plt.ylabel('ROI (in %)');
We can see in this distrubution that in general, movies with time around 140-180 minutes do have a higher ROI. Although, there is a good density of movies around 80-120 minutes mark with good ROI. Since this runtime has a lot of movies, the mean ROI goes down.
I can see that there is a good density of high ROI movies in 80-120 min interval. Also, 140-180 minutes interval have a high mean ROI. These two intervals would be a good runtime to look for high ROI movies.
While budget is directly involved in calculating the ROI, I wanted to see if high budget movies are getting better ROI, because a higher budget can enalbe better cast and crew. Let's access both mean ROI as well as ROI distribution with budget.
# plot budget vs ROI
plt.figure(figsize=(12, 6))
plt.scatter( df['budget_adj'], df['roi'])
# title and labels
plt.title('Budget vs ROI')
plt.xlabel('Budget (adjusted for inflation in $)')
plt.ylabel('ROI (in %)');
We can see that there is a lot of density towards low budget and low ROI. As the budget is increased, there seems to be a slight increase in ROI. We can also see a nice chunk of low budget movies getting good ROI. Let's see how the mean ROI looks against budget.
# plot budget vs average ROI
plt.figure(figsize=(10, 6))
mean_roi_grouped_by_budget = df.groupby('budget_adj').mean()['roi']
plt.plot(mean_roi_grouped_by_budget.index, mean_roi_grouped_by_budget)
# title and labels
plt.title('Budget vs mean ROI')
plt.xlabel('Budget (adjusted for inflation in $)')
plt.ylabel('ROI (in %)');
The mean ROI can also be seen going upwards with increase in budget. Although the curve is not very smooth, but if we have to generalize and see a common relationship, higher budget can be a common factor in successful movies. I am a bit ignorant towards that high ROI movies with very low budget because this is not realistic and seems more like incorrect data.
General trend suggests that for a higher ROI, either keep a very small budget or have a larger budget. Both areas seem to have better ROIs. Since budget is a denominator while calculating ROI, a small budget with decent revenue would result in a better ROI
Lastly, let's look at the average user ratings and their relationship with ROI.
# plot vote average vs ROI
plt.figure(figsize=(12, 6))
plt.scatter( df['vote_average'], df['roi'])
# title and labels
plt.title('Vote Average vs ROI')
plt.xlabel('Vote average')
plt.ylabel('ROI (in %)');
We can see that a lot of movies with better ROI have a larger vote average. So high average vote looks like a characteristic of successful movie. However we can see that there is a good density of non-successful movies as well with high vote average. Let's look at the mean ROI to see if we can get a trend here.
# plot runtime vs average ROI
plt.figure(figsize=(10, 6))
mean_roi = df.groupby('vote_average').mean()['roi']
plt.plot(mean_roi.index, mean_roi)
# title and labels
plt.title('Vote Average vs ROI')
plt.xlabel('Vote average')
plt.ylabel('ROI (in %)');
Well, the mean ROI clears things up with vote average. We can see a very clear thrend that ROI on average increases with the average vote. There are a few outliers, but high average vote seems to be a requirement for success of movie.
My research goals were to answer three major questions:
How does number of movies released per year change?
From my analysis, we clearly saw that number of movies is an increasing trend. It sort of exploded after year 2000.
How does runtime of movies change over years?
We saw that as the years passed, the average duration of movies remained more of less the same. However, we now have a lot of short films, as well as a few very lengthy films as well.
What factors can make a movie successful?
We had two operational definition for success, popularity and ROI. From my study of given dataset, I can conclude that following are the recipies for successful movies:
So we can see that high vote average and high vote counts are part of success either way. Recently released movies seem to be more popular, but ROI is getting down in the recent years, so movies released in the past are more successful in terms of ROI. We don't have a clear runtime slot, but short movies don't seem to be doing well on either definiotions.
First and foremost, the study and conclusions are limited by the quality of data. As seen in the wrangling phase, we had lots of missing values for budget and revenue. Later on while plotting ROI, we saw that budget values might be wrong as well. I did a filter based on my guess of ROI, but this still pollutes the result. Also some assumptions were made about outliers and we proceeded to look at the general trend.
This study shows definitive results for movies released per year and runtime, but it doesn't guarantee a recipe for success while evaluating the third question. It just points out what things are common across successful movies. While these things are common, they might not be the reason for the success of movie.