SQL Project: Analyzing Movie Data

Author: Sia Phulambrikar

In this project, we dive into a database of Movies, unraveling insights through SQL. I start with broad exploratory analysis about the dataset, and then delve deeper into genre-specific characteristics of movies, characters and dialogues. Read on as we uncover fascinating insights about Hollywood and the Silver Screen :)

Skills: SQL

1. Importing Dependancies and Loading Data

import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import inspect

movie_engine = create_engine('sqlite:///movie_lines.db')
insp = inspect(movie_engine)
table_names = insp.get_table_names()
con = movie_engine.connect()

2. Exploratory Analysis:

How many movies are in the database?
How many movies does each genre contain?
What release years do the movies span?

Screenshot 2023-12-22 at 11.22.57 PM.png

Q1. How many movies are in the database?

query = '''
SELECT COUNT(movie_id) as n_movies
FROM movies
'''
pd.read_sql_query(query, con)

	n_movies
0	617

There are 617 movies in the database.

Q2. How many movies does the database contain per genre?

Display results for the top 3 genres.

query = '''
SELECT genres.name as genre_name, COUNT(movies.movie_id) as n_movies
FROM movies
JOIN movie_genre_linking ON movies.movie_id = movie_genre_linking.movie_id
JOIN genres ON movie_genre_linking.genre_id = genres.genre_id
GROUP BY genres.genre_id
ORDER BY n_movies DESC
LIMIT 3
'''
pd.read_sql_query(query, con)

	genre_name	n_movies
0	drama	320
1	thriller	265
2	comedy	159

The database contains the most number of Drama movies, at 320, followed by Thriller at 265. The total number of genres in the database are 23 (same as the number of rows).

Q3. What release years do the movies span?

query = '''
SELECT MIN(year) as start_year, MAX(year) as end_year
FROM movies
'''
pd.read_sql_query(query, con)

	start_year	end_year
0	1927	2010

The dataset spans movies released between 1927-2010.

3. Analysis Questions: Digging Deeper

Which are the highest rated movies?
Which genre receives the best and worst average ratings?
Which are the most popular movies from each genre?
Are there movies with an all-male or all-female cast?

Understanding dialogues in movies

What are the average number of conversations per movie? Does this vary by genre?
Does the length of conversation vary by movie genre?

Delving into the Golden Age of Hollywood

Which were the most popular movies defining the Golden Age of Hollywood?
Which were the Top 5 characters during the Golden Age?

Q1. What are the Top 5 Highest Rated Movies?

query = '''
SELECT movie_id, title as movie_title, imdb_rating
FROM movies
ORDER BY imdb_rating DESC
LIMIT 5
'''
pd.read_sql_query(query, con)

	movie_id	movie_title	imdb_rating
0	457	neuromancer	9.3
1	203	the godfather	9.2
2	369	the godfather: part ii	9.0
3	463	one flew over the cuckoo’s nest	8.9
4	504	schindler’s list	8.9

Movies with the highest ratings are Neuromancer, The Godfather, The Godfather Part II, One Flew Over the Cuckoo’s Nest and Schindler’s List. The highest rated movie in the database, Neuromancer, has an IMDB rating of 9.3.

Q2. What genre recives the best and worst average ratings?

query = '''
WITH MovieGenres (movie_id, title, year, imdb_rating, imdb_votes, url, genre_id, genre_name)
AS 
    (SELECT movies.*, genres.*
    FROM movies
        JOIN movie_genre_linking ON movies.movie_id = movie_genre_linking.movie_id
        JOIN genres ON movie_genre_linking.genre_id = genres.genre_id
    )
SELECT genre_id, genre_name, AVG(imdb_rating) as avg_rating
FROM MovieGenres
GROUP BY genre_id
ORDER BY avg_rating DESC
'''
pd.read_sql_query(query,con)

	genre_id	genre_name	avg_rating
0	18	film-noir	8.400000
1	11	war	7.756522
2	5	biography	7.684000
3	21	history	7.547619
4	4	sport	7.325000
5	23	musical	7.275000
6	19	drama	7.236562
7	7	romance	7.050758
8	9	music	7.015385
9	6	crime	7.001379
10	16	mystery	6.968000
11	8	animation	6.962500
12	1	family	6.956250
13	24	adventure	6.851402
14	10	comedy	6.751572
15	3	western	6.741667
16	15	thriller	6.704906
17	12	sci-fi	6.665179
18	2	fantasy	6.612162
19	22	documentary	6.600000
20	17	short	6.560000
21	20	action	6.544304
22	14	adult	6.300000
23	13	horror	6.057143

From the above table, we see that Film Noir appears to be the most popular genre on average, with an average IMDB rating of 8.4. War and Biography movies also are rated quite highly on average. Genres rated lowest, on average, are Horror, Adult and Action, each with average ratings below 6.6.

Q3. Which are the best rated movies from the 5 most popular genres?

Here, we find the best rated movies focusing on the 5 most popular genres found above, which were Film Noir, War, Biography, History and Sport.

query = '''
WITH MovieGenres(movie_id, title, year, imdb_rating, imdb_votes, url, genre_id, genre_name)
AS
(SELECT movies.*, genres.*
FROM movies
    JOIN movie_genre_linking as link
    ON movies.movie_id = link.movie_id
    
    JOIN genres ON genres.genre_id = link.genre_id
)

SELECT genre_id, genre_name, title as movie_title, MAX(imdb_rating) as imdb_rating
FROM MovieGenres
GROUP BY genre_id
ORDER BY AVG(imdb_rating) DESC
LIMIT 5
'''
pd.read_sql_query(query,con)

	genre_id	genre_name	movie_title	imdb_rating
0	18	film-noir	sunset blvd.	8.7
1	11	war	schindler’s list	8.9
2	5	biography	schindler’s list	8.9
3	21	history	schindler’s list	8.9
4	4	sport	raging bull	8.4

On inspecting the highest rated movies from the most popular genres, we see some popular names pop up such as Schindler’s List, the highest rated War movie. However, Schindler’s List appears under multiple genres including Biography and History. This is because each movie is linked to multiple possible genres under the movie_genre_linking table. Thus, movies with high ratings like Neuromancer (9.3) and Godfather (9.2) appear mutliple times.

Q4. Are there movies with an all-male or all-female cast? If not, which movies have the highest male:female ratio?

#movies with an all-male cast
query = '''
SELECT movie_id, title as movie_title, n_female, n_male

FROM 
    (SELECT characters.movie_id, movies.title, COUNT(CASE WHEN gender = 'F' THEN gender ELSE NULL END) as n_female, 
        COUNT(CASE WHEN gender = 'M' THEN gender ELSE NULL END) as n_male 
    FROM characters
    JOIN movies ON characters.movie_id = movies.movie_id
    GROUP by characters.movie_id
    )

WHERE n_female = 0 AND n_male > 0
ORDER BY n_male DESC
'''
pd.read_sql_query(query, con)

	movie_id	movie_title	n_female	n_male
0	247	apocalypse now	0	8
1	170	reservoir dogs	0	7
2	525	south park: bigger longer & uncut	0	7
3	536	dr. strangelove or: how i learned to stop worr…	0	7
4	37	the boondock saints	0	6
…	…	…	…	…
68	277	la battaglia di algeri	0	1
69	315	dark city	0	1
70	478	predator	0	1
71	488	red white black & blue	0	1
72	572	ticker	0	1

73 rows × 4 columns

We can see that Apocalypse Now had the largest all-male cast in our dataset with an 8:0 male:female ratio, followed by Reservoir Dogs at 7:0. A simple google search about the Apocalypse Now cast would reveal that the results are correct - the film did comprise of an all-male cast. In total, there are 73 such movies with an all-male cast in the dataset.

query = '''
SELECT movie_id, title AS movie_title, n_female, n_male

FROM 
    (SELECT characters.movie_id, movies.title, COUNT(CASE WHEN gender = 'F' THEN gender ELSE NULL END) as n_female, 
        COUNT(CASE WHEN gender = 'M' THEN gender ELSE NULL END) as n_male 
    FROM characters
    JOIN movies ON characters.movie_id = movies.movie_id
    GROUP by characters.movie_id
    )

WHERE n_male = 0 AND n_female > 0
ORDER BY n_female DESC
'''
pd.read_sql_query(query, con)

	movie_id	movie_title	n_female
0	51	drop dead gorgeous	3
1	234	agnes of god	3
2	384	heavenly creatures	3
3	393	hellraiser iii: hell on earth	3
4	138	my mother dreams the satan’s disciples in new …	2
5	134	metropolis	1
6	459	the nightmare before christmas	1
7	500	salt of the earth	1

There are only 7 movies with an all female dataset. The top of this list is Drop Dead Gorgeous, a 1999 film which indeed has an all female cast. This is tied with Agnes of God, Heavenly Creatures and Hellraiser III: Hell on Earth.

Q5. What are the average number of conversations per movie? Which movies have the maximum number of conversations, and which have the minimum?

#average number of conversations overall
query = '''
WITH conversation_count (movie_id, n_conversations) 
AS 
(
SELECT movie_id, COUNT(conversation_id) as n_conversations
FROM conversations
GROUP BY movie_id
)

SELECT AVG(n_conversations) as avg_conversations
FROM conversation_count;
'''
pd.read_sql_query(query, con)

	avg_conversations
0	134.679092

#movies with the highest number of conversations
query = '''
WITH conversation_count (movie_id, n_conversations, title) 
AS 
(
SELECT movies.movie_id, COUNT(conversation_id) as n_conversations, movies.title
FROM conversations
INNER JOIN movies ON conversations.movie_id = movies.movie_id
GROUP BY conversations.movie_id
)

SELECT *
FROM conversation_count
WHERE n_conversations = (SELECT MAX(n_conversations) FROM conversation_count) 

UNION
SELECT *
FROM conversation_count
WHERE n_conversations = (SELECT MIN(n_conversations) FROM conversation_count)
;
'''
pd.read_sql_query(query, con)

	movie_id	n_conversations	title
0	289	338	casino
1	406	1	the jazz singer
2	602	1	what women want

The average number of conversations per movie is 134. The movie with the highest number of conversations (338) is Casino. The movies with the lowest (1) are the Jazz Singer and What Women Want. The Jazz Singer, interestingly, was one of the first movies to have a synchronized vocal and musical track, a milestone in the sound revolution. The movie was all-silent except just two minutes of vocal dialogue.

Q6. Does the length of conversations vary by movie genre?

query = '''
WITH line_counter 
AS
(SELECT conversation_id, COUNT(line_sort) as n_lines, lines.movie_id, genres.genre_id, genres.name
FROM lines
    JOIN movie_genre_linking 
    ON lines.movie_id = movie_genre_linking.movie_id

    JOIN genres
    ON movie_genre_linking.genre_id = genres.genre_id

GROUP BY conversation_id
)

SELECT genre_id, name, AVG(n_lines) as avg_lines
FROM line_counter
GROUP BY genre_id
ORDER BY avg_lines DESC
'''
pd.read_sql_query(query, con)

	genre_id	name	avg_lines
0	4	sport	13.279215
1	1	family	13.278655
2	6	crime	13.102732
3	8	animation	12.964912
4	2	fantasy	12.472645
5	3	western	12.229846
6	5	biography	11.513866
7	7	romance	10.608152
8	12	sci-fi	10.496909
9	9	music	9.912390
10	11	war	9.907377
11	13	horror	9.429785
12	15	thriller	8.814720
13	10	comedy	8.718647
14	18	film-noir	7.795222
15	16	mystery	7.543767
16	17	short	5.480176
17	20	action	4.967136
18	19	drama	4.791919
19	22	documentary	3.094340

The longest conversations, on average, occur in Sport genre movies, with Family and Crime movies being close runners-up. The shortest conversations, on the other hand, occur in Documentary films, followed by Drama and Action.

This could also be attributed to the way the dataset has been created - monologues by characters count as a single line in a “conversation”. Thus, in documentaries, where one documentary narrator might be speaking multiple lines at once in a monological fashion, this might be considered a single line for the dataset, leading to the least number of lines per conversation.

Q7. Which were the most popular movies defining the Golden Age of Hollywood?

The Golden Age of Hollywood was characterized by sound in movies, dominance of the “Big 5” studios, star power, glitz and glamour. The period popularly began in 1927 with the release of Jazz Singer, the first film to experiment with sound and faltered around 1950.

query = '''
SELECT title, year, imdb_rating
FROM movies
WHERE year >= 1927 AND year <= 1950
ORDER BY imdb_rating DESC
LIMIT 6
'''

pd.read_sql_query(query, con)

	title	year	imdb_rating
0	casablanca	1942	8.8
1	it’s a wonderful life	1946	8.7
2	sunset blvd.	1950	8.7
3	citizen kane	1941	8.6
4	all about eve	1950	8.5
5	the third man	1949	8.5

The most popular IMDB rated movies from this period were Casablanca, It’s a Wonderful Life, Sunset Blvd, Citizen Kane (which many refer to as the greatest movie ever made) and All About Eve. It’s curious that Jazz Singer doesn’t appear in this list. We can also look up the specific record for The Jazz Singer and compare its ratings to the Top 5 mentioned above.

query = '''
SELECT title, year, imdb_rating
FROM movies
WHERE title LIKE '%jazz%singer%'
'''
pd.read_sql_query(query, con)

	title	year	imdb_rating
0	the jazz singer	1927	6.8

Interestingly, IMDB rating for the Jazz Singer is 6.8, much lower than the top 5 movies from this era listed above.

Q8. Who were the most prominent characters from the Top 5 Golden Age Movies?

query = '''
SELECT title AS movie_title, imdb_rating, name as character_name, n_lines
FROM
    (WITH line_counter 
    AS
        (SELECT movies.movie_id, movies.title, movies.imdb_rating, characters.character_id, name, COUNT(line_id) as n_lines
        FROM movies
            JOIN characters ON movies.movie_id = characters.movie_id
            JOIN lines ON characters.character_id = lines.character_id
        WHERE year >= 1927 AND year <= 1950
        GROUP BY characters.character_id
        ORDER BY imdb_rating DESC
        )
    
    SELECT *, ROW_NUMBER() over (PARTITION BY movie_id ORDER BY n_lines DESC) AS line_rank
    FROM line_counter
    GROUP BY character_id
    ORDER BY imdb_rating DESC)
WHERE line_rank <=2
LIMIT 10
'''

pd.read_sql_query(query,con)

	movie_title	imdb_rating	character_name	n_lines
0	casablanca	8.8	RICK	281
1	casablanca	8.8	RENAULT	135
2	it’s a wonderful life	8.7	GEORGE	275
3	it’s a wonderful life	8.7	MARY	57
4	sunset blvd.	8.7	GILLIS	278
5	sunset blvd.	8.7	NORMA	158
6	citizen kane	8.6	KANE	221
7	citizen kane	8.6	LELAND	93
8	all about eve	8.5	MARGO	314
9	all about eve	8.5	EVE	232

The table above gives us the two most prominent characters from the most popular Golden Age movies. I determined the most prominent characters by ordering them by their number of dialogues in the movie. For the movie Casablanca, for instance, we get ‘Rick’ and ‘Renault’ as the two most prominent characters. While the dataset doesn’t include last names of characters, movie buffs might know that Rick Blaine was Humphrey Bogart’s famous character in Casablanca, while Renault refers to Captain Louis Renault from the movie.