SQL Project: Analyzing Movie Data

Author: Sia Phulambrikar

In this project, we dive into a database of Movies, unraveling insights through SQL. I start with broad exploratory analysis about the dataset, and then delve deeper into genre-specific characteristics of movies, characters and dialogues. Read on as we uncover fascinating insights about Hollywood and the Silver Screen :)

Skills: SQL

1. Importing Dependancies and Loading Data

import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import inspect
movie_engine = create_engine('sqlite:///movie_lines.db')
insp = inspect(movie_engine)
table_names = insp.get_table_names()
con = movie_engine.connect()

2. Exploratory Analysis:

  • How many movies are in the database?
  • How many movies does each genre contain?
  • What release years do the movies span?
Screenshot 2023-12-22 at 11.22.57 PM.png
Screenshot 2023-12-22 at 11.22.57 PM.png

Q1. How many movies are in the database?

query = '''
SELECT COUNT(movie_id) as n_movies
FROM movies
'''
pd.read_sql_query(query, con)
n_movies
0 617

There are 617 movies in the database.

Q2. How many movies does the database contain per genre?

Display results for the top 3 genres.

query = '''
SELECT genres.name as genre_name, COUNT(movies.movie_id) as n_movies
FROM movies
JOIN movie_genre_linking ON movies.movie_id = movie_genre_linking.movie_id
JOIN genres ON movie_genre_linking.genre_id = genres.genre_id
GROUP BY genres.genre_id
ORDER BY n_movies DESC
LIMIT 3
'''
pd.read_sql_query(query, con)
genre_name n_movies
0 drama 320
1 thriller 265
2 comedy 159

The database contains the most number of Drama movies, at 320, followed by Thriller at 265. The total number of genres in the database are 23 (same as the number of rows).

Q3. What release years do the movies span?

query = '''
SELECT MIN(year) as start_year, MAX(year) as end_year
FROM movies
'''
pd.read_sql_query(query, con)
start_year end_year
0 1927 2010

The dataset spans movies released between 1927-2010.

3. Analysis Questions: Digging Deeper

  • Which are the highest rated movies?
  • Which genre receives the best and worst average ratings?
  • Which are the most popular movies from each genre?
  • Are there movies with an all-male or all-female cast?
Understanding dialogues in movies
  • What are the average number of conversations per movie? Does this vary by genre?
  • Does the length of conversation vary by movie genre?
Delving into the Golden Age of Hollywood
  • Which were the most popular movies defining the Golden Age of Hollywood?
  • Which were the Top 5 characters during the Golden Age?

Q1. What are the Top 5 Highest Rated Movies?

query = '''
SELECT movie_id, title as movie_title, imdb_rating
FROM movies
ORDER BY imdb_rating DESC
LIMIT 5
'''
pd.read_sql_query(query, con)
movie_id movie_title imdb_rating
0 457 neuromancer 9.3
1 203 the godfather 9.2
2 369 the godfather: part ii 9.0
3 463 one flew over the cuckoo’s nest 8.9
4 504 schindler’s list 8.9

Movies with the highest ratings are Neuromancer, The Godfather, The Godfather Part II, One Flew Over the Cuckoo’s Nest and Schindler’s List. The highest rated movie in the database, Neuromancer, has an IMDB rating of 9.3.

Q2. What genre recives the best and worst average ratings?

query = '''
WITH MovieGenres (movie_id, title, year, imdb_rating, imdb_votes, url, genre_id, genre_name)
AS 
    (SELECT movies.*, genres.*
    FROM movies
        JOIN movie_genre_linking ON movies.movie_id = movie_genre_linking.movie_id
        JOIN genres ON movie_genre_linking.genre_id = genres.genre_id
    )
SELECT genre_id, genre_name, AVG(imdb_rating) as avg_rating
FROM MovieGenres
GROUP BY genre_id
ORDER BY avg_rating DESC
'''
pd.read_sql_query(query,con)
genre_id genre_name avg_rating
0 18 film-noir 8.400000
1 11 war 7.756522
2 5 biography 7.684000
3 21 history 7.547619
4 4 sport 7.325000
5 23 musical 7.275000
6 19 drama 7.236562
7 7 romance 7.050758
8 9 music 7.015385
9 6 crime 7.001379
10 16 mystery 6.968000
11 8 animation 6.962500
12 1 family 6.956250
13 24 adventure 6.851402
14 10 comedy 6.751572
15 3 western 6.741667
16 15 thriller 6.704906
17 12 sci-fi 6.665179
18 2 fantasy 6.612162
19 22 documentary 6.600000
20 17 short 6.560000
21 20 action 6.544304
22 14 adult 6.300000
23 13 horror 6.057143

From the above table, we see that Film Noir appears to be the most popular genre on average, with an average IMDB rating of 8.4. War and Biography movies also are rated quite highly on average. Genres rated lowest, on average, are Horror, Adult and Action, each with average ratings below 6.6.

Q4. Are there movies with an all-male or all-female cast? If not, which movies have the highest male:female ratio?

#movies with an all-male cast
query = '''
SELECT movie_id, title as movie_title, n_female, n_male

FROM 
    (SELECT characters.movie_id, movies.title, COUNT(CASE WHEN gender = 'F' THEN gender ELSE NULL END) as n_female, 
        COUNT(CASE WHEN gender = 'M' THEN gender ELSE NULL END) as n_male 
    FROM characters
    JOIN movies ON characters.movie_id = movies.movie_id
    GROUP by characters.movie_id
    )

WHERE n_female = 0 AND n_male > 0
ORDER BY n_male DESC
'''
pd.read_sql_query(query, con)
movie_id movie_title n_female n_male
0 247 apocalypse now 0 8
1 170 reservoir dogs 0 7
2 525 south park: bigger longer & uncut 0 7
3 536 dr. strangelove or: how i learned to stop worr… 0 7
4 37 the boondock saints 0 6
68 277 la battaglia di algeri 0 1
69 315 dark city 0 1
70 478 predator 0 1
71 488 red white black & blue 0 1
72 572 ticker 0 1

73 rows × 4 columns

We can see that Apocalypse Now had the largest all-male cast in our dataset with an 8:0 male:female ratio, followed by Reservoir Dogs at 7:0. A simple google search about the Apocalypse Now cast would reveal that the results are correct - the film did comprise of an all-male cast. In total, there are 73 such movies with an all-male cast in the dataset.

query = '''
SELECT movie_id, title AS movie_title, n_female, n_male

FROM 
    (SELECT characters.movie_id, movies.title, COUNT(CASE WHEN gender = 'F' THEN gender ELSE NULL END) as n_female, 
        COUNT(CASE WHEN gender = 'M' THEN gender ELSE NULL END) as n_male 
    FROM characters
    JOIN movies ON characters.movie_id = movies.movie_id
    GROUP by characters.movie_id
    )

WHERE n_male = 0 AND n_female > 0
ORDER BY n_female DESC
'''
pd.read_sql_query(query, con)
movie_id movie_title n_female n_male
0 51 drop dead gorgeous 3 0
1 234 agnes of god 3 0
2 384 heavenly creatures 3 0
3 393 hellraiser iii: hell on earth 3 0
4 138 my mother dreams the satan’s disciples in new … 2 0
5 134 metropolis 1 0
6 459 the nightmare before christmas 1 0
7 500 salt of the earth 1 0

There are only 7 movies with an all female dataset. The top of this list is Drop Dead Gorgeous, a 1999 film which indeed has an all female cast. This is tied with Agnes of God, Heavenly Creatures and Hellraiser III: Hell on Earth.

Q5. What are the average number of conversations per movie? Which movies have the maximum number of conversations, and which have the minimum?

#average number of conversations overall
query = '''
WITH conversation_count (movie_id, n_conversations) 
AS 
(
SELECT movie_id, COUNT(conversation_id) as n_conversations
FROM conversations
GROUP BY movie_id
)

SELECT AVG(n_conversations) as avg_conversations
FROM conversation_count;
'''
pd.read_sql_query(query, con)
avg_conversations
0 134.679092
#movies with the highest number of conversations
query = '''
WITH conversation_count (movie_id, n_conversations, title) 
AS 
(
SELECT movies.movie_id, COUNT(conversation_id) as n_conversations, movies.title
FROM conversations
INNER JOIN movies ON conversations.movie_id = movies.movie_id
GROUP BY conversations.movie_id
)

SELECT *
FROM conversation_count
WHERE n_conversations = (SELECT MAX(n_conversations) FROM conversation_count) 

UNION
SELECT *
FROM conversation_count
WHERE n_conversations = (SELECT MIN(n_conversations) FROM conversation_count)
;
'''
pd.read_sql_query(query, con)
movie_id n_conversations title
0 289 338 casino
1 406 1 the jazz singer
2 602 1 what women want

The average number of conversations per movie is 134. The movie with the highest number of conversations (338) is Casino. The movies with the lowest (1) are the Jazz Singer and What Women Want. The Jazz Singer, interestingly, was one of the first movies to have a synchronized vocal and musical track, a milestone in the sound revolution. The movie was all-silent except just two minutes of vocal dialogue.

Q6. Does the length of conversations vary by movie genre?

query = '''
WITH line_counter 
AS
(SELECT conversation_id, COUNT(line_sort) as n_lines, lines.movie_id, genres.genre_id, genres.name
FROM lines
    JOIN movie_genre_linking 
    ON lines.movie_id = movie_genre_linking.movie_id

    JOIN genres
    ON movie_genre_linking.genre_id = genres.genre_id

GROUP BY conversation_id
)

SELECT genre_id, name, AVG(n_lines) as avg_lines
FROM line_counter
GROUP BY genre_id
ORDER BY avg_lines DESC
'''
pd.read_sql_query(query, con)
genre_id name avg_lines
0 4 sport 13.279215
1 1 family 13.278655
2 6 crime 13.102732
3 8 animation 12.964912
4 2 fantasy 12.472645
5 3 western 12.229846
6 5 biography 11.513866
7 7 romance 10.608152
8 12 sci-fi 10.496909
9 9 music 9.912390
10 11 war 9.907377
11 13 horror 9.429785
12 15 thriller 8.814720
13 10 comedy 8.718647
14 18 film-noir 7.795222
15 16 mystery 7.543767
16 17 short 5.480176
17 20 action 4.967136
18 19 drama 4.791919
19 22 documentary 3.094340

The longest conversations, on average, occur in Sport genre movies, with Family and Crime movies being close runners-up. The shortest conversations, on the other hand, occur in Documentary films, followed by Drama and Action.

This could also be attributed to the way the dataset has been created - monologues by characters count as a single line in a “conversation”. Thus, in documentaries, where one documentary narrator might be speaking multiple lines at once in a monological fashion, this might be considered a single line for the dataset, leading to the least number of lines per conversation.

Q8. Who were the most prominent characters from the Top 5 Golden Age Movies?

query = '''
SELECT title AS movie_title, imdb_rating, name as character_name, n_lines
FROM
    (WITH line_counter 
    AS
        (SELECT movies.movie_id, movies.title, movies.imdb_rating, characters.character_id, name, COUNT(line_id) as n_lines
        FROM movies
            JOIN characters ON movies.movie_id = characters.movie_id
            JOIN lines ON characters.character_id = lines.character_id
        WHERE year >= 1927 AND year <= 1950
        GROUP BY characters.character_id
        ORDER BY imdb_rating DESC
        )
    
    SELECT *, ROW_NUMBER() over (PARTITION BY movie_id ORDER BY n_lines DESC) AS line_rank
    FROM line_counter
    GROUP BY character_id
    ORDER BY imdb_rating DESC)
WHERE line_rank <=2
LIMIT 10
'''

pd.read_sql_query(query,con)
movie_title imdb_rating character_name n_lines
0 casablanca 8.8 RICK 281
1 casablanca 8.8 RENAULT 135
2 it’s a wonderful life 8.7 GEORGE 275
3 it’s a wonderful life 8.7 MARY 57
4 sunset blvd. 8.7 GILLIS 278
5 sunset blvd. 8.7 NORMA 158
6 citizen kane 8.6 KANE 221
7 citizen kane 8.6 LELAND 93
8 all about eve 8.5 MARGO 314
9 all about eve 8.5 EVE 232

The table above gives us the two most prominent characters from the most popular Golden Age movies. I determined the most prominent characters by ordering them by their number of dialogues in the movie. For the movie Casablanca, for instance, we get ‘Rick’ and ‘Renault’ as the two most prominent characters. While the dataset doesn’t include last names of characters, movie buffs might know that Rick Blaine was Humphrey Bogart’s famous character in Casablanca, while Renault refers to Captain Louis Renault from the movie.