import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import inspect
movie_engine = create_engine('sqlite:///movie_lines.db')
insp = inspect(movie_engine)
table_names = insp.get_table_names()
con = movie_engine.connect()
query = '''
SELECT COUNT(movie_id) as n_movies
FROM movies
'''
pd.read_sql_query(query, con)
n_movies | |
---|---|
0 | 617 |
There are 617 movies in the database.
Display results for the top 3 genres.
query = '''
SELECT genres.name as genre_name, COUNT(movies.movie_id) as n_movies
FROM movies
JOIN movie_genre_linking ON movies.movie_id = movie_genre_linking.movie_id
JOIN genres ON movie_genre_linking.genre_id = genres.genre_id
GROUP BY genres.genre_id
ORDER BY n_movies DESC
LIMIT 3
'''
pd.read_sql_query(query, con)
genre_name | n_movies | |
---|---|---|
0 | drama | 320 |
1 | thriller | 265 |
2 | comedy | 159 |
The database contains the most number of Drama movies, at 320, followed by Thriller at 265. The total number of genres in the database are 23 (same as the number of rows).
query = '''
SELECT MIN(year) as start_year, MAX(year) as end_year
FROM movies
'''
pd.read_sql_query(query, con)
start_year | end_year | |
---|---|---|
0 | 1927 | 2010 |
The dataset spans movies released between 1927-2010.
query = '''
SELECT movie_id, title as movie_title, imdb_rating
FROM movies
ORDER BY imdb_rating DESC
LIMIT 5
'''
pd.read_sql_query(query, con)
movie_id | movie_title | imdb_rating | |
---|---|---|---|
0 | 457 | neuromancer | 9.3 |
1 | 203 | the godfather | 9.2 |
2 | 369 | the godfather: part ii | 9.0 |
3 | 463 | one flew over the cuckoo’s nest | 8.9 |
4 | 504 | schindler’s list | 8.9 |
Movies with the highest ratings are Neuromancer, The Godfather, The Godfather Part II, One Flew Over the Cuckoo’s Nest and Schindler’s List. The highest rated movie in the database, Neuromancer, has an IMDB rating of 9.3.
query = '''
WITH MovieGenres (movie_id, title, year, imdb_rating, imdb_votes, url, genre_id, genre_name)
AS
(SELECT movies.*, genres.*
FROM movies
JOIN movie_genre_linking ON movies.movie_id = movie_genre_linking.movie_id
JOIN genres ON movie_genre_linking.genre_id = genres.genre_id
)
SELECT genre_id, genre_name, AVG(imdb_rating) as avg_rating
FROM MovieGenres
GROUP BY genre_id
ORDER BY avg_rating DESC
'''
pd.read_sql_query(query,con)
genre_id | genre_name | avg_rating | |
---|---|---|---|
0 | 18 | film-noir | 8.400000 |
1 | 11 | war | 7.756522 |
2 | 5 | biography | 7.684000 |
3 | 21 | history | 7.547619 |
4 | 4 | sport | 7.325000 |
5 | 23 | musical | 7.275000 |
6 | 19 | drama | 7.236562 |
7 | 7 | romance | 7.050758 |
8 | 9 | music | 7.015385 |
9 | 6 | crime | 7.001379 |
10 | 16 | mystery | 6.968000 |
11 | 8 | animation | 6.962500 |
12 | 1 | family | 6.956250 |
13 | 24 | adventure | 6.851402 |
14 | 10 | comedy | 6.751572 |
15 | 3 | western | 6.741667 |
16 | 15 | thriller | 6.704906 |
17 | 12 | sci-fi | 6.665179 |
18 | 2 | fantasy | 6.612162 |
19 | 22 | documentary | 6.600000 |
20 | 17 | short | 6.560000 |
21 | 20 | action | 6.544304 |
22 | 14 | adult | 6.300000 |
23 | 13 | horror | 6.057143 |
From the above table, we see that Film Noir appears to be the most popular genre on average, with an average IMDB rating of 8.4. War and Biography movies also are rated quite highly on average. Genres rated lowest, on average, are Horror, Adult and Action, each with average ratings below 6.6.
Here, we find the best rated movies focusing on the 5 most popular genres found above, which were Film Noir, War, Biography, History and Sport.
query = '''
WITH MovieGenres(movie_id, title, year, imdb_rating, imdb_votes, url, genre_id, genre_name)
AS
(SELECT movies.*, genres.*
FROM movies
JOIN movie_genre_linking as link
ON movies.movie_id = link.movie_id
JOIN genres ON genres.genre_id = link.genre_id
)
SELECT genre_id, genre_name, title as movie_title, MAX(imdb_rating) as imdb_rating
FROM MovieGenres
GROUP BY genre_id
ORDER BY AVG(imdb_rating) DESC
LIMIT 5
'''
pd.read_sql_query(query,con)
genre_id | genre_name | movie_title | imdb_rating | |
---|---|---|---|---|
0 | 18 | film-noir | sunset blvd. | 8.7 |
1 | 11 | war | schindler’s list | 8.9 |
2 | 5 | biography | schindler’s list | 8.9 |
3 | 21 | history | schindler’s list | 8.9 |
4 | 4 | sport | raging bull | 8.4 |
On inspecting the highest rated movies from the most popular genres, we see some popular names pop up such as Schindler’s List, the highest rated War movie. However, Schindler’s List appears under multiple genres including Biography and History. This is because each movie is linked to multiple possible genres under the movie_genre_linking table. Thus, movies with high ratings like Neuromancer (9.3) and Godfather (9.2) appear mutliple times.
#movies with an all-male cast
query = '''
SELECT movie_id, title as movie_title, n_female, n_male
FROM
(SELECT characters.movie_id, movies.title, COUNT(CASE WHEN gender = 'F' THEN gender ELSE NULL END) as n_female,
COUNT(CASE WHEN gender = 'M' THEN gender ELSE NULL END) as n_male
FROM characters
JOIN movies ON characters.movie_id = movies.movie_id
GROUP by characters.movie_id
)
WHERE n_female = 0 AND n_male > 0
ORDER BY n_male DESC
'''
pd.read_sql_query(query, con)
movie_id | movie_title | n_female | n_male | |
---|---|---|---|---|
0 | 247 | apocalypse now | 0 | 8 |
1 | 170 | reservoir dogs | 0 | 7 |
2 | 525 | south park: bigger longer & uncut | 0 | 7 |
3 | 536 | dr. strangelove or: how i learned to stop worr… | 0 | 7 |
4 | 37 | the boondock saints | 0 | 6 |
… | … | … | … | … |
68 | 277 | la battaglia di algeri | 0 | 1 |
69 | 315 | dark city | 0 | 1 |
70 | 478 | predator | 0 | 1 |
71 | 488 | red white black & blue | 0 | 1 |
72 | 572 | ticker | 0 | 1 |
73 rows × 4 columns
We can see that Apocalypse Now had the largest all-male cast in our dataset with an 8:0 male:female ratio, followed by Reservoir Dogs at 7:0. A simple google search about the Apocalypse Now cast would reveal that the results are correct - the film did comprise of an all-male cast. In total, there are 73 such movies with an all-male cast in the dataset.
query = '''
SELECT movie_id, title AS movie_title, n_female, n_male
FROM
(SELECT characters.movie_id, movies.title, COUNT(CASE WHEN gender = 'F' THEN gender ELSE NULL END) as n_female,
COUNT(CASE WHEN gender = 'M' THEN gender ELSE NULL END) as n_male
FROM characters
JOIN movies ON characters.movie_id = movies.movie_id
GROUP by characters.movie_id
)
WHERE n_male = 0 AND n_female > 0
ORDER BY n_female DESC
'''
pd.read_sql_query(query, con)
movie_id | movie_title | n_female | n_male | |
---|---|---|---|---|
0 | 51 | drop dead gorgeous | 3 | 0 |
1 | 234 | agnes of god | 3 | 0 |
2 | 384 | heavenly creatures | 3 | 0 |
3 | 393 | hellraiser iii: hell on earth | 3 | 0 |
4 | 138 | my mother dreams the satan’s disciples in new … | 2 | 0 |
5 | 134 | metropolis | 1 | 0 |
6 | 459 | the nightmare before christmas | 1 | 0 |
7 | 500 | salt of the earth | 1 | 0 |
There are only 7 movies with an all female dataset. The top of this list is Drop Dead Gorgeous, a 1999 film which indeed has an all female cast. This is tied with Agnes of God, Heavenly Creatures and Hellraiser III: Hell on Earth.
#average number of conversations overall
query = '''
WITH conversation_count (movie_id, n_conversations)
AS
(
SELECT movie_id, COUNT(conversation_id) as n_conversations
FROM conversations
GROUP BY movie_id
)
SELECT AVG(n_conversations) as avg_conversations
FROM conversation_count;
'''
pd.read_sql_query(query, con)
avg_conversations | |
---|---|
0 | 134.679092 |
#movies with the highest number of conversations
query = '''
WITH conversation_count (movie_id, n_conversations, title)
AS
(
SELECT movies.movie_id, COUNT(conversation_id) as n_conversations, movies.title
FROM conversations
INNER JOIN movies ON conversations.movie_id = movies.movie_id
GROUP BY conversations.movie_id
)
SELECT *
FROM conversation_count
WHERE n_conversations = (SELECT MAX(n_conversations) FROM conversation_count)
UNION
SELECT *
FROM conversation_count
WHERE n_conversations = (SELECT MIN(n_conversations) FROM conversation_count)
;
'''
pd.read_sql_query(query, con)
movie_id | n_conversations | title | |
---|---|---|---|
0 | 289 | 338 | casino |
1 | 406 | 1 | the jazz singer |
2 | 602 | 1 | what women want |
The average number of conversations per movie is 134. The movie with the highest number of conversations (338) is Casino. The movies with the lowest (1) are the Jazz Singer and What Women Want. The Jazz Singer, interestingly, was one of the first movies to have a synchronized vocal and musical track, a milestone in the sound revolution. The movie was all-silent except just two minutes of vocal dialogue.
query = '''
WITH line_counter
AS
(SELECT conversation_id, COUNT(line_sort) as n_lines, lines.movie_id, genres.genre_id, genres.name
FROM lines
JOIN movie_genre_linking
ON lines.movie_id = movie_genre_linking.movie_id
JOIN genres
ON movie_genre_linking.genre_id = genres.genre_id
GROUP BY conversation_id
)
SELECT genre_id, name, AVG(n_lines) as avg_lines
FROM line_counter
GROUP BY genre_id
ORDER BY avg_lines DESC
'''
pd.read_sql_query(query, con)
genre_id | name | avg_lines | |
---|---|---|---|
0 | 4 | sport | 13.279215 |
1 | 1 | family | 13.278655 |
2 | 6 | crime | 13.102732 |
3 | 8 | animation | 12.964912 |
4 | 2 | fantasy | 12.472645 |
5 | 3 | western | 12.229846 |
6 | 5 | biography | 11.513866 |
7 | 7 | romance | 10.608152 |
8 | 12 | sci-fi | 10.496909 |
9 | 9 | music | 9.912390 |
10 | 11 | war | 9.907377 |
11 | 13 | horror | 9.429785 |
12 | 15 | thriller | 8.814720 |
13 | 10 | comedy | 8.718647 |
14 | 18 | film-noir | 7.795222 |
15 | 16 | mystery | 7.543767 |
16 | 17 | short | 5.480176 |
17 | 20 | action | 4.967136 |
18 | 19 | drama | 4.791919 |
19 | 22 | documentary | 3.094340 |
The longest conversations, on average, occur in Sport genre movies, with Family and Crime movies being close runners-up. The shortest conversations, on the other hand, occur in Documentary films, followed by Drama and Action.
This could also be attributed to the way the dataset has been created - monologues by characters count as a single line in a “conversation”. Thus, in documentaries, where one documentary narrator might be speaking multiple lines at once in a monological fashion, this might be considered a single line for the dataset, leading to the least number of lines per conversation.
The Golden Age of Hollywood was characterized by sound in movies, dominance of the “Big 5” studios, star power, glitz and glamour. The period popularly began in 1927 with the release of Jazz Singer, the first film to experiment with sound and faltered around 1950.
query = '''
SELECT title, year, imdb_rating
FROM movies
WHERE year >= 1927 AND year <= 1950
ORDER BY imdb_rating DESC
LIMIT 6
'''
pd.read_sql_query(query, con)
title | year | imdb_rating | |
---|---|---|---|
0 | casablanca | 1942 | 8.8 |
1 | it’s a wonderful life | 1946 | 8.7 |
2 | sunset blvd. | 1950 | 8.7 |
3 | citizen kane | 1941 | 8.6 |
4 | all about eve | 1950 | 8.5 |
5 | the third man | 1949 | 8.5 |
The most popular IMDB rated movies from this period were Casablanca, It’s a Wonderful Life, Sunset Blvd, Citizen Kane (which many refer to as the greatest movie ever made) and All About Eve. It’s curious that Jazz Singer doesn’t appear in this list. We can also look up the specific record for The Jazz Singer and compare its ratings to the Top 5 mentioned above.
query = '''
SELECT title, year, imdb_rating
FROM movies
WHERE title LIKE '%jazz%singer%'
'''
pd.read_sql_query(query, con)
title | year | imdb_rating | |
---|---|---|---|
0 | the jazz singer | 1927 | 6.8 |
Interestingly, IMDB rating for the Jazz Singer is 6.8, much lower than the top 5 movies from this era listed above.
query = '''
SELECT title AS movie_title, imdb_rating, name as character_name, n_lines
FROM
(WITH line_counter
AS
(SELECT movies.movie_id, movies.title, movies.imdb_rating, characters.character_id, name, COUNT(line_id) as n_lines
FROM movies
JOIN characters ON movies.movie_id = characters.movie_id
JOIN lines ON characters.character_id = lines.character_id
WHERE year >= 1927 AND year <= 1950
GROUP BY characters.character_id
ORDER BY imdb_rating DESC
)
SELECT *, ROW_NUMBER() over (PARTITION BY movie_id ORDER BY n_lines DESC) AS line_rank
FROM line_counter
GROUP BY character_id
ORDER BY imdb_rating DESC)
WHERE line_rank <=2
LIMIT 10
'''
pd.read_sql_query(query,con)
movie_title | imdb_rating | character_name | n_lines | |
---|---|---|---|---|
0 | casablanca | 8.8 | RICK | 281 |
1 | casablanca | 8.8 | RENAULT | 135 |
2 | it’s a wonderful life | 8.7 | GEORGE | 275 |
3 | it’s a wonderful life | 8.7 | MARY | 57 |
4 | sunset blvd. | 8.7 | GILLIS | 278 |
5 | sunset blvd. | 8.7 | NORMA | 158 |
6 | citizen kane | 8.6 | KANE | 221 |
7 | citizen kane | 8.6 | LELAND | 93 |
8 | all about eve | 8.5 | MARGO | 314 |
9 | all about eve | 8.5 | EVE | 232 |
The table above gives us the two most prominent characters from the most popular Golden Age movies. I determined the most prominent characters by ordering them by their number of dialogues in the movie. For the movie Casablanca, for instance, we get ‘Rick’ and ‘Renault’ as the two most prominent characters. While the dataset doesn’t include last names of characters, movie buffs might know that Rick Blaine was Humphrey Bogart’s famous character in Casablanca, while Renault refers to Captain Louis Renault from the movie.