imdbサイトのデータから、2018年おすすめの映画・ドラマを探し出す。

データの準備¶

先ず、imdbサイトからデータをダウンロードして、BeautifulSoupを使ってダウンロードしたデータから必要なデータだけを抽出する。

import re
import numpy as np
from time import sleep
from time import time
from random import randint
from requests import get
from bs4 import BeautifulSoup
from IPython.core.display import clear_output
# ループ監視の準備
start_time = time()
requests = 0
pages = [str(i) for i in range(1,2002,50)]
headers = {"Accept-Language": "en-US, en;q=0.5"}
movieTitle = []
movieDate = []
movieRunTime = []
movieGenre = []
movieRating = []
movieScore = []
movieDescription = []
movieDirector = []
movieStars = []
movieVotes = []
movieGross = []

for page in pages:

    # get requestをする
    response = get('https://www.imdb.com/search/title/?release_date=2019-01-01,2019-12-31&sort=num_votes,desc&start=' + page +
    '&ref_=adv_nxt', headers = headers, proxies = proxies)

    # ループをポーズする
    sleep(randint(8,15))

    # requestsをモニターする
    requests += 1
    elapsed_time = time() - start_time
    print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True)

    # non-200 status codesに対して警告する
    if response.status_code != 200:
        warn('Request: {}; Status code: {}'.format(requests, response.status_code))

    # 下記の数字以上になったらブレークする
    if requests > 41:
        warn('Number of requests was greater than expected.')
        break

    # BeautifulSoupでrequestの内容をパースする
    page_html = BeautifulSoup(response.text, 'html.parser')

    # 単ページの全50作品を選択する
    mv_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')
        
    for movie in mv_containers:
        movieFirstLine = movie.find("h3", class_="lister-item-header")
        movieTitle.append(movieFirstLine.find("a").text)
        movieDate.append(re.sub(r"[()]","", movieFirstLine.find_all("span")[-1].text))
        try:
            movieRunTime.append(movie.find("span", class_="runtime").text[:-4])
        except:
            movieRunTime.append(np.nan)
        movieGenre.append(movie.find("span", class_="genre").text.rstrip().replace("\n","").split(","))
        try:
            movieRating.append(movie.find("strong").text)
        except:
            movieRating.append(np.nan)
        try:
            movieScore.append(movie.find("span", class_="metascore").text.rstrip())
        except:
            movieScore.append(np.nan)
        movieDescription.append(movie.find_all("p", class_="text-muted")[-1].text.lstrip())
        movieCast = movie.find("p", class_="")
    
        try:
            casts = movieCast.text.replace("\n","").split('|')
            casts = [x.strip() for x in casts]
            casts = [casts[i].replace(j, "") for i,j in enumerate(["Director:", "Stars:"])]
            movieDirector.append(casts[0])
            movieStars.append([x.strip() for x in casts[1].split(",")])
        except:
            casts = movieCast.text.replace("\n","").strip()
            movieDirector.append(np.nan)
            movieStars.append([x.strip() for x in casts.split(",")])
    
        movieNumbers = movie.find_all("span", attrs={"name": "nv"})
    
        if len(movieNumbers) == 2:
            movieVotes.append(movieNumbers[0].text)
            movieGross.append(movieNumbers[1].text)
        elif len(movieNumbers) == 1:
            movieVotes.append(movieNumbers[0].text)
            movieGross.append(np.nan)
        else:
            movieVotes.append(np.nan)
            movieGross.append(np.nan)

movieData = [movieTitle, movieDate, movieRunTime, movieGenre, movieRating, movieScore, movieDescription,
					movieDirector, movieStars, movieVotes, movieGross]

Request:41; Frequency: 0.07282616601621636 requests/s

import pandas as pd
df = pd.DataFrame(movieData).T
df.tail()

	0	1	2	3	4	5	6	7	8	9	10
2045	This Time with Alan Partridge	2019	NaN	[Comedy, Talk-Show]	8.0	NaN	Alan and the team host a special show dedicate…	Directors:Neil Gibbons, Rob Gibbons	[Steve Coogan, Susannah Fielding, Felicity Mon…	207	NaN
2046	Playing with Fire	2019–	NaN	[Drama]	6.3	NaN	Three prosperous women, including a mother and…	NaN	[Stars:Jason Day, Margarita Rosa de Francisco,…	207	NaN
2047	The Enemy Within	2019	42	[Drama]	8.4	NaN	Once Shepherd convinces Keaton that Anna Cruz …	Martha Mitchell	[Jennifer Carpenter, Morris Chestnut, Raza Jaf…	207	NaN
2048	Law & Order: Special Victims Unit	2019	40	[Crime, Drama, Mystery]	8.2	NaN	When defense attorney Nikki Staines is raped a…	Christopher Misiano	[Mariska Hargitay, Kelli Giddish, Ice-T, Peter…	207	NaN
2049	Happy!	2019	39	[Action, Comedy, Crime]	8.2	NaN	This is one F-d up family dinner.	Wayne Yip	[Christopher Meloni, Ritchie Coster, Lili Miro…	207	NaN

df.columns = ['タイトル','公開年','上映時間','ジャンル','IMDBスコア','メタスコア',\
 'あらすじ','監督','出演者','投票数','興行収入']

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2050 entries, 0 to 2049
Data columns (total 11 columns):
タイトル       2050 non-null object
公開年        2050 non-null object
上映時間       1770 non-null object
ジャンル       2050 non-null object
IMDBスコア    2050 non-null object
メタスコア      161 non-null object
あらすじ       2050 non-null object
監督         1744 non-null object
出演者        2050 non-null object
投票数        2050 non-null object
興行収入       99 non-null object
dtypes: object(11)
memory usage: 176.2+ KB
None

IMDBスコアとメタスコアの最高値を抽出¶

先ずはIMDBスコアが高い順に並び替える。

df.sort_values(by='IMDBスコア',ascending=False).head(10)

	タイトル	公開年	上映時間	ジャンル	IMDBスコア	メタスコア	あらすじ	監督	出演者	投票数	興行収入
149	Attack on Titan	2019	24	[Animation, Action, Adventure]	9.9	NaN	A look into Grisha’s memories shows Eren the m…	NaN	[Stars:Yûki Kaji, Hiroshi Tsuchida, Yasunori M…	6,097	NaN
27	Chernobyl	2019	72	[Drama, History]	9.9	NaN	Valery, Boris and Ulana risk their lives and r…	Johan Renck	[Douggie McMeekin, Jamie Sives, Michael Socha,…	42,574	NaN
137	Attack on Titan	2019	24	[Animation, Action, Adventure]	9.9	NaN	While one front is rained on by flames, the ot…	NaN	[Stars:Yûki Kaji, Daisuke Ono, Hiroshi Kamiya,…	6,492	NaN
118	Attack on Titan	2019	24	[Animation, Action, Adventure]	9.9	NaN	As the battle for Shiganshina draws to a close…	NaN	[Stars:Yûki Kaji, Hiroshi Kamiya, Yui Ishikawa…	7,571	NaN
157	Lucifer	2019	NaN	[Crime, Drama, Fantasy]	9.8	NaN	With murderous demons on the loose in Los Ange…	Eagle Egilsson	[Tom Ellis, Lauren German, Kevin Alejandro, D….	5,645	NaN
274	Dark	2019	NaN	[Crime, Drama, Mystery]	9.8	NaN	Armed with a plan to prevent the apocalypse, J…	NaN	[Stars:Louis Hofmann, Sebastian Rudolph, Maja …	2,565	NaN
230	Barry	2019	38	[Comedy, Crime, Drama]	9.8	NaN	An encounter that Barry never could have predi…	Bill Hader	[Bill Hader, Stephen Root, Sarah Goldberg, Ant…	3,451	NaN
265	Dark	2019	NaN	[Crime, Drama, Mystery]	9.8	NaN	On the day of the apocalypse, Clausen executes…	NaN	[Stars:Sandra Borgmann, Karoline Eichhorn, Car…	2,656	NaN
1444	The Promised Neverland	2019	NaN	[Animation, Fantasy, Horror]	9.8	NaN	Emma and the others continue to carry out thei…	NaN	[Stars:Sumire Morohoshi, Maaya Uchida, Mariya …	344	NaN
36	Chernobyl	2019	65	[Drama, History]	9.7	NaN	With untold millions at risk, Ulana makes a de…	Johan Renck	[Emily Watson, Matthew Needham, Nadia Clifford…	38,078	NaN

「Attack on Titan」って何や？と思って調べてみると、「進撃の巨人」とかいうアニメのことだった。投票数が少ないとimdbスコアもあまり当てにならないようである。投票数を一定数以上に設定する必要がありそうだ。

df['投票数'] = df['投票数'].str.replace(',','')
df['投票数'] = df['投票数'].apply(pd.to_numeric, errors="coerce").astype(int)
df1 = df[(df['投票数'] > 1e4)]
df1.sort_values(by='IMDBスコア',ascending=False).head(10)

	タイトル	公開年	上映時間	ジャンル	IMDBスコア	メタスコア	あらすじ	監督	出演者	投票数	興行収入
27	Chernobyl	2019	72	[Drama, History]	9.9	NaN	Valery, Boris and Ulana risk their lives and r…	Johan Renck	[Douggie McMeekin, Jamie Sives, Michael Socha,…	42574	NaN
36	Chernobyl	2019	65	[Drama, History]	9.7	NaN	With untold millions at risk, Ulana makes a de…	Johan Renck	[Emily Watson, Matthew Needham, Nadia Clifford…	38078	NaN
38	Chernobyl	2019	65	[Drama, History]	9.7	NaN	Valery creates a detailed plan to decontaminat…	Johan Renck	[Baltasar Breki Samper, Philip Barantini, Osca…	35558	NaN
2	Chernobyl	2019	330	[Drama, History]	9.6	NaN	In April 1986, an explosion at the Chernobyl n…	NaN	[Stars:Jessie Buckley, Jared Harris, Stellan S…	283090	NaN
40	Chernobyl	2019	67	[Drama, History]	9.6	NaN	Valery and Boris attempt to find solutions to …	Johan Renck	[June Watson, Josef Altin, Jessie Buckley, Ser…	33087	NaN
32	Chernobyl	2019	60	[Drama, History]	9.5	NaN	Plant workers and firefighters put their lives…	Johan Renck	[Jared Harris, Michael Shaeffer, Jessie Buckle…	40349	NaN
77	Our Planet	2019	403	[Documentary]	9.4	NaN	Documentary series focusing on the breadth of …	NaN	[Star:David Attenborough]	13600	NaN
78	Kota Factory	2019–	45	[Comedy, Drama]	9.3	NaN	Dedicated to Shrimati SL Loney ji, Shri Irodov…	NaN	[Stars:Mayur More, Ranjan Raj, Revathi Pillai,…	12923	NaN
92	Enaaya	2019–	25	[Drama]	9.2	NaN	A college student with father-estrangement iss…	NaN	[Stars:Mehwish Hayat, Azfar Rehman, Asad Siddi…	10249	NaN
50	When They See Us	2019–	296	[Biography, Drama, History]	9.1	NaN	Five teens from Harlem become trapped in a nig…	NaN	[Stars:Asante Blackk, Caleel Harris, Ethan Her…	25685	NaN

「Chernobyl」がメチャメチャ高評価であることが伺える。チェルノブイリはHBO制作のドラマミニシリーズだそうだ。正確には、HBOとSky UKの共同制作となっている。日本のどっかのテレビ局も「Fukushima」というドラマミニシリーズを作ってみてはどうだろうか。今度は投票数を5万以上に設定してみる。

df2 = df[(df['投票数'] > 5e4)]
df2.sort_values(by='IMDBスコア',ascending=False).head(10)

	タイトル	公開年	上映時間	ジャンル	IMDBスコア	メタスコア	あらすじ	監督	出演者	投票数	興行収入
2	Chernobyl	2019	330	[Drama, History]	9.6	NaN	In April 1986, an explosion at the Chernobyl n…	NaN	[Stars:Jessie Buckley, Jared Harris, Stellan S…	283090	NaN
0	Avengers: Endgame	2019	181	[Action, Adventure, Sci-Fi]	8.7	78	After the devastating events of Avengers: Infi…	Directors:Anthony Russo, Joe Russo	[Robert Downey Jr., Chris Evans, Mark Ruffalo,…	461382	$835.78M
20	Love, Death & Robots	2019–	15	[Animation, Short, Comedy]	8.7	NaN	A collection of animated short stories that sp…	NaN	[Stars:Scott Whyte, Nolan North, Matthew Yang …	57518	NaN
17	Sex Education	2019–	45	[Comedy, Drama]	8.4	NaN	A teenage boy with a sex therapist mother team…	NaN	[Stars:Asa Butterfield, Gillian Anderson, Ncut…	67921	NaN
16	The Umbrella Academy	2019–	60	[Action, Adventure, Comedy]	8.1	NaN	A disbanded group of superheroes reunites afte…	NaN	[Stars:Ellen Page, Tom Hopper, David Castañeda…	73065	NaN
12	John Wick: Chapter 3 – Parabellum	2019	131	[Action, Crime, Thriller]	8.0	73	Super-assassin John Wick is on the run after k…	Chad Stahelski	[Keanu Reeves, Halle Berry, Ian McShane, Laure…	97315	$158.14M
11	Game of Thrones	2019	58	[Action, Adventure, Drama]	7.9	NaN	Jaime faces judgment and Winterfell prepares f…	David Nutter	[Peter Dinklage, Nikolaj Coster-Waldau, Emilia…	114494	NaN
10	Game of Thrones	2019	54	[Action, Adventure, Drama]	7.6	NaN	Jon and Daenerys arrive in Winterfell and are …	David Nutter	[Peter Dinklage, Nikolaj Coster-Waldau, Lena H…	116468	NaN
4	Game of Thrones	2019	82	[Action, Adventure, Drama]	7.6	NaN	The Night King and his army have arrived at Wi…	Miguel Sapochnik	[Peter Dinklage, Nikolaj Coster-Waldau, Emilia…	193573	NaN
18	How to Train Your Dragon: The Hidden World	2019	104	[Animation, Action, Adventure]	7.6	71	When Hiccup discovers Toothless isn’t the only…	Dean DeBlois	[Jay Baruchel, America Ferrera, F. Murray Abra…	64444	$160.80M

Love, Death & Robots, Sex Education, The Umbrella AcademyはNetflixオリジナル作品となっている。Game of ThronesはChernobyl同様HBOオリジナルで、日本でも有名な海外ドラマである。アベンジャーズ/エンドゲームは日本では低人気であるが、世界的にはメガヒットしている。ジョン・ウィックパラベラムは、日本で大人気のキアヌ・リーブス主演のアクション映画、日本公開は10月となっている。既に4の制作も決まっているとか。今度はメタスコア最高値を検出する。

df.sort_values(by='メタスコア',ascending=False).head(10)

	タイトル	公開年	上映時間	ジャンル	IMDBスコア	メタスコア	あらすじ	監督	出演者	投票数	興行収入
1517	Portrait of a Lady on Fire	2019	119	[Drama, History, Romance]	8.5	95	On an isolated island in Brittany at the end o…	Céline Sciamma	[Noémie Merlant, Adèle Haenel, Luàna Bajrami, …	322	NaN
656	The Souvenir	2019	120	[Drama, Mystery, Romance]	6.9	92	A young film student in the early 80s becomes …	Joanna Hogg	[Neil Young, Tosin Cole, Jack McMullen, Tilda …	939	$0.93M
1298	The Lighthouse	2019	110	[Drama, Fantasy, Horror]	8.3	91	The story of an aging lighthouse keeper named …	Robert Eggers	[Robert Pattinson, Willem Dafoe]	396	NaN
216	Parasite	2019	132	[Drama, Thriller]	8.6	89	All unemployed, Ki-taek’s family takes peculia…	Joon-ho Bong	[Kang-ho Song, Sun-kyun Lee, Yeo-jeong Jo, Woo…	3687	NaN
135	Once Upon a Time … in Hollywood	2019	159	[Comedy, Drama]	9.6	88	A faded television actor and his stunt double …	Quentin Tarantino	[Leonardo DiCaprio, Brad Pitt, Margot Robbie, …	6600	NaN
1729	Cold Case Hammarskjöld	2019	128	[Documentary]	8.2	88	Danish director Mads Brügger and Swedish priva…	Mads Brügger	[Mads Brügger, Göran Björkdahl, Dag Hammarskjö…	267	NaN
1651	The Farewell	I 2019	98	[Comedy, Drama]	6.8	87	A Chinese family discovers their grandmother h…	Lulu Wang	[Awkwafina, Tzi Ma, Gil Perez-Abraham, Diana Lin]	283	NaN
159	Apollo 11	2019	93	[Documentary, History]	8.3	87	A look at the Apollo 11 mission to land on the…	Todd Douglas Miller	[Buzz Aldrin, Joan Ann Archer, Janet Armstrong…	5575	$8.88M
734	Synonyms	2019	123	[Drama]	7.1	86	A young Israeli man absconds to Paris to flee …	Nadav Lapid	[Tom Mercier, Quentin Dolmaire, Louise Chevill…	823	NaN
1642	Divine Love	2019	101	[Drama]	6.3	86	A woman who uses her bureaucratic job to convi…	Gabriel Mascaro	[Dira Paes, Julio Machado, Teca Pereira, Emíli…	286	NaN

Portrait of a Lady on Fireは、セリーヌ・シアマ監督のカンヌ国際映画祭脚本賞を授賞したフランス映画である。このタイトルのあらすじを見てみる。

df['あらすじ'][1517]

'On an isolated island in Brittany at the end of the eighteenth century, a female painter is obliged to paint a wedding portrait of a young woman.'

あらすじは、「18世紀末、1人の女性画家が、ブリタニーの孤島で若い女性の婚礼肖像画を描くことになっている。」という興味深い内容である。フランス公開は9月18日で、日本で公開されるかどうかは未定である。「Parasite」は韓国映画で、韓国映画初となるパルムドールをカンヌ国際映画祭で受賞している。この映画は日本公開が決まっているようである。あらすじを見てみる。

df['あらすじ'][216]

"All unemployed, Ki-taek's family takes peculiar interest in the wealthy and glamorous Parks for their livelihood until they get entangled in an unexpected incident."

あらすじは、一家全員が失業中のKi-taek家が、一家が予期せぬ事態に巻き込まれるまで、裕福でグラマラスなPark家の生活に特別な興味を抱くようになる。となっており、パラサイトというタイトルから何となく話の内容が見えてくるような気もするが、一度見てみたい作品ではある。「Once Upon a Time … in Hollywood」は日本でもお馴染みのタランティーノ監督、ブラピやディカプリオが共演するので、公開が待ち遠しい作品ではないだろうか。この映画のあらすじを見てみる。

df['あらすじ'][135]

"A faded television actor and his stunt double strive to achieve fame and success in the film industry during the final years of Hollywood's Golden Age in 1969 Los Angeles."

あらすじは、落ち目のTV俳優とそのスタントダブルが、1969年のロスアンジェルスで、ハリウッド黄金期末期の映画界で、名声と成功を勝ち取ろうと悪戦苦闘する。となっているが、wikiでは、「1969年にハリウッド女優シャロン・テートがカルト集団チャールズ・マンソン・ファミリーに殺害された事件を背景に、ハリウッド映画界を描いた作品。」となっている。

参照サイトhttps://www.dataquest.io
参照サイトhttps://github.com/