平成30年投票数上位50作品メタ・IMDbスコア最高値作品を抽出

その買うを、もっとハッピーに。|ハピタス

このサイトを参考にしながら、Web scraping (ウェブスクレイピング)について学んでみる。ウェブスクレイピングとは、ウェブサイトから必要な情報を抽出することで、データサイエンスには必須の技術のようであるが、法的にスクレイピングが問題となるサイトもあるらしいので注意が必要だ。

スポンサーリンク

BeautifulSoupでIMDBデータを抽出

IMDBサイトから必要な映画データをBeautifulSoupを使って抽出する。テストケースとして最初のページから映画タイトル、公開年、IMDBスコア、メタデータスコア、投票数、興行収入等を抽出する。先ずは、ウェブサイトをダウンロードする。

from requests import get
url = 'https://www.imdb.com/search/title/?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&start=1&ref_=adv_nxt'
response = get(url)

BeautifulSoupを使って必要なデータを抽出する。

from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
content = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
import re
import numpy as np

movieTitle = []
movieDate = []
movieRunTime = []
movieGenre = []
movieRating = []
movieScore = []
movieDescription = []
movieDirector = []
movieStars = []
movieVotes = []
movieGross = []
for movie in content:
	movieFirstLine = movie.find("h3", class_="lister-item-header")
	movieTitle.append(movieFirstLine.find("a").text)
	movieDate.append(re.sub(r"[()]","", movieFirstLine.find_all("span")[-1].text))
	try:
		movieRunTime.append(movie.find("span", class_="runtime").text[:-4])
	except:
		movieRunTime.append(np.nan)
	movieGenre.append(movie.find("span", class_="genre").text.rstrip().replace("\n","").split(","))
	try:
		movieRating.append(movie.find("strong").text)
	except:
		movieRating.append(np.nan)
	try:
		movieScore.append(movie.find("span", class_="metascore").text.rstrip())
	except:
		movieScore.append(np.nan)
	movieDescription.append(movie.find_all("p", class_="text-muted")[-1].text.lstrip())
	movieCast = movie.find("p", class_="")

	try:
		casts = movieCast.text.replace("\n","").split('|')
		casts = [x.strip() for x in casts]
		casts = [casts[i].replace(j, "") for i,j in enumerate(["Director:", "Stars:"])]
		movieDirector.append(casts[0])
		movieStars.append([x.strip() for x in casts[1].split(",")])
	except:
		casts = movieCast.text.replace("\n","").strip()
		movieDirector.append(np.nan)
		movieStars.append([x.strip() for x in casts.split(",")])

	movieNumbers = movie.find_all("span", attrs={"name": "nv"})

	if len(movieNumbers) == 2:
		movieVotes.append(movieNumbers[0].text)
		movieGross.append(movieNumbers[1].text)
	elif len(movieNumbers) == 1:
		movieVotes.append(movieNumbers[0].text)
		movieGross.append(np.nan)
	else:
		movieVotes.append(np.nan)
		movieGross.append(np.nan)

movieData = [movieTitle, movieDate, movieRunTime, movieGenre, movieRating, movieScore, movieDescription,
					movieDirector, movieStars, movieVotes, movieGross]    
import pandas as pd
df = pd.DataFrame(movieData).T
df.tail()
0 1 2 3 4 5 6 7 8 9 10
45 Bad Times at the El Royale 2018 141 [Crime, Drama, Mystery] 7.1 60 Circa 1969, several strangers, most with a sec… Drew Goddard [Jeff Bridges, Cynthia Erivo, Dakota Johnson, … 90,157 $17.84M
46 Ralph Breaks the Internet 2018 112 [Animation, Adventure, Comedy] 7.1 71 Six years after the events of “Wreck-It Ralph,… Directors:Phil Johnston, Rich Moore [John C. Reilly, Sarah Silverman, Gal Gadot, T… 89,846 $201.09M
47 The Ballad of Buster Scruggs 2018 133 [Comedy, Drama, Musical] 7.3 79 Six tales of life and violence in the Old West… Directors:Ethan Coen, Joel Coen [Tim Blake Nelson, Willie Watson, Clancy Brown… 88,727 NaN
48 Tag I 2018 100 [Comedy] 6.5 56 A small group of former classmates organize an… Jeff Tomsic [Jeremy Renner, Ed Helms, Jake Johnson, Jon Hamm] 86,858 $54.55M
49 A Simple Favor 2018 117 [Comedy, Crime, Drama] 6.9 67 Stephanie is a single mother with a parenting … Paul Feig [Anna Kendrick, Blake Lively, Henry Golding, A… 84,723 $53.54M
df.columns = ['タイトル','公開年','上映時間','ジャンル','IMDBスコア','メタスコア',\
 'あらすじ','監督','出演者','投票数','興行収入']
df.head()
タイトル 公開年 上映時間 ジャンル IMDBスコア メタスコア あらすじ 監督 出演者 投票数 興行収入
0 Avengers: Infinity War 2018 149 [Action, Adventure, Sci-Fi] 8.5 68 The Avengers and their allies must be willing … Directors:Anthony Russo, Joe Russo [Robert Downey Jr., Chris Hemsworth, Mark Ruff… 678,522 $678.82M
1 Black Panther 2018 134 [Action, Adventure, Sci-Fi] 7.3 88 T’Challa, heir to the hidden but advanced king… Ryan Coogler [Chadwick Boseman, Michael B. Jordan, Lupita N… 520,689 $700.06M
2 Deadpool 2 2018 119 [Action, Adventure, Comedy] 7.8 66 Foul-mouthed mutant mercenary Wade Wilson (AKA… David Leitch [Ryan Reynolds, Josh Brolin, Morena Baccarin, … 396,546 $324.59M
3 Bohemian Rhapsody 2018 134 [Biography, Drama, Music] 8.0 49 The story of the legendary rock band Queen and… Bryan Singer [Rami Malek, Lucy Boynton, Gwilym Lee, Ben Hardy] 352,193 $216.43M
4 A Quiet Place 2018 90 [Drama, Horror, Sci-Fi] 7.6 82 In a post-apocalyptic world, a family is force… John Krasinski [Emily Blunt, John Krasinski, Millicent Simmon… 308,397 $188.02M
スポンサーリンク

IMDBスコアとメタスコアの最高値を抽出

先ずはIMDBスコアが高い順に並び替える。

df.sort_values(by='IMDBスコア',ascending=False).head(10)
タイトル 公開年 上映時間 ジャンル IMDBスコア メタスコア あらすじ 監督 出演者 投票数 興行収入
27 The Haunting of Hill House 2018– 50 [Drama, Horror, Mystery] 8.7 NaN Flashing between past and present, a fractured… NaN [Stars:Michiel Huisman, Carla Gugino, Henry Th… 117,975 NaN
0 Avengers: Infinity War 2018 149 [Action, Adventure, Sci-Fi] 8.5 68 The Avengers and their allies must be willing … Directors:Anthony Russo, Joe Russo [Robert Downey Jr., Chris Hemsworth, Mark Ruff… 678,522 $678.82M
15 Spider-Man: Into the Spider-Verse 2018 117 [Animation, Action, Adventure] 8.5 87 Teen Miles Morales becomes Spider-Man of his r… Directors:Bob Persichetti, Peter Ramsey, Rodne… [Shameik Moore, Jake Johnson, Hailee Steinfeld… 218,754 $190.24M
16 Green Book 2018 130 [Biography, Comedy, Drama] 8.2 69 A working-class Italian-American bouncer becom… Peter Farrelly [Viggo Mortensen, Mahershala Ali, Linda Cardel… 212,883 $85.08M
33 Altered Carbon 2018– 60 [Action, Drama, Sci-Fi] 8.2 NaN Set in a future where consciousness is digitiz… NaN [Stars:Chris Conner, Renée Elise Goldsberry, J… 103,821 NaN
3 Bohemian Rhapsody 2018 134 [Biography, Drama, Music] 8.0 49 The story of the legendary rock band Queen and… Bryan Singer [Rami Malek, Lucy Boynton, Gwilym Lee, Ben Hardy] 352,193 $216.43M
30 Inugashima 2018 101 [Animation, Adventure, Comedy] 7.9 82 Set in Japan, Isle of Dogs follows a boy’s ody… Wes Anderson [Bryan Cranston, Koyu Rankin, Edward Norton, B… 109,452 $32.02M
10 Mission: Impossible – Fallout 2018 147 [Action, Adventure, Thriller] 7.8 86 Ethan Hunt and his IMF team, along with some f… Christopher McQuarrie [Tom Cruise, Henry Cavill, Ving Rhames, Simon … 236,555 $220.16M
2 Deadpool 2 2018 119 [Action, Adventure, Comedy] 7.8 66 Foul-mouthed mutant mercenary Wade Wilson (AKA… David Leitch [Ryan Reynolds, Josh Brolin, Morena Baccarin, … 396,546 $324.59M
31 Roma 2018 135 [Drama] 7.8 96 A year in the life of a middle-class family’s … Alfonso Cuarón [Yalitza Aparicio, Marina de Tavira, Diego Cor… 108,104 NaN

トップの「The Haunting of Hill House」のあらすじを見てみる。

df['あらすじ'][27]
'Flashing between past and present, a fractured family confronts haunting memories of their old home and the terrifying events that drove them from it.'

あらすじは、「過去と現在の狭間に揺れる家族が、彼等のかつての家と彼等をそこから追い遣った惨劇の忌まわしい記憶に直面する。」といったもので、このドラマは、ネットフリックスのオリジナルのようである。日本では「ザ・ホーンティング・オブ・ヒルハウス」と呼ばれている。

次にメタスコアが高い順に並び替える。

df.sort_values(by='メタスコア',ascending=False).head(10)
タイトル 公開年 上映時間 ジャンル IMDBスコア メタスコア あらすじ 監督 出演者 投票数 興行収入
31 Roma 2018 135 [Drama] 7.8 96 A year in the life of a middle-class family’s … Alfonso Cuarón [Yalitza Aparicio, Marina de Tavira, Diego Cor… 108,104 NaN
26 Joouheika no okiniiri 2018 119 [Biography, Drama, History] 7.6 90 In early 18th century England, a frail Queen A… Yorgos Lanthimos [Olivia Colman, Emma Stone, Rachel Weisz, Nich… 118,728 $34.24M
1 Black Panther 2018 134 [Action, Adventure, Sci-Fi] 7.3 88 T’Challa, heir to the hidden but advanced king… Ryan Coogler [Chadwick Boseman, Michael B. Jordan, Lupita N… 520,689 $700.06M
8 A Star Is Born 2018 136 [Drama, Music, Romance] 7.7 88 A musician helps a young singer find fame as a… Bradley Cooper [Lady Gaga, Bradley Cooper, Sam Elliott, Greg … 255,181 $215.29M
21 Hereditary 2018 127 [Drama, Horror, Mystery] 7.3 87 After the family matriarch passes away, a grie… Ari Aster [Toni Collette, Milly Shapiro, Gabriel Byrne, … 154,803 $44.07M
15 Spider-Man: Into the Spider-Verse 2018 117 [Animation, Action, Adventure] 8.5 87 Teen Miles Morales becomes Spider-Man of his r… Directors:Bob Persichetti, Peter Ramsey, Rodne… [Shameik Moore, Jake Johnson, Hailee Steinfeld… 218,754 $190.24M
10 Mission: Impossible – Fallout 2018 147 [Action, Adventure, Thriller] 7.8 86 Ethan Hunt and his IMF team, along with some f… Christopher McQuarrie [Tom Cruise, Henry Cavill, Ving Rhames, Simon … 236,555 $220.16M
25 First Man 2018 141 [Biography, Drama, History] 7.4 84 A look at the life of the astronaut, Neil Arms… Damien Chazelle [Ryan Gosling, Claire Foy, Jason Clarke, Kyle … 121,883 $44.94M
22 BlacKkKlansman 2018 135 [Biography, Crime, Drama] 7.5 83 Ron Stallworth, an African American police off… Spike Lee [John David Washington, Adam Driver, Laura Har… 152,025 $48.69M
4 A Quiet Place 2018 90 [Drama, Horror, Sci-Fi] 7.6 82 In a post-apocalyptic world, a family is force… John Krasinski [Emily Blunt, John Krasinski, Millicent Simmon… 308,397 $188.02M

トップの「Roma」のあらすじを見てみる。

df['あらすじ'][31]
"A year in the life of a middle-class family's maid in Mexico City in the early 1970s."

あらすじは、「1970年代前半、メキシコシティーにおける中流家庭のメイドの生活の一年を綴る」となっている。これもNetflixオリジナルのようである。評論家達の評価は非情に高い一方で、一般視聴者の評価はまちまちである。Netflixと言えば、オバマ初のドキュメンタリーである「American Factory」が巷で話題になっている。

スポンサーリンク
スポンサーリンク