Pythonスクレーピング:データから必要な情報だけを抽出する方法

その買うを、もっとハッピーに。|ハピタス

今回はこのサイトのコードを参考にして、ウェブサイトからデータをダウンロードして、そのダウンロードしたデータの中から欲しいデータだけを抽出する方法を学習する。

スポンサーリンク

環境設定

先ずは今回のチュートリアルに必要なデータフォルダを作成する。

!mkdir pay && mkdir pay/tmp
!mkdir pay/tmp/capublicpay
スポンサーリンク

データ抽出

次に下のコードを実行して必要なデータをダウンロードして、そのダウンロードしたデータから欲しいデータだけを抽出する。

# The title of the highest paid California city government position in 2010
# note, the code below makes it easy to extend "years" to include multiple years
import csv
import os.path
import requests
from shutil import unpack_archive
LOCAL_DATADIR = "pay/tmp/capublicpay"
YEARS = range(2009,2017) # i.e. just 2010
def foosalary(row):
    return float(row['Total Wages']) if row['Total Wages'] else 0

for year in YEARS:
    bfname = '%s_City' % year
    url = 'https://publicpay.ca.gov/RawExport/%s.zip' % bfname
    zname = os.path.join("pay/tmp", bfname + '.zip')
    cname = os.path.join(LOCAL_DATADIR, bfname + '.csv')

    if not os.path.exists(zname):
        print("Downloading", url, 'to', zname)
        data = requests.get(url).content
        with open(zname, 'wb') as f:
            f.write(data)
    # done downloading, now unzip files
    print("Unzipping", zname, 'to', LOCAL_DATADIR)
    unpack_archive(zname, LOCAL_DATADIR, format = 'zip')

    with open(cname, encoding = 'latin-1') as f:
        # first four lines are:
        # “Disclaimer
        #
        # The information presented is posted as submitted by the reporting entity. The State Controller's Office is not responsible for the accuracy of this information.”
        data = list(csv.DictReader(f.readlines()))
        topitem = max(data, key = foosalary)
        print(topitem['Entity Name'], topitem['Department / Subdivision'],
            topitem['Position'], topitem['Total Wages'])
Downloading https://publicpay.ca.gov/RawExport/2009_City.zip to pay/tmp/2009_City.zip
Unzipping pay/tmp/2009_City.zip to pay/tmp/capublicpay
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-57-d7dd7a0d6456> in <module>()
     31         # The information presented is posted as submitted by the reporting entity. The State Controller's Office is not responsible for the accuracy of this information.”
     32         data = list(csv.DictReader(f.readlines()))
---> 33         topitem = max(data, key = foosalary)
     34         print(topitem['Entity Name'], topitem['Department / Subdivision'],
     35             topitem['Position'], topitem['Total Wages'])

<ipython-input-57-d7dd7a0d6456> in foosalary(row)
      8 YEARS = range(2009,2017) # i.e. just 2010
      9 def foosalary(row):
---> 10     return float(row['Total Wages']) if row['Total Wages'] else 0
     11 
     12 for year in YEARS:

KeyError: 'Total Wages'

Total WagesというKey nameに問題があるらしい。

ls pay/tmp/capublicpay
2009_City.csv

csvファイルの最初の1行をチェックしてみる。

a = !head -n 1 pay/tmp/capublicpay/2009_City.csv
print(a)
['"Year","EmployerType","EmployerName","DepartmentOrSubdivision","Position","ElectedOfficial","Judicial","OtherPositions","MinPositionSalary","MaxPositionSalary","ReportedBaseWage","RegularPay","OvertimePay","LumpSumPay","OtherPay","TotalWages","DefinedBenefitPlanContribution","EmployeesRetirementCostCovered","DeferredCompensationPlan","HealthDentalVision","TotalRetirementAndHealthContribution","PensionFormula","EmployerURL","EmployerPopulation","LastUpdatedDate","EmployerCounty","SpecialDistrictActivities","IncludesUnfundedLiability","SpecialDistrictType"']

あるいはpandasを使って表示させるという手もある。

import pandas as pd
a = pd.read_csv("pay/tmp/capublicpay/2009_City.csv", nrows=0)
print(a)
Empty DataFrame
Columns: [Year, EmployerType, EmployerName, DepartmentOrSubdivision, Position, ElectedOfficial, Judicial, OtherPositions, MinPositionSalary, MaxPositionSalary, ReportedBaseWage, RegularPay, OvertimePay, LumpSumPay, OtherPay, TotalWages, DefinedBenefitPlanContribution, EmployeesRetirementCostCovered, DeferredCompensationPlan, HealthDentalVision, TotalRetirementAndHealthContribution, PensionFormula, EmployerURL, EmployerPopulation, LastUpdatedDate, EmployerCounty, SpecialDistrictActivities, IncludesUnfundedLiability, SpecialDistrictType]
Index: []

[0 rows x 29 columns]

Key namesが変わってしまっているので、変わった名称を全て修正する。

# The title of the highest paid California city government position in 2010
# note, the code below makes it easy to extend "years" to include multiple years
import csv
import os.path
import requests
from shutil import unpack_archive
LOCAL_DATADIR = "pay/tmp/capublicpay"
YEARS = range(2009,2018) # i.e. just 2010
def foosalary(row):
    return float(row['TotalWages']) if row['TotalWages'] else 0

for year in YEARS:
    bfname = '%s_City' % year
    url = 'https://publicpay.ca.gov/RawExport/%s.zip' % bfname
    zname = os.path.join("pay/tmp", bfname + '.zip')
    cname = os.path.join(LOCAL_DATADIR, bfname + '.csv')

    if not os.path.exists(zname):
        print("Downloading", url, 'to', zname)
        data = requests.get(url).content
        with open(zname, 'wb') as f:
            f.write(data)
    # done downloading, now unzip files
    print("Unzipping", zname, 'to', LOCAL_DATADIR)
    unpack_archive(zname, LOCAL_DATADIR, format = 'zip')

    with open(cname, encoding = 'latin-1') as f:
        # first four lines are:
        # “Disclaimer
        #
        # The information presented is posted as submitted by the reporting entity. The State Controller's Office is not responsible for the accuracy of this information.”
        data = list(csv.DictReader(f.readlines()))
        topitem = max(data, key = foosalary)
        print(topitem['EmployerCounty'], topitem['DepartmentOrSubdivision'],
            topitem['Position'], topitem['TotalWages'])
Unzipping pay/tmp/2009_City.zip to pay/tmp/capublicpay
Los Angeles Administration Chief Administrative Officer 1171423
Unzipping pay/tmp/2010_City.zip to pay/tmp/capublicpay
Los Angeles Administration Chief Administrative Officer 765237
Unzipping pay/tmp/2011_City.zip to pay/tmp/capublicpay
Riverside City Manager City Manager 677172
Unzipping pay/tmp/2012_City.zip to pay/tmp/capublicpay
Orange City Managers City Manager 545394
Unzipping pay/tmp/2013_City.zip to pay/tmp/capublicpay
Los Angeles Police Police Officer III 500930
Unzipping pay/tmp/2014_City.zip to pay/tmp/capublicpay
Los Angeles Police Department Police Sergeant 592652
Unzipping pay/tmp/2015_City.zip to pay/tmp/capublicpay
Placer Fire Asst Fire Chief 509123.9
Unzipping pay/tmp/2016_City.zip to pay/tmp/capublicpay
Solano Fire Firefighter/56.3 1957089
Downloading https://publicpay.ca.gov/RawExport/2017_City.zip to pay/tmp/2017_City.zip
Unzipping pay/tmp/2017_City.zip to pay/tmp/capublicpay
Los Angeles Harbor Chief Port Pilot II 585165

585165ドルは日本円で6500万円なので、地方公務員の収入にしては高額だろう。ちなみにアメリカの公務員の場合、この収入だと年金も年5000万円以上支給される。とは言っても、州立大学のバスケやアメフトのコーチなんかはこの10倍以上の収入を得ている人達もいるらしく、例えばアラバマ大学のアメフトコーチは8億円近い収入を得ているという情報もあるほどで、上には上がいるとしか言えない。もちろん彼らも州立大学の職員なので地方公務員ということになる。

スポンサーリンク
スポンサーリンク