このサイトを参照してsetとdictionaryを実際のプログラミングの中で使用しながら学習していく。
Wordplay¶
Extended from code at http://programminghistorian.org/lessons/counting-frequencies
Suppose we have two books — maybe we’d like to see if they were written by the same author, or are otherwise similar. One approach to this is to evaluate the word use frequency in both texts, and then compute a “similarity” or “distance” measure between those two word frequencies. A related approach is to evaluate the frequency of one word being followed by another word (a “word pair”), and see the similarity in use of word pairs by the two texts. Or maybe we’re interested in seeing the set of all words that come after a given word in that text.
2つの本があって、その本が同じ作者によって書かれているのか、それともただ似ているだけなのかを知りたいと仮定する。これを達成する1つの方法が、2つの文章の言葉の使用頻度を評価することで、その後、それら2つの語の頻度間の類似度、もしくは、距離の大きさを算出することだ。関連手法は、後にもう一つの語が続くある語(ワードペア)の頻度を評価し、2つの文章で使われているワードペアに類似度を見出すことだ。もしくは、その文章中の所与の語の後に来る全ての語のセットを見るのも面白いかもしれない。
Here we’ll get some practice using dictionaries and sets, with such wordplay as our motivating example.
ここでは、そういった言葉遊びを動機づけの例として利用して、辞書とセットを使った練習をする。
Some data to play with…
言葉遊びに使うためのデータ
word_string1 = 'it was the best of times it was the worst of times '
word_string2 = 'it was the age of wisdom it was the age of foolishness'
word_string = word_string1 + word_string2
words1 = word_string1.split() #converts string with spaces to list of words
words2 = word_string2.split()
words = words1 + words2
print("words:", words)
Create some interesting dictionaries from these…
これらから興味深い辞書を作成する。
# Word frequencies
word_freq = {}
for w in words:
word_freq[w] = word_freq.get(w,0) + 1
print("Word frequencies:", word_freq)
# Word pair frequencies
word_pairs = {}
prev = words[0]
for w in words[1:]:
pair = (prev, w)
word_pairs[pair] = word_pairs.get(pair,0) + 1
prev = w
print("Pair frequencies:", word_pairs)
# What words follow each word?
word_after = {}
prev = words[0]
for w in words[1:]:
if prev not in word_after:
word_after[prev] = {w}
else:
word_after[prev].add(w)
prev = w
print("Words followed by\n" + str(word_after) + "\n")
# More Pythonic:
# zip is very handy for jointly processing the i_th element from multiple lists.
# Note that in python3 zip is a generator, so is very efficient.
word_after = {}
for w1, w2 in zip(words, words[1:]):
word_after.setdefault(w1, set()).add(w2)
print("Words followed by\n" + str(word_after) + "\n")
"""NOTE: In the above, `setdefault` is another useful method
for dictionaries that saves a lot of initialization code:
d.setdefault(key, val)
is equivalent to
if key not in d:
d[key] = val
d[key]
"""
Rewriting as procedures so we can use them later:
後でそれらを使えるように関数に書き直す。
def get_freq(words):
word_freq = {}
for w in words:
word_freq[w] = word_freq.get(w,0) + 1
return word_freq
def get_pair_freq(words):
word_pairs = {}
prev = words[0]
for w in words[1:]:
pair = (prev, w)
word_pairs[pair] = word_pairs.get(pair,0) + 1
prev = w
return word_pairs
Suppose we want to identify the high frequency words (i.e., sort word frequency from high to low). Our dictionary is “backwards” so doesn’t directly answer this. An approach is to build an “inverse” data structure to make answering this question easier:
高頻度語を特定したいと仮定する(換言すれば、語の頻度を高から低へ並べ替える)。手持ちの辞書は逆なので、直接的にはこの仮定の答えにはならない。このことをより簡単に達成するための1つの手法は、データ構造を逆に構築することだ。
def sort_freq_dict(freq):
aux = [(freq[key], key) for key in freq] #list comprehension
aux.sort() #sort in place
aux.reverse() #reverse in place
return aux #return an (ordered) list, not a dictionary
#More pythonic:
def sort_freq_dict(freq):
return sorted([(freq[key], key) for key in freq], reverse=True)
words_by_frequency = sort_freq_dict(word_freq)
print(words_by_frequency)
Next, we can build a similarity measure between two word frequencies. We’ll use a typical “geometric” notion of distance or similarity referred to as “cosine similarity.” We build this from vector measures of word frequency including the “norm” and the “dot product”, and then calculate a (normalized) cosine distance.
次に、2つの語の頻度間の類似性測度を構築できる。ここでは一般的な距離の幾何学的概念か余弦類似度と呼ばれる類似度を使用する。normとdot productを含む語の頻度のベクトル測度からこれを構築してから、(正規化)余弦距離を計算する。
The mathematical details are not crucial here — but we do want to note how we use dictionary and set operations.
数学的詳細はさして重要ではないが、辞書と集合演算をどう使うかは留意したい。
def freq_norm(freq):
""" Norm of frequencies in freq, as root-mean-square of freqs """
sum_square = 0
for w, num in freq.items(): #iterates over key, val in dictionary
sum_square += num**2
return sum_square**0.5
#More pythonic version:
def freq_norm(freq):
return sum(num*num for num in freq.values())**0.5 #comprehension on just dict values
def freq_dot(freq1, freq2):
""" Dot product of two word freqs, as sum of products of freqs for words"""
words_in_both = set(freq1) & set(freq2)
total = 0
for w in words_in_both:
total += freq1[w] * freq2[w]
return total
#More pythonic version (actually easier to understand!)
def freq_dot(freq1, freq2):
return sum(freq1[w] * freq2[w] for w in set(freq1) & set(freq2))
import math
#Returns similarity between two word (or pair) frequencies dictionaries.
# 1 indicates identical word (or pair) frequencies;
# 0 indicates completely different words (pairs)
def freq_similarity(freq1, freq2):
d = freq_dot(freq1, freq2)/(freq_norm(freq1)*freq_norm(freq2))
ang = math.acos(min(1.0, d))
return 1 - ang/(math.pi/2)
# Some quick tests/examples
x = {'arm':40, 'brick':2}
y = {'cat':3, 'arm':30}
print("Combined words:", set(x) | set(y))
print("freq_norm of", x, ":", freq_norm(x))
print("freq_norm of", y, ":", freq_norm(y))
print("freq_dot of", x, "and", y, ":", freq_dot(x,y))
print("freq_similarity:", freq_similarity(x,y))
# Try it out with our short phrases
words3 = "this is a random sentence good enough for any random day".split()
print("words1:", words1, "\nwords2:", words2, "\nwords3:", words3, "\n")
# build word and pair frequency dictionaries, and calculate some similarities
freq1 = get_freq(words1)
freq2 = get_freq(words2)
freq3 = get_freq(words3)
print("words1 vs. words2 -- word use similarity: ", freq_similarity(freq1, freq2))
print("words1 vs. words3 -- word use similarity: ", freq_similarity(freq1, freq3))
print("words3 vs. words3 -- word use similarity: ", freq_similarity(freq3, freq3))
# try that for similarity of WORD PAIR use...
pair1 = get_pair_freq(words1)
pair2 = get_pair_freq(words2)
pair3 = get_pair_freq(words3)
print("words1 vs. words2 -- pair use similarity: ", freq_similarity(pair1, pair2))
print("words1 vs. words3 -- pair use similarity: ", freq_similarity(pair1, pair3))
実際の本を使ってみる¶
with open('hamlet.txt', 'r') as f:
hamlet = f.read().replace('\n', '').lower()
hamlet_words = hamlet.split()
with open('macbeth.txt', 'r') as f:
macbeth = f.read().replace('\n', '').lower()
macbeth_words = macbeth.split()
with open('alice_in_wonderland.txt', 'r') as f:
alice = f.read().replace('\n', '').lower()
alice_words = alice.split()
print(len(hamlet_words), len(macbeth_words), len(alice_words))
with open('hamlet.txt', 'r', encoding='utf8') as f:
hamlet = f.read().replace('\n', '').lower()
hamlet_words = hamlet.split()
with open('macbeth.txt', 'r', encoding='utf8') as f:
macbeth = f.read().replace('\n', '').lower()
macbeth_words = macbeth.split()
with open('alice_in_wonderland.txt', 'r', encoding='utf8') as f:
alice = f.read().replace('\n', '').lower()
alice_words = alice.split()
print(len(hamlet_words), len(macbeth_words), len(alice_words))
With the text from those books in hand, let’s look at similarities…
それら手持ちの本からの文章を使って類似度を見てみる。
hamlet_freq = get_freq(hamlet_words)
macbeth_freq = get_freq(macbeth_words)
alice_freq = get_freq(alice_words)
print("similarity of word freq between hamlet & macbeth:", freq_similarity(hamlet_freq, macbeth_freq))
print("similarity of word freq between alice & macbeth:", freq_similarity(alice_freq, macbeth_freq))
hamlet_pair = get_pair_freq(hamlet_words)
macbeth_pair = get_pair_freq(macbeth_words)
alice_pair = get_pair_freq(alice_words)
print("\nsimilarity of word pairs between hamlet & macbeth:", \
freq_similarity(hamlet_pair, macbeth_pair))
print("similarity of word pairs between alice & macbeth:", \
freq_similarity(alice_pair, macbeth_pair))
So we’ve confirmed that Macbeth is more similar to Hamlet than to Alice in Wonderland, both in word use and in word pair usage. Good to know!
結果、マクベスは、語の使用と組において、不思議の国のアリスよりもハムレットに類似していることを確認した。