声優のプロフィール - 驚異のアニヲタ社会復帰の予備

MikuHatsune2015-05-17

　
心優しいお方が手助けしてくれました。
rvest で声優の男女データをスクレイピング #rstatsj - Qiita
持つべきものはプログラミング能力高い人だね!!
　
解析用に声優の年齢や所属事務所などのデータを取ろうと思った。
例によって .lainからデータをパクってくる。

h <- "http://lain.gr.jp/voicedb/profile/"
url <- paste(h, seq(5000), sep="")
write.table(url, "url.txt", row.names=FALSE, col.names=FALSE, quote=FALSE)

これをwget する。

wget -i url.txt

　
あとは必要そうなところをhtml タグか正規表現かでゴリ押しする。

# Python
import os
import re
import time

wd = "/profile/"
files = os.listdir(wd)
pat = ["<dt>名前</dt><dd>.+</dd>",
       "<dt>ローマ字</dt><dd>.+</dd>",
       "<dt>誕生日</dt><dd>\d+年\d+月\d+日</dd>",
       "<dt>身長</dt><dd>\d+cm</dd>",
       "<dt>出身地</dt><dd>.+?</dd>",
       "<dt>血液型</dt><dd>.+型</dd>"]

res = map(lambda x: re.compile(x), pat)
header = ["name", "name_en", "birth", "height", "place", "blood", "production"]
w0 = open("cv_profile.txt", "w")
w0.write("\t".join(header) + "\n")
for f in range(len(files)):
	g = open(wd + files[f], "rU")
	tmp_res = [""]*(len(header))
	flag = 0
	for i in range(1000):
		tmp = g.readline()
		if flag == 1:
			tmp = g.readline().strip()[:-4]
			tmp_res[-1] = tmp
			flag = 0
		
		d = map(lambda x: x.findall(tmp), res)
		re_find = map(len, d)
		if sum(re_find) >= 1:
			for j in range(len(re_find)):
				if re_find[j] > 0:
					tmp_res[j] = d[j][0][:-5].split("<dd>")[-1]
		
		if "<dt>所属</dt>" in tmp:
			flag = 1
	
	w0.write("\t".join(tmp_res) + "\n")

w0.close()

　
とりあえず、データを取ったので分布くらい描いた。
近年は若手の台頭が増えてきたので、未成年の声優もちらほら見かける。とはいっても最も多いのは40歳前後の声優。
80歳あたりで一山あるが、亡くなられた声優などを考慮していないので100歳の声優が存在していることになる。

種田梨沙のトークショーに行ってきてうれシードだった。大沢事務所が入るくらいに、お抱え声優人数をプロットするとこんな感じ。

dat <- read.delim("cv_profile.txt", stringsAsFactors=FALSE)
b <- as.Date(dat$birth, "%Y年")
as.Date(Sys.time(), "%Y%m%d") - b
age <- as.numeric(floor(difftime(as.Date(Sys.time(), "%Y%m%d"), b, unit="days")/365))
hist(age, main="声優の年齢分布", nclass=50)

# 所属事務所人数
pro <- sort(table(dat$production), decreasing=TRUE)
n <- 15
par(mar=c(2, 10, 2, 4))
b0 <- barplot(head(pro, n), horiz=TRUE, las=1)
text(head(pro, n), c(b0), head(pro, n), pos=4, xpd=TRUE)