声優統計第七号：結婚したら声優は仕事が減るのか? 〜種田梨沙が結婚したら僕はもう…〜

MikuHatsune2015-12-29

この記事はR Advent Calendar 2015 の第29日目の記事です(勝手に参加
　
声優統計第七号でInterrupted time series analysis を使って、結婚と仕事の影響を調べた。
結論からすると、パラメータ的には仕事は減るが、半年で1本程度なのであまり気にしなくてよさそう。
アニメ出演本数のパターンは、種田梨沙は生天目仁美、主役については斉藤千和と似ていて、どちらも出演数が多いため結婚後も仕事の数は減るどころかむしろ増える傾向。
いつものように詳しいことは本誌を買ってくださいお願いします。
　
完売しました。

### interruptet time series の解説
lambda <- 7
trend <- c(0, 0.12)
level <- 3
t_intervention <- 15
s0 <- 1:30

It <- ifelse(s0<t_intervention, 0, 1)
Itime <- ifelse(s0<t_intervention, 0, s0-t_intervention)

a <- lambda + trend[1]*s0+
level*It+
trend[2]*Itime

plot(s0, a)
tmp <- colMeans(mapply(rpois, n=30, a))
par(mar=c(4, 4.5, 2, 2))
plot(s0, tmp, xlab="Time", ylab="Y", las=1, cex.axis=1.5, cex.lab=1.5, pch=16)
lines(s0[s0<t_intervention], a[s0<t_intervention], lty=2, lwd=2)
lines(s0[s0>=t_intervention], a[s0>=t_intervention], lty=2, lwd=2)
abline(v=t_intervention, lty=3, lwd=2)
y1 <- (2*lambda+level)/2 + c(1, -1)*1.2
for(i in y1) arrows(14, (2*lambda+level)/2, y1=i, length=0.12, lwd=3)
text(t_intervention-1, (2*lambda+level)/2, "level", font=4, cex=1.5, pos=2)
xy <- 2
y <- 1.5
arrows(t_intervention+xy, a[t_intervention+xy]-y, max(s0)-xy, a[max(s0)-xy]-y, length=0.12, lwd=3)
text((t_intervention+max(s0))/2, (a[t_intervention+xy]+a[max(s0)-xy])/2-y-0.3, "trend", font=4, pos=1, cex=1.5)
text(t_intervention, par()$usr[4], "Event", font=4, pos=3, xpd=TRUE, cex=1.5)

　
heatmap.2 の使い方だが、Colv = FALSE にするとクラスタリングの並び替えをしないらしい。これで時系列データがそのまま表示できる。
cellnote でセル内に文字が書けること、 notecol で色付けできるようだが、何故か並びがh$rowInd になっていてしかも転置されているっぽいので、一回 h というheatmap.2 オブジェクトを作ってからプロットし直さないといけないみたいで面倒だった。
　
アニメ出演本数

主役キャラ出演本数

setwd("/voicestat/C89/")
library(gplots)
dat <- read.delim("data.txt", header=FALSE, stringsAsFactors=FALSE)
dat$V4 <- ifelse(dat$V4>1, 1, dat$V4)
dat$V1 <- gsub(" \\(声優\\)", "", dat$V1)

# 主役キャラ出演本数については # のほうを実行する
taneda <- read.delim("taneda.txt", header=FALSE, stringsAsFactors=FALSE)
taneda_cast <- table(taneda$V2)
# taneda_cast <- table(subset(taneda, V4>0)$V2)

set.seed(1234)
taneda_casty <- mapply(function(y) mapply(function(x) sample(c(0, x%%2)) + x%/%2, y), taneda_cast)

prof <- read.table("marrige.txt", header=TRUE, stringsAsFactors=FALSE)
prof <- prof[order(prof$name),]

marrige <- as.Date(prof$marrige)

cast <- mapply(function(x) table(factor(x$V2, min(x$V2):2015)), split(dat, dat$V1))
# cast <- mapply(function(x) table(factor(subset(x, V4>0)$V2, min(x$V2):2015)), split(dat, dat$V1))

set.seed(1234)
casty <- mapply(function(y) mapply(function(x) sample(c(0, x%%2)) + x%/%2, y), cast)

ma <- rapply(casty, function(x) x%/%100, how="replace")
for(i in seq(ma)){
  x <- (as.numeric(format(marrige[i], "%m")) > 6) + 1
  y <- format(marrige[i], "%Y")
  ma[[i]][x, y] <- 1
  ma[[i]][(which(ma[[i]]==1)+1):length(ma[[i]])] <- 2
}

r <- mapply(function(x) x$length, lapply(sapply(ma, c), rle))
clab <- c(rev(-seq(max(r[1,]))), 0:max(r[3,]))
m1 <- m2 <- matrix(NA, nr=ncol(r), nc=length(clab), dimnames=list(colnames(r), clab))
for(i in seq(nrow(m1))){
  j <- as.character(c(rev(-seq(r[1,i])), 0:r[3,i]))
  m1[i, j] <- c(casty[[i]])                 # 出演アニメ数
  m2[i, j] <- ifelse(c(ma[[i]]) <= 1, 0, 1) # 結婚状態
}

# 種田梨沙をつけたす
m1 <- rbind(m1, NA)
m1[nrow(m1), as.character(-rev(seq(taneda_casty)-1))] <- c(taneda_casty)
rownames(m1)[nrow(m1)] <- "種田梨沙 ★"

m1 <- m1[,26:ncol(m1)]
m2 <- m2[,26:ncol(m2)]
m3 <- m1

sort(as.matrix(dist(m1))["種田梨沙",])[2]
cols <- grey(seq(1, 0, length=max(m1, na.rm=TRUE)+1))
h <- heatmap.2(m1, trace="none", col=cols, hclustfun=function(b){hclust(dist(b), method="ward")}, Colv=FALSE)

cellcols <- ifelse(c(t(m1[h$rowInd,]))>=9, cols[1], cols[15])
# cellcols <- ifelse(c(t(m1[h$rowInd,]))>=4, cols[1], cols[15])
notecex <- ifelse(c(t(m1[h$rowInd,]))>=10, 0.8, 1)

m3[m3==0] <- "."
heatmap.2(m1, trace="none", col=cols, hclustfun=function(b){hclust(dist(b), method="ward")}, key=FALSE, Colv=FALSE, lhei=c(0.05, 2), lwid=c(1, 5), margins=c(2, 5), colsep=31, sepcolor=2, cellnote=m3, notecol=cellcols, notecex=notecex)

　
RStan のモデルとしては3つ作った。

# marrige01.stan
data {
  int<lower=0> n_cv;
  int<lower=0> time;
  int<lower=0> cast[n_cv, time];
  int<lower=0, upper=1> marrige_status[n_cv, time];
  int s0[n_cv]; # NA でない最初
  int s2[n_cv]; # NA でない最後
}
parameters{
  real intercept;
  real beta_0;
  #real beta[n_cv];
}
model {
  for(i in 1:n_cv){
    for(j in s0[i]:s2[i]){
      cast[i, j] ~ poisson(intercept + beta_0*marrige_status[i, j]);
    }
  }
}

# marrige02.stan
data {
  int<lower=0> n_cv;
  int<lower=0> time;
  int<lower=0> cast[n_cv, time];
  int<lower=0, upper=1> marrige_status[n_cv, time];
  int s0[n_cv]; # NA でない最初
  int s2[n_cv]; # NA でない最後
}
parameters{
  real intercept;
  real beta_0;
  real beta[n_cv];
}
model {
  for(i in 1:n_cv){
    for(j in s0[i]:s2[i]){
      cast[i, j] ~ poisson(intercept + (beta_0+beta[i])*marrige_status[i, j]);
    }
  }
}

# marrige03.stan
data {
  int<lower=0> n_cv;
  int<lower=0> time;
  int<lower=0> cast[n_cv, time];
  int<lower=0, upper=1> marrige_status[n_cv, time];
  int s0[n_cv]; # NA でない最初
  int s1;       # 結婚直後のindex
  int s2[n_cv]; # NA でない最後
}
parameters{
  real intercept;
  real trend[2];
  real beta_0;
  real beta[n_cv];
}
model {
  for(i in 1:n_cv){
    for(j in s0[i]:s2[i]){
      cast[i, j] ~ poisson(intercept + trend[1]*i+(beta_0+beta[i])*marrige_status[i, j]+trend[2]*if_else(i <= s1, 0, i-s1));
    }
  }
}

Stanは欠損値のデータは別ループで回す必要があるため回避する。
結婚によってアニメ出演本数が減るか増えるかのパラメータ $\beta$ (本誌では $\alpha$ になっているけれども)をメタアナリシス風にプロットする。
　
アニメ出演本数

　
主役キャラ出演本数

　
ちなみに、収束の $\hat{R}$ がまったく1 に近づかなかったので、分散に事前分布を導入したがほんの少し改善しただけでやっぱり収束しなかった。

# marrige02_2.stan
data {
  int<lower=0> n_cv;
  int<lower=0> time;
  int<lower=0> cast[n_cv, time];
  int<lower=0, upper=1> marrige_status[n_cv, time];
  int s0[n_cv]; # NA でない最初
  int s2[n_cv]; # NA でない最後
}
parameters{
  real intercept;
  real beta_0;
  real beta[n_cv];
  real<lower=0> sigma_alpha[n_cv]; # 分散
}
model {
  for(i in 1:n_cv){
    beta[i] ~ normal(0, sigma_alpha[i]);
    for(j in s0[i]:s2[i]){
      cast[i, j] ~ poisson(intercept + (beta_0+beta[i])*marrige_status[i, j]);
    }
  }
}

# NA を回避するための index
idx <- apply(!is.na(m1), 1, which)
s0 <- sapply(idx, head, 1)
s1 <- which(colnames(m1) == "0")
s2 <- sapply(idx, tail, 1)

m1[is.na(m1)] <- 0
m2[is.na(m2)] <- 0

alpha <- 0.05
library(rstan)
# 結婚による出演数低下パラメータ1つのモデル
stanmodel01 <- stan_model("marrige01.stan")
standata <- list(n_cv=nrow(prof), time=ncol(m1),
                 cast=m1, marrige_status=m2,
                 s0=s0, s2=s2
                 )

fit <- sampling(stanmodel01, data=standata, chains=3, warmup=500, iter=1500, seed=1234)
ex <- extract(fit)
sapply(head(ex, 2), quantile, c(alpha/2, 0.5, 1-alpha/2))

# 各声優ごとの項をつける
stanmodel02 <- stan_model("marrige02.stan")
fit <- sampling(stanmodel02, data=standata, chains=3, warmup=500, iter=1500, seed=1234)
ex <- extract(fit)
sapply(head(ex, 2), quantile, c(alpha/2, 0.5, 1-alpha/2))
b02 <- apply(ex$beta, 2, quantile, c(alpha/2, 0.5, 1-alpha/2))
colnames(b02) <- rownames(m1)

# trend もつける
stanmodel03 <- stan_model("marrige03.stan")
standata <- list(n_cv=nrow(prof), time=ncol(m1),
                 cast=m1, marrige_status=m2,
                 s0=s0, s1=s1, s2=s2
                 )
fit <- sampling(stanmodel03, data=standata, chains=3, warmup=500, iter=1500, seed=12345)
ex <- extract(fit)
apply(cbind(ex$intercept, ex$trend), 2, quantile, c(alpha/2, 0.5, 1-alpha/2))
quantile(ex$beta_0, c(alpha/2, 0.5, 1-alpha/2))
b03 <- apply(ex$beta, 2, quantile, c(alpha/2, 0.5, 1-alpha/2))
colnames(b03) <- rownames(m1)

b <- lapply(list(b02, b03), t)
# beta の plot
idx <- which(apply(b[[1]]<0, 1, all) | apply(b[[1]]>0, 1, all)|
             apply(b[[2]]<0, 1, all) | apply(b[[2]]>0, 1, all)) # どちらかのモデルで有意

idx <- which((apply(b[[1]]>0, 1, all)&apply(b[[2]]>0, 1, all))|
(apply(b[[1]]<0, 1, all)&apply(b[[2]]<0, 1, all))) # どちらでも有意
idx <- c(idx, 24) # 24 は生天目
# idx <- c(idx, 15) # 24 は斉藤千和

o <- order(b[[1]][idx,2], decreasing=TRUE)
bo <- mapply(function(x) x[idx,][o,], b, SIMPLIFY=FALSE) # 並び替え

# plot
xl <- c(-3, 7)
yl <- c(length(o)+1, 1)

par(mar=c(5, 5.5, 2, 9))
plot(1, type="n", xlab="", ylab="", xlim=xl, ylim=yl, yaxt="n", bty="n", xaxt="n")
d <- -8:8
axis(1, d, d)
abline(h=1:(length(o)+1)-0.5, lty=3, col=grey(0.5))
abline(v=0, lty=3, col=grey(0.5))
text(par()$usr[1]-3, seq(nrow(bo[[1]])), rownames(bo[[1]]), xpd=TRUE, pos=4)
legend(par()$usr[2], par()$usr[3], legend=paste("Model", 2:3), lty=c(1,3), bg="white", lwd=3, xpd=TRUE, xjust=-0.1)
title("アニメの出演本数（半年あたり）")
#title("主役の出演本数（半年あたり）")
xy <- c(3, 0.2)
y1 <- 1.7
p1 <- par()$usr[3]
for(i in seq(2)){
  arrows(0+c(1,-1)[i]*xy[2], p1+y1, c(1,-1)[i]*xy[1], xpd=TRUE, length=0.12, lwd=2)
  text(c(1,-1)[i]*mean(xy), p1+y1, c("増える", "減る")[i], xpd=TRUE, pos=1)
}
text(0, p1+y1, "アニメ出演本数が", xpd=TRUE, pos=3)
text(0, p1+y1, expression(alpha), xpd=TRUE, cex=1.5, pos=1)
lag <- c(-1, 1)*0.19
cols <- c(1, 1)
for(j in seq(bo)){
for(i in seq(nrow(bo[[j]]))){
  segments(bo[[j]][i,1], i+lag[j], bo[[j]][i,3], lty=c(1,3)[j], col=cols[j], lwd=2)
  points(bo[[j]][i,2], i+lag[j], pch=c(16, 16)[j], col=cols[j])
  tmp <- as.character(round(bo[[j]][i,], 2))
  for(k in seq(tmp)){
    if(length(grep("-", tmp[k])) < 1){
      tmp01 <- strsplit(tmp[k], "\\.")[[1]]
      tmp01[1] <- paste(" ", tmp01[1], sep="")
      if(nchar(tmp01[2])<2) tmp01[2] <- paste(tmp01[2], "0", sep="")
      tmp[k] <- paste(tmp01, collapse=".")
    } else {
      tmp01 <- strsplit(tmp[k], "\\.")[[1]]
      if(nchar(tmp01[2])<2) tmp01[2] <- paste(tmp01[2], "0", sep="")
      tmp[k] <- paste(tmp01, collapse=".")
    }
  }
  txt <- paste(tmp[2], paste("[", tmp[1], ",", sep=""), paste(tmp[3], "]", sep=""), sep=" ")
  text(par()$usr[2], i+lag[j]*1.2, txt, xpd=TRUE, pos=4)
}
}

　
html 内に声優の名前でwiki からとってきたデータがごっそりあるとする。
このデータはwget とかでとってくる。

import os
import re
from operator import add # flatten
from operator import xor
wd = "/html/"
htmls = os.listdir(wd)

outwd = "/test/"
toctext = '<span class="toctext">'
toc = re.compile('<span class="toctext">.+?</span></a>')
# 目次読んで
# 行をマーキングして
#　3回目で処理する
#w0 = open("/output.txt", "w")
for i in range(len(htmls)):
  g = open(wd + htmls[i], "rU")
  line = [d.rstrip() for d in g]
  mokuji_idx = map(lambda x: "目次" in x, line).index(True)
  # 目次を処理する
  mokuji = []
  for j in line:
    sen0 = toc.findall(j)
    if len(sen0) > 0:
      sen = sen0[0]
      for e in ['<span class="toctext">', '</span></a>']:
        sen = re.sub(e, "", sen)
      
      mokuji += [sen]
  re_name = re.compile('<title>.+? - Wikipedia</title>')
  if "テレビアニメ" in mokuji:
    anime_idx = mokuji.index("テレビアニメ")
    anime = re.compile('<h3><span class="mw-headline" id=.+テレビアニメ')
    nextmokuji = re.compile('<h3><span class="mw-headline" id=.+'+mokuji[anime_idx+1])
    re_year = re.compile("\d{4}年</b></p>")
    li = ["<li>", "<\li>"]
    b = re.compile('<b>.+?</b>')
    flag = [False, False]
    for j, tmp in enumerate(line):
      if ' - Wikipedia</title>' in tmp:
        name = re.sub(' - Wikipedia</title>', "", tmp)[7:]
      
      sen0 = anime.findall(tmp)
      sen1 = nextmokuji.findall(tmp)
      if len(sen0) > 0:
        flag[0] = True
      if len(sen1) > 0:
        flag[1] = True
      
      if xor(flag[0], flag[1]):
        tmpyear = re_year.findall(tmp)
        if len(tmpyear) > 0:
          year = tmpyear[0][:4]
        
        if li[0] in tmp:
          cast = map(lambda x: len(re.findall(x, tmp)), ["<li>.+?</li>", "<b>.+?</b>"])
          w0.write("\t".join([name, year]+map(str, cast))+"\n")

w0.close()