新型肺炎COVID-19の無症状感染者の割合をrstanで推定しようとしたが断念した

読んだ。
Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship, Yokohama, Japan, 2020. - PubMed - NCBI
COI：なし

ダイヤモンド・プリンセス号のPCR検査と陽性数および症状のある・なしのデータから、無症状でPCR陽性となる患者の割合を推定しようという試み。

The asymptomatic proportion was defined as the proportion of asymptomatically infected individuals among the total number of infected individuals.

とあるように、全感染者数に対して無症状な人の割合をasymptomatic proportion とし、論文では17.9% (95% credible interval (CrI): 15.5–20.2%) だった、と言っている。

データはダイヤモンド・プリンセス号の各日の報告者数で、tableになっている。ただし、2/14日まではNAで、無症状だった人は累計で35人いたので、感度分析では2/5から2/13までこの35人をランダムに振った、と言っている。
適当に補完したsymptomatic/asymptomatic な患者はこんな感じである。supplemental の図っぽくしている。
f:id:MikuHatsune:20200413214835p:plain

The reported asymptomatic cases consists of both true asymptomatic infections and cases who had not yet developed symptoms at the time of data collection but became symptomatic later, i.e. the data are right censored. Each datum consists of an interval of time during which the individual may have been infected and a binary variable indicating whether they were symptomatic as at 18 February.

と言っているように、無症状の患者というのは、(1)本当に無症状のまま経過する人と、(2)PCR検査が陽性となった時点で無症状（だが、いつかは症状が出たかもしれない）、つまりみぎ打ち切りデータというのと、2パターンある。
各患者は、感染しうる期間と、2/18時点で症状があったかなかったか、というデータを持っている。しかし、「感染しうる期間」は以降で $[a_i, b_i]$ で定義されているがデータを見てもそんなのない(?)し、2/18時点で症状があったかなかったか、というのもない(?)。そもそもtable は2/20まである(?)。

モデルとしては、ある患者 $i$ についてある感染しうる期間 $[a_i, b_i]$ に、 $X_i$ のタイミングで感染したとする。症状が出る打ち切り時刻は $c$ だが、index がついてないので全患者に共通なのだろうか。いや多分違うと思う（ $c_i$ では?）。
感染から症状が出るまでの時間 $D_i=c-X_i$ は、適当な累積確率分布 $F_D$ に従うと仮定し、これは平均6.4日、標準偏差2.3日のワイブル分布としている。

The asymptomatic proportion, $p$ , is the probability an individual will never develop symptoms.

というように、 $p$ を推定したい。

ある患者が $c$ に無症状である確率 $g(x, p)$ は
${\displaystyle \begin{eqnarray} g(x, p)=\left\{ \begin{array}{ll} p + (1-p)(1-F_D(c-x)) & asymptomatic\\ F_D(c-x) & symptomatic \end{array} \right. \end{eqnarray} }$
とする。 $g(x, p)$ は全患者で独立なので掛けあわせてよい、つまりは対数では足しあわせてよい。

ということでやってみようと思ったが、感染しうる期間 $[a_i, b_i]$ がないし打ち切り時刻 $c$ はPCR陽性のときの症状がある/なしの患者数とその日付、ということにしてなんとかデータを作ってみる。

結果では

The posterior median estimate of the true proportion of asymptomatic individuals among the reported asymptomatic cases is 0.35 (95% credible interval (CrI): 0.30–0.39), with the estimated total number of the true asymptomatic cases at 113.3 (95%CrI: 98.2–128.3) and the estimated asymptomatic proportion (among all infected cases) at 17.9% (95%CrI: 15.5–20.2%).

と言っていて、asymptomatic と言っていた人のうち本当に今後も無症状のままなtrue asymptomatic は0.35で、全（PCR陽性）患者に占めるasymtomatic は17.9%と言っている。

モデルとして怪しい（絶対間違っている）今回のコードでの $p$ の推定結果は、0.65 だった。
f:id:MikuHatsune:20200413204659p:plain
$p$ はtrue asymptomatic (?) だったのか?よくわからない。いや上のモデル式でもthe probability an individual will never develop symptoms と言っているしこれが推定したい $p$ ?
この $p$ を用いて、generated quantities ブロック内で、PCR陽性時には無症状だった人についてbinomial で今後も無症状か有症状になるかサンプリングした結果は、0.32 だった。
これが自分のなかではtrue asymptomatic と思っているのだが、全然違う。
f:id:MikuHatsune:20200413205004p:plain

このネタであんまり粘っても正解が得られそうにないのでいいところで諦めた。

txt <- "
Date	Number of passengers and crew members on board	Number of disembarked passengers and crew members (cumulative)	Number of tests	Number of tests (cumulative)	Number of individuals testing positive	Number of individuals testing positive (cumulative)	Number of symptomatic cases	Number of asymptomatic cases	Number of asymptomatic cases (cumulative)
2020-02-05	3711	NA	31	31	10	10	NA	NA	NA
2020-02-06	NA	NA	71	102	10	20	NA	NA	NA
2020-02-07	NA	NA	171	273	41	61	NA	NA	NA
2020-02-08	NA	NA	6	279	3	64	NA	NA	NA
2020-02-09	NA	NA	57	336	6	70	NA	NA	NA
2020-02-10	NA	NA	103	439	65	135	NA	NA	NA
2020-02-11	NA	NA	NA	NA	NA	NA	NA	NA	NA
2020-02-12	NA	NA	53	492	39	174	NA	NA	NA
2020-02-13	NA	NA	221	713	44	218	NA	NA	NA
2020-02-14	3451	260	NA	NA	NA	NA	NA	NA	NA
2020-02-15	NA	NA	217	930	67	285	29	38	73
2020-02-16	NA	NA	289	1219	70	355	32	38	111
2020-02-17	3183	528	504	1723	99	454	29	70	181
2020-02-18	NA	NA	681	2404	88	542	23	65	246
2020-02-19	NA	NA	607	3011	79	621	11	68	314
2020-02-20	NA	NA	52	3063	13	634	7	6	320
"

library(rstan)
rstan_options(auto_write=TRUE)
options(mc.cores=parallel::detectCores())
dat <- read.delim(text=txt, stringsAsFactors=FALSE, check.names=TRUE)
N0 <- dat$Number.of.individuals.testing.positive..cumulative.
n0 <- dat$Number.of.individuals.testing.positive
n1 <- dat$Number.of.symptomatic.cases
n2 <- dat$Number.of.asymptomatic.cases

s <- mapply(function(z) sample(1:n0[z], 1), c(8, 11))
nn0 <- replace(n0, c(7:8, 10:11), c(s[1], n0[8]-s[1], s[2], n0[11]-s[2])) # 各日の検査陽性を補完
N1 <- cumsum(nn0)

na.ind <- 10
nn2 <- replace(n2, 1:na.ind, c(rmultinom(1, size=35, prob=nn0[1:na.ind]))) # asymptomatic の各日人数
nn1 <- replace(n1, 1:na.ind, nn0[1:na.ind] - nn2[1:na.ind]) # symptomatic の各日人数
nn1[na.ind + 1] <- nn1[na.ind + 1] - nn1[na.ind]
nn2[na.ind + 1] <- nn2[na.ind + 1] - nn2[na.ind]
nn1;nn2;nn0
nn12 <- c(nn1, nn2)
sym <- unlist(mapply(rep, 0:1, each=sapply(list(nn1, nn2), sum)))


mat1 <- t(do.call(cbind, mapply(function(i) replicate(nn1[i], replace(rep(0, nrow(dat)), i, 1)), seq(nrow(dat)))))
mat2 <- t(do.call(cbind, mapply(function(i) replicate(nn2[i], replace(rep(0, nrow(dat)), i, 2)), seq(nrow(dat)))))
mat2 <- matrix(unlist(mat2), nc=nrow(dat), byrow=FALSE)
mat <- rbind(mat1, mat2)
colnames(mat) <- dat$Date
Xi <- apply(mat>1, 1, which) # 陽性になった日

Nsym <- apply(mat>0, 2, tapply, sym, sum)
rownames(Nsym) <- c("symptomatic", "asymptomatic")
#Nsym <- Nsym[2:1,]

cols <- c("white", "red", "green")
mat0 <- t(mat)
par(mar=c(4, 3, 3, 2))
image(1:nrow(mat0), 1:ncol(mat0), mat0, col=cols, xaxt="n", yaxt="n", xlab="", ylab="")
axis(1, at=1:nrow(mat0), labels=gsub("-", "/", gsub("2020-", "", rownames(mat0))), las=2)
pa <- par()$usr
text(1:nrow(mat0), cumsum(Nsym[1,]), Nsym[1,], pos=3, col=cols[2])
text(pa[1]-0.5, tapply(seq(sym), sym, mean), rownames(Nsym), srt=90, pos=3, xpd=TRUE, col=cols[2:3], cex=2)
text(1:nrow(mat0), c(Nsym[2,1], head(cumsum(Nsym[2,]), -1))+sum(Nsym[1,]), Nsym[2,], pos=1, col=cols[3], font=2)
text(1:nrow(mat0), rep(pa[4], ncol(Nsym)), colSums(Nsym), xpd=TRUE, pos=3)
text(pa[1]-0.8, pa[4], "Total", xpd=TRUE, pos=3, cex=1)
abline(v=1:nrow(mat0)+0.5, lty=3)

code <- "
data{
  int<lower=1> N;
  int<lower=0> Day;
  real<lower=0> par[2];
  int<lower=0, upper=1> sympt[N];
  int<lower=0> All[Day];
  int<lower=0> Asympt[Day];
  real<lower=0> Up;

}
parameters{
  real<lower=0, upper=Up> infect_interval[N]; // 感染日までの期間
  real<lower=0, upper=1> p;
}
model{
  for(i in 1:N){
    if(sympt[i] == 0){
      target += weibull_lcdf(infect_interval[i]| par[1], par[2]);
    } else {
      target += log(p + (1-p)*(1-weibull_cdf(infect_interval[i], par[1], par[2])));
    }
  }
  Asympt ~ binomial(All, p);
}
generated quantities{
  vector[N] r;
  for(i in 1:N){
    if(sympt[i] == 0){
      r[i] = 0;
    } else {
      r[i] = bernoulli_rng(p);
    }
  }
}
"

pars <- c(3.02, 7.25)
m0 <- stan_model(model_code=code)
standata <- list(N=nrow(mat), Day=ncol(mat), par=pars, sympt=sym, All=nn0, Asympt=nn2, Up=100)
fit <- sampling(m0, standata, iter=1000, warmup=500, chain=4, seed=2223)
ex <- extract(fit, pars=head(fit@model_pars, -1))
hist(ex$p, col="pink", main="", xlab="The asymptomatic proportion")
hist(rowMeans(ex$r), col="yellow", main="",
     xlab="The estimated asymptomatic proportion (among all infected cases)")