内挿 - 驚異のアニヲタ社会復帰の予備

MikuHatsune2012-12-10

AMIA Summits Transl Sci Proc. 2011; 2011: 16–20
内挿によってデータを取っていない区間についても推定して、回帰の精度をあげるお話。
平滑化ということで取り上げられている。
シグモイドみたいな、パラメトリックに関数をおいて、観測データをぎちぎちに当てはめると、当てはまるときにはいいけど、当てはまらないときにはダメ。
Isotonic regressionのような、ノンパラメトリックに推定すると、柔軟ではあるが過大評価したり過小評価したりする。
ということで、それとなくノンパラメトリックだけどパラメトリックな感じの、中途半端に内挿すると、精度が上がるんじゃね的な。
　
というわけでRで内挿法をいろいろ。
スプライン(spline)
既にやっている人がいるのでパクってみる。

par(mfrow=c(1, 3))
require(graphics)

attach(cars)
plot(speed, dist, main = "data(cars)  &  smoothing splines")
cars.spl <- smooth.spline(speed, dist)
(cars.spl)
## This example has duplicate points, so avoid cv = TRUE

lines(cars.spl, col = "blue")
lines(smooth.spline(speed, dist, df = 10), lty = 2, col = "red")
legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)),
               "s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
       bg = 'bisque')
detach()


## Residual (Tukey Anscombe) plot:
plot(residuals(cars.spl) ~ fitted(cars.spl))
abline(h = 0, col = "gray")

## consistency check:
stopifnot(all.equal(cars$dist,
                    fitted(cars.spl) + residuals(cars.spl)))

##-- artificial example
y18 <- c(1:3, 5, 4, 7:3, 2*(2:5), rep(10, 4))
xx  <- seq(1, length(y18), len = 201)
(s2  <- smooth.spline(y18)) # GCV
(s02  <- smooth.spline(y18, spar = 0.2))
(s02. <- smooth.spline(y18, spar = 0.2, cv = NA))
plot(y18, main = deparse(s2$call), col.main = 2)
lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)

## The following shows the problematic behavior of 'spar' searching:
(s2  <- smooth.spline(y18, control =
                      list(trace = TRUE, tol = 1e-6, low = -1.5)))
(s2m <- smooth.spline(y18, cv = TRUE, control =
                      list(trace = TRUE, tol = 1e-6, low = -1.5)))
## both above do quite similarly (Df = 8.5 +- 0.2)

　
Piecewise cubic hermite interpolation(PCHIP)区分的 3 次元エルミート内挿多項式
論文では、単調増加するように内挿したいらしく、この関数を使っている。

library(matlab)
library(signal)
xf = linspace(0,11,500)
yf = sin(2*pi*xf/5)
xp = c(0:10); yp = sin(2*pi*xp/5)
pch  = pchip(xp, yp, xf)
plot(xp, yp, xlim=c(0,11))
lines(xf, pch, col = "orange")

　
ロジスティック回帰モデルの適合度はHosmer-Lemeshow testで計算している。
RではeRm、ResourceSelection、MKmiscでできるらしい。

library(eRm)
#Goodness-of-fit for a Rasch model
res <- RM(raschdat1)
pres <- person.parameter(res)
gof.res <- gofIRT(pres)
gof.res
summary(gof.res)

gof.res

Goodness-of-Fit Results:
Collapsed Deviance = 770.574 (df = 780, p-value = 0.588)
Pearson R2: 0.275
Area Under ROC: 0.803

summary(gof.res)

Goodness-of-Fit Tests
                      value         df p-value
Collapsed Deviance  770.574        780   0.588
Hosmer-Lemeshow       6.942          8   0.543
Rost Deviance      2564.654 1073741794   1.000
Casewise Deviance  3221.328       2945   0.000

library(ResourceSelection)
set.seed(123)
n <- 500
x <- rnorm(n)
y <- rbinom(n, 1, plogis(0.1 + 0.5*x))
m <- glm(y ~ x, family=binomial)
hoslem.test(m$y, fitted(m))

hoslem.test(m$y, fitted(m))

	Hosmer and Lemeshow goodness of fit (GOF) test

data:  m$y, fitted(m) 
X-squared = 4.5227, df = 8, p-value = 0.8072