判別分析の判別線がおかしい - 驚異のアニヲタ社会復帰の予備

という相談を受けた。
解析したいデータは2変数なので、例として
判別分析モデルの応用
がついてきたが、3変数の場合はirisを用いて
データ解析・マイニングとR言語
もついてきて、こんな感じでやりたい、と言われた。

0/1のtype は以下の線形モデルで表現されて
$Type = \alpha + \beta_xX+\beta_yY$
判別線は
$Y=-(\alpha + \beta_xX)/\beta_y$
となる。

しかし、リンクの通りに $\alpha$ を計算すると、判別線より上か下かを決めるスコア（正負で上下が決まる）と、判別線との関係がなにかおかしい、というのが問題のようである。
図で言うと、青の判別線の下にある8個の☓がおかしい点である。
f:id:MikuHatsune:20210212164704p:plain

結論から言うと

# 以上の手順は1回で済ますこともできる
apply(disc.model$means %*% disc.model$scaling, 2, mean)

の部分が平均を取っており、リンクの例題なら2カテゴリーが1:1 で作成されたので偶然結果が一致するが、今回のデータは0/1 が偏っているので、その偏りに応じた重み付けが必要になる。
これはiris を使った2番目のリンクの例題でも、1:1:1 になって平均を取って偶然一致するので、わかりにくい。

alpha <- c(disc.model$prior %*% disc.model$means %*% disc.model$scaling)

とすればデータの事前確率に応じた重み付けが取れる（行列演算なので返り値がベクトルになるようにcをかませている）。

緑の判別線ならうまくいったようである。
f:id:MikuHatsune:20210212165254p:plain

txt <- "
type X Y
0 0 107
0 0 109
0 0 97
0 0 75
0 0 101
0 0 92
0 0 88
0 0 99
0 0 89
0 22 96
0 37 110
0 0 90
0 0 101
0 92 68
0 51 99
0 26 105
0 0 92
0 0 81
0 0 95
0 0 96
0 18 88
0 0 89
0 0 94
0 0 92
0 0 90
0 0 109
0 0 88
0 0 84
0 0 108
0 0 92
1 0 75
1 47 61
1 29 73
1 96 76
1 46 85
1 103 72
1 68 60
1 0 78
1 0 52
1 0 84
1 0 64
1 64 85
1 72 89
1 50 65
1 50 75
1 116 87
1 79 80
1 130 77
1 48 82
1 40 86
1 140 71
1 34 77
1 109 95
1 208 67
1 85 81
1 0 62
1 91 64
1 106 95
1 27 73
1 36 61
1 28 77
1 53 94
1 113 94
1 59 77
1 38 63
1 0 75
1 78 60
1 321 68
1 0 55
1 81 54
1 77 87
1 49 57
1 131 69
1 100 54
1 149 51
1 59 65
1 83 43
1 39 56
1 176 47
1 122 60
1 172 78
1 123 46
1 164 54
1 42 64
1 86 60
1 31 75
1 27 73
1 64 57
1 124 72
1 59 62
1 0 84
1 18 66
1 103 92
1 0 76
1 22 44
1 125 54
1 197 71
"

dat <- read.table(text=txt, header=TRUE)

library(MASS)
disc.model <- lda(type ~ Y + X, data=dat)
p <- predict(disc.model)
pp <- p$posterior
p2 <- function(x) ifelse(x < 1, 1/x, x)

TF <- p$class == dat$type

group.mean <- disc.model$mean
# 線形係数のデータ行列をcoefに代入
coef <- disc.model$scaling
# 行列のかけ算は%*%で行える
# その結果をaに代入
a <- group.mean %*% coef
# 行列aの列平均を計算する
# 引数に2ではなく1を指定すれば行に対して計算される
# 仮にmeanをsumにすると、平均ではなく合計が計算される
apply(a, 2, mean)
# 以上の手順は1回で済ますこともできる
# web のやつ
alpha1 <- apply(disc.model$means %*% disc.model$scaling, 2, mean)
# データの事前確率を反映させた正しいもの
alpha2 <- c(disc.model$prior %*% disc.model$means %*% disc.model$scaling)

# 判別線
f2 <- function(x, alpha) -(-alpha + disc.model$scaling["X", "LD1"]*x)/disc.model$scaling["Y", "LD1"]
x <- seq(0, 500, length=1000)

plot(Y ~ X, data=dat, col=dat$type+1, pch=ifelse(p$x < 0, 4, 16))
legend("topright", legend=sprintf("type %d", 0:1), pch=16, col=1:2)
lines(x, f2(x, alpha1), lty=3, col=4, lwd=3)


plot(Y ~ X, data=dat, col=dat$type+1, pch=ifelse(p$x < 0, 4, 16))
lines(x, f2(x, alpha1), lty=3, col=4, lwd=3)
lines(x, f2(x, alpha2), lty=3, col=3, lwd=3)
legend("topright", legend=sprintf("type %d", 0:1), pch=16, col=1:2)
legend("bottomright", legend=c("正", "負"), pch=c(4, 16), col=1:2)


plot(Y ~ X, data=dat, col=dat$type+1, pch=c(4, 16)[TF+1], cex=log10(p2(pp[,1]/pp[,2]))+1)
legend("topright", legend=sprintf("type %d", 0:1), pch=16, col=1:2)
lines(x, f2(x, alpha2), lty=3, col=4, lwd=3)