二値判別

二値判別にはmdrrデータセットを使ってみる。

Svetnik et al (2003) describe these data: "Bakken and Jurs studied a set of compounds originally discussed by Klopman et al., who were interested in multidrug resistance reversal (MDRR) agents. The original response variable is a ratio measuring the ability of a compound to reverse a leukemia cell’s resistance to adriamycin. However, the problem was treated as a classification problem, and compounds with the ratio >4.2 were considered active, and those with the ratio <= 2.0 were considered inactive. Compounds with the ratio between these two cutoffs were called moderate and removed from the data for twoclass classification, leaving a set of 528 compounds (298 actives and 230 inactives). (Various other arrangements of these data were examined by Bakken and Jurs, but we will focus on this particular one.) We did not have access to the original descriptors, but we generated a set of 342 descriptors of three different types that should be similar to the original descriptors, using the DRAGON software."
The data and R code are in the Supplimental Data file for the article.

で、
mdrrDescr:the descriptors
mdrrClass:the categorical outcome ("Active" or "Inactive")
となっているので、xをmdrrDescr、yをmdrrClassとして使えばおk。
familyをbinomial指定して計算する。データ行列がちょっと大きいので計算に時間がかかる。

data(mdrr, package = "caret")
mdrr.glmnet <- glmnet(as.matrix(mdrrDescr), mdrrClass, family="binomial")
plot(mdrr.glmnet)


回帰と同じように、最適な\lambdaを求めて、判別分析をする。

mdrr.cv.glmnet <- cv.glmnet(as.matrix(mdrrDescr), mdrrClass, family="binomial")
table(as.numeric(mdrrClass), predict(mdrr.glmnet, as.matrix(mdrrDescr), s=mdrr.cv.glmnet$lambda.min, type="class"))
    Active Inactive
  1    268       30
  2     45      185