Rで実装できませんでした。Pythonでやります。
OSはUbuntu 12.04。
まずFRaCのサイトから、最新版のスクリプトをダウンロードして解凍する。fracというフォルダができる(と思う)が、ここで作業することにする。
LIBSVMというSVM計算を行うソフトを導入する。
ubuntuならターミナルから
sudo apt-get install python-libsvm*
Wekaといういろいろな予測モデルをやってくれるソフトウェアをダウンロードする。適当なところに解凍する。
weka.jarというのが重要である。これが置いてあるディレクトリのパスはわかるように。
FRaCはWeka中の決定木をパクって使いたいらしい。
あとはfracフォルダ中にdetectというスクリプトがあるが、これを実行する。
引数やオプションは次の通り。
デフォルトは教育データがすべて正常の分布からくる(半教師有り学習)という状況を想定しているっぽいが、教育データとテストデータを同一のものにして実行したらたぶん教師なし学習として外れ値検出をしてくれるっぽい…ことを祈っている。
サンプルデータのacuteの中にあるデータセットを使ってみよう。
#いまのディレクトリはfrac python detect -X test.data/acute/trainset -Q test.data/acute/testset -T -R -N -S -w ~/Desktop/weka-3-6-9/weka.jar
# Training set examples: 45 # Test set examples: 75 # Infer feature types from test.data/acute/trainset and test.data/acute/testset # 6 feature types (column, type, values): # 1 continuous 35.5,41.5 # 2 nominal no,yes # 3 nominal no,yes # 4 nominal no,yes # 5 nominal no,yes # 6 nominal no,yes # @0.01 seconds: Feature #1, Decision Stump ... # @0.02 seconds: Feature #1, svm-train -s 3 -t 0 ... # @0.24 seconds: Feature #1, svm-train -s 3 -t 2 ... # @0.49 seconds: Feature #1, weka.classifiers.trees.REPTree ... # @4.10 seconds: Feature #2, Decision Stump ... # @4.11 seconds: Feature #2, svm-train -s 0 -t 0 -b 1 ... # @4.33 seconds: Feature #2, svm-train -s 0 -t 2 -b 1 ... # @4.58 seconds: Feature #2, weka.classifiers.trees.J48 -R ... # @8.24 seconds: Feature #3, Decision Stump ... # @8.24 seconds: Feature #3, svm-train -s 0 -t 0 -b 1 ... # @8.47 seconds: Feature #3, svm-train -s 0 -t 2 -b 1 ... # @8.70 seconds: Feature #3, weka.classifiers.trees.J48 -R ... # @12.29 seconds: Feature #4, Decision Stump ... # @12.30 seconds: Feature #4, svm-train -s 0 -t 0 -b 1 ... # @12.53 seconds: Feature #4, svm-train -s 0 -t 2 -b 1 ... # @12.75 seconds: Feature #4, weka.classifiers.trees.J48 -R ... # @16.82 seconds: Feature #5, Decision Stump ... # @16.82 seconds: Feature #5, svm-train -s 0 -t 0 -b 1 ... # @17.05 seconds: Feature #5, svm-train -s 0 -t 2 -b 1 ... # @17.30 seconds: Feature #5, weka.classifiers.trees.J48 -R ... # @20.94 seconds: Feature #6, Decision Stump ... # @20.95 seconds: Feature #6, svm-train -s 0 -t 0 -b 1 ... # @21.16 seconds: Feature #6, svm-train -s 0 -t 2 -b 1 ... # @21.37 seconds: Feature #6, weka.classifiers.trees.J48 -R ... # @25.24 seconds: Write normalized surprisal ... #以下、テストデータのanomaly scoreがずらずら…
--version show program's version number and exit -h, --help show this help message and exit -X Filename Training set file. Format is tabular; each line is an example, each column is a feature -Q Filename Test set file (same format as training set file above) -d String Field delimiter. E.g., if your input file(s) are lines of comma-separated values, set this to ','. Default field separator is any white space -m Filename Optional meta-data file. If not supplied, column types are inferred automatically from the provided values. The format for the meta-data file is as follows: Each line has three parts: column number, column type, possible values. Column numbers start at 1. Valid column types are 'nominal', 'continuous' and 'ignore'. For nominal features, possible values should be a comma-separated list. For continuous features, this list should have only a minimum and a maximum allowed value. This option is useful if you need to specify that certain features are not to be considered, or that an enumerated nominal feature uses integers as values -T Use pruned regression/decision tree models. Uses WEKA (http://www.cs.waikato.ac.nz/ml/weka). -R Use RBF-kernel SVM models. Uses LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm). -N Use linear-kernel SVM models. Uses LIBSVM. -S Use decision stump models. These simply predict the mean of the feature distribution without regard to the other features (mostly, this is just here because it doesn't require additional software) -w Path Path to weka.jar (for supervised learners implemented in WEKA). Default is "./weka.jar" See: http://www.cs.waikato.ac.nz/ml/weka -f Integer Learn predictor models C_i for these features only (thus output anomaly scores will be sums of surprisal for these features only). This option may be invoked multiple times. If this option is not invoked, learn a predictor for all features -o Filename Write anomaly detection scores to this location