意識高すぎて休日にword2vecをいじるしかやることなかった

word2vecという、ニューラルネットワーク的なことを用いて文書集合からコーパスを構築して、単語のベクトル演算ができるようになる手法があるらしい。
艦これ加賀さんから乳を引いてみるという話を聞いてスゲー!!ってなったので、Twitterでやってみたとか英辞郎でやってみたとかMagic: The Gatheringとかwikipediaいろいろあるなか、何番煎じだよｿﾚｪ…って思われそうだけれどもやってみる。
こちらを参考にword2vecをインストールする。今回はPythonではなくターミナルでカチャカチャやることにする。
demo-word.sh の中にtext8というデータがあるが、これは100MBほどのコーパスで、

anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as

というように、半角スペース区切りでひたすら単語が並んでいるデータである。なので英語でも日本語でも、各単語を半角スペース区切りでひとつのファイルにぶちこんでおけばよい。コンマ、ピリオド、大文字、数字、特殊文字はすべて削っておく。
英語ならば半角スペースで勝手に区切られているので分かち書きの必要はないが、活用形を元に戻す必要があるのでこちらを参考にWordNetで活用を一般形に戻すと多分精度があがる。
日本語ならおなじみのMeCabを使って、活用形を一般形に戻すオプションをつけて前処理しておく。

# Pythonでコンマ、ピリオド、大文字、数字、特殊文字を削るスクリプト
file = "text.txt"
f = open(file, "rU")
rep = ['[', ']', '#', '&', ',', '.', ';', ':', '(', ')', '%', '<', '>', '!', '?', '\x81f'] + map(str, range(0, 10))

res = []
for line in f:
	tmp = line.rstrip().lower()
	for i in rep:
		tmp = tmp.replace(i, "")
	
	res += [" " + tmp]

res = "".join(res)
w0 = open("text1.txt", "w")
w0.write(res)
w0.close()

　
demo-word.sh を実行すればサンプルデータのダウンロードからword2vecの実行まで勝手にやってくれるのだが、重要なのは実行スクリプトと、出てきた結果 vectors.bin であり、下記の部分で実行できる。

time ./word2vec -train text8 -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1
./distance vectors.bin     # 距離の演算
./word-analogy vectors.bin # 類似度の演算

距離についてはサイトにある通りだが、類似度については

Enter three words (EXIT to break): paris france berlin

Word: paris  Position in vocabulary: 1055

Word: france  Position in vocabulary: 303

Word: berlin  Position in vocabulary: 1360

                                              Word              Distance
------------------------------------------------------------------------
                                           germany		0.633804
                                           hungary		0.486985
                                            russia		0.472652
                                           austria		0.469492
                                              ussr		0.465501

多分、paris と france を足して berlin を引けば、首都分同士が消えて国名が出てくるんじゃないかな?
　
さてここから、意識高い解析をやる。
医療系のテキストをクローリングして20MBくらいのものを入手した。
とりあえずぶち込んでやってみる。
意識高い系の人が勉強したがる抗菌薬については、マクロライドやβ-ラクタムといった抗菌薬の分類が出てきた。

Enter word or sentence (EXIT to break): antibiotics

Word: antibiotics  Position in vocabulary: 564

                                              Word       Cosine distance
------------------------------------------------------------------------
                                         macrolide		0.686868
                                         β-lactam		0.671718
                                          rifampin		0.659790
                                          regimens		0.659161
                                    broad-spectrum		0.649693
                                        antibiotic		0.646567
                                        antifungal		0.638228
                                   fluoroquinolone		0.632737
                                       rifapentine		0.628934
                                           regimen		0.628189
                                   aminoglycosides		0.623643
                                  fluoroquinolones		0.619080
                                      prophylactic		0.618527
                                       combination		0.618506
                                    cephalosporins		0.613100
                                     metronidazole		0.612240
                                        quinolones		0.609158
                                   antituberculous		0.608638
                                     cephalosporin		0.608548
                                            agents		0.607759
                                     antimicrobial		0.601876
                                  third-generation		0.598869
                                       single-dose		0.598591
                                        vancomycin		0.594600
                                    aminoglycoside		0.591262
                                         isoniazid		0.589512
                                          low-dose		0.587311
                                   corticosteroids		0.585872

cephalosporinには第n世代という開発上の分類があるので、これがどうなるかやってみたら、うん、まあ中身見るのめんどくさいね!!

Enter word or sentence (EXIT to break): cephalosporin

Word: cephalosporin  Position in vocabulary: 5052

                                              Word       Cosine distance
------------------------------------------------------------------------
                              ampicillin/sulbactam		0.899973
                                        cefotaxime		0.878304
                                         quinolone		0.875494
                                      tetracycline		0.871787
                                        ampicillin		0.870064
                                          imipenem		0.868366
                                         aztreonam		0.866771
                                   fluoroquinolone		0.864932
                                    aminoglycoside		0.859710
                                       clindamycin		0.858559
                                   chloramphenicol		0.851192
                                        tobramycin		0.851054
                                  third-generation		0.848996
                                    nitrofurantoin		0.848938
                                     metronidazole		0.848060
                                         macrolide		0.847879
                                      erythromycin		0.847772
                                         ofloxacin		0.843216
                                        cephalexin		0.843158
                           ticarcillin/clavulanate		0.839059
                                         cefazolin		0.835968
                                     ciprofloxacin		0.835081
                                        vancomycin		0.833714
                                         sulbactam		0.833306
                                      levofloxacin		0.831119
                                       ceftazidime		0.830860
                                         cefotetan		0.830272
                                        gentamicin		0.828700
                                      piperacillin		0.827680
                                           tmp-smx		0.827597
                               penicillin-allergic		0.827292
                                       amoxicillin		0.826194
                                       minocycline		0.824555
                                  antipneumococcal		0.824483
                                         meropenem		0.819185
                                       ceftriaxone		0.818070
                                           bismuth		0.816361
                                        quinolones		0.815513
                                         oxacillin		0.815391
                           amoxicillin/clavulanate		0.814614

　
癌でcancerをやってみて、malignantとかbenignが釣れるかなと期待してやってみたけど、体の部位ばかり出てきて、共起解析で十分じゃないの?という印象…でも、hormone-dependentなんて最近の分子標的薬治療を反映したようなことが出てきててちょっといい感じかも、と思った。

Enter word or sentence (EXIT to break): cancer

Word: cancer  Position in vocabulary: 92

                                              Word       Cosine distance
------------------------------------------------------------------------
                                           cancers		0.772832
                                            breast		0.741152
                                          prostate		0.736099
                                        colorectal		0.732450
                                       endometrial		0.700628
                                          melanoma		0.679323
                                        osteogenic		0.645691
                                           ovarian		0.639607
                                            cervix		0.635429
                                         carcinoma		0.631158
                                 hormone-dependent		0.623069
                                         non-small		0.604022
                                        small-cell		0.597804
                                            polyps		0.583271
                                             hnpcc		0.581199
                                     neuroblastoma		0.569610
                                        carcinomas		0.563273
                                      malignancies		0.554716

　
Bioinformaticsはいまいち何を指しているのかわからないので、このコーパスでは何かというと、結局microarrayらしい。NGSはまだそんなに記述がないようだったので、そういうことなんだろう。HapMapが釣れてきたので、DBを使うことを含んでいるようだ。

Enter word or sentence (EXIT to break): bioinformatics

Word: bioinformatics  Position in vocabulary: 18193

                                              Word       Cosine distance
------------------------------------------------------------------------
                                         plausible		0.799823
                                         catalogue		0.774522
                                       microarrays		0.774336
                                   nonconventional		0.768620
                                   double-antibody		0.766882
                                           hobbies		0.766779
                                               hgp		0.766022
                                          outdoors		0.764794
                                        stereotype		0.759451
                                            hapmap		0.756572

　
最後に、演算から類似度推定的なことをしたかったが、いいネタが思いつかなかった。とりあえず、最近の流行りは多剤併用の抗癌剤治療なので、例えば白血病に化学療法をするけれども、分子標的薬であるリツキシマブを引いたら、古典的なDNA障害系の抗癌剤が出てくるかと期待してやった。結果はadjuvantやneoadjuvantといった術後・術前化学療法が出てきたけど、cyclophosphamideやgemcitabineといった抗癌剤が出てきたので、まあよしとしよう。リツキシマブは多剤併用なので、monotherapyが出てきたのもなんかいい感じだと思った。

Enter three words (EXIT to break): leukemia chemotherapy rituximab

Word: leukemia  Position in vocabulary: 777

Word: chemotherapy  Position in vocabulary: 422

Word: rituximab  Position in vocabulary: 4732

                                              Word              Distance
------------------------------------------------------------------------
                                          adjuvant		0.615775
                                  cyclophosphamide		0.609339
                                    platinum-based		0.607038
                                         high-dose		0.604657
                                       neoadjuvant		0.596819
                                       gemcitabine		0.596638
                                          regimens		0.593977
                                       monotherapy		0.592749
                                        infliximab		0.584584
                                       efficacious		0.578444
                               lopinavir/ritonavir		0.574542
                                        adalimumab		0.568594
                                       thalidomide		0.567074
                                        three-drug		0.565481
                                         multidrug		0.562847
                                        cytarabine		0.560205
                                      single-agent		0.559553
                                      methotrexate		0.559375
                                       bevacizumab		0.554945
                                          low-dose		0.550815
                                       combination		0.548845
                                         tamoxifen		0.547920
                                       second-line		0.547072
                                       doxorubicin		0.543647
                                        lamivudine		0.542324
                                             alone		0.534888
                                            ifn-α		0.534870
                                         antiviral		0.534470
                                      mitoxantrone		0.534140
                                       trastuzumab		0.527470
                                      posaconazole		0.526527
                                            taxane		0.526007
                                        comparable		0.523686
                                      voriconazole		0.523049
                             treatment-experienced		0.521246
                                       leflunomide		0.521215
                                      combinations		0.519301
                                             m-vac		0.518551
                                          anti-tnf		0.517589
                                         cisplatin		0.514165

他にも、癌遺伝子や癌抑制遺伝子を放り込んだり、原因遺伝子がわかっている遺伝病を放り込んで遊んでみたけど、よさげなのが得られなかったのでこんな感じ。
　
症状から疾患を推定したいというのが自動診断だと思うので、お腹が痛い女性というのもをやってみたら、これいいのかよくわからん。

Enter three words (EXIT to break): pain abdomen woman

Word: pain  Position in vocabulary: 86

Word: abdomen  Position in vocabulary: 1939

Word: woman  Position in vocabulary: 3529

                                              Word              Distance
------------------------------------------------------------------------
                                               dre		0.516829
                                 contrast-enhanced		0.482033
                                               spn		0.477665
                                     child-bearing		0.463668
                                      childbearing		0.462467
                                     demonstrating		0.457256
                                          flexible		0.456756
                                         -year-old		0.456158
                                          pregnant		0.452602
                                           session		0.441014
                                               tof		0.440933
                                              mmse		0.439102
                                       nonpregnant		0.438862