Scikit learnで行う文章の特徴ベクトルの抽出 - のんびりしているエンジニアの日記

皆さんこんにちは
お元気ですか。私は元気です。

今日は、Scikit learnを使って、お手軽に文を特徴ベクトルに変換してみます。

どんな時に使うのか？

ある文章、例えば、This is a pen（①）とThat is a pen（②）を機械学習で学習させるとき、
基本的に文字をベクトルに変換する作業が必要です。
これをScikit learnを使って実行してみます。

文章中のワードの数にしたがって変換する方式

実行結果

まずは、文中のワードの数にしたがって行うとイメージは以下の通りです。
まず、文章中に出てくる文字をスペース区切りで取得し、単語数を数えて並べます。

１つの文（文章、段落）が持つ、ベクトルのサイズはワードの種類Kであり、つまりK次元です。
因みに学習時に存在しないものはどうするか気になる人がいると思います。
処理の時には基本的に無視しますが、Feature Hasingと呼ばれる技術があり、
それを使うと解決ができるようです。（私も細かくは知らない）

	This	That	is	a	pen
①	1	0	1	1	1
②	0	1	1	1	1

#coding:utf-8
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

corpus = [
	'This is a pen.',
	'That is a bot.',
	'These are red document and blue document.',
]

X = vectorizer.fit_transform(corpus)
print vectorizer.get_feature_names()
print X.toarray()

結果

[u'and', u'are', u'blue', u'bot', u'document', u'is', u'pen', u'red', u'that', u'these', u'this']
[[0 0 0 0 0 1 1 0 0 0 1]
 [0 0 0 1 0 1 0 0 1 0 0]
 [1 1 1 0 2 0 0 1 0 1 0]]

fit_transformメソッドで学習と変換を同時に行っています。get_feature_namesを使うと
何を学習しているのかわかります。

Bigram,Trigram

作成

続いて、bi-gramやtri-gramを作ることができます。
bi-gramは2つの単語の組でカウントするイメージです。

	is pen	That is	This is
①	1	0	1
②	1	1	0

これはCountVectorizerにngram_rangeパラメータがあります。このパラメータを変更することによって、変更することができます。例えば、(1,2)の場合は、
単独のワードとbi-gram設定で実行することができます。

vectorizer = CountVectorizer(ngram_range=(1, 2))

実行結果

[u'and', u'and blue', u'are', u'are red', u'blue', u'blue document', u'bot', u'document', u'document and', u'is', u'is bot', u'is pen', u'pen', u'red', u'red document', u'that', u'that is', u'these', u'these are', u'this', u'this is']
[[0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1]
 [0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0]
 [1 1 1 1 1 1 0 2 1 0 0 0 0 1 1 0 0 1 1 0 0]]

参考文献

4.2. Feature extraction — scikit-learn 0.16.1 documentation