機械学習　クラスタリングアルゴリズム、K平均法（Kmeans）

皆さんこんにちは
お元気ですか？私は元気です。

6/11:ソースコード訂正

さて、今回は機械学習のアルゴリズムのうちクラスタリングアルゴリズムで有名である
K平均法について取り扱って見ようと思います。

クラスタリング アルゴリズムって？

データがあり、それを一定の手法で分けることをクラスタリングアルゴリズムといいます。
今回はK平均法と呼ばれるアルゴリズムを使ってみます。

K平均法とはなんぞ？

由来

クラスタの平均を使って、K個に分割することからK平均法と名付けられたそう（Wikipediaより）

特徴

他のクラスタリングアルゴリズムと比較すると、計算量が少ない
計算の都合上、片方が伸びてると違うクラスタリングに含みやすいといった欠点がある（ユーグリッド距離）

アルゴリズム

１，各データをランダムにクラスタリングに割り当てる。
分割したいクラスタリングの数を最大値として、ベクトルを生成

attribute = np.random.randint(0,self.clusterNumber,len(vectors))

２．割り振ったデータを元に重心を計算
ベクトルと属性を渡して、属性ごとに重心の計算を行っています。

def executeCalcCentroid(self,vectors,attribute):
	for k in xrange(self.clusterNumber):
		sum = np.zeros(len(vectors[0]))
		count = 0
		for i in xrange(len(attribute)):
			if attribute[i] == k:
				sum += vectors[i]
				count += 1

		self.centroid[k] = sum / count

３．更新作業
重心からの距離を再計算し、最も近いところにクラスタの属性を割り当てる

while True:
	print time,score
	next_attribute = np.zeros(len(vectors))
	for i in xrange(len(vectors)):
		minIndex = 0
		minDistance = 1000000
		for k in xrange(self.clusterNumber):
			distance = np.power(vectors[i] - self.centroid[k],2).sum()
			if distance < minDistance:
				minIndex = k
				minDistance = distance
		next_attribute[i] = minIndex

	next_score = self.executeCalcCentroid(vectors,next_attribute)

４，割り当てが変化しなくなるまで、ないし打ち切り条件が達成されないまで３の更新作業を実行する
今回は合計距離が一定以上縮まなければ打ち切っています。

next_score = self.executeCalcCentroid(vectors,next_attribute)
attribute = next_attribute
if score - next_score < 0.3:
	print "end",next_score
	break
score = next_score
time+= 1

実験

今回はfaithful.txtと呼ばれる
分類の有名なテキストファイルを活用させていただきました。(https://github.com/stober/gmm/blob/master/data/faithful.txt)

可視化したら以下のような感じになります。

f:id:tereka:20140608174922p:plain

参考文献

http://ja.wikipedia.org/wiki/K%E5%B9%B3%E5%9D%87%E6%B3%95
http://aidiary.hatenablog.com/entry/20100515/1273888686

ソースコード全文

#coding:utf-8
import numpy as np
from pylab import *
import sys

class Kmeans(object):
	"""docstring for Kmeans"""
	def __init__(self, cluster):
		super(Kmeans, self).__init__()
		#クラスターの数を設定する
		self.clusterNumber = cluster

	def executeCalcCentroid(self,vectors,attribute):
		for k in xrange(self.clusterNumber):
			sum = np.zeros(len(vectors[0]))
			count = 0
			for i in xrange(len(attribute)):
				if int(attribute[i]) == k:
					#print attribute[i],k
					sum += vectors[i]
					count += 1

			self.centroid[k] = sum / count

	def drawGraph(self,vectors,attribute):
		#描写モード
		for i in xrange(len(vectors)):
			c = ''
			if 1 == attribute[i]:
				c = 'r'
			else:
				c = 'y'
			scatter(vectors[i,0], vectors[i,1], c=c, marker='o')

		show()

	def executeCluster(self,vectors):
		#ランダムでベクトルを生成
		attribute = np.random.randint(0,self.clusterNumber,len(vectors))
		self.centroid = np.zeros((self.clusterNumber,len(vectors[0])))

		self.executeCalcCentroid(vectors,attribute)

		score = sys.maxint
		time = 0
		while True:
			now_score = 0
			next_attribute = np.zeros(len(vectors))
			for i in xrange(len(vectors)):
				minIndex = 0
				minDistance = sys.maxint
				for k in xrange(self.clusterNumber):
					distance = np.power(vectors[i] - self.centroid[k],2.0).sum()
					if distance < minDistance:
						minIndex = k
						minDistance = distance
				next_attribute[i] = minIndex
				now_score += minDistance

			self.executeCalcCentroid(vectors,next_attribute)
			attribute = next_attribute

			print time,now_score
			if time != 0 and score - now_score < 0.3:
				break
			score = now_score
			time+= 1

	def predictAttribute(self,vector):
		minIndex = 0
		minDistance = 1000000
		for k in xrange(self.clusterNumber):
			distance = np.power(vector - self.centroid[k],2).sum()
			if distance < minDistance:
				minIndex = k
				minDistance = distance
		
		return minIndex

のんびりしているエンジニアの日記

ソフトウェアなどのエンジニア的な何かを書きます。