KNN / K-meansを使用して、データフレーム内の時系列をクラスタリングするにはどうすればよいですか？

Question

1000行を含むデータフレームを想定します。各行は時系列を表します。

次に、2行間の距離を計算するDTWアルゴリズムを作成しました。

データフレームの教師なし分類タスクを完了するために次に何をすべきかわかりません。

データフレームのすべての行にラベルを付ける方法は？

hhh · Answer

定義

KNNアルゴリズム= K最近傍分類アルゴリズム

K-means=セントロイドベースのクラスタリングアルゴリズム

[〜＃〜] dtw [〜＃〜]=時系列の類似性測定アルゴリズムの動的タイムワーピング

以下に、2つの時系列を構築する方法と、動的タイムワーピング（DTW）アルゴリズムを計算する方法について段階的に説明します。重心の数を指定せずに scikit-learn を使用して教師なしk-meansクラスタリングを構築できます。そうすると、scikit-learnはautoと呼ばれるアルゴリズムを使用することを認識します。

時系列の構築とDTWの計算

2つの時系列があり、次のようにDTWを計算します。

import pandas as pd import numpy as np import random from dtw import dtw from matplotlib.pyplot import plot from matplotlib.pyplot import imshow from matplotlib.pyplot import cm from sklearn.cluster import KMeans from sklearn.preprocessing import MultiLabelBinarizer #About classification, read the tutorial #http://scikit-learn.org/stable/tutorial/basic/tutorial.html def createTs(myStart, myLength): index = pd.date_range(myStart, periods=myLength, freq='H'); values= [random.random() for _ in range(myLength)]; series = pd.Series(values, index=index); return(series) #Time series of length 30, start from 1/1/2000 & 1/2/2000 so overlap myStart='1/1/2000' myLength=30 timeS1=createTs(myStart, myLength) myStart='1/2/2000' timeS2=createTs(myStart, myLength) #This could be your dataframe but unnecessary here #myDF = pd.DataFrame([x for x in timeS1.data], [x for x in timeS2.data])#, columns=['data1', 'data2']) x=[xxx*100 for xxx in sorted(timeS1.data)] y=[xx for xx in timeS2.data] choice="dtw" if (choice="timeseries"): print(timeS1) print(timeS2) if (choice=="drawingPlots"): plot(x) plot(y) if (choice=="dtw"): #DTW with the 1st order norm myDiff=[xx-yy for xx,yy in Zip(x,y)] dist, cost, acc, path = dtw(x, y, dist=lambda x, y: np.linalg.norm(myDiff, ord=1)) imshow(acc.T, Origin='lower', cmap=cm.gray, interpolation='nearest') plot(path[0], path[1], 'w')

KNNによる時系列の分類

何にどのラベルを付けるべきかについての質問では明らかではありませんか？したがって、以下の詳細を提供してください

データフレームで何にラベルを付ける必要がありますか？ DTWアルゴリズムによって計算されたパス？
どのタイプのラベリングですか？バイナリ？マルチクラス？

その後、いわゆるKNNアルゴリズムである可能性のある分類アルゴリズムを決定できます。これは、トレーニングセットとテストセットの2つの別個のデータセットを持つように機能します。トレーニングセットでは、時系列にラベルを付けるアルゴリズムを教えますが、テストセットは、AUCなどのモデル選択ツールでモデルがどの程度うまく機能するかを測定できるツールです。

質問についての詳細が提供されるまで、小さなパズルは開いたままにしました

#PUZZLE #from tutorial (#http://scikit-learn.org/stable/tutorial/basic/tutorial.html) newX = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]] newY = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]] newY = MultiLabelBinarizer().fit_transform(newY) #Continue to the article.

分類子に関するScikit-learnの比較記事は、以下の2番目の列挙項目に記載されています。

K-meansによるクラスタリング（KNNと同じではありません）

K-meansは、クラスタリングアルゴリズムであり、次のように使用できる教師なしバージョンです。

#Unsupervised version "auto" of the KMeans as no assignment for the n_clusters myClusters=KMeans(path) #myClusters.fit(YourDataHere)

これはKNNアルゴリズムとは非常に異なるアルゴリズムです。ここではラベルは必要ありません。以下のトピックに関する詳細な資料を最初の列挙項目で提供します。

さらに読む

K-meansはK-nearest-neighbourアルゴリズムを組み込んでいますか？
Scikitlearnの分類子に関する比較ここ

Dogan Askan · Answer

[〜＃〜] dtw [〜＃〜] を利用できます。実際、私は自分のプロジェクトの1つで同じ問題を抱えていて、そのための独自のクラスをPythonで作成しました。

これがロジックです。

すべてのクラスターの組み合わせを作成します。 kはクラスター数、nは系列数です。返されるアイテムの数はn! / k! / (n-k)!である必要があります。これらは潜在的なセンターのようなものになります。
シリーズごとに、各クラスターグループの各中心の距離を計算し、最小の中心に割り当てます。
各クラスターグループについて、個々のクラスター内の合計距離を計算します。
最小値を選択してください。

そしてコード;

import numpy as np import pandas as pd from itertools import combinations import time def dtw_distance(x, y, d=lambda x,y: abs(x-y), scaled=False, fill=True): """Finds the distance of two arrays by dynamic time warping method source: https://en.wikipedia.org/wiki/Dynamic_time_warping Dependencies: import numpy as np Args: x, y: arrays d: distance function, default is absolute difference scaled: boolean, should arrays be scaled before calculation fill: boolean, should NA values be filled with 0 returns: distance as float, 0.0 means series are exactly same, upper limit is infinite """ if fill: x = np.nan_to_num(x) y = np.nan_to_num(y) if scaled: x = array_scaler(x) y = array_scaler(y) n = len(x) + 1 m = len(y) + 1 DTW = np.zeros((n, m)) DTW[:, 0] = float('Inf') DTW[0, :] = float('Inf') DTW[0, 0] = 0 for i in range(1, n): for j in range(1, m): cost = d(x[i-1], y[j-1]) DTW[i, j] = cost + min(DTW[i-1, j], DTW[i, j-1], DTW[i-1, j-1]) return DTW[n-1, m-1] def array_scaler(x): """Scales array to 0-1 Dependencies: import numpy as np Args: x: mutable iterable array of float returns: scaled x """ arr_min = min(x) x = np.array(x) - float(arr_min) arr_max = max(x) x = x/float(arr_max) return x class TrendCluster(): def __init__(self): self.clusters = None self.centers = None self.scale = None def fit(self, series, n=2, scale=True): ''' Work-flow 1 - make series combination with size n, initial clusters 2 - assign closest series to each cluster 3 - calculate total distance for each combinations 4 - choose the minimum Args: series: dict, keys can be anything, values are time series as list, assumes no nulls n: int, cluster size scale: bool, if scale needed ''' assert isinstance(series, dict) and isinstance(n, int) and isinstance(scale, bool), 'wrong argument type' assert n < len(series.keys()), 'n is too big' assert len(set([len(s) for s in series.values()])) == 1, 'series length not same' self.scale = scale combs = combinations(series.keys(), n) combs = [[c, -1] for c in combs] series_keys = pd.Series(series.keys()) dtw_matrix = pd.DataFrame(series_keys.apply(lambda x: series_keys.apply(lambda y: dtw_distance(series[x], series[y], scaled=scale)))) dtw_matrix.columns, dtw_matrix.index = series_keys, series_keys for c in combs: c[1] = dtw_matrix.loc[c[0], :].min(axis=0).sum() combs.sort(key=lambda x: x[1]) self.centers = {c:series[c] for c in combs[0][0]} self.clusters = {c:[] for c in self.centers.keys()} for k, _ in series.items(): tmp = [[c, dtw_matrix.loc[k, c]] for c in self.centers.keys()] tmp.sort(key=lambda x: x[1]) cluster = tmp[0][0] self.clusters[cluster].append(k) return None def assign(self, serie, save=False): ''' Assigns the serie to appropriate cluster Args: serie, dict: 1 element dict save, bool: if new serie is stored to clusters Return: str, assigned cluster key ''' assert isinstance(serie, dict) and isinstance(save, bool), 'wrong argument type' assert len(serie) == 1, 'serie\'s length is not exactly 1' tmp = [[c, dtw_distance(serie.values()[0], self.centers[c], scaled=self.scale)] for c in self.centers.keys()] tmp.sort(key=lambda x: x[1]) cluster = tmp[0][0] if save: self.clusters[cluster].append(serie.keys()[0]) return cluster

実際にそれを確認したい場合は、時系列クラスタリングについて私のリポジトリを参照できます。