ランダムフォレストを高速化するための提案

Question

私はrandomForestパッケージでいくつかの作業を行っていますが、それはうまく機能しますが、時間がかかる可能性があります。誰かが物事をスピードアップするための提案がありますか？デュアルコアAMDチップを搭載したWindows7ボックスを使用しています。 Rがマルチスレッド/プロセッサではないことは知っていますが、並列パッケージ（rmpi、snow、snowfallなど）のいずれかがrandomForestのもので機能するかどうか興味がありました。ありがとう。

編集：

私はいくつかの分類作業（0と1）にrFを使用しています。データには約8〜12個の変数列があり、トレーニングセットは10k行のサンプルであるため、適切なサイズですが、クレイジーではありません。私は500本の木と2、3、または4のmtryを実行しています。

編集2：ここにいくつかの出力があります：

> head(t22) Id Fail CCUse Age S-TFail DR MonInc #OpenLines L-TFail RE M-TFail Dep 1 1 1 0.7661266 45 2 0.80298213 9120 13 0 6 0 2 2 2 0 0.9571510 40 0 0.12187620 2600 4 0 0 0 1 3 3 0 0.6581801 38 1 0.08511338 3042 2 1 0 0 0 4 4 0 0.2338098 30 0 0.03604968 3300 5 0 0 0 0 5 5 0 0.9072394 49 1 0.02492570 63588 7 0 1 0 0 6 6 0 0.2131787 74 0 0.37560697 3500 3 0 1 0 1 > ptm <- proc.time() > > RF<- randomForest(t22[,-c(1,2,7,12)],t22$Fail + ,sampsize=c(10000),do.trace=F,importance=TRUE,ntree=500,,forest=TRUE) Warning message: In randomForest.default(t22[, -c(1, 2, 7, 12)], t22$Fail, sampsize = c(10000), : The response has five or fewer unique values. Are you sure you want to do regression? > proc.time() - ptm user system elapsed 437.30 0.86 450.97 >

rcs · Accepted Answer

foreachパッケージのマニュアルには、並列ランダムフォレストに関するセクションがあります（ foreachパッケージの使用、セクション5.1）：

> library("foreach") > library("doSNOW") > registerDoSNOW(makeCluster(4, type="SOCK")) > x <- matrix(runif(500), 100) > y <- gl(2, 50) > rf <- foreach(ntree = rep(250, 4), .combine = combine, .packages = "randomForest") %dopar% + randomForest(x, y, ntree = ntree) > rf Call: randomForest(x = x, y = y, ntree = ntree) Type of random forest: classification Number of trees: 1000

1000本の木を持つランダムフォレストモデルを作成する必要があり、コンピューターに4つのコアがある場合、randomForest関数をntree引数を250に設定します。もちろん、結果のrandomForestオブジェクトを組み合わせる必要がありますが、randomForestパッケージにはcombineという関数が付属しています。

Brent · Answer

この問題に対処する「すぐに使える」オプションが2つあります。まず、キャレットパッケージには、これをエレガントに処理するメソッド「parRF」が含まれています。私は通常これを16コアで使用して大きな効果を上げています。 randomShrubberyパッケージは、RevolutionRのRF）の複数のコアも利用します。

eagle34 · Answer

これを実装するためにPython（つまり、scikit-learnおよびマルチプロセッシングモジュール）を使用しない特別な理由はありますか？joblibを使用して、同じサイズのデータセットでランダムフォレストをトレーニングしましたRにかかる時間のほんの一部です。マルチプロセッシングがなくても、ランダムフォレストはPythonで大幅に高速になります。これは、PythonでRF分類子と相互検証をトレーニングする簡単な例です。また、簡単に行うこともできます。機能の重要性を抽出し、ツリーを視覚化します。

import numpy as np from sklearn.metrics import * from sklearn.cross_validation import StratifiedKFold from sklearn.ensemble import RandomForestClassifier #assuming that you have read in data with headers #first column corresponds to response variable y = data[1:, 0].astype(np.float) X = data[1:, 1:].astype(np.float) cm = np.array([[0, 0], [0, 0]]) precision = np.array([]) accuracy = np.array([]) sensitivity = np.array([]) f1 = np.array([]) matthews = np.array([]) rf = RandomForestClassifier(n_estimators=100, max_features = 5, n_jobs = 2) #divide dataset into 5 "folds", where classes are equally balanced in each fold cv = StratifiedKFold(y, n_folds = 5) for i, (train, test) in enumerate(cv): classes = rf.fit(X[train], y[train]).predict(X[test]) precision = np.append(precision, (precision_score(y[test], classes))) accuracy = np.append(accuracy, (accuracy_score(y[test], classes))) sensitivity = np.append(sensitivity, (recall_score(y[test], classes))) f1 = np.append(f1, (f1_score(y[test], classes))) matthews = np.append(matthews, (matthews_corrcoef(y[test], classes))) cm = np.add(cm, (confusion_matrix(y[test], classes))) print("Accuracy: %0.2f (+/- %0.2f)" % (accuracy.mean(), accuracy.std() * 2)) print("Precision: %0.2f (+/- %0.2f)" % (precision.mean(), precision.std() * 2)) print("Sensitivity: %0.2f (+/- %0.2f)" % (sensitivity.mean(), sensitivity.std() * 2)) print("F1: %0.2f (+/- %0.2f)" % (f1.mean(), f1.std() * 2)) print("Matthews: %0.2f (+/- %0.2f)" % (matthews.mean(), matthews.std() * 2)) print(cm)

Manolete · Answer

すでに並列化および最適化されたランダムフォレストの実装を使用してみませんか？ MPIを使用したSPRINTをご覧ください。 http://www.r-sprint.org/