Numpy hstack-「ValueError：すべての入力配列は同じ次元数でなければなりません」-しかし

Question

私は2つのnumpy配列を結合しようとしています。 1つでは、単一のテキスト列でTF-IDFを実行した後、一連の列/機能があります。もう一方には、整数である1つの列/機能があります。それで、電車とテストデータの列を読み、これでTF-IDFを実行します。そして、別の整数列を追加したいと思います。

残念ながら、hstackを実行してこの単一の列を他のnumpy配列に追加しようとすると、タイトルにエラーが表示されます。

ここに私のコードがあります：

 #reading in test/train data for TF-IDF traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2]) testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2]) #reading in labels for training y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2] #reading in single integer column to join AlexaTrainData = p.read_csv('FinalCSVFin.csv', delimiter=";")[["alexarank"]] AlexaTestData = p.read_csv('FinalTestCSVFin.csv', delimiter=";")[["alexarank"]] AllAlexaAndGoogleInfo = AlexaTestData.append(AlexaTrainData) tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='Word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) #tf-idf object rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None) #Classifier X_all = traindata + testdata #adding test and train data to put into tf-idf lentrain = len(traindata) #find length of train data tfv.fit(X_all) #fit tf-idf on all our text X_all = tfv.transform(X_all) #transform it X = X_all[:lentrain] #reduce to size of training set AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo[:lentrain] #reduce to size of training set X_test = X_all[lentrain:] #reduce to size of training set #printing debug info, output below : print "X.shape => " + str(X.shape) print "AllAlexaAndGoogleInfo.shape => " + str(AllAlexaAndGoogleInfo.shape) print "X_all.shape => " + str(X_all.shape) #line we get error on X = np.hstack((X, AllAlexaAndGoogleInfo))

出力とエラーメッセージは次のとおりです。

X.shape => (7395, 238377) AllAlexaAndGoogleInfo.shape => (7395, 1) X_all.shape => (10566, 238377) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-12-2b310887b5e4> in <module>() 31 print "X_all.shape => " + str(X_all.shape) 32 #X = np.column_stack((X, AllAlexaAndGoogleInfo)) ---> 33 X = np.hstack((X, AllAlexaAndGoogleInfo)) 34 sc = preprocessing.StandardScaler().fit(X) 35 X = sc.transform(X) C:\Users\Simon\Anaconda\lib\site-packages
umpy\core\shape_base.pyc in hstack(tup) 271 # As a special case, dimension 0 of 1-dimensional arrays is "horizontal" 272 if arrs[0].ndim == 1: --> 273 return _nx.concatenate(arrs, 0) 274 else: 275 return _nx.concatenate(arrs, 1) ValueError: all the input arrays must have same number of dimensions

ここで私の問題の原因は何ですか？どうすれば修正できますか？私が見る限り、これらの列を結合できるはずです？私は何を誤解しましたか？

ありがとうございました。

編集：

以下の回答のメソッドを使用すると、次のエラーが発生します。

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-16-640ef6dd335d> in <module>() ---> 36 X = np.column_stack((X, AllAlexaAndGoogleInfo)) 37 sc = preprocessing.StandardScaler().fit(X) 38 X = sc.transform(X) C:\Users\Simon\Anaconda\lib\site-packages
umpy\lib\shape_base.pyc in column_stack(tup) 294 arr = array(arr,copy=False,subok=True,ndmin=2).T 295 arrays.append(arr) --> 296 return _nx.concatenate(arrays,1) 297 298 def dstack(tup): ValueError: all the input array dimensions except for the concatenation axis must match exactly

興味深いことに、私はXのdtypeを印刷しようとしましたが、これはうまくいきました：

X.dtype => float64

ただし、次のようにAllAlexaAndGoogleInfoのdtypeを出力しようとします。

print "AllAlexaAndGoogleInfo.dtype => " + str(AllAlexaAndGoogleInfo.dtype)

生産する：

'DataFrame' object has no attribute 'dtype'

YS-L · Accepted Answer

Xはスパース配列であるため、numpy.hstackではなく、scipy.sparse.hstackを使用して配列を結合します。私の意見では、エラーメッセージはここで誤解を招くようなものです。

この最小限の例は、状況を示しています。

import numpy as np from scipy import sparse X = sparse.Rand(10, 10000) xt = np.random.random((10, 1)) print 'X shape:', X.shape print 'xt shape:', xt.shape print 'Stacked shape:', np.hstack((X,xt)).shape #print 'Stacked shape:', sparse.hstack((X,xt)).shape #This works

次の出力に基づく

X shape: (10, 10000) xt shape: (10, 1)

次の行のhstackが機能すると予想される場合がありますが、実際にはこのエラーがスローされます。

ValueError: all the input arrays must have same number of dimensions

したがって、スタックするスパース配列がある場合は、scipy.sparse.hstackを使用します。

実際、私はあなたの別の質問のコメントとしてこれに答えましたが、あなたは別のエラーメッセージがポップアップすることを言及しました：

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

まず、AllAlexaAndGoogleInfoはdtypeであるため、DataFrameを持ちません。基になるnumpy配列を取得するには、AllAlexaAndGoogleInfo.valuesを使用します。 dtypeを確認してください。エラーメッセージに基づいて、dtypeのobjectがあります。つまり、文字列などの非数値要素が含まれている可能性があります。

これは、この状況を再現する最小限の例です。

X = sparse.Rand(100, 10000) xt = np.random.random((100, 1)) xt = xt.astype('object') # Comment this to fix the error print 'X:', X.shape, X.dtype print 'xt:', xt.shape, xt.dtype print 'Stacked shape:', sparse.hstack((X,xt)).shape

エラーメッセージ：

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

したがって、スタックを行う前に、AllAlexaAndGoogleInfoに数値以外の値があるかどうかを確認し、それらを修復してください。

Drewness · Answer

使用する .column_stack。そのようです：

X = np.column_stack((X, AllAlexaAndGoogleInfo))

docs から：

1次元配列のシーケンスを取得し、それらを列としてスタックして、単一の2次元配列を作成します。 2次元配列は、hstackと同じようにそのままスタックされます。

hpaulj · Answer

試してください：

X = np.hstack((X, AllAlexaAndGoogleInfo.values))

実行中のPandasモジュールがないため、テストできません。しかし、DataFrameのドキュメントではvalues Numpy representation of NDFrame。 np.hstackはnumpy関数であり、DataFrameの内部構造については何も知りません。