Pandas axis = 0の場合に並行して適用

Question

すべてのpandas列に並列に関数を適用したい。たとえば、これを並列に実行したい：

def my_sum(x, a): return x + a df = pd.DataFrame({'num_legs': [2, 4, 8, 0], 'num_wings': [2, 0, 0, 0]}) df.apply(lambda x: my_sum(x, 2), axis=0)

swifterパッケージがあることはわかっていますが、axis=0はサポートされていません。

NotImplementedError：Swifterは、大きなデータセットに対してaxis = 0の適用を実行できません。 Daskには現在、axis = 0 applyが実装されていません。詳細は https://github.com/jmcarpenter2/swifter/issues/1 にあります

Daskはaxis=0についてもこれをサポートしていません（Swiftのドキュメントによると）。

いくつかのソースをグーグルで検索しましたが、簡単な解決策を見つけることができませんでした。

これはパンダではとても複雑だとは信じられません。

Chicodelarosa · Answer

私の意見では、このケースは、データが使用可能なリソースにどのように分割されるかに焦点を当てて取り組む必要があります。 Dask オファー map_partitions これは、各DataFrameパーティションにPython関数を適用します。もちろん、ワークステーションが処理できるパーティションごとの行数は、使用可能なハードウェアリソースによって異なります。以下に例を示します。質問で提供した情報に基づいて：

# imports import dask from dask import dataframe as dd import multiprocessing as mp import numpy as np import pandas as pd # range for values to be randomly generated range_ = { "min": 0, "max": 100 } # rows and columns for the fake dataframe df_shape = ( int(1e8), # rows 2 # columns ) # some fake data data_in = pd.DataFrame(np.random.randint(range_["min"], range_["max"], size = df_shape), columns = ["legs", "wings"]) # function to apply adding some value a to the partition def my_sum(x, a): return x + a """ applies my_sum on the partitions rowwise (axis = 0) number of partitions = cpu_count the scheduler can be: "threads": Uses a ThreadPool in the local process "processes": Uses a ProcessPool to spread work between processes "single-threaded": Uses a for-loop in the current thread """ data_out = dd.from_pandas(data_in, npartitions = mp.cpu_count()).map_partitions( lambda df: df.apply( my_sum, axis = 0, a = 2 ) ).compute(scheduler = "threads") # inspection print(data_in.head(5)) print(data_out.head(5))

この実装は、100,000,000行、2列のランダムに生成されたデータフレームでテストされました。

ワークステーションの仕様
CPU：Intel（R）Core（TM）i7-8750H CPU @ 2.20GHz
メモリの合計：16698340 kB
OS：Ubuntu 18.04.4 LTS

Someone · Answer

Dask delayインターフェースを使用して、カスタムワークフローを設定できます。

import pandas as pd import dask import distributed # start local cluster, by default one worker per core client = distributed.Client() @dask.delayed def my_sum(x, a): return x + a df = pd.DataFrame({'num_legs': [2, 4, 8, 0], 'num_wings': [2, 0, 0, 0]}) # Here, we mimic the apply command. However, we do not # actually run any computation. Instead, that line of code # results in a list of delayed objects, which contain the # information what computation should be performed eventually delayeds = [my_sum(df[column], 2) for column in df.columns] # send the list of delayed objects to the cluster, which will # start computing the result in parallel. # It returns future objects, pointing to the computation while # it is still running futures = client.compute(delayeds) # get all the results, as soon as they are ready. This returns # a list of pandas Series objects, each is one column of the # output dataframe computed_columns = client.gather(futures) # create dataframe out of individual columns computed_df = pd.concat(computed_columns, axis = 1)

または、daskのマルチプロセッシングバックエンドを使用することもできます。

import pandas as pd import dask @dask.delayed def my_sum(x, a): return x + a df = pd.DataFrame({'num_legs': [2, 4, 8, 0], 'num_wings': [2, 0, 0, 0]}) # same as above delayeds = [my_sum(df[column], 2) for column in df.columns] # run the computation using the dask's multiprocessing backend computed_columns = dask.compute(delayeds, scheduler = 'processes') # create dataframe out of individual columns computed_df = pd.concat(computed_columns, axis = 1)