Pyspark：データフレーム列のヒストグラムを表示

Question

pandasデータフレームでは、次のコードを使用して列のヒストグラムをプロットしています。

my_df.hist(column = 'field_1')

Pysparkデータフレームで同じ目標を達成できるものはありますか？（私はJupyterノートブックにいます）ありがとう！

Shivam Gaur · Answer

残念ながら、PySpark Dataframes APIにはきれいなplot()やhist()関数があるとは思いませんが、最終的にはその方向に進むことを望んでいます。

当分の間、Sparkでヒストグラムを計算し、計算したヒストグラムを棒グラフとしてプロットできます。例：

import pandas as pd import pyspark.sql as sparksql # Let's use UCLA's college admission dataset file_name = "https://stats.idre.ucla.edu/stat/data/binary.csv" # Creating a pandas dataframe from Sample Data df_pd = pd.read_csv(file_name) sql_context = sparksql.SQLcontext(sc) # Creating a Spark DataFrame from a pandas dataframe df_spark = sql_context.createDataFrame(df_pd) df_spark.show(5)

データは次のようになります。

Out[]: +-----+---+----+----+ |admit|gre| gpa|rank| +-----+---+----+----+ | 0|380|3.61| 3| | 1|660|3.67| 3| | 1|800| 4.0| 1| | 1|640|3.19| 4| | 0|520|2.93| 4| +-----+---+----+----+ only showing top 5 rows # This is what we want df_pandas.hist('gre');

df_pandas.hist（）を使用してプロットしたときのヒストグラム

# Doing the heavy lifting in Spark. We could leverage the `histogram` function from the RDD api gre_histogram = df_spark.select('gre').rdd.flatMap(lambda x: x).histogram(11) # Loading the Computed Histogram into a Pandas Dataframe for plotting pd.DataFrame( list(Zip(*gre_histogram)), columns=['bin', 'frequency'] ).set_index( 'bin' ).plot(kind='bar');

RDD.histogram（）を使用して計算されたヒストグラム

Chris van den Berg · Answer

pyspark_dist_explore パッケージを使用して、Spark DataFramesのmatplotlib hist関数を活用できます。

from pyspark_dist_explore import hist import matplotlib.pyplot as plt fig, ax = plt.subplots() hist(ax, data_frame, bins = 20, color=['red'])

このライブラリは、rddヒストグラム関数を使用してビン値を計算します。

Andrew · Answer

RDDの histogram メソッドは、ビン範囲とビンカウントを返します。このヒストグラムデータを取得し、それをヒストグラムとしてプロットする関数を次に示します。

import numpy as np import matplotlib.pyplot as mplt import matplotlib.ticker as mtick def plotHistogramData(data): binSides, binCounts = data N = len(binCounts) ind = np.arange(N) width = 1 fig, ax = mplt.subplots() rects1 = ax.bar(ind+0.5, binCounts, width, color='b') ax.set_ylabel('Frequencies') ax.set_title('Histogram') ax.set_xticks(np.arange(N+1)) ax.set_xticklabels(binSides) ax.xaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e')) ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e')) mplt.show()

（このコードは、ビンの長さが等しいと仮定しています。）

Elior Malul · Answer

追加のインポートを必要としない、効率的な別のソリューション。まず、ウィンドウパーティションを使用します。

_import pyspark.sql.functions as F import pyspark.sql as SQL win = SQL.Window.partitionBy('column_of_values') _

次に、countウィンドウでパーティション分割された集約を使用するために必要なもの：

df.select(F.count('column_of_values').over(win).alias('histogram'))

集約演算子はクラスターの各パーティションで発生し、ホストへの余分な往復を必要としません。

conner.xyz · Answer

これは簡単で、うまく機能します。

df.groupby( '<group-index>' ).count().select( 'count' ).rdd.flatMap( lambda x: x ).histogram(20)