Python＆Pandas-日ごとにグループ化し、毎日カウントする

Question

私はpandasで新しいです。今のところ、自分のタイムシリーズを調整する方法がわかりません。それを見てください。

_date & time of connection 19/06/2017 12:39 19/06/2017 12:40 19/06/2017 13:11 20/06/2017 12:02 20/06/2017 12:04 21/06/2017 09:32 21/06/2017 18:23 21/06/2017 18:51 21/06/2017 19:08 21/06/2017 19:50 22/06/2017 13:22 22/06/2017 13:41 22/06/2017 18:01 23/06/2017 16:18 23/06/2017 17:00 23/06/2017 19:25 23/06/2017 20:58 23/06/2017 21:03 23/06/2017 21:05 _

これは130 k rawのデータセットのサンプルです。試してみました：df.groupby('date & time of connection')['date & time of connection'].apply(list)

十分とは思いません

私はすべきだと思います：

Dd/mm/yyyyからdd/mm/yyyyまでのインデックスを持つ辞書を作成します
「接続の日時」タイプのdateTimeをDateに変換します
「接続の日時」のグループとカウント日
私が数える数を辞書の中に入れますか？

私の論理についてどう思いますか？あなたはいくつかのtutosを知っていますか？どうもありがとうございました

jezrael · Accepted Answer

dt.floor datesに変換してから value_counts またはgroupby with size ：

df = (pd.to_datetime(df['date & time of connection']) .dt.floor('d') .value_counts() .rename_axis('date') .reset_index(name='count')) print (df) date count 0 2017-06-23 6 1 2017-06-21 5 2 2017-06-19 3 3 2017-06-22 3 4 2017-06-20 2

または：

s = pd.to_datetime(df['date & time of connection']) df = s.groupby(s.dt.floor('d')).size().reset_index(name='count') print (df) date & time of connection count 0 2017-06-19 3 1 2017-06-20 2 2 2017-06-21 5 3 2017-06-22 3 4 2017-06-23 6

タイミング：

np.random.seed(1542) N = 220000 a = np.unique(np.random.randint(N, size=int(N/2))) df = pd.DataFrame(pd.date_range('2000-01-01', freq='37T', periods=N)).drop(a) df.columns = ['date & time of connection'] df['date & time of connection'] = df['date & time of connection'].dt.strftime('%d/%m/%Y %H:%M:%S') print (df.head()) In [193]: %%timeit ...: df['date & time of connection']=pd.to_datetime(df['date & time of connection']) ...: df1 = df.groupby(by=df['date & time of connection'].dt.date).count() ...: 539 ms ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [194]: %%timeit ...: df1 = (pd.to_datetime(df['date & time of connection']) ...: .dt.floor('d') ...: .value_counts() ...: .rename_axis('date') ...: .reset_index(name='count')) ...: 12.4 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [195]: %%timeit ...: s = pd.to_datetime(df['date & time of connection']) ...: df2 = s.groupby(s.dt.floor('d')).size().reset_index(name='count') ...: 17.7 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Allen · Answer

列が日付形式であることを確認します。

df['date & time of connection']=pd.to_datetime(df['date & time of connection'])

次に、データを日付でグループ化してカウントします。

df.groupby(by=df['date & time of connection'].dt.date).count() Out[10]: date & time of connection date & time of connection 2017-06-19 3 2017-06-20 2 2017-06-21 5 2017-06-22 3 2017-06-23 6

Jaan Olev · Answer

リサンプルでこれを行う簡単な方法を見つけました。

# Set the date column as index column. df = df.set_index('your_date_column') # Make counts df_counts = df.your_date_column.resample('D').count()

あなたの列名は長く、スペースが含まれていますが、それは私を少し不気味にします。スペースの代わりにダッシュを使います。