Pyspark：2つの日付の違い（Cast TimestampType、Datediff）

Question

インシデントと特定のタイムスタンプを含むテーブルがあります。 Pyspark 2.0APIを使用して経過した日数を計算するのに苦労しています。タイムスタンプが別の形式（yyyy-mm-dd）に従っている場合、私はなんとか同じことをすることができました

 +-------------------+------------------------+----------+--------------+ | first_booking_date|first_booking_date_clean| today |customer_since| +-------------------+------------------------+----------+--------------+ |02-06-2011 20:52:04| 02-06-2011|02-06-2011| null| |03-06-2004 18:15:10| 03-06-2004|02-06-2011| null|

私は次のことを試しました（何も機能しませんでした）：-文字列操作で日付を抽出し、datediffを使用します-タイムスタンプにキャストしてからdd：MM：yy（-> result null）を抽出します-sqlで追加の変換よりもpysparkコマンドを使用することを好みます

ヘルプは高く評価されています、最高であり、どうもありがとう!!!

編集：これは機能しなかった例です：

import datetime today = datetime.date(2011,2,1) today = "02-06-2011" first_bookings = first_bookings.withColumn("today",F.lit(today)) first_bookings = first_bookings.withColumn("first_booking_date_clean",F.substring(first_bookings.first_booking_date, 0, 10)) first_bookings = first_bookings.withColumn("customer_since",F.datediff(first_bookings.today,first_bookings.first_booking_date_clean))

Zephro · Accepted Answer

この回答は基本的に https://stackoverflow.com/a/36985244/4219202 のコピーです。あなたの場合、timeFmtは列に対して「dd-MM-yyyy」になりますfirst_booking_date_cleanおよびtoday

Spark 1.5以降、使用できます nix_timestamp ：

from pyspark.sql import functions as F timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS" timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt) - F.unix_timestamp('StartDateTime', format=timeFmt)) df = df.withColumn("Duration", timeDiff)