sparkデータフレームからnull値を除外する方法

Question

次のスキーマを使用して、sparkにデータフレームを作成しました。

root |-- user_id: long (nullable = false) |-- event_id: long (nullable = false) |-- invited: integer (nullable = false) |-- day_diff: long (nullable = true) |-- interested: integer (nullable = false) |-- event_owner: long (nullable = false) |-- friend_id: long (nullable = false)

データは以下のとおりです。

+----------+----------+-------+--------+----------+-----------+---------+ | user_id| event_id|invited|day_diff|interested|event_owner|friend_id| +----------+----------+-------+--------+----------+-----------+---------+ | 4236494| 110357109| 0| -1| 0| 937597069| null| | 78065188| 498404626| 0| 0| 0| 2904922087| null| | 282487230|2520855981| 0| 28| 0| 3749735525| null| | 335269852|1641491432| 0| 2| 0| 1490350911| null| | 437050836|1238456614| 0| 2| 0| 991277599| null| | 447244169|2095085551| 0| -1| 0| 1579858878| null| | 516353916|1076364848| 0| 3| 1| 3597645735| null| | 528218683|1151525474| 0| 1| 0| 3433080956| null| | 531967718|3632072502| 0| 1| 0| 3863085861| null| | 627948360|2823119321| 0| 0| 0| 4092665803| null| | 811791433|3513954032| 0| 2| 0| 415464198| null| | 830686203| 99027353| 0| 0| 0| 3549822604| null| |1008893291|1115453150| 0| 2| 0| 2245155244| null| |1239364869|2824096896| 0| 2| 1| 2579294650| null| |1287950172|1076364848| 0| 0| 0| 3597645735| null| |1345896548|2658555390| 0| 1| 0| 2025118823| null| |1354205322|2564682277| 0| 3| 0| 2563033185| null| |1408344828|1255629030| 0| -1| 1| 804901063| null| |1452633375|1334001859| 0| 4| 0| 1488588320| null| |1625052108|3297535757| 0| 3| 0| 1972598895| null| +----------+----------+-------+--------+----------+-----------+---------+

「friend_id」のフィールドにヌル値が含まれる行を除外したい。

scala> val aaa = test.filter("friend_id is null") scala> aaa.count

私は：res52：Long = 0を得ましたが、これは明らかに正しくありません。それを取得する正しい方法は何ですか？

もう1つの質問は、friend_idフィールドの値を置き換えることです。 nullを除く他の値については、nullを0と1に置き換えたいです。私が理解できるコードは次のとおりです。

val aaa = train_friend_join.select($"user_id", $"event_id", $"invited", $"day_diff", $"interested", $"event_owner", ($"friend_id" != null)?1:0)

このコードも機能しません。誰がそれを修正することができますか？ありがとう

Sachin Tyagi · Accepted Answer

このデータ設定があるとしましょう（結果が再現できるように）：

// declaring data types case class Company(cName: String, cId: String, details: String) case class Employee(name: String, id: String, email: String, company: Company) // setting up example data val e1 = Employee("n1", null, "n1@c1.com", Company("c1", "1", "d1")) val e2 = Employee("n2", "2", "n2@c1.com", Company("c1", "1", "d1")) val e3 = Employee("n3", "3", "n3@c1.com", Company("c1", "1", "d1")) val e4 = Employee("n4", "4", "n4@c2.com", Company("c2", "2", "d2")) val e5 = Employee("n5", null, "n5@c2.com", Company("c2", "2", "d2")) val e6 = Employee("n6", "6", "n6@c2.com", Company("c2", "2", "d2")) val e7 = Employee("n7", "7", "n7@c3.com", Company("c3", "3", "d3")) val e8 = Employee("n8", "8", "n8@c3.com", Company("c3", "3", "d3")) val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8) val df = sc.parallelize(employees).toDF

データは次のとおりです。

+----+----+---------+---------+ |name| id| email| company| +----+----+---------+---------+ | n1|null|n1@c1.com|[c1,1,d1]| | n2| 2|n2@c1.com|[c1,1,d1]| | n3| 3|n3@c1.com|[c1,1,d1]| | n4| 4|n4@c2.com|[c2,2,d2]| | n5|null|n5@c2.com|[c2,2,d2]| | n6| 6|n6@c2.com|[c2,2,d2]| | n7| 7|n7@c3.com|[c3,3,d3]| | n8| 8|n8@c3.com|[c3,3,d3]| +----+----+---------+---------+

null idの従業員をフィルタリングするには、次のようにします-

df.filter("id is null").show

次のように正しく表示されます：

+----+----+---------+---------+ |name| id| email| company| +----+----+---------+---------+ | n1|null|n1@c1.com|[c1,1,d1]| | n5|null|n5@c2.com|[c2,2,d2]| +----+----+---------+---------+

質問の2番目の部分では、null idを0に、他の値を1に置き換えることができます。

df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show

この結果：

+----+---+---------+---------+ |name| id| email| company| +----+---+---------+---------+ | n1| 0|n1@c1.com|[c1,1,d1]| | n2| 1|n2@c1.com|[c1,1,d1]| | n3| 1|n3@c1.com|[c1,1,d1]| | n4| 1|n4@c2.com|[c2,2,d2]| | n5| 0|n5@c2.com|[c2,2,d2]| | n6| 1|n6@c2.com|[c2,2,d2]| | n7| 1|n7@c3.com|[c3,3,d3]| | n8| 1|n8@c3.com|[c3,3,d3]| +----+---+---------+---------+

Adriana Lazar · Answer

またはdf.filter($"friend_id".isNotNull)など

Michael Kopaniov · Answer

df.where(df.col("friend_id").isNull)

chAlexey · Answer

私にとって良い解決策は、null値を持つ行を削除することでした：

Dataset<Row> filtered = df.filter(row => !row.anyNull());

他のケースに興味がある場合は、row.anyNull()を呼び出してください。（Java APIを使用したSpark 2.1.0）

Ayush Vatsyayan · Answer

それを行うには2つの方法があります。フィルター条件の作成 1）手動2）動的に。

サンプルDataFrame：

val df = spark.createDataFrame(Seq( (0, "a1", "b1", "c1", "d1"), (1, "a2", "b2", "c2", "d2"), (2, "a3", "b3", null, "d3"), (3, "a4", null, "c4", "d4"), (4, null, "b5", "c5", "d5") )).toDF("id", "col1", "col2", "col3", "col4") +---+----+----+----+----+ | id|col1|col2|col3|col4| +---+----+----+----+----+ | 0| a1| b1| c1| d1| | 1| a2| b2| c2| d2| | 2| a3| b3|null| d3| | 3| a4|null| c4| d4| | 4|null| b5| c5| d5| +---+----+----+----+----+

1）フィルター条件の手動作成 つまり、DataFrame whereまたはfilter関数を使用します

df.filter(col("col1").isNotNull && col("col2").isNotNull).show

または

df.where("col1 is not null and col2 is not null").show

結果：

+---+----+----+----+----+ | id|col1|col2|col3|col4| +---+----+----+----+----+ | 0| a1| b1| c1| d1| | 1| a2| b2| c2| d2| | 2| a3| b3|null| d3| +---+----+----+----+----+

2）フィルタ条件を動的に作成する：これは、列にnull値を持たせたくない場合や、列の数が多い場合に便利です。ほとんどの場合です。

これらの場合にフィルター条件を手動で作成すると、多くの時間が無駄になります。以下のコードでは、DataFrame列でmapおよびreduce関数を使用して動的にすべての列を含めています。

val filterCond = df.columns.map(x=>col(x).isNotNull).reduce(_ && _)

filterCondの外観：

filterCond: org.Apache.spark.sql.Column = (((((id IS NOT NULL) AND (col1 IS NOT NULL)) AND (col2 IS NOT NULL)) AND (col3 IS NOT NULL)) AND (col4 IS NOT NULL))

フィルタリング：

val filteredDf = df.filter(filterCond)

結果：

+---+----+----+----+----+ | id|col1|col2|col3|col4| +---+----+----+----+----+ | 0| a1| b1| c1| d1| | 1| a2| b2| c2| d2| +---+----+----+----+----+

Andrushenko Alexander · Answer

Javaのsparkのソリューションを次に示します。データ行を選択するには含む null。データセットデータがある場合、次のことを行います。

Dataset<Row> containingNulls = data.where(data.col("COLUMN_NAME").isNull())

データを除外するにはwithout nullsを実行します：

Dataset<Row> withoutNulls = data.where(data.col("COLUMN_NAME").isNotNull())

多くの場合、データフレームにはString型の列が含まれますが、nullの代わりに ""のような空の文字列があります。そのようなデータも除外するには、次のようにします。

Dataset<Row> withoutNullsAndEmpty = data.where(data.col("COLUMN_NAME").isNotNull().and(data.col("COLUMN_NAME").notEqual("")))

mputha · Answer

最初の質問では、nullを除外しているため、カウントはゼロです。

2番目の置換の場合：以下のように使用します。

val options = Map("path" -> "...\ex.csv", "header" -> "true") val dfNull = spark.sqlContext.load("com.databricks.spark.csv", options) scala> dfNull.show +----------+----------+-------+--------+----------+-----------+---------+ | user_id| event_id|invited|day_diff|interested|event_owner|friend_id| +----------+----------+-------+--------+----------+-----------+---------+ | 4236494| 110357109| 0| -1| 0| 937597069| null| | 78065188| 498404626| 0| 0| 0| 2904922087| null| | 282487230|2520855981| 0| 28| 0| 3749735525| null| | 335269852|1641491432| 0| 2| 0| 1490350911| null| | 437050836|1238456614| 0| 2| 0| 991277599| null| | 447244169|2095085551| 0| -1| 0| 1579858878| a| | 516353916|1076364848| 0| 3| 1| 3597645735| b| | 528218683|1151525474| 0| 1| 0| 3433080956| c| | 531967718|3632072502| 0| 1| 0| 3863085861| null| | 627948360|2823119321| 0| 0| 0| 4092665803| null| | 811791433|3513954032| 0| 2| 0| 415464198| null| | 830686203| 99027353| 0| 0| 0| 3549822604| null| |1008893291|1115453150| 0| 2| 0| 2245155244| null| |1239364869|2824096896| 0| 2| 1| 2579294650| d| |1287950172|1076364848| 0| 0| 0| 3597645735| null| |1345896548|2658555390| 0| 1| 0| 2025118823| null| |1354205322|2564682277| 0| 3| 0| 2563033185| null| |1408344828|1255629030| 0| -1| 1| 804901063| null| |1452633375|1334001859| 0| 4| 0| 1488588320| null| |1625052108|3297535757| 0| 3| 0| 1972598895| null| +----------+----------+-------+--------+----------+-----------+---------+ dfNull.withColumn("friend_idTmp", when($"friend_id".isNull, "1").otherwise("0")).drop($"friend_id").withColumnRenamed("friend_idTmp", "friend_id").show +----------+----------+-------+--------+----------+-----------+---------+ | user_id| event_id|invited|day_diff|interested|event_owner|friend_id| +----------+----------+-------+--------+----------+-----------+---------+ | 4236494| 110357109| 0| -1| 0| 937597069| 1| | 78065188| 498404626| 0| 0| 0| 2904922087| 1| | 282487230|2520855981| 0| 28| 0| 3749735525| 1| | 335269852|1641491432| 0| 2| 0| 1490350911| 1| | 437050836|1238456614| 0| 2| 0| 991277599| 1| | 447244169|2095085551| 0| -1| 0| 1579858878| 0| | 516353916|1076364848| 0| 3| 1| 3597645735| 0| | 528218683|1151525474| 0| 1| 0| 3433080956| 0| | 531967718|3632072502| 0| 1| 0| 3863085861| 1| | 627948360|2823119321| 0| 0| 0| 4092665803| 1| | 811791433|3513954032| 0| 2| 0| 415464198| 1| | 830686203| 99027353| 0| 0| 0| 3549822604| 1| |1008893291|1115453150| 0| 2| 0| 2245155244| 1| |1239364869|2824096896| 0| 2| 1| 2579294650| 0| |1287950172|1076364848| 0| 0| 0| 3597645735| 1| |1345896548|2658555390| 0| 1| 0| 2025118823| 1| |1354205322|2564682277| 0| 3| 0| 2563033185| 1| |1408344828|1255629030| 0| -1| 1| 804901063| 1| |1452633375|1334001859| 0| 4| 0| 1488588320| 1| |1625052108|3297535757| 0| 3| 0| 1972598895| 1| +----------+----------+-------+--------+----------+-----------+---------+

Robin Wang · Answer

マイケルコパニオフのヒントから、以下の作品

df.where(df("id").isNotNull).show

Steven Li · Answer

次のコードを使用して質問を解決します。できます。しかし、私たち全員が知っているように、私はそれを解決するために国のマイルを回って働きます。だから、そのためのショートカットはありますか？ありがとう

def filter_null(field : Any) : Int = field match { case null => 0 case _ => 1 } val test = train_event_join.join( user_friends_pair, train_event_join("user_id") === user_friends_pair("user_id") && train_event_join("event_owner") === user_friends_pair("friend_id"), "left" ).select( train_event_join("user_id"), train_event_join("event_id"), train_event_join("invited"), train_event_join("day_diff"), train_event_join("interested"), train_event_join("event_owner"), user_friends_pair("friend_id") ).rdd.map{ line => ( line(0).toString.toLong, line(1).toString.toLong, line(2).toString.toLong, line(3).toString.toLong, line(4).toString.toLong, line(5).toString.toLong, filter_null(line(6)) ) }.toDF("user_id", "event_id", "invited", "day_diff", "interested", "event_owner", "creator_is_friend")