Find 3 closest dates in Spark dataframe per some criteria using pyspark
我有以下Spark数据帧:-
df1
1 2 3 4 5 | id dia_date 1 2/12/17 1 4/25/16 2 12/8/17 2 6/12/11 |
df2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | id obs_date obs_value 1 2/16/17 4 1 2/20/17 2 1 2/9/17 4 1 12/12/18 5 1 4/18/16 1 1 4/18/16 6 1 4/30/16 7 1 5/25/16 9 2 12/12/17 10 2 12/6/17 11 2 12/14/17 4 2 6/11/11 5 2 6/11/11 6 |
我希望得到的数据帧如下:-
1)通过比较df2中df1中的日期来找到最接近的三个日期。
2)如果没有3个最接近的日期,则为它们插入null。
3)最接近的日期应仅按ID字段分组。就像ID为'1'的dia_date一样,我们必须仅查看ID1为'1 \\'的df2中的obs_date字段。
结果数据帧的示例:-
1 2 3 4 5 | id dia_date obs_date1 obs_val1 obs_date2 obs_val2 obs_date3 obs_val3 1 2/12/17 2/9/17 4 2/16/17 4 2/20/17 2 1 4/25/16 4/18/16 1 4/18/16 6 4/30/16 7 2 12/8/17 12/6/17 11 12/12/17 10 12/14/17 4 2 6/12/11 6/11/11 5 6/11/11 6 null null |
我想用pyspark做到这一点。已经尝试了一些方法,但是发现这真的很困难,因为我只是从pyspark开始。
这是Scala的答案,因为问题与pyspark无关。您可以转换。
我无法获得您的最终输出,但是可以替代。
1
2
3
4
5 // Assuming we could also optimize this further but not doing so.
// Assuming distinct values to compare against. If not, then some further logic required.
// A bug found in Ranking it looks like - or me??? Worked around this and dropped the logic
// Pivoting does not help here. Grouping by in SQL to vet specific columns names not
elegant, used the more Scala approach.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | import org.apache.spark.sql.functions._ import spark.implicits._ import java.time._ import org.apache.spark.sql.functions.{rank} import org.apache.spark.sql.expressions.Window def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay val toEpochDayUdf = udf(toEpochDay(_: String)) // Our input. val df0 = Seq( ("1","2018-09-05"), ("1","2018-09-14"), ("2","2018-12-23"), ("5","2015-12-20"), ("6","2018-12-23") ).toDF("id","dia_dt") val df1 = Seq( ("1","2018-09-06", 5), ("1","2018-09-07", 6), ("6","2023-09-07", 7), ("2","2018-12-23", 4), ("2","2018-12-24", 5), ("2","2018-10-23", 5), ("1","2017-09-06", 5), ("1","2017-09-07", 6), ("5","2015-12-20", 5), ("5","2015-12-21", 6), ("5","2015-12-19", 5), ("5","2015-12-18", 7), ("5","2015-12-22", 5), ("5","2015-12-23", 6), ("5","2015-12-17", 6), ("5","2015-12-26", 60) ).toDF("id","obs_dt","obs_val") val myExpression ="abs(dia_epoch - obs_epoch)" // Hard to know how to restrict further at this point. val df2 = df1.withColumn("obs_epoch", toEpochDayUdf($"obs_dt")) val df3 = df2.join(df0, Seq("id"),"inner").withColumn("dia_epoch", toEpochDayUdf($"dia_dt")) .withColumn("abs_diff", expr(myExpression)) @transient val w1 = org.apache.spark.sql.expressions.Window.partitionBy("id","dia_epoch" ).orderBy(asc("abs_diff")) val df4 = df3.select($"*", rank.over(w1).alias("rank")) // This is required // Final results as collect_list. Distinct column names not so easy due to not being able to use pivot - may be a limitation on knowledge on my side. df4.orderBy("id","dia_dt") .filter($"rank" <= 3) .groupBy($"id", $"dia_dt") .agg(collect_list(struct($"obs_dt", $"obs_val")).as("observations")) .show(false) |
返回:
1 2 3 4 5 6 7 8 9 | +---+----------+---------------------------------------------------+ |id |dia_dt |observations | +---+----------+---------------------------------------------------+ |1 |2018-09-05|[[2017-09-07, 6], [2018-09-06, 5], [2018-09-07, 6]]| |1 |2018-09-14|[[2017-09-07, 6], [2018-09-06, 5], [2018-09-07, 6]]| |2 |2018-12-23|[[2018-10-23, 5], [2018-12-23, 4], [2018-12-24, 5]]| |5 |2015-12-20|[[2015-12-19, 5], [2015-12-20, 5], [2015-12-21, 6]]| |6 |2018-12-23|[[2023-09-07, 7]] | +---+----------+---------------------------------------------------+ |
进一步采取,繁重的工作完成。