Apply changes consecutively in a spark dataframe
我有一个初始状态为init的数据框。我有一个具有相同架构的数据框,其中每个行都有一个数据框init字段的更新,而其他字段有Null。如何连续应用更改来重建每个记录?更清楚地说,让我们举个例子:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | listOfTuples = [(101,"Status_0", '2019','value_col_4',0)] init = spark.createDataFrame(listOfTuples , ["id","status","year","col_4","ord"]) #initial state >>> init.show() +---+--------+----+-----------+---+ | id| status|year| col_4|ord| +---+--------+----+-----------+---+ | 1|Status_0|2019|value_col_4| 0| +---+--------+----+-----------+---+ #dataframe with changes schema = StructType([StructField('id', StringType(), True), StructField('status', StringType(), True), StructField('year', StringType(), True), StructField('col_4', StringType(), True), StructField('ord', IntegerType(), True)]) listOfTuples = [(1,"Status_A", None, None,1), (1,"Status_B", None, None,2), (1, None, None,"new_val", 3), (1,"Status_C", None, None,4)] changes = spark.createDataFrame(listOfTuples , schema) >>> changes.show() +---+--------+----+-------+---+ | id| status|year| col_4|ord| +---+--------+----+-------+---+ | 1|Status_A|null| null| 1| | 1|Status_B|null| null| 2| | 1| null|null|new_val| 3| | 1|Status_C|null| null| 4| +---+--------+----+-------+---+ |
我希望这些更改按照ord列的顺序连续地应用到最终数据帧中,并以数据帧init中的值作为基线。所以我希望我的最终数据框像这样:
1 2 3 4 5 6 7 8 9 10 | >>> final.show() +---+--------+----+--------------+ | id| status|year| col_4 | +---+--------+----+--------------+ | 1|Status_0|2019| value_col_4 | | 1|Status_A|2019| value_col_4 | | 1|Status_B|2019| value_col_4 | | 1|Status_B|2019| new_val | | 1|Status_C|2019| new_val | +---+--------+----+--------------+ |
我当时正在考虑将两个数据帧按ord列进行合并,然后以某种方式在下面传播更改。有谁知道如何执行此操作吗?
这是Scala代码,但希望对您有所帮助。最后,您可以删除或重命名这些列。
解决方案是先执行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.expressions.WindowSpec import org.apache.spark.sql.functions._ scala> initial.show +---+--------+----+-----------+---+ | id| status|year| col_4|ord| +---+--------+----+-----------+---+ | 1|Status_0|2019|value_col_4| 0| +---+--------+----+-----------+---+ scala> changes.show +---+--------+----+-------+---+ | id| status|year| col_4|ord| +---+--------+----+-------+---+ | 1|Status_A|null| null| 1| | 1|Status_B|null| null| 2| | 1| null|null|new_val| 3| | 1|Status_C|null| null| 4| +---+--------+----+-------+---+ scala> val inter = initial.union(changes) inter: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, status: string ... 3 more fields] scala> inter.show +---+--------+----+-----------+---+ | id| status|year| col_4|ord| +---+--------+----+-----------+---+ | 1|Status_0|2019|value_col_4| 0| | 1|Status_A|null| null| 1| | 1|Status_B|null| null| 2| | 1| null|null| new_val| 3| | 1|Status_C|null| null| 4| +---+--------+----+-----------+---+ scala> val overColumns = Window.partitionBy("id").orderBy("ord").rowsBetween(Window.unboundedPreceding, Window.currentRow) overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@70f4b378 scala> val output = inter.withColumn("newstatus", last("status", true).over(overColumns)).withColumn("newyear", last("year", true).over(overColumns)).withColumn("newcol_4", last("col_4", true).over(overColumns)) output: org.apache.spark.sql.DataFrame = [id: string, status: string ... 6 more fields] scala> output.show(false) +---+--------+----+-----------+---+---------+-------+-----------+ |id |status |year|col_4 |ord|newstatus|newyear|newcol_4 | +---+--------+----+-----------+---+---------+-------+-----------+ |1 |Status_0|2019|value_col_4|0 |Status_0 |2019 |value_col_4| |1 |Status_A|null|null |1 |Status_A |2019 |value_col_4| |1 |Status_B|null|null |2 |Status_B |2019 |value_col_4| |1 |null |null|new_val |3 |Status_B |2019 |new_val | |1 |Status_C|null|null |4 |Status_C |2019 |new_val | +---+--------+----+-----------+---+---------+-------+-----------+ |
在python中使用@ C.S.Reddy Gadipally的代码
1 2 3 4 5 6 7 8 9 | import pyspark.sql.functions as f from pyspark.sql.window import Window f = init.union(changes) w = Window.partitionBy(f['id']).orderBy(f['ord']) for c in f.columns[1:]: f = f.withColumn(c,func.last(c,True).over(w)) |