How to group by and merge these rows of spark dataframe's group
假设我有一个这样的表,
1 2 3 4 5 | A | B | C | D | E | F x1 | 5 | 20200115 | 15 | 4.5 | 1 x1 | 10 | 20200825 | 15 | 5.6 | 19 x2 | 10 | 20200115 | 15 | 4.1 | 1 x2 | 10 | 20200430 | 15 | 9.1 | 1 |
我正在寻找在col
1 2 3 4 | A | B | C | D | E | F x1 | 15 | 20200825 | 15 | 5.6 | 19 x2 | 10 | 20200115 | 15 | 4.1 | 1 x2 | 10 | 20200430 | 15 | 9.1 | 1 |
基本上,如果列A中的组的列B的总和等于列D的值,则
由于对于X2组,上述条件不成立(即B列的总和大于20且D列等于15),因此我想将两个记录都保留在目标中
假设:在我的数据中,给定组的D列将相同(在本例中为15)
我看过很多分组和窗口化(partitioning)示例,但在我看来,这是不同的,因此我无法缩小范围。
我可以将分组数据通过管道传输到UDF并执行某些操作吗?
PS:在pyspark中构建它,如果您的示例可以在pyspark
中,那就太好了
尝试一下-
将
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | df.show(false) df.printSchema() /** * +---+---+--------+---+---+---+ * |A |B |C |D |E |F | * +---+---+--------+---+---+---+ * |x1 |5 |20200115|15 |4.5|1 | * |x1 |10 |20200825|15 |5.6|19 | * |x2 |10 |20200115|15 |4.1|1 | * |x2 |10 |20200430|15 |9.1|1 | * +---+---+--------+---+---+---+ * * root * |-- A: string (nullable = true) * |-- B: integer (nullable = true) * |-- C: integer (nullable = true) * |-- D: integer (nullable = true) * |-- E: double (nullable = true) * |-- F: integer (nullable = true) */ val w = Window.partitionBy("A") df.withColumn("sum", sum("B").over(w)) .withColumn("latestC", max("C").over(w)) .withColumn("retain", when($"sum" === $"D", when($"latestC" === $"C", true).otherwise(false) ) .otherwise(true) ) .where($"retain" === true) .withColumn("B", when($"sum" === $"D", when($"latestC" === $"C", $"sum").otherwise($"B") ) .otherwise($"B")) .show(false) /** * +---+---+--------+---+---+---+---+--------+------+ * |A |B |C |D |E |F |sum|latestC |retain| * +---+---+--------+---+---+---+---+--------+------+ * |x1 |15 |20200825|15 |5.6|19 |15 |20200825|true | * |x2 |10 |20200115|15 |4.1|1 |20 |20200430|true | * |x2 |10 |20200430|15 |9.1|1 |20 |20200430|true | * +---+---+--------+---+---+---+---+--------+------+ */ |
在pyspark中,我会这样:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | from pyspark.sql import functions as F, Window as W b = ["A","B","C","D","E","F"] a = [ ("x1", 5,"20200115", 15, 4.5, 1), ("x1", 10,"20200825", 15, 5.6, 19), ("x2", 10,"20200115", 15, 4.1, 1), ("x2", 10,"20200430", 15, 9.1, 1), ] df = spark.createDataFrame(a, b) df = df.withColumn("B_sum", F.sum("B").over(W.partitionBy("A"))) process_df = df.where("D >= B_Sum") no_process_df = df.where("D < B_sum").drop("B_sum") process_df = ( process_df.withColumn( "rng", F.row_number().over(W.partitionBy("A").orderBy(F.col("C").desc())) ) .where("rng=1") .select("A", F.col("B_sum").alias("B"),"C","D","E","F",) ) final_output = process_df.unionByName(no_process_df) +---+---+--------+---+---+---+ | A| B| C| D| E| F| +---+---+--------+---+---+---+ | x1| 15|20200825| 15|5.6| 19| | x2| 10|20200115| 15|4.1| 1| | x2| 10|20200430| 15|9.1| 1| +---+---+--------+---+---+---+ |