关于pandas:如何分组和合并Spark DataFrame的组的这些行

How to group by and merge these rows of spark dataframe's group

假设我有一个这样的表,

1
2
3
4
5
A  | B  |    C     | D  |  E  | F
x1 | 5  | 20200115 | 15 | 4.5 | 1
x1 | 10 | 20200825 | 15 | 5.6 | 19
x2 | 10 | 20200115 | 15 | 4.1 | 1
x2 | 10 | 20200430 | 15 | 9.1 | 1

我正在寻找在col A上合并这些行,并生成这样的数据框

1
2
3
4
A  | B  |    C     | D  |  E  | F
x1 | 15 | 20200825 | 15 | 5.6 | 19
x2 | 10 | 20200115 | 15 | 4.1 | 1
x2 | 10 | 20200430 | 15 | 9.1 | 1

基本上,如果列A中的组的列B的总和等于列D的值,则

  • B列的新值将是B列的总和
  • 将根据C列的最新值(这是YYYYmmDD中的日期)提取C,E,F列
  • 由于对于X2组,上述条件不成立(即B列的总和大于20且D列等于15),因此我想将两个记录都保留在目标中

    假设:在我的数据中,给定组的D列将相同(在本例中为15)

    我看过很多分组和窗口化(partitioning)示例,但在我看来,这是不同的,因此我无法缩小范围。

    我可以将分组数据通过管道传输到UDF并执行某些操作吗?

    PS:在pyspark中构建它,如果您的示例可以在pyspark

    中,那就太好了


    尝试一下-

    sum max与开窗功能一起使用

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    df.show(false)
        df.printSchema()
        /**
          * +---+---+--------+---+---+---+
          * |A  |B  |C       |D  |E  |F  |
          * +---+---+--------+---+---+---+
          * |x1 |5  |20200115|15 |4.5|1  |
          * |x1 |10 |20200825|15 |5.6|19 |
          * |x2 |10 |20200115|15 |4.1|1  |
          * |x2 |10 |20200430|15 |9.1|1  |
          * +---+---+--------+---+---+---+
          *
          * root
          * |-- A: string (nullable = true)
          * |-- B: integer (nullable = true)
          * |-- C: integer (nullable = true)
          * |-- D: integer (nullable = true)
          * |-- E: double (nullable = true)
          * |-- F: integer (nullable = true)
          */

        val w = Window.partitionBy("A")
        df.withColumn("sum", sum("B").over(w))
          .withColumn("latestC", max("C").over(w))
          .withColumn("retain",
            when($"sum" === $"D", when($"latestC" === $"C", true).otherwise(false) )
              .otherwise(true) )
          .where($"retain" === true)
          .withColumn("B", when($"sum" === $"D", when($"latestC" === $"C", $"sum").otherwise($"B") )
            .otherwise($"B"))
          .show(false)

        /**
          * +---+---+--------+---+---+---+---+--------+------+
          * |A  |B  |C       |D  |E  |F  |sum|latestC |retain|
          * +---+---+--------+---+---+---+---+--------+------+
          * |x1 |15 |20200825|15 |5.6|19 |15 |20200825|true  |
          * |x2 |10 |20200115|15 |4.1|1  |20 |20200430|true  |
          * |x2 |10 |20200430|15 |9.1|1  |20 |20200430|true  |
          * +---+---+--------+---+---+---+---+--------+------+
          */

    在pyspark中,我会这样:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    from pyspark.sql import functions as F, Window as W

    b = ["A","B","C","D","E","F"]
    a = [
        ("x1", 5,"20200115", 15, 4.5, 1),
        ("x1", 10,"20200825", 15, 5.6, 19),
        ("x2", 10,"20200115", 15, 4.1, 1),
        ("x2", 10,"20200430", 15, 9.1, 1),
    ]

    df = spark.createDataFrame(a, b)


    df = df.withColumn("B_sum", F.sum("B").over(W.partitionBy("A")))

    process_df = df.where("D >= B_Sum")
    no_process_df = df.where("D < B_sum").drop("B_sum")


    process_df = (
        process_df.withColumn(
           "rng", F.row_number().over(W.partitionBy("A").orderBy(F.col("C").desc()))
        )
        .where("rng=1")
        .select("A", F.col("B_sum").alias("B"),"C","D","E","F",)
    )

    final_output = process_df.unionByName(no_process_df)
    +---+---+--------+---+---+---+
    |  A|  B|       C|  D|  E|  F|
    +---+---+--------+---+---+---+
    | x1| 15|20200825| 15|5.6| 19|
    | x2| 10|20200115| 15|4.1|  1|
    | x2| 10|20200430| 15|9.1|  1|
    +---+---+--------+---+---+---+