Aggregate data from multiple rows to one and then nest the data
我对scala和spark编程还比较陌生。
我有一个用例,我需要根据某些列对数据进行分组,并对某个列进行计数(使用数据透视),然后最终需要在平面数据框中创建一个嵌套的数据框。
我面临的一个主要挑战是我还需要保留其他一些专栏(而不是我正在讨论的专栏)。
我无法找到一种有效的方法。
输入
1 2 3 4 5 | ID ID2 ID3 country items_purchased quantity 1 1 1 UK apple 1 1 1 1 USA mango 1 1 2 3 China banana 3 2 1 1 UK mango 1 |
现在说,我想在\\'country \\上旋转,并在(\\'ID \\',\\'ID2 \\',\\'ID3 \\')上分组
但我也想将其他列作为列表进行维护。
例如,
输出1:
1 2 3 4 | ID ID2 ID3 UK USA China items_purchased quantity 1 1 1 1 1 0 [apple,mango] [1,1] 1 2 3 0 0 1 [banana] [3] 2 1 1 1 0 0 [mango] [1] |
一旦我做到了,
我想将其嵌套到嵌套结构中,这样我的架构看起来像:
1 2 3 4 5 6 7 8 9 10 11 12 13 | { "ID" : 1, "ID2" : 1, "ID3" : 1, "countries" : { "UK" : 1, "USA" : 1, "China" : 0, }, "items_purchased" : ["apple","mango"], "quantity" : [1,1] } |
我相信我可以使用case类,然后将数据框的每一行映射到它。但是,我不确定这是否有效,我很想知道是否有更优化的方法来实现这一目标。
我想到的是这些方面的内容:
1 2 3 4 5 6 7 | dataframe.map(ROW => myCaseClass(ROW.getAs[Long]("ID"), ROW.getAs[Long]("ID2"), ROW.getAs[Long]("ID3"), CountriesCaseClass( ROW.getAs[String]("UK") ) ) |
依此类推...
我认为这应该适合您的情况。分区号是根据公式
计算的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | import org.apache.spark.sql.functions.{collect_list, COUNT, col, lit, map} val DATA = Seq( (1, 1, 1,"UK","apple", 1), (1, 1, 1,"USA","mango", 1), (1, 2, 3,"China","banana", 3), (2, 1, 1,"UK","mango", 1)) // e.g: partitions_num = 100GB / 500MB = 200, adjust it according TO the SIZE OF your DATA val partitions_num = 250 val df = DATA.toDF("ID","ID2","ID3","country","items_purchased","quantity") .repartition(partitions_num, $"ID", $"ID2", $"ID3") //the partition should remain the same FOR ALL the operations .persist() //GET countries, we will need it TO fill WITH 0 the NULL VALUES after pivoting, FOR the mapping AND FOR the DROP val countries = df.select("country").distinct.collect.map{_.getString(0)} //creates a SEQUENCE OF KEY/VALUE which should be the INPUT FOR the map FUNCTION val countryMapping = countries.flatMap{c => Seq(lit(c), col(c))} val pivotCountriesDF = df.select("ID","ID2","ID3","country") .groupBy("ID","ID2","ID3") .pivot($"country") .count() .na.fill(0, countries) .withColumn("countries", map(countryMapping:_*))//i.e map("UK", col("UK"),"China", col("China")) -> {"UK":0,"China":1} .drop(countries:_*) // pivotCountriesDF.rdd.getNumPartitions == 250, Spark will retain the partition NUMBER since we didnt CHANGE the partition KEY // +---+---+---+-------------------------------+ // |ID |ID2|ID3|countries | // +---+---+---+-------------------------------+ // |1 |2 |3 |[China -> 1, USA -> 0, UK -> 0]| // |1 |1 |1 |[China -> 0, USA -> 1, UK -> 1]| // |2 |1 |1 |[China -> 0, USA -> 0, UK -> 1]| // +---+---+---+-------------------------------+ val listDF = df.select("ID","ID2","ID3","items_purchased","quantity") .groupBy("ID","ID2","ID3") .agg( collect_list("items_purchased").as("items_purchased"), collect_list("quantity").as("quantity")) // +---+---+---+---------------+--------+ // |ID |ID2|ID3|items_purchased|quantity| // +---+---+---+---------------+--------+ // |1 |2 |3 |[banana] |[3] | // |1 |1 |1 |[apple, mango] |[1, 1] | // |2 |1 |1 |[mango] |[1] | // +---+---+---+---------------+--------+ // listDF.rdd.getNumPartitions == 250, TO validate this try TO CHANGE the partition KEY WITH .groupBy("ID","ID2") it will fall back TO the DEFAULT 200 VALUE OF spark.sql.shuffle.partitions setting val joinedDF = pivotCountriesDF.join(listDF, Seq("ID","ID2","ID3")) // joinedDF.rdd.getNumPartitions == 250, the same partitions will be used FOR the JOIN AS well. // +---+---+---+-------------------------------+---------------+--------+ // |ID |ID2|ID3|countries |items_purchased|quantity| // +---+---+---+-------------------------------+---------------+--------+ // |1 |2 |3 |[China -> 1, USA -> 0, UK -> 0]|[banana] |[3] | // |1 |1 |1 |[China -> 0, USA -> 1, UK -> 1]|[apple, mango] |[1, 1] | // |2 |1 |1 |[China -> 0, USA -> 0, UK -> 1]|[mango] |[1] | // +---+---+---+-------------------------------+---------------+--------+ joinedDF.toJSON.show(FALSE) // +--------------------------------------------------------------------------------------------------------------------+ // |VALUE | // +--------------------------------------------------------------------------------------------------------------------+ // |{"ID":1,"ID2":2,"ID3":3,"countries":{"China":1,"USA":0,"UK":0},"items_purchased":["banana"],"quantity":[3]} | // |{"ID":1,"ID2":1,"ID3":1,"countries":{"China":0,"USA":1,"UK":1},"items_purchased":["apple","mango"],"quantity":[1,1]}| // |{"ID":2,"ID2":1,"ID3":1,"countries":{"China":0,"USA":0,"UK":1},"items_purchased":["mango"],"quantity":[1]} | // +--------------------------------------------------------------------------------------------------------------------++ |
祝您好运,如果您需要任何澄清,请告诉我。
我看不到任何问题,这是一个很好的解决方案。无论如何,我会创建一个\\'Dataset
1 2 | val ds: Dataset[myCaseClass] = dataframe.map(ROW => myCaseClass(ROW.getAs[Long]("ID"), ... |
编辑
您要求这样的东西。
1 2 3 4 | INPUT.groupby("ID","ID2","ID3") .withColumn("UK", col("country").contains("UK")) .withColumn("China", col("country").contains("China")) .withColumnRenamed("country","USA", col("country").contains("USA")) |