关于sql：将数据从多行聚合为一，然后嵌套数据

Aggregate data from multiple rows to one and then nest the data

我对scala和spark编程还比较陌生。

我有一个用例，我需要根据某些列对数据进行分组，并对某个列进行计数(使用数据透视)，然后最终需要在平面数据框中创建一个嵌套的数据框。

我面临的一个主要挑战是我还需要保留其他一些专栏(而不是我正在讨论的专栏)。

我无法找到一种有效的方法。

输入

1
2
3
4
5

ID ID2 ID3 country items_purchased quantity
1 1 1 UK apple 1
1 1 1 USA mango 1
1 2 3 China banana 3
2 1 1 UK mango 1

现在说，我想在\\'country \\上旋转，并在(\\'ID \\'，\\'ID2 \\'，\\'ID3 \\')上分组
但我也想将其他列作为列表进行维护。

例如，

输出1：

1
2
3
4

ID ID2 ID3 UK USA China items_purchased quantity
1 1 1 1 1 0 [apple,mango] [1,1]
1 2 3 0 0 1 [banana] [3]
2 1 1 1 0 0 [mango] [1]

一旦我做到了，

我想将其嵌套到嵌套结构中，这样我的架构看起来像：

1
2
3
4
5
6
7
8
9
10
11
12
13

{
"ID" : 1,
"ID2" : 1,
"ID3" : 1,
"countries" :
{
"UK" : 1,
"USA" : 1,
"China" : 0,
},
"items_purchased" : ["apple","mango"],
"quantity" : [1,1]
}

我相信我可以使用case类，然后将数据框的每一行映射到它。但是，我不确定这是否有效，我很想知道是否有更优化的方法来实现这一目标。

我想到的是这些方面的内容：

1
2
3
4
5
6
7

dataframe.map(ROW => myCaseClass(ROW.getAs[Long]("ID"),
ROW.getAs[Long]("ID2"),
ROW.getAs[Long]("ID3"),
CountriesCaseClass(
ROW.getAs[String]("UK")
)
)

依此类推...

相关讨论

我认为这应该适合您的情况。分区号是根据公式partitions_num = data_size / 500MB。

计算的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

import org.apache.spark.sql.functions.{collect_list, COUNT, col, lit, map}

val DATA = Seq(
(1, 1, 1,"UK","apple", 1),
(1, 1, 1,"USA","mango", 1),
(1, 2, 3,"China","banana", 3),
(2, 1, 1,"UK","mango", 1))

// e.g: partitions_num = 100GB / 500MB = 200, adjust it according TO the SIZE OF your DATA
val partitions_num = 250
val df = DATA.toDF("ID","ID2","ID3","country","items_purchased","quantity")
.repartition(partitions_num, $"ID", $"ID2", $"ID3") //the partition should remain the same FOR ALL the operations
.persist()

//GET countries, we will need it TO fill WITH 0 the NULL VALUES after pivoting, FOR the mapping AND FOR the DROP
val countries = df.select("country").distinct.collect.map{_.getString(0)}

//creates a SEQUENCE OF KEY/VALUE which should be the INPUT FOR the map FUNCTION
val countryMapping = countries.flatMap{c => Seq(lit(c), col(c))}
val pivotCountriesDF = df.select("ID","ID2","ID3","country")
.groupBy("ID","ID2","ID3")
.pivot($"country")
.count()
.na.fill(0, countries)
.withColumn("countries", map(countryMapping:_*))//i.e map("UK", col("UK"),"China", col("China")) -> {"UK":0,"China":1}
.drop(countries:_*)

// pivotCountriesDF.rdd.getNumPartitions == 250, Spark will retain the partition NUMBER since we didnt CHANGE the partition KEY

// +---+---+---+-------------------------------+
// |ID |ID2|ID3|countries |
// +---+---+---+-------------------------------+
// |1 |2 |3 |[China -> 1, USA -> 0, UK -> 0]|
// |1 |1 |1 |[China -> 0, USA -> 1, UK -> 1]|
// |2 |1 |1 |[China -> 0, USA -> 0, UK -> 1]|
// +---+---+---+-------------------------------+

val listDF = df.select("ID","ID2","ID3","items_purchased","quantity")
.groupBy("ID","ID2","ID3")
.agg(
collect_list("items_purchased").as("items_purchased"),
collect_list("quantity").as("quantity"))

// +---+---+---+---------------+--------+
// |ID |ID2|ID3|items_purchased|quantity|
// +---+---+---+---------------+--------+
// |1 |2 |3 |[banana] |[3] |
// |1 |1 |1 |[apple, mango] |[1, 1] |
// |2 |1 |1 |[mango] |[1] |
// +---+---+---+---------------+--------+

// listDF.rdd.getNumPartitions == 250, TO validate this try TO CHANGE the partition KEY WITH .groupBy("ID","ID2") it will fall back TO the DEFAULT 200 VALUE OF spark.sql.shuffle.partitions setting

val joinedDF = pivotCountriesDF.join(listDF, Seq("ID","ID2","ID3"))

// joinedDF.rdd.getNumPartitions == 250, the same partitions will be used FOR the JOIN AS well.

// +---+---+---+-------------------------------+---------------+--------+
// |ID |ID2|ID3|countries |items_purchased|quantity|
// +---+---+---+-------------------------------+---------------+--------+
// |1 |2 |3 |[China -> 1, USA -> 0, UK -> 0]|[banana] |[3] |
// |1 |1 |1 |[China -> 0, USA -> 1, UK -> 1]|[apple, mango] |[1, 1] |
// |2 |1 |1 |[China -> 0, USA -> 0, UK -> 1]|[mango] |[1] |
// +---+---+---+-------------------------------+---------------+--------+

joinedDF.toJSON.show(FALSE)

// +--------------------------------------------------------------------------------------------------------------------+
// |VALUE |
// +--------------------------------------------------------------------------------------------------------------------+
// |{"ID":1,"ID2":2,"ID3":3,"countries":{"China":1,"USA":0,"UK":0},"items_purchased":["banana"],"quantity":[3]} |
// |{"ID":1,"ID2":1,"ID3":1,"countries":{"China":0,"USA":1,"UK":1},"items_purchased":["apple","mango"],"quantity":[1,1]}|
// |{"ID":2,"ID2":1,"ID3":1,"countries":{"China":0,"USA":0,"UK":1},"items_purchased":["mango"],"quantity":[1]} |
// +--------------------------------------------------------------------------------------------------------------------++

祝您好运，如果您需要任何澄清，请告诉我。