关于Scala:根据条件加入多个Spark数据框

Joining multiple Spark Dataframes based on conditions

基于" SC"代码,我需要使用RefTable-1或RefTable-2将SRCTable加入

条件:
如果SC为" D",则SRCTable与KEY = KEY1上的RefTable-1联接以获取值。
否则,如果SC为" U",则SRCTable与KEY = KEY2上的RefTable-2联接


这是您可以仅使用DataFrame

上的连接函数进行测试的代码

1
2
3
4
5
6
7
8
val SRCTable = Seq((66,"D","a"), (67,"U","b"), (70,"D","y"), (71,"U","q")).toDF("KEY","SC","FK")
val RefTable1 = Seq((66,"xyz1"),(67,"abc1"),(68,"fgr1"),(69,"yte1"),(70,"erx1"),(71,"ter1")).toDF("KEY1","Value")
val RefTable2 = Seq((66,"a","xyz2"), (67,"c","abc2"), (67,"b","fgr2"), (69,"g","yte2"), (70,"y","erx2"), (71,"q","ter2")).toDF("KEY2","KEY3","Value")

val join1 = SRCTable.where(SRCTable.col("SC").equalTo("D")).join(RefTable1, SRCTable.col("KEY") === RefTable1.col("KEY1")).select("KEY","SC","FK","Value")
val join2 = SRCTable.where(SRCTable.col("SC").equalTo("U")).join(RefTable2, SRCTable.col("KEY") === RefTable2.col("KEY2") && SRCTable.col("FK") === RefTable2.col("KEY3") ).select("KEY","SC","FK","Value")

join1.unionAll(join2).show

如果您有任何性能问题,我建议您看一下如何对数据进行很好的分区,如果您的DataFrame之一很小,也请看一下Broadcast对象。