Joining multiple Spark Dataframes based on conditions
基于" SC"代码,我需要使用RefTable-1或RefTable-2将SRCTable加入
条件:
如果SC为" D",则SRCTable与KEY = KEY1上的RefTable-1联接以获取值。
否则,如果SC为" U",则SRCTable与KEY = KEY2上的RefTable-2联接
这是您可以仅使用
上的连接函数进行测试的代码
1 2 3 4 5 6 7 8 | val SRCTable = Seq((66,"D","a"), (67,"U","b"), (70,"D","y"), (71,"U","q")).toDF("KEY","SC","FK") val RefTable1 = Seq((66,"xyz1"),(67,"abc1"),(68,"fgr1"),(69,"yte1"),(70,"erx1"),(71,"ter1")).toDF("KEY1","Value") val RefTable2 = Seq((66,"a","xyz2"), (67,"c","abc2"), (67,"b","fgr2"), (69,"g","yte2"), (70,"y","erx2"), (71,"q","ter2")).toDF("KEY2","KEY3","Value") val join1 = SRCTable.where(SRCTable.col("SC").equalTo("D")).join(RefTable1, SRCTable.col("KEY") === RefTable1.col("KEY1")).select("KEY","SC","FK","Value") val join2 = SRCTable.where(SRCTable.col("SC").equalTo("U")).join(RefTable2, SRCTable.col("KEY") === RefTable2.col("KEY2") && SRCTable.col("FK") === RefTable2.col("KEY3") ).select("KEY","SC","FK","Value") join1.unionAll(join2).show |
如果您有任何性能问题,我建议您看一下如何对数据进行很好的分区,如果您的DataFrame之一很小,也请看一下Broadcast对象。