How to find the nearest value of two DataFrames in Spark
我想从第二个 DataFrame 中为 DataFrame 的每个元素找到最接近的值。
我有两个 DF。第一个 DataFrame (DF1) 包含 14.000.000 个元素。
我拿了一个包含 30.000 个元素的 Sample DataFrame (DF2)。
现在我想从 DF2 的所有元素中为 DF1 中的每个元素找到最接近的值。
例如:
DF1:
1 2 3 4 5 6 7 8 | Timestamp, Value 2014-01-01 00:00:01, 3.0 2014-01-01 00:00:05, 12.0 2014-01-01 00:00:09, 8.0 2014-01-01 00:00:10, 45.0 2014-01-01 00:00:15, 3.0 2014-01-01 00:00:21, 4.0 2014-01-01 00:00:32, 19.0 |
DF2:
1 2 3 4 | Timestamp, Value 2014-01-01 00:00:01, 3.0 2014-01-01 00:00:10, 45.0 2014-01-01 00:00:09, 8.0 |
结果应该是这样的:
1 2 3 4 5 6 7 8 9 | Timestamp, Value, ClosestValue 2014-01-01 00:00:01, 3.0, 3.0 2014-01-01 00:00:05, 12.0, 8.0 2014-01-01 00:00:09, 8.0, 8.0 2014-01-01 00:00:10, 45.0, 45.0 2014-01-01 00:00:15, 3.0, 3.0 2014-01-01 00:00:21, 4.0, 3.0 2014-01-01 00:00:32, 19.0, 8.0 ... |
考虑到您的第二个
的大小
第 1 步 - 创建广播变量
1 2 3 |
第 2 步 - 实现二分查找
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | def findClosest(element: Double, values: Array[Double]): Double = { var left = 0 var right = values.length - 1 var closest = Double.NaN var min = Double.MaxValue while(left <= right) { val mid = (left + right) / 2 val current = values(mid) if(current == element) { closest = element left = right + 1 } else { if(current < element) { left = mid + 1 } else { right = mid - 1 } val distance = (current - element).abs if(distance < min) { min = distance closest = current } } } closest } |
第 3 步 - 创建 UDF
1 |
第 4 步 - 使用 UDF
1 | df1.withColumn("ClosestValue", findClosestUdf(df1("Value"))) |