Efficiently calculate row totals of a wide Spark DF
我有一个宽的Spark数据框,其中包含数千列,大约一百万行,为此,我想计算行总数。到目前为止,我的解决方案如下。我用了:
dplyr-使用正则表达式和多个列的总和
https://github.com/tidyverse/rlang/issues/116
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | library(sparklyr) library(DBI) library(dplyr) library(rlang) sc1 <- spark_connect(master ="local") wide_df = as.data.frame(matrix(ceiling(runif(2000, 0, 20)), 10, 200)) wide_sdf = sdf_copy_to(sc1, wide_df, overwrite = TRUE, name ="wide_sdf") col_eqn = paste0(colnames(wide_df), collapse ="+" ) # build up the SQL query and send to spark with DBI query = paste0("SELECT (", col_eqn, ") as total FROM wide_sdf") dbGetQuery(sc1, query) # Equivalent approach using dplyr instead col_eqn2 = quo(!! parse_expr(col_eqn)) wide_sdf %>% transmute("total" := !!col_eqn2) %>% collect() %>% as.data.frame() |
增加列数会出现问题。在Spark SQL上,似乎一次只计算一个元素,即((((V1 V1)V3)V4 ......)
有人有其他更有效的方法吗?任何帮助将不胜感激。
您真不走运。一种或另一种方式您将达到一些递归限制(即使您使用SQL解析器,足够大的表达式总数也会使查询计划程序崩溃)。有一些慢速解决方案可用:
-
使用
spark_apply (以往返于R的代价):1wide_sdf %>% spark_apply(function(df) { data.frame(total = rowSums(df)) }) -
转换为长格式并进行汇总(以
explode 和随机播放为代价):1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17key_expr <-"monotonically_increasing_id() AS key"
value_expr <- paste(
"explode(array(", paste(colnames(wide_sdf), collapse=","),")) AS value"
)
wide_sdf %>%
spark_dataframe() %>%
# Add id and explode. We need a separate invoke so id is applied
# before"lateral view"
sparklyr::invoke("selectExpr", list(key_expr,"*")) %>%
sparklyr::invoke("selectExpr", list("key", value_expr)) %>%
sdf_register() %>%
# Aggregate by id
group_by(key) %>%
summarize(total = sum(value)) %>%
arrange(key)
要获得更有效的效果,应该考虑编写Scala扩展并将sum直接应用于
1 2 3 4 5 6 7 8 9 | package com.example.sparklyr.rowsum import org.apache.spark.sql.{DataFrame, Encoders} object RowSum { def apply(df: DataFrame, cols: Seq[String]) = df.map { row => cols.map(c => row.getAs[Double](c)).sum }(Encoders.scalaDouble) } |
和
1 2 3 4 | invoke_static( sc,"com.example.sparklyr.rowsum.RowSum","apply", wide_sdf %>% spark_dataframe ) %>% sdf_register() |