关于r:有效计算宽Spark DF的行总数

Efficiently calculate row totals of a wide Spark DF

我有一个宽的Spark数据框,其中包含数千列,大约一百万行,为此,我想计算行总数。到目前为止,我的解决方案如下。我用了:
dplyr-使用正则表达式和多个列的总和
https://github.com/tidyverse/rlang/issues/116

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
library(sparklyr)
library(DBI)
library(dplyr)
library(rlang)

sc1 <- spark_connect(master ="local")
wide_df = as.data.frame(matrix(ceiling(runif(2000, 0, 20)), 10, 200))
wide_sdf = sdf_copy_to(sc1, wide_df, overwrite = TRUE, name ="wide_sdf")

col_eqn = paste0(colnames(wide_df), collapse ="+" )

# build up the SQL query and send to spark with DBI
query = paste0("SELECT (",
               col_eqn,
              ") as total FROM wide_sdf")

dbGetQuery(sc1, query)

# Equivalent approach using dplyr instead
col_eqn2 = quo(!! parse_expr(col_eqn))

wide_sdf %>%
    transmute("total" := !!col_eqn2) %>%
        collect() %>%
            as.data.frame()

增加列数会出现问题。在Spark SQL上,似乎一次只计算一个元素,即((((V1 V1)V3)V4 ......)

有人有其他更有效的方法吗?任何帮助将不胜感激。


您真不走运。一种或另一种方式您将达到一些递归限制(即使您使用SQL解析器,足够大的表达式总数也会使查询计划程序崩溃)。有一些慢速解决方案可用:

  • 使用spark_apply(以往返于R的代价):

    1
    wide_sdf %>% spark_apply(function(df) { data.frame(total = rowSums(df)) })
  • 转换为长格式并进行汇总(以explode和随机播放为代价):

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    key_expr <-"monotonically_increasing_id() AS key"

    value_expr <- paste(
    "explode(array(", paste(colnames(wide_sdf), collapse=","),")) AS value"
    )

    wide_sdf %>%
      spark_dataframe() %>%
      # Add id and explode. We need a separate invoke so id is applied
      # before"lateral view"
      sparklyr::invoke("selectExpr", list(key_expr,"*")) %>%
      sparklyr::invoke("selectExpr", list("key", value_expr)) %>%
      sdf_register() %>%
      # Aggregate by id
      group_by(key) %>%
      summarize(total = sum(value)) %>%
      arrange(key)

要获得更有效的效果,应该考虑编写Scala扩展并将sum直接应用于Row对象,而不会爆炸:

1
2
3
4
5
6
7
8
9
package com.example.sparklyr.rowsum

import org.apache.spark.sql.{DataFrame, Encoders}

object RowSum {
  def apply(df: DataFrame, cols: Seq[String]) = df.map {
    row => cols.map(c => row.getAs[Double](c)).sum
  }(Encoders.scalaDouble)
}

1
2
3
4
invoke_static(
  sc,"com.example.sparklyr.rowsum.RowSum","apply",
  wide_sdf %>% spark_dataframe
) %>% sdf_register()