关于Scala:如何查找Spark RDD / Dataframe大小?

How to find spark RDD/Dataframe size?

我知道如何在Scala中找到文件大小,但是如何在Spark中找到RDD /数据帧大小?

Scala:

1
2
3
4
object Main extends App {
  val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
  println(file.length)
}

火花:

1
2
val distFile = sc.textFile(file)
println(distFile.length)

但是,如果我处理它没有得到文件的大小。 如何找到RDD大小?


如果您只是想计算rdd中的行数,请执行以下操作:

1
2
val distFile = sc.textFile(file)
println(distFile.count)

如果您对字节感兴趣,可以使用SizeEstimator

1
2
import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))

https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html


是的,最后我找到了解决方案。
包括这些库。

1
2
3
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd

如何找到RDD大小:

1
2
3
4
def calcRDDSize(rdd: RDD[String]): Long = {
  rdd.map(_.getBytes("UTF-8").length.toLong)
     .reduce(_+_) //add the sizes together
}

查找DataFrame大小的函数:
(此函数仅在内部将DataFrame转换为RDD)

1
2
3
4
5
val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path

val rddOfDataframe = dataFrame.rdd.map(_.toString())

val size = calcRDDSize(rddOfDataframe)


以下是与SizeEstimator分开的一种方法。我经常使用

要从代码中了解有关RDD的信息是否已缓存,更确切地说,它的多少个分区缓存在内存中,多少个分区缓存在磁盘上?要获取存储级别,还想知道当前的实际缓存状态。要了解内存消耗。

Spark Context具有开发人员api方法getRDDStorageInfo()
有时您可以使用此功能。

Return information about what RDDs are cached, if they are in mem or
on disk, how much space they take, etc.

For Example :

1
2
3
4
scala> sc.getRDDStorageInfo
       res3: Array[org.apache.spark.storage.RDDInfo] =
       Array(RDD"HiveTableScan [name#0], (MetastoreRelation sparkdb,
       firsttable, None), None"
(3) StorageLevel: StorageLevel(false, true, false, true, 1);  CachedPartitions: 1;

TotalPartitions: 1;
MemorySize: 256.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)

似乎spark ui也从此代码中使用了相同的内容

  • 请参阅此源问题SPARK-17019,其中描述了...

Description
With SPARK-13992, Spark supports persisting data into
off-heap memory, but the usage of off-heap is not exposed currently,
it is not so convenient for user to monitor and profile, so here
propose to expose off-heap memory as well as on-heap memory usage in
various places:

  • Spark UI's executor page will display both on-heap and off-heap memory usage.
  • REST request returns both on-heap and off-heap memory.
  • Also these two memory usage can be obtained programmatically from SparkListener.