关于Scala：如何查找Spark RDD / Dataframe大小？

How to find spark RDD/Dataframe size?

我知道如何在Scala中找到文件大小，但是如何在Spark中找到RDD /数据帧大小？

Scala：

1
2
3
4

object Main extends App {
val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
println(file.length)
}

火花：

1 2	val distFile = sc.textFile(file) println(distFile.length)

但是，如果我处理它没有得到文件的大小。如何找到RDD大小？

相关讨论

如果您只是想计算rdd中的行数，请执行以下操作：

1 2	val distFile = sc.textFile(file) println(distFile.count)

如果您对字节感兴趣，可以使用SizeEstimator：

1 2	import org.apache.spark.util.SizeEstimator println(SizeEstimator.estimate(distFile))

https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html

相关讨论

是的，最后我找到了解决方案。
包括这些库。

1
2
3

import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd

如何找到RDD大小：

1
2
3
4

def calcRDDSize(rdd: RDD[String]): Long = {
rdd.map(_.getBytes("UTF-8").length.toLong)
.reduce(_+_) //add the sizes together
}

查找DataFrame大小的函数：
(此函数仅在内部将DataFrame转换为RDD)

1
2
3
4
5

val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path

val rddOfDataframe = dataFrame.rdd.map(_.toString())

val size = calcRDDSize(rddOfDataframe)

相关讨论

以下是与SizeEstimator分开的一种方法。我经常使用

要从代码中了解有关RDD的信息是否已缓存，更确切地说，它的多少个分区缓存在内存中，多少个分区缓存在磁盘上？要获取存储级别，还想知道当前的实际缓存状态。要了解内存消耗。

Spark Context具有开发人员api方法getRDDStorageInfo()
有时您可以使用此功能。

Return information about what RDDs are cached, if they are in mem or
on disk, how much space they take, etc.

For Example :

1
2
3
4
scala> sc.getRDDStorageInfo
res3: Array[org.apache.spark.storage.RDDInfo] =
Array(RDD"HiveTableScan [name#0], (MetastoreRelation sparkdb,
firsttable, None), None" (3) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 1;

TotalPartitions: 1;
MemorySize: 256.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)

似乎spark ui也从此代码中使用了相同的内容

请参阅此源问题SPARK-17019，其中描述了...

Description
With SPARK-13992, Spark supports persisting data into
off-heap memory, but the usage of off-heap is not exposed currently,
it is not so convenient for user to monitor and profile, so here
propose to expose off-heap memory as well as on-heap memory usage in
various places:

Spark UI's executor page will display both on-heap and off-heap memory usage.

REST request returns both on-heap and off-heap memory.

Also these two memory usage can be obtained programmatically from SparkListener.

相关讨论