关于字典:如何在Apache Spark上的Scala中使用可变映射?找不到金钥错误

How to use mutable map in Scala on Apache Spark? Key not found error

我正在使用Spark 1.3.0。
我的地图说它有钥匙,但是访问钥匙时我找不到钥匙或找不到钥匙。

1
2
3
4
5
import scala.collection.mutable.HashMap
val labeldata = sc.textFile("/home/data/trainLabels2.csv")
val labels: Array[Array[String]] = labeldata.map(line => line.split(",")).collect()
var fn2label: HashMap[String,Int] = new HashMap()
labels.foreach{ x => fn2label += (x(0) -> x(1).toInt)}

然后我的地图如下:

1
2
scala> fn2label
res45: scala.collection.mutable.HashMap[String,Int] = Map("k2VDmKNaUlXtnMhsuCic" -> 1,"AGzOvc4dUfw1B8nDmY2X" -> 1,"BqRPMt4QY1sHzvF6JK7j" -> 3,.....

它甚至具有键:

1
2
scala> fn2label.keys
res46: Iterable[String] = Set("k2VDmKNaUlXtnMhsuCic","AGzOvc4dUfw1B8nDmY2X","BqRPMt4QY1sHzvF6JK7j",

但是我无法访问它们:

1
2
3
4
5
scala> fn2label.get("k2VDmKNaUlXtnMhsuCic")
res48: Option[Int] = None

scala> fn2label("k2VDmKNaUlXtnMhsuCic")
java.util.NoSuchElementException: key not found: k2VDmKNaUlXtnMhsuCic

我尝试过的内容包括广播地图,广播标签和地图,使用Map而不是HashMap,并行化,如https://stackoverflow.com/a/24734410/1290485

1
2
3
val mapRdd = sc.parallelize(fn2label.toSeq)
mapRdd.lookup("k2VDmKNaUlXtnMhsuCic")
res50: Seq[Int] = WrappedArray()

我想念什么?


您的数据中只包含多余的引号:

1
2
3
4
5
6
7
8
scala> val fn2label = scala.collection.mutable.HashMap(""k2VDmKNaUlXtnMhsuCic"" -> 1,""AGzOvc4dUfw1B8nDmY2X"" -> 1,""BqRPMt4QY1sHzvF6JK7j"" -> 3)
fn2label: scala.collection.mutable.HashMap[String,Int] = Map("BqRPMt4QY1sHzvF6JK7j" -> 3,"AGzOvc4dUfw1B8nDmY2X" -> 1,"k2VDmKNaUlXtnMhsuCic" -> 1)

scala> fn2label.get(""k2VDmKNaUlXtnMhsuCic"")
res4: Option[Int] = Some(1)

scala>  fn2label.keys
res5: Iterable[String] = Set("BqRPMt4QY1sHzvF6JK7j","AGzOvc4dUfw1B8nDmY2X","k2VDmKNaUlXtnMhsuCic")