Convert a json string to array of key-value pairs in Spark scala
我有一个加载到Spark DataFrame中的JSON字符串。 JSON字符串可以包含0到3个键值对。
发送多个kv对时,
1 2 3 4 5 6 | {"id":1, "productData":{ "product":{ "product_name":"xyz", "product_facets":{"entry":[{"key":"test","value":"success"}, {"key":"test2","value" :"fail"}]} }}} |
我现在可以使用爆炸功能:
1 2 | sourceDF.filter($"someKey".contains("some_string")) .select($"id", explode($"productData.product.product_facets.entry") as"kvPairs") |
但是,当仅发送一个键值时,输入的源JSON字符串未格式化为带有方括号的数组:
1 2 3 4 5 6 | {"id":1, "productData":{ "product":{ "product_name":"xyz", "product_facets":{"entry":{"key":"test","value":"success"}} }}} |
产品标签的架构如下:
1 2 3 4 |
如何将条目更改为与爆炸功能兼容的键值对数组。我的最终目标是将键旋转到各个列中,我想通过分解kv对使用group。我尝试使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | val schema = StructType( Seq( StructField("entry", ArrayType( StructType( Seq( StructField("key", StringType), StructField("value",StringType) ) ) )) ) ) sourceDF.filter($"someKey".contains("some_string")) .select($"id", from_json($"productData.product.product_facets.entry", schema) as"kvPairsFromJson") |
但是上面的方法确实创建了一个新列kvPairsFromJson,看起来像" [] ",并且使用explode不会执行任何操作。
关于发生了什么事情或是否有更好的方法来执行此操作的任何指针?
我认为一种方法可能是:
1.创建一个将
2.在udf中,检查
以下代码说明了上述方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | // one row where entry is array and other non-array val ds = Seq("""{"id":1,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":[{"key":"test","value":"success"},{"key":"test2","value":"fail"}]}}}}""","""{"id":2,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":{"key":"test","value":"success"}}}}}""").toDS val df = spark.read.json(ds) // Schema used by udf to generate output column import org.apache.spark.sql.types._ val outputSchema = ArrayType(StructType(Seq( StructField("key", StringType, false), StructField("value", StringType, false) ))) // Converts non-array entry value to array val toArray = udf((json: String) => { import com.fasterxml.jackson.databind._ import com.fasterxml.jackson.module.scala.DefaultScalaModule val jsonMapper = new ObjectMapper() jsonMapper.registerModule(DefaultScalaModule) if(!json.startsWith("[")) { val jsonMap = jsonMapper.readValue(json, classOf[Map[String, String]]) List((jsonMap("key"), jsonMap("value"))) } else { jsonMapper.readValue(json, classOf[List[Map[String, String]]]).map(f => (f("key"), f("value"))) } }, outputSchema) val arrayResult = df.select(col("id").as("id"), toArray(col("productData.product.product_facets.entry")).as("entry")) val arrayExploded = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry")) val explodedToCols = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry")).select(col("id"), col("entry.key").as("key"), col("entry.value").as("value")) |
结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | scala> arrayResult.printSchema root |-- id: long (nullable = true) |-- entry: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- key: string (nullable = false) | | |-- value: string (nullable = false) scala> arrayExploded.printSchema root |-- id: long (nullable = true) |-- entry: struct (nullable = true) | |-- key: string (nullable = false) | |-- value: string (nullable = false) scala> arrayResult.show(false) +---+--------------------------------+ |id |entry | +---+--------------------------------+ |1 |[[test, success], [test2, fail]]| |2 |[[test, success]] | +---+--------------------------------+ scala> arrayExploded.show(false) +---+---------------+ |id |entry | +---+---------------+ |1 |[test, success]| |1 |[test2, fail] | |2 |[test, success]| +---+---------------+ |