Applying a schema to a Spark's Dataset of a java object
这里有一个类似的问题:如何在Spark中向数据集添加架构?
但是,我面临的问题是我已经有一个预定义的
示例代码:
1 2 3 4 5 6 7 8 | Dataset<Row> rowDataset = spark.getSpark().sqlContext().createDataFrame(rowRDD, schema).toDF(); Dataset<MyObj> objResult = rowDataset.map((MapFunction<Row, MyObj>) row -> new MyObj( row.getInt(row.fieldIndex("field1")), row.isNullAt(row.fieldIndex("field2")) ?"" : row.getString(row.fieldIndex("field2")), row.isNullAt(row.fieldIndex("field3")) ?"" : row.getString(row.fieldIndex("field3")), row.isNullAt(row.fieldIndex("field4")) ?"" : row.getString(row.fieldIndex("field4")) ), Encoders.javaSerialization(MyObj.class)); |
如果我要打印行数据集的架构,则将按预期方式获得该架构:
1 2 3 4 5 6 7 | rowDataset.printSchema(); root |-- field1: integer (nullable = false) |-- field2: string (nullable = false) |-- field3: string (nullable = false) |-- field4: string (nullable = false) |
如果我正在打印对象数据集,那么我将丢失实际的模式
1 2 3 4 | objResult.printSchema(); root |-- value: binary (nullable = true) |
问题是如何为
下面是代码片段,我尝试过并且spark的行为符合预期,看来问题的根本原因不是map函数。
1 2 3 4 5 6 7 8 9 | SparkSession session = SparkSession.builder().config(conf).getOrCreate(); Dataset<Row> ds = session.read().text("<some path>"); Encoder<Employee> employeeEncode = Encoders.bean(Employee.class); ds.map(new MapFunction<Row, Employee>() { @Override public Employee call(Row value) throws Exception { return new Employee(value.getString(0).split(",")); } }, employeeEncode).printSchema(); |
输出:
1 2 3 | root |-- age: integer (nullable = true) |-- name: string (nullable = true) |
//雇员Bean
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | public class Employee { public String name; public Integer age; public Employee(){ } public Employee(String [] args){ this.name=args[0]; this.age=Integer.parseInt(args[1]); } public String getName() { return name; } public void setName(String name) { this.name = name; } public Integer getAge() { return age; } public void setAge(Integer age) { this.age = age; } } |