今天我们来看看在流式SQL中值得注意的一个技术点,不同的SQL会产生不同类型的输出。
来看两个SQL,带窗口的GroupBy,
不带窗口的GroupBy,
这两条SQL会产生不同类型的输出,
- 带窗口的GroupBy,对于产生的结果,只要不断进行Append就可以了,因为时间一直在推进(这里不考虑因数据延迟而需要对已输出结果进行修正);
- 不带窗口的GroupBy,对于产生的结果,只要有同一分组数据输入就需要对已输出结果进行修正(与分组状态数据存储时效也有关系);
而修正的方式也有多种,
- 修正带主键的结果数据,直接更新原结果,或者删除原结果,写入新结果;
- 修正不带主键的结果数据,删除原结果,写入新结果;
因此在流式SQL中需要有不同类型的TableSink来支持不同的SQL。以Flink为例,Flink有3种流式的TableSink,
AppendStreamTableSink ,对应上文所说的,只需要不断Append的输出类型;RetractStreamTableSink ,对应上文所说的,可以修正不带主键结果的输出类型,或者以删除原结果写入新结果方式修正带主键结果的输出类型;UpsertStreamTableSink ,对应上文所说的,可以用更新方式修正带主键结果的输出类型;
那么,什么样的SQL需要什么样的输出类型呢?还是来看Flink的实现,
AppendStreamTableSink
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | case appendSink: AppendStreamTableSink[_] => // optimize plan val optimizedPlan = optimize(table.getRelNode, updatesAsRetraction = false) // verify table is an insert-only (append-only) table if (!UpdatingPlanChecker.isAppendOnly(optimizedPlan)) { throw new TableException( "AppendStreamTableSink requires that Table has only insert changes.") } val outputType = sink.getOutputType val resultType = getResultType(table.getRelNode, optimizedPlan) // translate the Table into a DataStream and provide the type that the TableSink expects. val result: DataStream[T] = translate( optimizedPlan, resultType, streamQueryConfig, withChangeFlag = false)(outputType) // Give the DataStream to the TableSink to emit it. appendSink.asInstanceOf[AppendStreamTableSink[T]].emitDataStream(result) |
只有
1 2 3 4 5 6 7 | /** Validates that the plan produces only append changes. */ def isAppendOnly(plan: RelNode): Boolean = { val appendOnlyValidator = new AppendOnlyValidator appendOnlyValidator.go(plan) appendOnlyValidator.isAppendOnly } |
1 2 3 4 5 6 7 8 9 10 11 12 13 | private class AppendOnlyValidator extends RelVisitor { var isAppendOnly = true override def visit(node: RelNode, ordinal: Int, parent: RelNode): Unit = { node match { case s: DataStreamRel if s.producesUpdates || s.producesRetractions => isAppendOnly = false case _ => super.visit(node, ordinal, parent) } } } |
只有
而只有
1 2 | // outer join will generate retractions override def producesRetractions: Boolean = joinType != JoinRelType.INNER |
也就是说Outer Join的结果有可能需要进行修正,这与Flink的流式Join实现有关,这里就不展开了。
RetractStreamTableSink
1 2 3 4 5 6 7 8 9 10 11 12 13 | case retractSink: RetractStreamTableSink[_] => // retraction sink can always be used val outputType = sink.getOutputType // translate the Table into a DataStream and provide the type that the TableSink expects. val result: DataStream[T] = translate( table, streamQueryConfig, updatesAsRetraction = true, withChangeFlag = true)(outputType) // Give the DataStream to the TableSink to emit it. retractSink.asInstanceOf[RetractStreamTableSink[Any]] .emitDataStream(result.asInstanceOf[DataStream[JTuple2[JBool, Any]]]) |
可以看到
UpsertStreamTableSink
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | case upsertSink: UpsertStreamTableSink[_] => // optimize plan val optimizedPlan = optimize(table.getRelNode, updatesAsRetraction = false) // check for append only table val isAppendOnlyTable = UpdatingPlanChecker.isAppendOnly(optimizedPlan) upsertSink.setIsAppendOnly(isAppendOnlyTable) // extract unique key fields val tableKeys: Option[Array[String]] = UpdatingPlanChecker.getUniqueKeyFields(optimizedPlan) // check that we have keys if the table has changes (is not append-only) tableKeys match { case Some(keys) => upsertSink.setKeyFields(keys) case None if isAppendOnlyTable => upsertSink.setKeyFields(null) case None if !isAppendOnlyTable => throw new TableException( "UpsertStreamTableSink requires that Table has a full primary keys if it is updated.") } val outputType = sink.getOutputType val resultType = getResultType(table.getRelNode, optimizedPlan) // translate the Table into a DataStream and provide the type that the TableSink expects. val result: DataStream[T] = translate( optimizedPlan, resultType, streamQueryConfig, withChangeFlag = true)(outputType) // Give the DataStream to the TableSink to emit it. upsertSink.asInstanceOf[UpsertStreamTableSink[Any]] .emitDataStream(result.asInstanceOf[DataStream[JTuple2[JBool, Any]]]) |
可以看到,如果不是Append Only的SQL,则需要有主键才能使用
1 2 3 4 | case a: DataStreamGroupAggregate => // get grouping keys val groupKeys = a.getRowType.getFieldNames.take(a.getGroupings.length) Some(groupKeys.map(e => (e, e))) |
具体也不展开了,感兴趣的同学可以自己看代码:)
参考资料
- Dynamic Tables