流式SQL中不同类型TableSink浅析

今天我们来看看在流式SQL中值得注意的一个技术点，不同的SQL会产生不同类型的输出。

来看两个SQL，带窗口的GroupBy，

不带窗口的GroupBy，

这两条SQL会产生不同类型的输出，

带窗口的GroupBy，对于产生的结果，只要不断进行Append就可以了，因为时间一直在推进（这里不考虑因数据延迟而需要对已输出结果进行修正）；
不带窗口的GroupBy，对于产生的结果，只要有同一分组数据输入就需要对已输出结果进行修正（与分组状态数据存储时效也有关系）；

而修正的方式也有多种，

修正带主键的结果数据，直接更新原结果，或者删除原结果，写入新结果；
修正不带主键的结果数据，删除原结果，写入新结果；

因此在流式SQL中需要有不同类型的TableSink来支持不同的SQL。以Flink为例，Flink有3种流式的TableSink，

AppendStreamTableSink，对应上文所说的，只需要不断Append的输出类型；
RetractStreamTableSink，对应上文所说的，可以修正不带主键结果的输出类型，或者以删除原结果写入新结果方式修正带主键结果的输出类型；
UpsertStreamTableSink，对应上文所说的，可以用更新方式修正带主键结果的输出类型；

那么，什么样的SQL需要什么样的输出类型呢？还是来看Flink的实现，StreamTableEnvironment#writeToSink，

AppendStreamTableSink

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

case appendSink: AppendStreamTableSink[_] =>
// optimize plan
val optimizedPlan = optimize(table.getRelNode, updatesAsRetraction = false)
// verify table is an insert-only (append-only) table
if (!UpdatingPlanChecker.isAppendOnly(optimizedPlan)) {
throw new TableException(
"AppendStreamTableSink requires that Table has only insert changes.")
}
val outputType = sink.getOutputType
val resultType = getResultType(table.getRelNode, optimizedPlan)
// translate the Table into a DataStream and provide the type that the TableSink expects.
val result: DataStream[T] =
translate(
optimizedPlan,
resultType,
streamQueryConfig,
withChangeFlag = false)(outputType)
// Give the DataStream to the TableSink to emit it.
appendSink.asInstanceOf[AppendStreamTableSink[T]].emitDataStream(result)

只有UpdatingPlanChecker#isAppendOnly为true的SQL才能使用AppendStreamTableSink，

1
2
3
4
5
6
7

/** Validates that the plan produces only append changes. */
def isAppendOnly(plan: RelNode): Boolean = {
val appendOnlyValidator = new AppendOnlyValidator
appendOnlyValidator.go(plan)

appendOnlyValidator.isAppendOnly
}

1
2
3
4
5
6
7
8
9
10
11
12
13

private class AppendOnlyValidator extends RelVisitor {

var isAppendOnly = true

override def visit(node: RelNode, ordinal: Int, parent: RelNode): Unit = {
node match {
case s: DataStreamRel if s.producesUpdates || s.producesRetractions =>
isAppendOnly = false
case _ =>
super.visit(node, ordinal, parent)
}
}
}

只有DataStreamGroupAggregate#producesUpdates为true，也就是上文所说的，不带窗口的GroupBy产生的结果需要进行Update，不是Append Only的。

而只有DataStreamJoin#producesRetractions有可能为true，

1 2	// outer join will generate retractions override def producesRetractions: Boolean = joinType != JoinRelType.INNER

也就是说Outer Join的结果有可能需要进行修正，这与Flink的流式Join实现有关，这里就不展开了。

RetractStreamTableSink

1
2
3
4
5
6
7
8
9
10
11
12
13

case retractSink: RetractStreamTableSink[_] =>
// retraction sink can always be used
val outputType = sink.getOutputType
// translate the Table into a DataStream and provide the type that the TableSink expects.
val result: DataStream[T] =
translate(
table,
streamQueryConfig,
updatesAsRetraction = true,
withChangeFlag = true)(outputType)
// Give the DataStream to the TableSink to emit it.
retractSink.asInstanceOf[RetractStreamTableSink[Any]]
.emitDataStream(result.asInstanceOf[DataStream[JTuple2[JBool, Any]]])

可以看到RetractStreamTableSink没有什么限制，所有SQL都可以使用。

UpsertStreamTableSink

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

case upsertSink: UpsertStreamTableSink[_] =>
// optimize plan
val optimizedPlan = optimize(table.getRelNode, updatesAsRetraction = false)
// check for append only table
val isAppendOnlyTable = UpdatingPlanChecker.isAppendOnly(optimizedPlan)
upsertSink.setIsAppendOnly(isAppendOnlyTable)
// extract unique key fields
val tableKeys: Option[Array[String]] = UpdatingPlanChecker.getUniqueKeyFields(optimizedPlan)
// check that we have keys if the table has changes (is not append-only)
tableKeys match {
case Some(keys) => upsertSink.setKeyFields(keys)
case None if isAppendOnlyTable => upsertSink.setKeyFields(null)
case None if !isAppendOnlyTable => throw new TableException(
"UpsertStreamTableSink requires that Table has a full primary keys if it is updated.")
}
val outputType = sink.getOutputType
val resultType = getResultType(table.getRelNode, optimizedPlan)
// translate the Table into a DataStream and provide the type that the TableSink expects.
val result: DataStream[T] =
translate(
optimizedPlan,
resultType,
streamQueryConfig,
withChangeFlag = true)(outputType)
// Give the DataStream to the TableSink to emit it.
upsertSink.asInstanceOf[UpsertStreamTableSink[Any]]
.emitDataStream(result.asInstanceOf[DataStream[JTuple2[JBool, Any]]])

可以看到，如果不是Append Only的SQL，则需要有主键才能使用UpsertStreamTableSink，主键是通过UpdatingPlanChecker#getUniqueKeyFields获取的。举个栗子，上文所说的，不带窗口的GroupBy，使用的grouping key就是主键，

1
2
3
4

case a: DataStreamGroupAggregate =>
// get grouping keys
val groupKeys = a.getRowType.getFieldNames.take(a.getGroupings.length)
Some(groupKeys.map(e => (e, e)))

具体也不展开了，感兴趣的同学可以自己看代码：）

参考资料

Dynamic Tables