流式SQL中不同类型TableSink浅析

 2020-05-19 

今天我们来看看在流式SQL中值得注意的一个技术点,不同的SQL会产生不同类型的输出。

来看两个SQL,带窗口的GroupBy,

不带窗口的GroupBy,

这两条SQL会产生不同类型的输出,

  • 带窗口的GroupBy,对于产生的结果,只要不断进行Append就可以了,因为时间一直在推进(这里不考虑因数据延迟而需要对已输出结果进行修正);
  • 不带窗口的GroupBy,对于产生的结果,只要有同一分组数据输入就需要对已输出结果进行修正(与分组状态数据存储时效也有关系);

而修正的方式也有多种,

  • 修正带主键的结果数据,直接更新原结果,或者删除原结果,写入新结果;
  • 修正不带主键的结果数据,删除原结果,写入新结果;

因此在流式SQL中需要有不同类型的TableSink来支持不同的SQL。以Flink为例,Flink有3种流式的TableSink,

  • AppendStreamTableSink,对应上文所说的,只需要不断Append的输出类型;
  • RetractStreamTableSink,对应上文所说的,可以修正不带主键结果的输出类型,或者以删除原结果写入新结果方式修正带主键结果的输出类型;
  • UpsertStreamTableSink,对应上文所说的,可以用更新方式修正带主键结果的输出类型;

那么,什么样的SQL需要什么样的输出类型呢?还是来看Flink的实现,StreamTableEnvironment#writeToSink

AppendStreamTableSink

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
      case appendSink: AppendStreamTableSink[_] =>
        // optimize plan
        val optimizedPlan = optimize(table.getRelNode, updatesAsRetraction = false)
        // verify table is an insert-only (append-only) table
        if (!UpdatingPlanChecker.isAppendOnly(optimizedPlan)) {
          throw new TableException(
            "AppendStreamTableSink requires that Table has only insert changes.")
        }
        val outputType = sink.getOutputType
        val resultType = getResultType(table.getRelNode, optimizedPlan)
        // translate the Table into a DataStream and provide the type that the TableSink expects.
        val result: DataStream[T] =
          translate(
            optimizedPlan,
            resultType,
            streamQueryConfig,
            withChangeFlag = false)(outputType)
        // Give the DataStream to the TableSink to emit it.
        appendSink.asInstanceOf[AppendStreamTableSink[T]].emitDataStream(result)

只有UpdatingPlanChecker#isAppendOnlytrue的SQL才能使用AppendStreamTableSink

1
2
3
4
5
6
7
  /** Validates that the plan produces only append changes. */
  def isAppendOnly(plan: RelNode): Boolean = {
    val appendOnlyValidator = new AppendOnlyValidator
    appendOnlyValidator.go(plan)

    appendOnlyValidator.isAppendOnly
  }

1
2
3
4
5
6
7
8
9
10
11
12
13
  private class AppendOnlyValidator extends RelVisitor {

    var isAppendOnly = true

    override def visit(node: RelNode, ordinal: Int, parent: RelNode): Unit = {
      node match {
        case s: DataStreamRel if s.producesUpdates || s.producesRetractions =>
          isAppendOnly = false
        case _ =>
          super.visit(node, ordinal, parent)
      }
    }
  }

只有DataStreamGroupAggregate#producesUpdatestrue,也就是上文所说的,不带窗口的GroupBy产生的结果需要进行Update,不是Append Only的。

而只有DataStreamJoin#producesRetractions有可能为true

1
2
  // outer join will generate retractions
  override def producesRetractions: Boolean = joinType != JoinRelType.INNER

也就是说Outer Join的结果有可能需要进行修正,这与Flink的流式Join实现有关,这里就不展开了。

RetractStreamTableSink

1
2
3
4
5
6
7
8
9
10
11
12
13
      case retractSink: RetractStreamTableSink[_] =>
        // retraction sink can always be used
        val outputType = sink.getOutputType
        // translate the Table into a DataStream and provide the type that the TableSink expects.
        val result: DataStream[T] =
          translate(
            table,
            streamQueryConfig,
            updatesAsRetraction = true,
            withChangeFlag = true)(outputType)
        // Give the DataStream to the TableSink to emit it.
        retractSink.asInstanceOf[RetractStreamTableSink[Any]]
          .emitDataStream(result.asInstanceOf[DataStream[JTuple2[JBool, Any]]])

可以看到RetractStreamTableSink没有什么限制,所有SQL都可以使用。

UpsertStreamTableSink

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
      case upsertSink: UpsertStreamTableSink[_] =>
        // optimize plan
        val optimizedPlan = optimize(table.getRelNode, updatesAsRetraction = false)
        // check for append only table
        val isAppendOnlyTable = UpdatingPlanChecker.isAppendOnly(optimizedPlan)
        upsertSink.setIsAppendOnly(isAppendOnlyTable)
        // extract unique key fields
        val tableKeys: Option[Array[String]] = UpdatingPlanChecker.getUniqueKeyFields(optimizedPlan)
        // check that we have keys if the table has changes (is not append-only)
        tableKeys match {
          case Some(keys) => upsertSink.setKeyFields(keys)
          case None if isAppendOnlyTable => upsertSink.setKeyFields(null)
          case None if !isAppendOnlyTable => throw new TableException(
            "UpsertStreamTableSink requires that Table has a full primary keys if it is updated.")
        }
        val outputType = sink.getOutputType
        val resultType = getResultType(table.getRelNode, optimizedPlan)
        // translate the Table into a DataStream and provide the type that the TableSink expects.
        val result: DataStream[T] =
          translate(
            optimizedPlan,
            resultType,
            streamQueryConfig,
            withChangeFlag = true)(outputType)
        // Give the DataStream to the TableSink to emit it.
        upsertSink.asInstanceOf[UpsertStreamTableSink[Any]]
          .emitDataStream(result.asInstanceOf[DataStream[JTuple2[JBool, Any]]])

可以看到,如果不是Append Only的SQL,则需要有主键才能使用UpsertStreamTableSink,主键是通过UpdatingPlanChecker#getUniqueKeyFields获取的。举个栗子,上文所说的,不带窗口的GroupBy,使用的grouping key就是主键,

1
2
3
4
        case a: DataStreamGroupAggregate =>
          // get grouping keys
          val groupKeys = a.getRowType.getFieldNames.take(a.getGroupings.length)
          Some(groupKeys.map(e => (e, e)))

具体也不展开了,感兴趣的同学可以自己看代码:)

参考资料

  • Dynamic Tables