Pyspark 中的增量数据加载和查询,无需重新启动 Spark JOB

Incremental Data loading and Querying in Pyspark without restarting Spark JOB

大家好,我想做增量数据查询。

1
2
3
4
5
6
7
8
9
  df = spark .read.csv('csvFile', header=True)  #1000 Rows
  df.persist() #Assume it takes 5 min
  df.registerTempTable('data_table') #or createOrReplaceTempView
  result = spark.sql('select * from data_table where column1 > 10') #100 rows
  df_incremental  = spark.read.csv('incremental.csv') #200 Rows
  df_combined = df.unionAll(df_incremental)
  df_combined.persist() #It will take morethan 5 mins, I want to avoid this, because other queries might be running at this time
  df_combined.registerTempTable("data_table")
  result = spark.sql('select * from data_table where column1 > 10') # 105 Rows.
  • 将 csv/mysql 表数据读入 spark 数据帧。

  • 仅在内存中保留该数据帧(原因:我需要性能


    尝试流式传输会更快,因为会话已经在运行,并且每次您在文件夹中放置内容时都会触发它:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    df_incremental = spark \\
        .readStream \\
        .option("sep",",") \\
        .schema(input_schema) \\
        .csv(input_path)

    df_incremental.where("column1 > 10") \\
        .writeStream \\
        .queryName("data_table") \\
        .format("memory") \\
        .start()

    spark.sql("SELECT * FROM data_table).show()