Pyspark 中的增量数据加载和查询，无需重新启动 Spark JOB

Incremental Data loading and Querying in Pyspark without restarting Spark JOB

大家好，我想做增量数据查询。

1
2
3
4
5
6
7
8
9

df = spark .read.csv('csvFile', header=True) #1000 Rows
df.persist() #Assume it takes 5 min
df.registerTempTable('data_table') #or createOrReplaceTempView
result = spark.sql('select * from data_table where column1 > 10') #100 rows
df_incremental = spark.read.csv('incremental.csv') #200 Rows
df_combined = df.unionAll(df_incremental)
df_combined.persist() #It will take morethan 5 mins, I want to avoid this, because other queries might be running at this time
df_combined.registerTempTable("data_table")
result = spark.sql('select * from data_table where column1 > 10') # 105 Rows.

将 csv/mysql 表数据读入 spark 数据帧。

仅在内存中保留该数据帧(原因：我需要性能

尝试流式传输会更快，因为会话已经在运行，并且每次您在文件夹中放置内容时都会触发它：

1
2
3
4
5
6
7
8
9
10
11
12
13

df_incremental = spark \\
.readStream \\
.option("sep",",") \\
.schema(input_schema) \\
.csv(input_path)

df_incremental.where("column1 > 10") \\
.writeStream \\
.queryName("data_table") \\
.format("memory") \\
.start()

spark.sql("SELECT * FROM data_table).show()