关于python：bigquery storage API：是否可以将AVRO文件直接流式传输/保存到Google Cloud Storage？

bigquery storage API: Is it possible to stream / save AVRO files directly to Google Cloud Storage?

我想将90 TB BigQuery表导出到Google Cloud Storage。根据文档，由于与其他方法相关联的导出大小配额(例如，ExtractBytesPerDay)，BigQuery Storage API(测试版)应该成为首选方法。

该表是按日期分区的，每个分区约占300 GB。我有一个在GCP上运行的Python AI笔记本，该笔记本通过从文档改编的此脚本来运行分区(并行)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

from google.cloud import bigquery_storage_v1

client = bigquery_storage_v1.BigQueryReadClient()

table ="projects/{}/datasets/{}/tables/{}".format(
"bigquery-public-data","usa_names","usa_1910_current"
) # I am using my private table instead of this one.

requested_session = bigquery_storage_v1.types.ReadSession()
requested_session.table = table
requested_session.data_format = bigquery_storage_v1.enums.DataFormat.AVRO

parent ="projects/{}".format(project_id)
session = client.create_read_session(
parent,
requested_session,
max_stream_count=1,
)
reader = client.read_rows(session.streams[0].name)

# The read stream contains blocks of Avro-encoded bytes. The rows() method
# uses the fastavro library to parse these blocks as an iterable of Python
# dictionaries.

rows = reader.rows(session)

是否可以将流中的数据直接保存到Google Cloud Storage？

我尝试使用fastavro将表作为AVRO文件保存到我的AI实例，然后再使用Blob.upload_from_filename()将它们上传到GCS，但这过程非常缓慢。我希望可以将流指向我的GCS存储桶。我尝试了Blob.upload_from_file，但无法弄清楚。

我无法将整个流解码到内存并使用Blob.upload_from_string，因为我没有超过300 GB的RAM。

我花了最后两天的时间来解析GCP文档，但找不到任何东西，因此，如果有可能，我将非常感谢您的帮助，最好提供代码段。 (如果使用另一种文件格式更容易，我将全力以赴。)

谢谢！

Is it possible to save data from the stream directly to Google Cloud Storage?

BigQuery Storage API本身无法直接写入GCS；您需要将API与代码配对以解析数据，将其写入本地存储，然后上传到GCS。这可能是您手动编写的代码，也可能是某种框架中的代码。

您似乎共享的代码片段以单线程方式处理每个分区，从而将吞吐量限制为单个读取流的吞吐量。该存储API旨在通过并行性实现高吞吐量，因此可以与并行处理框架(例如Google Cloud Dataflow或Apache Spark)一起使用。如果您想使用Dataflow，可以从Google提供的模板开始。对于Spark，您可以使用David已经共享的代码段。