关于Tensorflow Data API:Tensorflow Data API-预取

Tensorflow Data API - prefetch

我正在尝试使用TF的新功能,即Data API,并且我不确定预取如何工作。在下面的代码中

1
2
3
4
5
6
7
8
9
10
11
def dataset_input_fn(...)
    dataset = tf.data.TFRecordDataset(filenames, compression_type="ZLIB")
    dataset = dataset.map(lambda x:parser(...))
    dataset = dataset.map(lambda x,y: image_augmentation(...)
                      , num_parallel_calls=num_threads
                     )

    dataset = dataset.shuffle(buffer_size)
    dataset = dataset.batch(batch_size)    
    dataset = dataset.repeat(num_epochs)
    iterator = dataset.make_one_shot_iterator()

放在dataset=dataset.prefetch(batch_size)上方的每行之间有关系吗?或者,如果数据集来自tf.contrib.data,则应该在每次使用output_buffer_size的操作之后进行?


在github上的讨论中,我找到了mrry的评论:

Note that in TF 1.4 there will be a Dataset.prefetch() method that
makes it easier to add prefetching at any point in the pipeline, not
just after a map(). (You can try it by downloading the current nightly
build.)

For example, Dataset.prefetch() will start a background thread to
populate a ordered buffer that acts like a tf.FIFOQueue, so that
downstream pipeline stages need not block. However, the prefetch()
implementation is much simpler, because it doesn't need to support as
many different concurrent operations as a tf.FIFOQueue.

因此它意味着任何命令都可以进行预取,并且可以在前一个命令上使用。到目前为止,我注意到仅将其放在最末即可获得最大的性能提升。

在Dataset.map,Dataset.prefetch和Dataset.shuffle中还有关于buffer_size含义的讨论,其中mrry进一步说明了有关预取和缓冲区的内容。

更新2018/10/01:

从1.7.0版开始,Dataset API(有帮助)具有一个prefetch_to_device选项。请注意,此转换必须是管道中的最后一个转换,并且当TF 2.0到达时,contrib将消失。要在多个GPU上进行预取,请使用MultiDeviceIterator(示例,请参见#13610)multi_device_iterator_ops.py。

https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/data/prefetch_to_device