tf.data基础API使用（tf.data.Dataset.from_tensor_slices()，repeat，batch,interleave）

在这里主要绍tf.data基础API的使用。

1、tf.data.Dataset.from_tensor_slices：

1
2
3
4

@staticmethod
from_tensor_slices(
tensors
)

创建一个数据集，其元素是给定张量的切片

请注意，如果张量包含NumPy数组，并且没有启用 eager execution，则这些值将作为一个或多个tf.constant操作嵌入到图中。对于大型数据集(>1 GB），这会浪费内存并遇到图形序列化的字节限制。如果张量包含一个或多个大型NumPy数组，请考虑指南中描述的替代方案。

参数：

tensors: 数据集元素，每个分量在第0维上的大小均相同。（x,y：数据和label的元素值是相等的）

Dataset:一个数据集

示例：

下面重点介绍一下from_tensor_slices方法的使用。使用tf.data.dataset.from_tensor_slices可以从内存中构建一个数据集,参数可以是：列表，numpy数组，元组，字典。

（1）使用numpy数组进行初始化（作为输入参数）

1
2
3

dataset = tf.data.Dataset.from_tensor_slices(np.arange(10))
print(dataset)
np.arange(10)

1
<TensorSliceDataset shapes: (), types: tf.int32>

Out[12]:

1
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

1
2
3

#遍历数据集
for item in dataset:
print(item)

1
2
3
4
5
6
7
8
9
10
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)

（2）使用元组进行初始化（作为输入参数）

1
2
3
4
5
6
7
8

#元组
x = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array(['cat', 'dog', 'fox'])
dataset3 = tf.data.Dataset.from_tensor_slices((x, y))
print(dataset3)

for item_x, item_y in dataset3:
print(item_x.numpy(), item_y.numpy())

1
2
3
4
<TensorSliceDataset shapes: ((2,), ()), types: (tf.int32, tf.string)>
[1 2] b'cat'
[3 4] b'dog'
[5 6] b'fox'

（3）使用字典进行初始化（作为输入参数）

1
2
3
4
5
6
7

#字典
x = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array(['cat', 'dog', 'fox'])
dataset4 = tf.data.Dataset.from_tensor_slices({"feature": x,
"label": y})
for item in dataset4:
print(item["feature"].numpy(), item["label"].numpy())

1
2
3
[1 2] b'cat'
[3 4] b'dog'
[5 6] b'fox'

2、repeat 、batch方法使用

repeat：

1
2
3

repeat(
count=None
)

重复数据集count次

参数：

count：（可选）A tf.int64 scalar tf.Tensor, 表示应重复数据集的次数。数据集默认（如果count为None或-1）是无限期重复的。

Dataset: 一个数据集

batch:

1
2
3

batch(
batch_size, drop_remainder=False
)

参数：

batch_size: A tf.int64 scalar tf.Tensor,表示在单个batch中合并的此数据集的连续元素数。

drop_remainder：（可选）A tf.bool scalar tf.Tensor，表示如果最后一个batch的元素数少于batch_size元素数，是否应删除该batch；默认行为是不删除较小的batch。

Dataset:一个数据集

示例：

1
2
3

dataset = dataset.repeat( ).batch(7)
for item in dataset:
print(item)

1
2
3
4
5
tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)

3、interleave

思想：对现有的dataset中的每一个元素做处理，每个元素做完处理后会产生一个新的结果，interleave会把这些新的结果合并起来，形成一个新的数据集。
常用case: 文件名dataset -> 具体数据集

1
2
3

interleave(
map_func, cycle_length=AUTOTUNE, block_length=1, num_parallel_calls=None
)

在此数据集中映射“map_func”，并对结果进行interleave（交错）。例如，可以使用DataSet.Interleave（）并发处理许多输入文件: 同时预处理4个文件，并交错处理每个文件中的16条记录。

1
2
3
4
5
6
7

# Preprocess 4 files concurrently, and interleave blocks of 16 records from
# each file.
filenames = ["/var/data/file1.txt", "/var/data/file2.txt", ...]
dataset = (Dataset.from_tensor_slices(filenames)
.interleave(lambda x:
TextLineDataset(x).map(parse_fn, num_parallel_calls=1),
cycle_length=4, block_length=16))

cycle_length和block_length参数控制生成元素的顺序，cycle_length控制并发处理的输入元素的数量，如果将cycle_length设置为1，则此转换将一次处理一个输入元素，并将产生与tf.data.dataset.flat_map相同的结果。通常，此转换将应用map_func到cycle_length输入元素，在返回的数据集对象上打开迭代器，并循环遍历它们，从每个迭代器生成block_length个连续元素，并在每次到达迭代器末尾时使用下一个输入元素。

case：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

a = Dataset.range(1, 6) # ==> [ 1, 2, 3, 4, 5 ]

# NOTE: New lines indicate "block" boundaries.
a.interleave(lambda x: Dataset.from_tensors(x).repeat(6),
cycle_length=2, block_length=4) # ==> [1, 1, 1, 1,
# 2, 2, 2, 2,
# 1, 1,
# 2, 2,
# 3, 3, 3, 3,
# 4, 4, 4, 4,
# 3, 3,
# 4, 4,
# 5, 5, 5, 5,
# 5, 5]

注意：只要map_func是纯函数，则此转换产生的元素顺序是确定的。如果map_func包含任何有状态操作，则访问该状态的顺序是未定义的。

参数：

map_func:将数据集元素映射到数据集的函数。
cycle_length:（可选）将并发处理的输入元素数。如果未指定，则该值将从可用的CPU内核数中得出。如果将num_parallel_calls参数设置为tf.data.experimental.AUTOTUNE，则cycle_length参数将标识最大并行度。
block_length:（可选）在循环到另一个输入元素之前，从每个输入元素产生的连续元素数量。
num_parallel_calls:（可选）如果指定，该实现将创建一个线程池，用于异步和并行地从循环元素中获取输入。默认行为是从循环元素中同步获取输入，而不具有并行性。如果使用值tf.data.experimental.AUTOTUNE ，则根据可用CPU动态设置并行调用的数量。

Dataset: 一个数据集

示例：

1
2
3
4
5
6
7
8
9
10

dataset = tf.data.Dataset.from_tensor_slices(np.arange(10))
dataset = dataset.repeat(3).batch(7)

dataset2 = dataset.interleave(
lambda v: tf.data.Dataset.from_tensor_slices(v), # 1、map_fn：做什么样的变换
cycle_length = 5, #/ 2、cycle_length：并行程度：并行处理dataset中的多少个元素
block_length = 5, #/ 3、block_length：从map_fn变换的结果中每次取多少元素：达到一个均匀混合效果
)
for item in dataset2:
print(item)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)

码农家园

tf.data基础API使用（tf.data.Dataset.from_tensor_slices()，repeat，batch,interleave）

1、tf.data.Dataset.from_tensor_slices：

参数：

返回：

示例：

（1）使用numpy数组进行初始化（作为输入参数）

（2）使用元组进行初始化（作为输入参数）

（3）使用字典进行初始化（作为输入参数）

2、repeat 、batch方法使用

repeat：

batch:

示例：

3、interleave

case：

参数：

返回：

示例：