注意:这里的batch指的是mini-batch
两种实现序列(文本、日志)批处理的方法
- 固定长度的序列(uniform length sequences in batches)
所有batch内序列的长度一样。比如seqs = [[1,2,3,3,4,5,6,7], [1,2,3], [2,4,1,2,3], [1,2,4,1]]
batch_size = 2
那么最大序列长度取8,如果不足8用0填充到该长度
1 2 | batch1 = [[1, 2, 3, 3, 4, 5, 6, 7], [1, 2, 3, 0, 0, 0, 0, 0]], batch2 = [[2, 4, 1, 2, 3, 0, 0, 0], [1, 2, 4, 1, 0, 0, 0, 0]] |
- 变长的序列(variable length sequences in batches)
每个batch的序列长度一致,不同batch之间的序列的长度可能不同。比如上面的例子,如果是变长的,那么先对序列长度排序,再按照每个batch内序列最大长度padding
1 2 | batch1 = [[1, 2, 3, 0], [1, 2, 4, 1]]。 # len = 4 batch2 = [[2, 4, 1, 2, 3, 0, 0, 0], [1, 2, 3, 3, 4, 5, 6, 7]] #len = 8 |
为什么要采用变长的序列呢?
如果训练数据中有非常短的序列,那么用一个统一的长度padding,会造成数据过于稀疏。有可能影响训练时间,以及模型的预测效果。
pytorch中如何实现变长的序列?
答:dynamical padding(动态填充)
根据前面的思路,实现动态填充主要两步
- 先根据序列长度排序
- 在每个batch里,选择序列最大长度,或者四分之三分为点的长度(防止极端情况,最大值非常的大)作为该batch的固定长度。
collate_fn 参数
collate_fn是DataLoader的一个属性,用来处理批次数据,官网介绍
1 2 3 4 | DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None) |
代码实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | from torch.utils.data import Dataset, DataLoader import torch import numpy as np class MyDataset(Dataset): def __init__(self, seq, label): self.seq = seq self.label = label def __len__(self): return len(self.label) def __getitem__(self, index): return self.seq[index], self.label[index] def collate_fn(batch): """ args: batch: [[input_vector, label_vector] for seq in batch] return: [[output_vector]] * batch_size, [[label]]*batch_szie """ percentile = 100 dynamical_pad = True max_len = 50 pad_index = 0 lens = [len(dat[0]) for dat in batch] # find the max len in each batch if dynamical_pad: # dynamical padding seq_len = min(int(np.percentile(lens, percentile)), max_len) # or seq_len = max(lens) else: # fixed length padding seq_len = max_len print("collate_fn seq_len", seq_len) output = [] out_label = [] for dat in batch: seq = dat[0][:seq_len] label = dat[1][:seq_len] padding = [pad_index for _ in range(seq_len - len(seq))] seq.extend(padding) label.extend(padding) output.append(seq) out_label.append(label) output = torch.tensor(output, dtype=torch.long) out_label = torch.tensor(out_label, dtype=torch.long) return output, out_label batch_size = 2 seqs = np.array([[1,0,3,3,4,5,6,0], [1,0,3], [2,4,0,2,3], [1,2,0,1]]) label = np.array([[0,2,0,0,0,0,0,7], [0,2,0], [0,0,1,0,0], [0,0,4,0]]) lens = np.array(list(map(len, seqs))) len_index = np.argsort(-1 * lens) seqs = seqs[len_index] label = label[len_index] mydataset = MyDataset(seqs, label) dl = DataLoader(mydataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn) for batch in dl: print(batch) |
代码参考[1]
如果你用到了rnn-based的模型,那么可以了解一下pack_padded_sequence和 pad_packed_sequence这两个函数,作用是不将padding的值传入到模型训练。
pack_padded_sequence将填充过的模型压紧,然后将数据传入到模型训练。模型的结果用pad_packed_sequence恢复原来的维度
简单实现
1 2 3 4 5 6 7 8 | #pack_padded_sequence so that padded items in the sequence won't be shown to the LSTM X = torch.nn.utils.rnn.pack_padded_sequence(x, X_lengths, batch_first=True) # now run through LSTM X, self.hidden = self.lstm(X, self.hidden) # undo the packing operation X, _ = torch.nn.utils.rnn.pad_packed_sequence(X, batch_first=True) |
中文参考见[2]
英文参考见[3],里面还提到用了padding怎么计算loss,推荐一看。pytorch自带的一些loss 函数比如NLLLoss,可以指定忽略padding 值。
参考资料
[1]https://www.kaggle.com/evilpsycho42/pytorch-batch-dynamic-padding-sort-pack
[2]https://blog.csdn.net/u011550545/article/details/89529977?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param
[3]https://towardsdatascience.com/taming-lstms-variable-sized-mini-batches-and-why-pytorch-is-good-for-your-health-61d35642972e