Cutmix是否也适用于表数据？

简介

监督学习通常需要足够数量的标记数据才能实现高精度。但是，手动注释需要大量时间和精力。解决此问题的一种方法是数据扩充，它可以人为地扩充数据。

但是，数据扩展主要是与图像有关的，并且没有很多方法可以应用于表数据。因此，本文介绍了可应用于表数据，进行实验并验证其性能的数据扩展。

混合

混合：超越经验风险最小化

混合是ICLR在2017年提出并采用的一种方法。通过混合两个输入生成一个新的输入。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

import random as rn

from sklearn.utils import check_random_state

def mixup(x, y=None, alpha=0.2, p=1.0, random_state=None):
n, _ = x.shape

if n is not None and rn.random() < p:
random_state = check_random_state(random_state)
l = random_state.beta(alpha, alpha)
shuffle = random_state.choice(n, n, replace=False)

x = l * x + (1.0 - l) * x[shuffle]

if y is not None:
y = l * y + (1.0 - l) * y[shuffle]

return x, y

据报道，通过将Mixup应用于除图像之外的音频和表格数据，性能得到了改善。

混合

CutMix：正则化策略来训练具有可本地化功能的强大分类器

Cutmix是一种在2019年提出并由ICCV采纳的方法。通过用另一部分替换输入的一部分来生成新输入。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

import random as rn

import numpy as np
from sklearn.utils import check_random_state

def cutmix(x, y=None, alpha=1.0, p=1.0, random_state=None):
n, h, w, _ = x.shape

if n is not None and rn.random() < p:
random_state = check_random_state(random_state)
l = np.random.beta(alpha, alpha)
r_h = int(h * np.sqrt(1.0 - l))
r_w = int(w * np.sqrt(1.0 - l))
x1 = np.random.randint(h - r_h)
y1 = np.random.randint(w - r_w)
x2 = x1 + r_h
y2 = y1 + r_w
shuffle = random_state.choice(n, n, replace=False)

x[:, x1:x2, y1:y2] = x[shuffle, x1:x2, y1:y2]

if y is not None:
y = l * y + (1.0 - l) * y[shuffle]

return x, y

仅在纸上报道过Cutmix，因为将其应用于图像。如果将此应用于表数据会怎样？

在

表数据中，特征的顺序(年龄，国籍等)是没有意义的。因此，我将随机选择要用其他输入替换的零件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

import random as rn

import numpy as np
from sklearn.utils import check_random_state

def cutmix_for_tabular(x, y=None, alpha=1.0, p=1.0, random_state=None):
n, d = x.shape

if n is not None and rn.random() < p:
random_state = check_random_state(random_state)
l = random_state.beta(alpha, alpha)
mask = random_state.choice([False, True], size=d, p=[l, 1.0 - l])
mask = np.where(mask)[0]
shuffle = random_state.choice(n, n, replace=False)

x[:, mask] = x[shuffle, mask]

if y is not None:
y = l * y + (1.0 - l) * y[shuffle]

return x, y

实验

这次，我们将使用以下数据集进行实验。这是一个多标签分类问题，可根据基因表达模式预测化合物的作用机理。

作用机理(MoA)预测| Kaggle

有关实验的详细信息，请检查以下代码。

Moa基准| Kaggle
moa-mixup | Kaggle
moa-cutmix | Kaggle

Logloss现在看起来像这样：

<表格>

本地

公开

私人

<身体>

基线

0.01604

0.01906

0.01666

混合

0.01605

0.01905

0.01668

Cutmix

0.01604

0.01901

0.01663

可以肯定的是，Cutmix改进了公共和私人分数。

末尾

Cutmix也是用于表数据的有效方法。

最后，我们发布了在上述比赛中使用Cutmix排名第35位的解决方案，所以如果您有兴趣，请看看。

作用机理(MoA)预测| Kaggle