关于python：sklearn train_test_split在熊猫上按多列分层

sklearn train_test_split on pandas stratify by multiple columns

我是sklearn的一个相对较新的用户，并且在sklearn.model_selection的train_test_split中遇到了一些意外的行为。我有一个熊猫数据框，我想分为训练和测试集。我想按我的数据框中的至少2列(最好是4列)对数据进行分层。

当我尝试执行此操作时，sklearn没有警告，但是后来我发现最终数据集中有重复的行。我创建了一个示例测试来显示此行为：

1
2
3
4
5

from sklearn.model_selection import train_test_split
a = np.array([i for i in range(1000000)])
b = [i%10 for i in a]
c = [i%5 for i in a]
df = pd.DataFrame({'a':a, 'b':b, 'c':c})

如果我按任一列进行分层，这似乎可以按预期工作：

1
2
3
4
5
6
7

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b']])
print(len(train.a.values)) # prints 800000
print(len(set(train.a.values))) # prints 800000

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['c']])
print(len(train.a.values)) # prints 800000
print(len(set(train.a.values))) # prints 800000

但是当我尝试按两列进行分层时，会得到重复的值：

1
2
3

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b', 'c']])
print(len(train.a.values)) # prints 800000
print(len(set(train.a.values))) # prints 640000

得到重复的原因是因为train_test_split()最终将strata定义为传递给stratify参数的任何值的唯一值集。由于层次是从两列定义的，因此一行数据可能代表一个以上的层次，因此采样可能会选择同一行两次，因为它认为它是从不同类中采样的。

train_test_split()函数调用StratifiedShuffleSplit，该函数在y上使用np.unique()(这是通过stratify传递的内容)。从源代码：

1 2	classes, y_indices = np.unique(y, return_inverse=True) n_classes = classes.shape[0]

这是一个简化的示例案例，是您提供的示例的变体：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

N = 20
a = np.arange(N)
b = np.random.choice(["foo","bar"], size=N)
c = np.random.choice(["y","z"], size=N)
df = pd.DataFrame({'a':a, 'b':b, 'c':c})

print(df)
a b c
0 0 bar y
1 1 foo y
2 2 bar z
3 3 bar y
4 4 foo z
5 5 bar y
...

分层功能认为有四个类别可拆分：foo，bar，y和z。但是由于这些类本质上是嵌套的，这意味着y和z都出现在b == foo和b == bar中，所以当拆分器尝试从每个类中采样时，我们将得到重复。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

train, test = train_test_split(df, test_size=0.2, random_state=0,
stratify=df[['b', 'c']])
print(len(train.a.values)) # 16
print(len(set(train.a.values))) # 12

print(train)
a b c
3 3 bar y # selecting a = 3 for b = bar*
5 5 bar y
13 13 foo y
4 4 foo z
14 14 bar z
10 10 foo z
3 3 bar y # selecting a = 3 for c = y
6 6 bar y
16 16 foo y
18 18 bar z
6 6 bar y
8 8 foo y
18 18 bar z
7 7 bar z
4 4 foo z
19 19 bar y

#* We can't be sure which row is selecting for `bar` or `y`,
# I'm just illustrating the idea here.

这里有一个更大的设计问题：您是否要使用嵌套的分层抽样，或者您实际上只是想将df.b和df.c中的每个类都当作一个单独的类来进行抽样？如果是后者，那就是您已经得到的。前者更复杂，而这不是train_test_split要做的。

您可能会发现有关嵌套分层抽样的讨论很有用。

如果您希望train_test_split的行为符合预期(按没有重复的多列分层)，请创建一个新列，该列是其他列中值的串联，并在新列上进行分层。

1 2	df['bc'] = df['b'].astype(str) + df['c'].astype(str) train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['bc']])

如果您担心由于11和3以及1和13之类的值都创建连接值113而导致的冲突，则可以在中间添加一些任意字符串：

1	df['bc'] = df['b'].astype(str) +"_" + df['c'].astype(str)