关于python：如何根据数据框和numpy中的协变量对观测进行分类？

How to classify observations based on their covariates in dataframe and numpy?

我有一个有n个观察值的数据集，比如2个变量x1和x2。我正试图根据它们(x1，x2)值的一组条件对每个观察结果进行分类。例如，数据集看起来像

1
2
3
4
5
6

df:
Index X1 X2
1 0.2 0.8
2 0.6 0.2
3 0.2 0.1
4 0.9 0.3

组的定义是

第1组：x1<0.5&x2>=0.5
第2组：x1>=0.5&x2>=0.5
第3组：x1<0.5&x2<0.5
第4组：x1>=0.5&x2<0.5

我想生成以下数据帧。

1
2
3
4
5
6

expected result:
Index X1 X2 Group
1 0.2 0.8 1
2 0.6 0.2 4
3 0.2 0.1 3
4 0.9 0.3 4

另外，对于这种类型的问题，使用numpy数组会更好/更快吗？

在回答你最后一个问题时，我肯定认为pandas是一个很好的工具；它可以在numpy中完成，但在使用数据帧时，pandas可以说更直观，而且对于大多数应用程序来说足够快。pandas和numpy也在一起玩得很好。例如，在您的案例中，您可以使用numpy.select构建您的pandas列：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

import numpy as np
import pandas as pd
# Lay out your conditions
conditions = [((df.X1 < 0.5) & (df.X2>=0.5)),
((df.X1>=0.5) & (df.X2>=0.5)),
((df.X1<0.5) & (df.X2<0.5)),
((df.X1>=0.5) & (df.X2<0.5))]

# Name the resulting groups (in the same order as the conditions)
choicelist = [1,2,3,4]

df['group']= np.select(conditions, choicelist, default=-1)

# Above, I've the default to -1, but change as you see fit
# if none of your conditions are met, then it that row would be classified as -1

>>> df
Index X1 X2 group
0 1 0.2 0.8 1
1 2 0.6 0.2 4
2 3 0.2 0.1 3
3 4 0.9 0.3 4

类似的东西

1
2
3
4
5
6
7

df[['X1','X2']].gt(0.5).astype(str).sum(1).map({'FalseTrue':1,'TrueFalse':4,'FalseFalse':3,'TrueTrue':2})
Out[56]:
0 1
1 4
2 3
3 4
dtype: int64