关于python：pandas示例基于每一行的类别

pandas sample based on category for each row

假设我有一个熊猫数据框

1
2
3
4
5
6
7
8
9
10
11

rid category
0 0 c2
1 1 c3
2 2 c2
3 3 c3
4 4 c2
5 5 c2
6 6 c1
7 7 c3
8 8 c1
9 9 c3

我想添加2列pid和nid，以便对于每行pid包含一个与rid属于同一类别的随机ID(除rid之外)，而nid包含一个与rid属于不同类别的随机ID，

一个示例数据框将是：

1
2
3
4
5
6
7
8
9
10
11

rid category pid nid
0 0 c2 2 1
1 1 c3 7 4
2 2 c2 0 1
3 3 c3 1 5
4 4 c2 5 7
5 5 c2 4 6
6 6 c1 8 5
7 7 c3 9 8
8 8 c1 6 2
9 9 c3 1 2

请注意，pid不应与rid相同。现在，我只是通过遍历行并每次采样来强制执行它，这似乎效率很低。

有一个更好的方法吗？

编辑1：为简单起见，让我们假设每个类别至少代表两次，以便至少可以找到一个ID，该ID并非摆脱但具有相同的类别。

编辑2：为进一步简化，让我们假设在一个大数据框中，以与rid相同的id结束的概率为零。如果是这样，我相信解决方案应该会更容易。我宁愿不要做这个假设

对于pid列，请使用Sattolo's algorithm；对于nid，请获取所有可能的值具有差异的列的所有数量，其组的值具有numpy.random.choice和set差异：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

from random import randrange

#https://stackoverflow.com/questions/7279895
def sattoloCycle(items):
items = list(items)
i = len(items)
while i > 1:
i = i - 1
j = randrange(i) # 0 <= j <= i-1
items[j], items[i] = items[i], items[j]
return items

def outsideGroupRand(x):
return np.random.choice(list(set(df['rid']).difference(x)),
size=len(x),
replace=False)

df['pid1'] = df.groupby('category')['rid'].transform(sattoloCycle)
df['nid1'] = df.groupby('category')['rid'].transform(outsideGroupRand)
print (df)
rid category pid nid pid1 nid1
0 0 c2 2 1 4 6
1 1 c3 7 4 7 4
2 2 c2 0 1 5 3
3 3 c3 1 5 1 0
4 4 c2 5 7 2 9
5 5 c2 4 6 0 8
6 6 c1 8 5 8 3
7 7 c3 9 8 9 5
8 8 c1 6 2 6 5
9 9 c3 1 2 3 6

首先定义一个函数计算pid：

1 2	def getPid(elem, grp): return grp[grp != elem].sample().values[0]

参数：

eleme-当前摆脱群组的人，
grp-整组rid值。

这个想法是：

从当前组(针对某些类别)中选择"其他"元素，
呼叫样本
返回样本返回的系列中唯一的返回值。

然后定义第二个函数，生成两个新的id：

1
2
3
4
5
6
7

def getIds(grp):
pids = grp.rid.apply(getPid, grp=grp.rid)
rowNo = grp.rid.size
currGrp = grp.name
nids = df.query('category != @currGrp').rid\\
.sample(rowNo, replace=True)
return pd.DataFrame({'pid': pids, 'nid': nids.values}, index=grp.index)

注意：

当前组的所有nid值都可以使用
一个电话来取样，
来自"其他类别"的一系列摆脱。

但是pid值必须单独计算，将getPid应用于每个
当前组的元素(摆脱)。

原因是每次都应消除不同的元素
从当前组中调用样本之前。

为了得到结果，运行一条指令：

1	pd.concat([df, df.groupby('category').apply(getIds)], axis=1)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

import pandas as pd
import numpy as np

## generate dummy data
raw = {
"rid": range(10),
"cat": np.random.choice("c1,c2,c3".split(","), 10)
}

df = pd.DataFrame(raw)

def get_random_ids(x):
pids,nids = [],[]

sh = x.copy()
for _ in x:
## do circular shift choose random value except cur_val
cur_value = sh.iloc[0]
sh = sh.shift(-1)
sh[-1:] = cur_value
pids.append(np.random.choice(sh[:-1]))

## randomly choose from values from other cat
nids = np.random.choice(df[df["cat"]!=x.name]["rid"], len(x))

return pd.DataFrame({"pid": pids,"nid": nids}, index=x.index)

new_ids = df.groupby("cat")["rid"].apply(lambda x:get_random_ids(x))
df.join(new_ids).sort_values("cat")

输出

1
2
3
4
5
6
7
8
9
10
11

rid cat pid nid
5 5 c1 8.0 9
8 8 c1 5.0 6
0 0 c2 6.0 1
2 2 c2 0.0 8
3 3 c2 0.0 9
6 6 c2 2.0 4
7 7 c2 3.0 1
1 1 c3 9.0 5
4 4 c3 9.0 0
9 9 c3 4.0 2