Numpy shuffle multidimensional array by row only, keep column order unchanged
如何仅在Python中按行对多维数组进行混排(因此,请勿对列进行混排)。
我正在寻找最有效的解决方案,因为我的矩阵非常庞大。 是否还可以在原始阵列上高效执行此操作(以节省内存)?
例:
1 2 3 4 5 | import numpy as np X = np.random.random((6, 2)) print(X) Y = ???shuffle by row only not colls??? print(Y) |
我现在期望的是原始矩阵:
1 2 3 4 5 6 | [[ 0.48252164 0.12013048] [ 0.77254355 0.74382174] [ 0.45174186 0.8782033 ] [ 0.75623083 0.71763107] [ 0.26809253 0.75144034] [ 0.23442518 0.39031414]] |
输出对行进行排序,而不是对列进行排序,例如:
1 2 3 4 5 6 | [[ 0.45174186 0.8782033 ] [ 0.48252164 0.12013048] [ 0.77254355 0.74382174] [ 0.75623083 0.71763107] [ 0.23442518 0.39031414] [ 0.26809253 0.75144034]] |
这就是
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | >>> X = np.random.random((6, 2)) >>> X array([[ 0.9818058 , 0.67513579], [ 0.82312674, 0.82768118], [ 0.29468324, 0.59305925], [ 0.25731731, 0.16676408], [ 0.27402974, 0.55215778], [ 0.44323485, 0.78779887]]) >>> np.random.shuffle(X) >>> X array([[ 0.9818058 , 0.67513579], [ 0.44323485, 0.78779887], [ 0.82312674, 0.82768118], [ 0.29468324, 0.59305925], [ 0.25731731, 0.16676408], [ 0.27402974, 0.55215778]]) |
您还可以使用
1 | np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X) |
样品运行-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | In [23]: X Out[23]: array([[ 0.60511059, 0.75001599], [ 0.30968339, 0.09162172], [ 0.14673218, 0.09089028], [ 0.31663128, 0.10000309], [ 0.0957233 , 0.96210485], [ 0.56843186, 0.36654023]]) In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X); In [25]: X Out[25]: array([[ 0.14673218, 0.09089028], [ 0.31663128, 0.10000309], [ 0.30968339, 0.09162172], [ 0.56843186, 0.36654023], [ 0.0957233 , 0.96210485], [ 0.60511059, 0.75001599]]) |
额外的性能提升
这是使用
1 | np.random.rand(X.shape[0]).argsort() |
加速结果-
1 2 3 4 5 6 7 | In [32]: X = np.random.random((6000, 2000)) In [33]: %timeit np.random.permutation(X.shape[0]) 1000 loops, best of 3: 510 μs per loop In [34]: %timeit np.random.rand(X.shape[0]).argsort() 1000 loops, best of 3: 297 μs per loop |
因此,改组解决方案可以修改为-
1 | np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X) |
运行时测试-
这些测试包括两种方法在此列出后和
1 2 3 4 5 6 7 8 9 10 | In [40]: X = np.random.random((6000, 2000)) In [41]: %timeit np.random.shuffle(X) 10 loops, best of 3: 25.2 ms per loop In [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X) 10 loops, best of 3: 53.3 ms per loop In [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X) 10 loops, best of 3: 53.2 ms per loop |
所以,似乎使用基于可用于这些
有点实验后,我发现第二阵列的大多数存储器和时间高效的方式混洗的数据(行wise)被,洗牌指数,并获得从洗牌索引的数据
1 2 3 4 | rand_num2 = np.random.randint(5, size=(6000, 2000)) perm = np.arange(rand_num2.shape[0]) np.random.shuffle(perm) rand_num2 = rand_num2[perm] |
更detailsHere,我使用memory_profiler找到内存使用和Python的内置"时间"模块,以创纪录的时间和比较所有以前的答案
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | def main(): # shuffle data itself rand_num = np.random.randint(5, size=(6000, 2000)) start = time.time() np.random.shuffle(rand_num) print('Time for direct shuffle: {0}'.format((time.time() - start))) # Shuffle index and get data from shuffled index rand_num2 = np.random.randint(5, size=(6000, 2000)) start = time.time() perm = np.arange(rand_num2.shape[0]) np.random.shuffle(perm) rand_num2 = rand_num2[perm] print('Time for shuffling index: {0}'.format((time.time() - start))) # using np.take() rand_num3 = np.random.randint(5, size=(6000, 2000)) start = time.time() np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3) print("Time taken by np.take, {0}".format((time.time() - start))) |
时间的结果
1 2 3 | Time for direct shuffle: 0.03345608711242676 # 33.4msec Time for shuffling index: 0.019818782806396484 # 19.8msec Time taken by np.take, 0.06726956367492676 # 67.2msec |
存储器剖析结果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | Line # Mem usage Increment Line Contents ================================================ 39 117.422 MiB 0.000 MiB @profile 40 def main(): 41 # shuffle data itself 42 208.977 MiB 91.555 MiB rand_num = np.random.randint(5, size=(6000, 2000)) 43 208.977 MiB 0.000 MiB start = time.time() 44 208.977 MiB 0.000 MiB np.random.shuffle(rand_num) 45 208.977 MiB 0.000 MiB print('Time for direct shuffle: {0}'.format((time.time() - start))) 46 47 # Shuffle index and get data from shuffled index 48 300.531 MiB 91.555 MiB rand_num2 = np.random.randint(5, size=(6000, 2000)) 49 300.531 MiB 0.000 MiB start = time.time() 50 300.535 MiB 0.004 MiB perm = np.arange(rand_num2.shape[0]) 51 300.539 MiB 0.004 MiB np.random.shuffle(perm) 52 300.539 MiB 0.000 MiB rand_num2 = rand_num2[perm] 53 300.539 MiB 0.000 MiB print('Time for shuffling index: {0}'.format((time.time() - start))) 54 55 # using np.take() 56 392.094 MiB 91.555 MiB rand_num3 = np.random.randint(5, size=(6000, 2000)) 57 392.094 MiB 0.000 MiB start = time.time() 58 392.242 MiB 0.148 MiB np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3) 59 392.242 MiB 0.000 MiB print("Time taken by np.take, {0}".format((time.time() - start))) |
可以使用
1 2 3 | shuffle = np.vectorize(np.random.permutation, signature='(n)->(n)') A_shuffled = shuffle(A) |
我尝试了许多解决方案,最后我使用了一个简单的解决方案:
1 2 3 4 5 | from sklearn.utils import shuffle x = np.array([[1, 2], [3, 4], [5, 6]]) print(shuffle(x, random_state=0)) |
输出:
1 2 3 4 5 | [ [5 6] [3 4] [1 2] ] |
如果你有3D阵列,通过第1轴环(轴= 0)和应用此功能,如:
1 | np.array([shuffle(item) for item in 3D_numpy_array]) |
我对此有一个疑问(或者也许是答案)
假设我们有形状的numpy的阵列X =(1000,60,11,1)
还假设X是图像的大小为60x11和信道数= 1(60x11x1)的阵列。
如果我想打乱所有这些图像的顺序,要做到这一点,我会在X的索引使用洗牌
1 2 3 4 5 | def shuffling( X): indx=np.arange(len(X)) # create a array with indexes for X data np.random.shuffle(indx) X=X[indx] return X |
那行得通吗?据我所知,len(X)将返回最大尺寸。