关于numpy：如何在Python中实现Softmax函数

How to implement the Softmax function in Python

从Udacity的深度学习类中，y_i的softmax只是指数除以整个Y向量的指数之和：

enter image description here

其中S(y_i)是y_i的softmax函数，e是指数，而j是no。输入向量Y中的列数。

我尝试了以下方法：

1
2
3
4
5
6
7
8
9

import numpy as np

def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()

scores = [3.0, 1.0, 0.2]
print(softmax(scores))

1	[ 0.8360188 0.11314284 0.05083836]

但是建议的解决方案是：

1
2
3

def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)

即使第一个实现显式地取了每一列和最大值的差然后除以总和，它的输出与第一个实现相同。

有人可以从数学上说明为什么吗？一个是正确的，另一个是错误的吗？

在代码和时间复杂度方面实现是否相似？哪个更有效？

相关讨论

它们都是正确的，但从数值稳定性的角度来看，您是首选。

你开始

1	e ^ (x - max(x)) / sum(e^(x - max(x))

通过使用a ^(b-c)=(a ^ b)/(a ^ c)的事实

1
2
3

= e ^ x / (e ^ max(x) * sum(e ^ x / e ^ max(x)))

= e ^ x / sum(e ^ x)

另一个答案是什么。您可以将max(x)替换为任何变量，并且它将抵消。

相关讨论

(嗯……在这里，无论是在问题上还是在答案上，都有很多困惑……)

首先，这两种解决方案(即您的解决方案和建议的解决方案)并不相同；它们恰好只对一维分数数组的特例等效。如果您还尝试了Udacity测验提供的示例中的2-D分数数组，则会发现它。

在结果方面，两个解决方案之间的唯一实际区别是axis=0参数。要看到是这种情况，让我们尝试您的解决方案(your_softmax)，其中一个唯一的区别是axis参数：

1
2
3
4
5
6
7
8
9
10
11
12
13

import numpy as np

# your solution:
def your_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()

# correct solution:
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0) # only difference

如我所说，对于一维分数数组，结果确实是相同的：

1
2
3
4
5
6
7

scores = [3.0, 1.0, 0.2]
print(your_softmax(scores))
# [ 0.8360188 0.11314284 0.05083836]
print(softmax(scores))
# [ 0.8360188 0.11314284 0.05083836]
your_softmax(scores) == softmax(scores)
# array([ True, True, True], dtype=bool)

不过，以下是Udacity测验中给出的2-D分数数组的结果作为测试示例：

1
2
3
4
5
6
7
8
9
10
11
12
13

scores2D = np.array([[1, 2, 3, 6],
[2, 4, 5, 6],
[3, 8, 7, 6]])

print(your_softmax(scores2D))
# [[ 4.89907947e-04 1.33170787e-03 3.61995731e-03 7.27087861e-02]
# [ 1.33170787e-03 9.84006416e-03 2.67480676e-02 7.27087861e-02]
# [ 3.61995731e-03 5.37249300e-01 1.97642972e-01 7.27087861e-02]]

print(softmax(scores2D))
# [[ 0.09003057 0.00242826 0.01587624 0.33333333]
# [ 0.24472847 0.01794253 0.11731043 0.33333333]
# [ 0.66524096 0.97962921 0.86681333 0.33333333]]

结果是不同的-第二个结果确实与Udacity测验中预期的结果相同，在Udacity测验中，所有列的确加起来为1，而第一个(错误的)结果并非如此。

因此，所有的麻烦实际上都是为了实现细节-axis参数。根据numpy.sum文档：

The default, axis=None, will sum all of the elements of the input array

而在这里我们要按行求和，因此为axis=0。对于一维数组，(仅)行的和与所有元素的和恰好是相同的，因此在这种情况下您的结果相同...

除了axis问题，您的实现(即您选择先减去最大值)实际上比建议的解决方案好！实际上，这是实现softmax函数的推荐方法-有关理由，请参见此处(数字稳定性，也由上面的一些答案指出)。

相关讨论

因此，这确实是对Desertnaut答案的评论，但由于我的声誉，我暂时无法对此发表评论。正如他指出的那样，仅当您的输入包含单个样本时，您的版本才是正确的。如果您的输入由几个样本组成，那是错误的。但是，desertnaut的解决方案也是错误的。问题在于，一旦他接受一维输入，然后接受二维输入。让我给你看看。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

import numpy as np

# your solution:
def your_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()

# desertnaut solution (copied from his answer):
def desertnaut_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0) # only difference

# my (correct) solution:
def softmax(z):
assert len(z.shape) == 2
s = np.max(z, axis=1)
s = s[:, np.newaxis] # necessary step to do broadcasting
e_x = np.exp(z - s)
div = np.sum(e_x, axis=1)
div = div[:, np.newaxis] # dito
return e_x / div

让我们以Desertnauts为例：

1	x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)

这是输出：

1
2
3
4
5
6
7
8

your_softmax(x1)
array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037047]])

desertnaut_softmax(x1)
array([[ 1., 1., 1., 1.]])

softmax(x1)
array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037047]])

您可以看到在这种情况下desernauts版本将失败。 (如果输入只是一维，如np.array([1、2、3、6])，则不会。

现在使用3个样本，因为这就是我们使用二维输入的原因。以下x2与来自desernauts示例的x2不同。

1
2
3

x2 = np.array([[1, 2, 3, 6], # sample 1
[2, 4, 5, 6], # sample 2
[1, 2, 3, 6]]) # sample 1 again(!)

此输入包含3个样本的批次。但是样本一和样本三本质上是相同的。现在，我们期望3行softmax激活，其中第一行应与第三行相同，并且也应与x1的激活相同！

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

your_softmax(x2)
array([[ 0.00183535, 0.00498899, 0.01356148, 0.27238963],
[ 0.00498899, 0.03686393, 0.10020655, 0.27238963],
[ 0.00183535, 0.00498899, 0.01356148, 0.27238963]])

desertnaut_softmax(x2)
array([[ 0.21194156, 0.10650698, 0.10650698, 0.33333333],
[ 0.57611688, 0.78698604, 0.78698604, 0.33333333],
[ 0.21194156, 0.10650698, 0.10650698, 0.33333333]])

softmax(x2)
array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037047],
[ 0.01203764, 0.08894682, 0.24178252, 0.65723302],
[ 0.00626879, 0.01704033, 0.04632042, 0.93037047]])

希望您能看到只有我的解决方案才有这种情况。

1
2
3
4
5

softmax(x1) == softmax(x2)[0]
array([[ True, True, True, True]], dtype=bool)

softmax(x1) == softmax(x2)[2]
array([[ True, True, True, True]], dtype=bool)

此外，这是TensorFlows softmax实现的结果：

1
2
3
4
5
6
7
8

import tensorflow as tf
import numpy as np
batch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])
x = tf.placeholder(tf.float32, shape=[None, 4])
y = tf.nn.softmax(x)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(y, feed_dict={x: batch})

结果：

1
2
3

array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037045],
[ 0.01203764, 0.08894681, 0.24178252, 0.657233 ],
[ 0.00626879, 0.01704033, 0.04632042, 0.93037045]], dtype=float32)

相关讨论

我要说的是，尽管两者在数学上都是正确的，但从实现角度来看，第一个比较好。当计算softmax时，中间值可能会变得非常大。将两个大数相除可能会造成数值不稳定。这些注释(来自斯坦福大学)提到了归一化技巧，这实际上就是您正在做的事情。

相关讨论

sklearn还提供softmax的实现

1
2
3
4
5
6
7
8

from sklearn.utils.extmath import softmax
import numpy as np

x = np.array([[ 0.50839931, 0.49767588, 0.51260159]])
softmax(x)

# output
array([[ 0.3340521 , 0.33048906, 0.33545884]])

相关讨论

从数学观点来看，双方是平等的。

您可以轻松证明这一点。让我们m=max(x)。现在，函数softmax返回一个向量，其第i个坐标等于

enter image description here

请注意，这适用于任何m，因为对于所有(甚至复数)数字e^m != 0

从计算复杂度的角度来看，它们也是等效的，并且都在O(n)时间内运行，其中n是向量的大小。
从数值稳定性的角度来看，首选第一个解决方案，因为e^x增长非常快，即使x的值很小，它也会溢出。减去最大值可以消除此溢出。为了实际体验我所谈论的内容，请尝试将x = np.array([1000, 5])输入到您的两个函数中。一个将返回正确的概率，第二个将返回nan
您的解决方案仅适用于矢量(Udacity测验也希望您也针对矩阵进行计算)。为了修复它，您需要使用sum(axis=0)

相关讨论

在这里，您可以了解为什么他们使用- max。

从那里：

"When you’re writing code for computing the Softmax function in practice, the intermediate terms may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick."

编辑。从1.2.0版开始，scipy包含softmax作为特殊功能：

https://scipy.github.io/devdocs/generation/scipy.special.softmax.html

我编写了一个在任何轴上应用softmax的函数：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

def softmax(X, theta = 1.0, axis = None):
"""
Compute the softmax of each element along an axis of X.

Parameters
----------
X: ND-Array. Probably should be floats.
theta (optional): float parameter, used as a multiplier
prior to exponentiation. Default = 1.0
axis (optional): axis to compute values along. Default is the
first non-singleton axis.

Returns an array the same size as X. The result will sum to 1
along the specified axis.
"""

# make X at least 2d
y = np.atleast_2d(X)

# find axis
if axis is None:
axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

# multiply y against the theta parameter,
y = y * float(theta)

# subtract the max for numerical stability
y = y - np.expand_dims(np.max(y, axis = axis), axis)

# exponentiate y
y = np.exp(y)

# take the sum along the specified axis
ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

# finally: divide elementwise
p = y / ax_sum

# flatten if X was 1D
if len(X.shape) == 1: p = p.flatten()

return p

如其他用户所述，减去最大值是一种好习惯。我在这里写了一篇详细的文章。

一个更简洁的版本是：

1 2	def softmax(x): return np.exp(x) / np.exp(x).sum(axis=0)

相关讨论

要提供替代解决方案，请考虑以下情况：您的论点的数量级非常大，以致exp(x)将下溢(在负数情况下)或上溢(在正数情况下)。您希望在此处尽可能长时间地保留在日志空间中，仅在您可以相信结果会表现良好的末尾进行幂运算。

1
2
3
4
5

import scipy.special as sc
import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
return np.exp(x - sc.logsumexp(x))

相关讨论

我需要一些与Tensorflow密集层的输出兼容的东西。

@desertnaut的解决方案在这种情况下不起作用，因为我有大量数据。因此，我提供了另一种在两种情况下均适用的解决方案：

1
2
3

def softmax(x, axis=-1):
e_x = np.exp(x - np.max(x)) # same code
return e_x / e_x.sum(axis=axis, keepdims=True)

结果：

1
2
3
4
5
6
7
8
9

logits = np.asarray([
[-0.0052024, -0.00770216, 0.01360943, -0.008921], # 1
[-0.0052024, -0.00770216, 0.01360943, -0.008921] # 2
])

print(softmax(logits))

#[[0.2492037 0.24858153 0.25393605 0.24827873]
# [0.2492037 0.24858153 0.25393605 0.24827873]]

参考：Tensorflow softmax

相关讨论

我建议这样做：

1
2
3

def softmax(z):
z_norm=np.exp(z-np.max(z,axis=0,keepdims=True))
return(np.divide(z_norm,np.sum(z_norm,axis=0,keepdims=True)))

它将适用于随机和批处理。
有关更多详细信息，请参见：
https://medium.com/@ravish1729/analysis-of-softmax-function-ad058d6a564d

为了保持数值稳定性，应减去max(x)。以下是softmax函数的代码；

def softmax(x)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

if len(x.shape) > 1:
tmp = np.max(x, axis = 1)
x -= tmp.reshape((x.shape[0], 1))
x = np.exp(x)
tmp = np.sum(x, axis = 1)
x /= tmp.reshape((x.shape[0], 1))
else:
tmp = np.max(x)
x -= tmp
x = np.exp(x)
tmp = np.sum(x)
x /= tmp

return x

在以上答案中已经详细回答了。减去max以避免溢出。我在这里在python3中添加了另一个实现。

1
2
3
4
5
6
7
8
9
10

import numpy as np
def softmax(x):
mx = np.amax(x,axis=1,keepdims = True)
x_exp = np.exp(x - mx)
x_sum = np.sum(x_exp, axis = 1, keepdims = True)
res = x_exp / x_sum
return res

x = np.array([[3,2,4],[4,5,6]])
print(softmax(x))

根据所有答复和CS231n注释，请允许我总结一下：

1
2
3

def softmax(x, axis):
x -= np.max(x, axis=axis, keepdims=True)
return np.exp(x) / np.exp(x).sum(axis=axis, keepdims=True)

用法：

1
2
3
4

x = np.array([[1, 0, 2,-1],
[2, 4, 6, 8],
[3, 2, 1, 0]])
softmax(x, axis=1).round(2)

输出：

1
2
3

array([[0.24, 0.09, 0.64, 0.03],
[0. , 0.02, 0.12, 0.86],
[0.64, 0.24, 0.09, 0.03]])

似乎每个人都发布了他们的解决方案，所以我将发布我的解决方案：

1
2
3

def softmax(x):
e_x = np.exp(x.T - np.max(x, axis = -1))
return (e_x / e_x.sum(axis=0)).T

我得到的结果与从sklearn导入的结果完全相同：

1	from sklearn.utils.extmath import softmax

1
2
3
4
5
6
7
8
9
10
11
12

import tensorflow as tf
import numpy as np

def softmax(x):
return (np.exp(x).T / np.exp(x).sum(axis=-1)).T

logits = np.array([[1, 2, 3], [3, 10, 1], [1, 2, 5], [4, 6.5, 1.2], [3, 6, 1]])

sess = tf.Session()
print(softmax(logits))
print(sess.run(tf.nn.softmax(logits)))
sess.close()

相关讨论

softmax函数的目的是保留向量的比率，而不是随着值饱和(即趋于+/- 1(tanh)或从0到1(逻辑上))以S形压缩端点。这是因为它保留了有关端点变化率的更多信息，因此更适用于N输出为1的神经网络(即，如果压缩端点，则很难区分1 -of-N输出类，因为我们不能说哪个是"最大"或"最小"的，因为它们被压扁了。)；也会使总输出总和为1，明确的获胜者将接近1，而其他彼此接近的数之和将为1 / p，其中p是具有相似值的输出神经元的数量。

从向量中减去最大值的目的是，当您进行指数运算时，您可能会得到很高的值，该值会将浮点数修剪为最大值，导致出现平局，在此示例中并非如此。如果您减去最大值以得到负数，那么这将成为一个大问题，您将拥有一个负指数，该指数会迅速缩小值以更改比例，这是发帖人的问题中出现的结果，并且给出了错误的答案。

Udacity提供的答案很糟糕。我们要做的第一件事是为所有矢量分量计算e ^ y_j，保留这些值，然后将它们求和并除。 Udacity搞砸的地方是他们计算两次e ^ y_j ！！！这是正确的答案：

1
2
3

def softmax(y):
e_to_the_y_j = np.exp(y)
return e_to_the_y_j / np.sum(e_to_the_y_j, axis=0)

目标是使用Numpy和Tensorflow达到类似的结果。与原始答案的唯一变化是np.sum api的axis参数。

初始方法：axis=0-但是，当尺寸为N时，这不能提供预期的结果。

修改的方法：axis=len(e_x.shape)-1-总是在最后一个维度求和。这提供了与tensorflow的softmax函数相似的结果。

1
2
3
4
5
6
7
8
9
10

def softmax_fn(input_array):
"""
| **@author**: Prathyush SP
|
| Calculate Softmax for a given array
:param input_array: Input Array
:return: Softmax Score
"""
e_x = np.exp(input_array - np.max(input_array))
return e_x / e_x.sum(axis=len(e_x.shape)-1)

这是使用numpy和comparision的广义解决方案，用于使用tensorflow ansscipy的正确性：

数据准备：

1
2
3
4
5
6
7
8
9
10
11

import numpy as np

np.random.seed(2019)

batch_size = 1
n_items = 3
n_classes = 2
logits_np = np.random.rand(batch_size,n_items,n_classes).astype(np.float32)
print('logits_np.shape', logits_np.shape)
print('logits_np:')
print(logits_np)

输出：

1
2
3
4
5

logits_np.shape (1, 3, 2)
logits_np:
[[[0.9034822 0.3930805 ]
[0.62397 0.6378774 ]
[0.88049906 0.299172 ]]]

使用tensorflow的Softmax：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

import tensorflow as tf

logits_tf = tf.convert_to_tensor(logits_np, np.float32)
scores_tf = tf.nn.softmax(logits_np, axis=-1)

print('logits_tf.shape', logits_tf.shape)
print('scores_tf.shape', scores_tf.shape)

with tf.Session() as sess:
scores_np = sess.run(scores_tf)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np,axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出：

1
2
3
4
5
6
7
8
9
10

logits_tf.shape (1, 3, 2)
scores_tf.shape (1, 3, 2)
scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
[0.4965232 0.5034768 ]
[0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

使用scipy的Softmax：

1
2
3
4
5
6
7
8
9
10
11

from scipy.special import softmax

scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出：

1
2
3
4
5
6
7
8

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
[0.4965232 0.5034768 ]
[0.6413727 0.35862732]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

使用numpy的Softmax(https://nolanbconaway.github.io/blog/2017/softmax-numpy)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

输出：

1
2
3
4
5
6
7
8

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
[0.49652317 0.5034768 ]
[0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

我想补充一点对问题的理解。在这里减去数组的最大值是正确的。但是，如果您在另一篇文章中运行代码，则当数组为2D或更高尺寸时，您会发现它没有给出正确的答案。

在这里，我给您一些建议：

要获得最大值，请尝试沿x轴进行操作，您将获得一维数组。

将您的最大数组重塑为原始形状。

是否使np.exp获得指数值。

沿轴做np.sum。

获得最终结果。

按照结果进行矢量化处理，您将获得正确的答案。由于它与大学作业有关，因此我无法在此处发布确切的代码，但是如果您不理解，我想提出更多建议。

相关讨论