关于python 2.7：sklearn CountVectorizer

sklearn CountVectorizer

我对使用vocabulary_.get有疑问，代码如下。
如下所示，我在一项机器学习练习中使用了CountVectorizer来获取特定单词出现的次数。

1
2
3
4
5
6
7
8
9
10
11
12
13

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
s1 = 'KJ YOU WILL BE FINE'
s2 = 'ABHI IS MY BESTIE'
s3 = 'sam is my bestie'
frnd_list = [s1,s2,s3]
bag_of_words = vectorizer.fit(frnd_list)
bag_of_words = vectorizer.transform(frnd_list)
print(bag_of_words)
# To get the feature word number from word
#for eg:
print(vectorizer.vocabulary_.get('bestie'))
print(vectorizer.vocabulary_.get('BESTIE'))

输出：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Bag_of_words is :
(0, 1) 1
(0, 3) 1
(0, 5) 1
(0, 8) 1
(0, 9) 1
(1, 0) 1
(1, 2) 1
(1, 4) 1
(1, 6) 1
(2, 2) 1
(2, 4) 1
(2, 6) 1
(2, 7) 1

'bestie' has feature number:
2
'BESTIE' has feature number:
None

因此，我的疑问是，为什么'bistie'显示正确的特征编号，即2而'BESTIE'显示None。 vocabulary_.get与大写向量不兼容吗？

CountVectorizer采用参数lowercase，该参数默认为True，如此处的文档所述：

1
2
lowercase : boolean, True by default
Convert all characters to lowercase before tokenizing.

如果要区别对待小写和大写，请将其更改为False。

countvectorizer采用参数"小写"，默认情况下，其值为true

如果我们要区分大小写字母，请设置小写= False

enter image description here

有关更多信息，请单击此处http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html