sklearn CountVectorizer
我对使用vocabulary_.get有疑问,代码如下。
如下所示,我在一项机器学习练习中使用了CountVectorizer来获取特定单词出现的次数。
1 2 3 4 5 6 7 8 9 10 11 12 13 | from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() s1 = 'KJ YOU WILL BE FINE' s2 = 'ABHI IS MY BESTIE' s3 = 'sam is my bestie' frnd_list = [s1,s2,s3] bag_of_words = vectorizer.fit(frnd_list) bag_of_words = vectorizer.transform(frnd_list) print(bag_of_words) # To get the feature word number from word #for eg: print(vectorizer.vocabulary_.get('bestie')) print(vectorizer.vocabulary_.get('BESTIE')) |
输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | Bag_of_words is : (0, 1) 1 (0, 3) 1 (0, 5) 1 (0, 8) 1 (0, 9) 1 (1, 0) 1 (1, 2) 1 (1, 4) 1 (1, 6) 1 (2, 2) 1 (2, 4) 1 (2, 6) 1 (2, 7) 1 'bestie' has feature number: 2 'BESTIE' has feature number: None |
因此,我的疑问是,为什么'bistie'显示正确的特征编号,即2而'BESTIE'显示None。 vocabulary_.get与大写向量不兼容吗?
1
2 lowercase : boolean, True by default
Convert all characters to lowercase before tokenizing.
如果要区别对待小写和大写,请将其更改为
countvectorizer采用参数"小写",默认情况下,其值为true
如果我们要区分大小写字母,请设置小写= False
有关更多信息,请单击此处http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html