关于机器学习：“ Pipeline”对象在scikit-learn中没有属性“ get_feature_names”

'Pipeline' object has no attribute 'get_feature_names' in scikit-learn

我基本上是使用mini_batch_kmeans和kmeans算法将一些文档聚类的。我只是遵循了该教程，这是scikit-learn网站的链接，如下所示：
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

他们正在使用一些用于矢量化的方法，其中之一是HashingVectorizer。在hashingVectorizer中，他们使用TfidfTransformer()方法建立了一条管道。

1
2
3
4
5

# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=opts.n_features,
stop_words='english', non_negative=True,
norm=None, binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())

一旦这样做，我从中得到的矢量化器将没有方法get_feature_names()。但由于我将其用于集群，因此需要使用此" get_feature_names()"获取"条款"

1
2
3
4
5
6

terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()

我该如何解决这个错误？

我的整个代码如下所示：

1
2
3
4
5

X_train_vecs, vectorizer = vector_bow.count_tfidf_vectorizer(_contents)
mini_kmeans_batch = MiniBatchKmeansTechnique()
# MiniBatchKmeans without the LSA dimensionality reduction
mini_kmeans_batch.mini_kmeans_technique(number_cluster=8, X_train_vecs=X_train_vecs,
vectorizer=vectorizer, filenames=_filenames, contents=_contents, is_dimension_reduced=False)

用tfidf传递计数矢量。

1
2
3
4
5
6

def count_tfidf_vectorizer(self,contents):
count_vect = CountVectorizer()
vectorizer = make_pipeline(count_vect,TfidfTransformer())
X_train_vecs = vectorizer.fit_transform(contents)
print("The count of bow :", X_train_vecs.shape)
return X_train_vecs, vectorizer

mini_batch_kmeans类如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

class MiniBatchKmeansTechnique():
def mini_kmeans_technique(self, number_cluster, X_train_vecs, vectorizer,
filenames, contents, svd=None, is_dimension_reduced=True):
km = MiniBatchKMeans(n_clusters=number_cluster, init='k-means++', max_iter=100, n_init=10,
init_size=1000, batch_size=1000, verbose=True, random_state=42)
print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X_train_vecs)
print("done in %0.3fs" % (time() - t0))
print()
cluster_labels = km.labels_.tolist()
print("List of the cluster names is :",cluster_labels)
data = {'filename':filenames, 'contents':contents, 'cluster_label':cluster_labels}
frame = pd.DataFrame(data=data, index=[cluster_labels], columns=['filename', 'contents', 'cluster_label'])
print(frame['cluster_label'].value_counts(sort=True,ascending=False))
print()
grouped = frame['cluster_label'].groupby(frame['cluster_label'])
print(grouped.mean())
print()
print("Top Terms Per Cluster :")

if is_dimension_reduced:
if svd != None:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(number_cluster):
print("Cluster %d:" % i, end=' ')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end=',')
print()
print("Cluster %d filenames:" % i, end='')
for file in frame.ix[i]['filename'].values.tolist():
print(' %s,' % file, end='')
print()

相关讨论

流水线没有get_feature_names()方法，因为要为流水线实现此方法并不简单-人们需要考虑所有流水线步骤才能获得要素名称。参见https://github.com/scikit-learn/scikit-learn/issues/6424、https://github.com/scikit-learn/scikit-learn/issues/6425等-有很多相关内容票并尝试修复它。

如果您的管道很简单(TfidfVectorizer后跟MiniBatchKMeans)，则可以从TfidfVectorizer获得功能名称。

如果要使用HashingVectorizer，则更为复杂，因为HashingVectorizer并非根据设计提供功能名称。 HashingVectorizer不存储词汇，而是使用散列-这意味着它可以在在线设置中应用，并且不需要任何RAM-但是，权衡是您没有获得功能名称。

不过，仍然可以从HashingVectorizer获取功能名称；为此，您需要将其应用到文档样本中，存储哪些散列对应于哪些单词，并以此方式了解这些散列的含义，即特征名称是什么。可能会发生冲突，因此不可能100％确定功能名称正确，但是通常这种方法行之有效。这种方法在eli5库中实现；有关示例，请参见http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html#debugging-hashingvectorizer。您将必须使用InvertableHashingVectorizer做类似的事情：

1
2
3
4
5
6

from eli5.sklearn import InvertableHashingVectorizer
ivec = InvertableHashingVectorizer(vec) # vec is a HashingVectorizer instance
# X_sample is a sample from contents; you can use the
# whole contents array, or just e.g. every 10th element
ivec.fit(content_sample)
hashing_feat_names = ivec.get_feature_names()

然后，您可以使用hashing_feat_names作为特征名称，因为TfidfTransformer不会更改输入向量的大小，而只是缩放相同的特征。

从make_pipeline文档中：

1
2
3

This is a shorthand for the Pipeline constructor; it does not require, and
does not permit, naming the estimators. Instead, their names will be set
to the lowercase of their types automatically.

因此，为了访问特征名称，在对数据进行拟合之后，您可以：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

# Perform an IDF normalization on the output of HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline

hasher = HashingVectorizer(n_features=10,
stop_words='english', non_negative=True,
norm=None, binary=False)

tfidf = TfidfVectorizer()
vectorizer = make_pipeline(hasher, tfidf)
# ...
# fit to the data
# ...

# use the instance's class name to lower
terms = vectorizer.named_steps[tfidf.__class__.__name__.lower()].get_feature_names()

# or to be more precise, as used in `_name_estimators`:
# terms = vectorizer.named_steps[type(tfidf).__name__.lower()].get_feature_names()
# btw TfidfTransformer and HashingVectorizer do not have get_feature_names afaik

希望这有帮助，祝你好运！

编辑：在使用您的示例看到更新的问题之后，@ Vivek Kumar是正确的，此代码terms = vectorizer.get_feature_names()不会在管道中运行，而仅在以下情况下运行：

1
2
3

vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
min_df=2, stop_words='english',
use_idf=opts.use_idf)