'Pipeline' object has no attribute 'get_feature_names' in scikit-learn
我基本上是使用mini_batch_kmeans和kmeans算法将一些文档聚类的。 我只是遵循了该教程,这是scikit-learn网站的链接,如下所示:
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
他们正在使用一些用于矢量化的方法,其中之一是HashingVectorizer。 在hashingVectorizer中,他们使用TfidfTransformer()方法建立了一条管道。
1 2 3 4 5 | # Perform an IDF normalization on the output of HashingVectorizer hasher = HashingVectorizer(n_features=opts.n_features, stop_words='english', non_negative=True, norm=None, binary=False) vectorizer = make_pipeline(hasher, TfidfTransformer()) |
一旦这样做,我从中得到的矢量化器将没有方法get_feature_names()。 但由于我将其用于集群,因此需要使用此" get_feature_names()"获取"条款"
1 2 3 4 5 6 | terms = vectorizer.get_feature_names() for i in range(true_k): print("Cluster %d:" % i, end='') for ind in order_centroids[i, :10]: print(' %s' % terms[ind], end='') print() |
我该如何解决这个错误?
我的整个代码如下所示:
1 2 3 4 5 | X_train_vecs, vectorizer = vector_bow.count_tfidf_vectorizer(_contents) mini_kmeans_batch = MiniBatchKmeansTechnique() # MiniBatchKmeans without the LSA dimensionality reduction mini_kmeans_batch.mini_kmeans_technique(number_cluster=8, X_train_vecs=X_train_vecs, vectorizer=vectorizer, filenames=_filenames, contents=_contents, is_dimension_reduced=False) |
用tfidf传递计数矢量。
1 2 3 4 5 6 | def count_tfidf_vectorizer(self,contents): count_vect = CountVectorizer() vectorizer = make_pipeline(count_vect,TfidfTransformer()) X_train_vecs = vectorizer.fit_transform(contents) print("The count of bow :", X_train_vecs.shape) return X_train_vecs, vectorizer |
mini_batch_kmeans类如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | class MiniBatchKmeansTechnique(): def mini_kmeans_technique(self, number_cluster, X_train_vecs, vectorizer, filenames, contents, svd=None, is_dimension_reduced=True): km = MiniBatchKMeans(n_clusters=number_cluster, init='k-means++', max_iter=100, n_init=10, init_size=1000, batch_size=1000, verbose=True, random_state=42) print("Clustering sparse data with %s" % km) t0 = time() km.fit(X_train_vecs) print("done in %0.3fs" % (time() - t0)) print() cluster_labels = km.labels_.tolist() print("List of the cluster names is :",cluster_labels) data = {'filename':filenames, 'contents':contents, 'cluster_label':cluster_labels} frame = pd.DataFrame(data=data, index=[cluster_labels], columns=['filename', 'contents', 'cluster_label']) print(frame['cluster_label'].value_counts(sort=True,ascending=False)) print() grouped = frame['cluster_label'].groupby(frame['cluster_label']) print(grouped.mean()) print() print("Top Terms Per Cluster :") if is_dimension_reduced: if svd != None: original_space_centroids = svd.inverse_transform(km.cluster_centers_) order_centroids = original_space_centroids.argsort()[:, ::-1] else: order_centroids = km.cluster_centers_.argsort()[:, ::-1] terms = vectorizer.get_feature_names() for i in range(number_cluster): print("Cluster %d:" % i, end=' ') for ind in order_centroids[i, :10]: print(' %s' % terms[ind], end=',') print() print("Cluster %d filenames:" % i, end='') for file in frame.ix[i]['filename'].values.tolist(): print(' %s,' % file, end='') print() |
流水线没有get_feature_names()方法,因为要为流水线实现此方法并不简单-人们需要考虑所有流水线步骤才能获得要素名称。参见https://github.com/scikit-learn/scikit-learn/issues/6424、https://github.com/scikit-learn/scikit-learn/issues/6425等-有很多相关内容票并尝试修复它。
如果您的管道很简单(TfidfVectorizer后跟MiniBatchKMeans),则可以从TfidfVectorizer获得功能名称。
如果要使用HashingVectorizer,则更为复杂,因为HashingVectorizer并非根据设计提供功能名称。 HashingVectorizer不存储词汇,而是使用散列-这意味着它可以在在线设置中应用,并且不需要任何RAM-但是,权衡是您没有获得功能名称。
不过,仍然可以从HashingVectorizer获取功能名称;为此,您需要将其应用到文档样本中,存储哪些散列对应于哪些单词,并以此方式了解这些散列的含义,即特征名称是什么。可能会发生冲突,因此不可能100%确定功能名称正确,但是通常这种方法行之有效。这种方法在eli5库中实现;有关示例,请参见http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html#debugging-hashingvectorizer。您将必须使用InvertableHashingVectorizer做类似的事情:
1 2 3 4 5 6 | from eli5.sklearn import InvertableHashingVectorizer ivec = InvertableHashingVectorizer(vec) # vec is a HashingVectorizer instance # X_sample is a sample from contents; you can use the # whole contents array, or just e.g. every 10th element ivec.fit(content_sample) hashing_feat_names = ivec.get_feature_names() |
然后,您可以使用
从
1 2 3 | This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically. |
因此,为了访问特征名称,在对数据进行拟合之后,您可以:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | # Perform an IDF normalization on the output of HashingVectorizer from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer from sklearn.pipeline import make_pipeline hasher = HashingVectorizer(n_features=10, stop_words='english', non_negative=True, norm=None, binary=False) tfidf = TfidfVectorizer() vectorizer = make_pipeline(hasher, tfidf) # ... # fit to the data # ... # use the instance's class name to lower terms = vectorizer.named_steps[tfidf.__class__.__name__.lower()].get_feature_names() # or to be more precise, as used in `_name_estimators`: # terms = vectorizer.named_steps[type(tfidf).__name__.lower()].get_feature_names() # btw TfidfTransformer and HashingVectorizer do not have get_feature_names afaik |
希望这有帮助,祝你好运!
编辑:在使用您的示例看到更新的问题之后,@ Vivek Kumar是正确的,此代码
1 2 3 | vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features, min_df=2, stop_words='english', use_idf=opts.use_idf) |