SPARK, ML, Tuning, CrossValidator: access the metrics
为了构建NaiveBayes多类分类器,我使用CrossValidator选择管道中的最佳参数:
1 2 3 4 5 6 7 | val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) .setEvaluator(new MulticlassClassificationEvaluator) .setNumFolds(10) val cvModel = cv.fit(trainingSet) |
管道按以下顺序包含常用的转换器和估计器:Tokenizer,StopWordsRemover,HashingTF,IDF,最后是NaiveBayes。
是否可以访问为最佳模型计算的指标?
理想情况下,我想访问所有模型的指标,以了解更改参数如何改变分类的质量。
但是目前,最好的模型已经足够好了。
仅供参考,我正在使用Spark 1.6.0
这是我的操作方式:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | val pipeline = new Pipeline() .setStages(Array(tokenizer, stopWordsFilter, tf, idf, word2Vec, featureVectorAssembler, categoryIndexerModel, classifier, categoryReverseIndexer)) ... val paramGrid = new ParamGridBuilder() .addGrid(tf.numFeatures, Array(10, 100)) .addGrid(idf.minDocFreq, Array(1, 10)) .addGrid(word2Vec.vectorSize, Array(200, 300)) .addGrid(classifier.maxDepth, Array(3, 5)) .build() paramGrid.size // 16 entries ... // Print the average metrics per ParamGrid entry val avgMetricsParamGrid = crossValidatorModel.avgMetrics // Combine with paramGrid to see how they affect the overall metrics val combined = paramGrid.zip(avgMetricsParamGrid) ... val bestModel = crossValidatorModel.bestModel.asInstanceOf[PipelineModel] // Explain params for each stage val bestHashingTFNumFeatures = bestModel.stages(2).asInstanceOf[HashingTF].explainParams val bestIDFMinDocFrequency = bestModel.stages(3).asInstanceOf[IDFModel].explainParams val bestWord2VecVectorSize = bestModel.stages(4).asInstanceOf[Word2VecModel].explainParams val bestDecisionTreeDepth = bestModel.stages(7).asInstanceOf[DecisionTreeClassificationModel].explainParams |
1 | cvModel.avgMetrics |
在pyspark 2.2.0中运行