关于python:XGBoost中的特征重要性\\’gain\\’

Feature importance 'gain' in XGBoost

我想了解 xgboost 中的特征重要性是如何通过 \\'gain\\' 计算的。来自 https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7:

a€?Gaina€? is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).

在scikit-learn中,特征重要性是通过使用变量拆分后每个节点的gini杂质/信息增益减少来计算的,即节点的加权杂质平均值-左子节点的加权杂质平均值-右子节点的加权杂质平均值节点(另见:https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting)

我想知道 xgboost 是否也使用上述引用中所述的信息增益或准确性的这种方法。我试图挖掘xgboost的代码并发现了这个方法(已经切断了不相关的部分):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def get_score(self, fmap='', importance_type='gain'):
    trees = self.get_dump(fmap, with_stats=True)

    importance_type += '='
    fmap = {}
    gmap = {}
    for tree in trees:
        for line in tree.split('\
'
):
            # look for the opening square bracket
            arr = line.split('[')
            # if no opening bracket (leaf node), ignore this line
            if len(arr) == 1:
                continue

            # look for the closing bracket, extract only info within that bracket
            fid = arr[1].split(']')

            # extract gain or cover from string after closing bracket
            g = float(fid[1].split(importance_type)[1].split(',')[0])

            # extract feature name from string before closing bracket
            fid = fid[0].split('<')[0]

            if fid not in fmap:
                # if the feature hasn't been seen yet
                fmap[fid] = 1
                gmap[fid] = g
            else:
                fmap[fid] += 1
                gmap[fid] += g

    return gmap

所以 \\'gain\\' 是从每个助推器的转储文件中提取的,但它是如何实际测量的呢?


好问题。使用以下公式计算增益:

enter