我尝试了Auto-sklearn 2.0

介绍

2020年7月8日在arXiv上发布了Auto-sklearn 2.0专论(https://arxiv.org/abs/2007.04074)。
我知道Auto-sklearn的存在，但是我从未使用过它，所以我借此机会尝试了一下。

论文舔

作者来自弗赖堡-汉诺威小组(https://www.automl.org/)。
根据该论文，Auto-sklearn 2.0的更新内容是对模型选择策略，投资组合构建，自动策略选择的改进，但是请参考该论文以获取详细信息。
(我了解我已经改善了总体配置的三个方面，包括模型选择，高级参数调整和预处理部分，但很抱歉我犯了一个错误，因为这只是一瞥。)

要做的事情

我想对某些数据执行Auto-sklearn1.0和Auto-sklearn2.0，并比较预测准确性。
另外，在Auto-sklearn中关闭也不是一件有趣的事情，因此我想使用Auto-Gluon预测相同的数据，然后看看精度如何。

环境

这次我将使用Google合作实验室(不使用GPU)。

做了什么

安装Auto-sklearn

https://automl.github.io/auto-sklearn/master/installation.html
按照此处所述执行以下操作：
截至2020年7月22日，安装过程中没有错误。

1
2
3

!sudo apt-get install build-essential swig
!curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install
!pip install auto-sklearn

https://automl.github.io/auto-sklearn/master/
由于此处描述了Auto-sklearn 2.0的导入方法，因此请执行以下操作。
查看Github(https://github.com/automl/auto-sklearn)的内容，我找不到Auto-sklearn 2.0的Regressor。我期待将来的更新。

1
2
3
4
5
6

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
from sklearn import preprocessing

from autosklearn.experimental.askl2 import AutoSklearn2Classifier

数据加载

Auto-sklearn2.0仅具有分类器，因此我提到了官方示例(https://automl.github.io/auto-sklearn/master/examples/example_feature_types.html)。
数据来自此处的任务，以确定您的收入是否超过一定金额。

1
2
3
4
5
6
7
8
9
10
11

X, y = sklearn.datasets.fetch_openml(data_id=179, return_X_y=True)

# y needs to be encoded, as fetch openml doesn't download a float
y = preprocessing.LabelEncoder().fit_transform(y)

X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1)

# Create feature type list from openml.org indicator and run autosklearn
data = sklearn.datasets.fetch_openml(data_id=179, as_frame=True)
feat_type = ['Categorical' if x.name == 'category' else 'Numerical' for x in data['data'].dtypes]

执行Auto-sklearn 2.0

由于这是一次试用，因此我将时间设置为5分钟并执行如下。

1
2
3
4
5
6
7
8
9
10

%%time

cls = AutoSklearn2Classifier(
time_left_for_this_task=300,
seed=1,
metric=autosklearn.metrics.log_loss
)
cls.fit(X_train, y_train, feat_type=feat_type)
#CPU times: user 16.8 s, sys: 557 ms, total: 17.3 s
#Wall time: 4min 56s

但是，如果按原样执行，则以下警告将大量显示。

1 2	/usr/local/lib/python3.6/dist-packages/sklearn/base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. FutureWarning)

我不认为这是一个问题，但是如果您很好奇，可以使用warnings忽略它。

1 2	import warnings warnings.simplefilter('ignore')

当通过Accuracy和AUC计算预测精度时，可获得以下结果。

1
2
3
4
5

predictions = cls.predict(X_test)
print("Accuracy score ", sklearn.metrics.accuracy_score(y_test, predictions))
print("AUC ", sklearn.metrics.roc_auc_score(y_test, predictions))
#Accuracy score 0.8585701416755385
#AUC 0.7749035627902573

运行Auto-sklearn 1.0

如果要使用传统的Auto-sklearn，请使用import autosklearn.classification导入库，并以相同的方式运行它。

1
2
3
4
5
6
7
8
9
10
11

%%time

cls = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=300,
per_run_time_limit=30,
seed=1,
metric=autosklearn.metrics.log_loss
)
cls.fit(X_train, y_train, feat_type=feat_type)
#CPU times: user 5.77 s, sys: 1.08 s, total: 6.86 s
#Wall time: 4min 55s

1
2
3
4
5

这一次，在构建模型时未执行交叉验证，因此我们必须进行一些仔细的比较，但是如果仅比较数值，则Auto-sklearn 2.0会提供更好的结果。

运行自动胶水

有关Auto-Gluon的更多信息，请参见另一篇文章(https://qiita.com/dyamaguc/items/dded739f35e59a6491c8)。
在自动胶粘剂的情况下，由于无法为输入数据指定numpy的ndarray，因此添加了转换为pandas的DataFrame的过程。
除此之外，基本上是相同的。

1
2
3
4
5
6
7
8
9
10
11

X, y = sklearn.datasets.fetch_openml(data_id=179, return_X_y=True)

# y needs to be encoded, as fetch openml doesn't download a float
y = preprocessing.LabelEncoder().fit_transform(y)

X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1)

X_train_ = pd.DataFrame( X_train )
y_train_ = pd.DataFrame(y_train, columns=['class'])
train_data = pd.concat( [X_train_, y_train_ ], axis=1)

作为设置执行以使用自动堆叠。

1
2
3
4
5
6
7
8
9
10
11
12
13

%%time

long_time = 5*60 # for quick demonstration only, you should set this to longest time you are willing to wait
dir = 'agModels-predictClass-autostack' # specifies folder where to store trained models
predictor = task.fit(
train_data=train_data,
label='class',
auto_stack=True,
output_directory = dir,
eval_metric='log_loss',
time_limits=long_time)
#CPU times: user 9min 21s, sys: 11.3 s, total: 9min 32s
#Wall time: 5min 15s

1
2
3
4
5
6
7

import sklearn.metrics
X_test_ = pd.DataFrame( X_test )
predictions = predictor.predict(X_test_)
print("Accuracy score ", sklearn.metrics.accuracy_score(y_test, predictions))
print("AUC ", sklearn.metrics.roc_auc_score(y_test, predictions))
#Accuracy score 0.8609450495454918
#AUC 0.780458426444995

使用此数据集和设置，Auto-Gluon可获得最佳结果。

概要

这次我接触了Auto-sklearn 2.0，但是我认为这种AutoML确实很容易并且给出了合理的结果，所以很方便。