1.5.1 Boosting算法

智能系统与技术丛书·AI安全之对抗样本入门作者：兜哥投票推荐加入书签留言反馈

    1.5.1 boosting算法
    boosting系列算法的原理是在训练集用初始权重训练出一个分类器，根据分类器的表现来更新训练样本的权重，使得这些错误率高的样本在后面的训练中得到更多的重视。如此重复进行，直到分类器的数量达到事先指定的数目，最终将全部分类器通过集合策略进行整合，得到新的分类器。
    boosting系列算法里最著名的算法主要有adaboost算法和梯度提升决策树gbdt（gradient boosting decision tree）算法，我们以adaboost和gbdt为例，介绍如何在scikit-learn中使用它们。
    以adaboost为例，数据集使用随机生成的数据，使用adaboostclassifier，分类器个数设置为100：
    x, y = datasets.make_classification(n_samples=1000,
    n_features=100,n_redundant=0, random_state = 1)
    train_x, test_x, train_y, test_y = train_test_split(x,
    y,
    test_size=0.2,
    random_state=66)
    clf = adaboostclassifier(n_estimators=100)
    clf.fit(train_x, train_y)
    pred_y = clf.predict(test_x)
    输出对应的性能指标，准确度为80.5%，f1为81.52%，准确率为81.13%，召回率为81.90%，auc为0.80：
    accuracy_score:
    0.805
    f1_score:
    0.815165876777
    recall_score:
    0.819047619048
    precision_score:
    0.811320754717
    confusion_matrix:
    [[75 20]
    [19 86]]
    auc:
    0.804260651629
    对应的roc曲线如图1-35所示，综合指标都优于之前的knn。
    图1-35 adaboost的roc曲线
    以gbdt为例，数据集依然使用随机生成的数据，使用gradientboostingclassifier，分类器个数设置为100：
    x, y = datasets.make_classification(n_samples=1000,
    n_features=100,n_redundant=0, random_state = 1)
    train_x, test_x, train_y, test_y = train_test_split(x,
    y,
    test_size=0.2,
    random_state=66)
    clf = gradientboostingclassifier(n_estimators=100)
    clf.fit(train_x, train_y)
    pred_y = clf.predict(test_x)
    report(test_y, pred_y)
    输出对应的性能指标，准确度为84%，f1为84.76%，准确率为84.76%，召回率为84.76%，auc为0.84：
    accuracy_score:
    0.84
    f1_score:
    0.847619047619
    recall_score:
    0.847619047619
    precision_score:
    0.847619047619
    confusion_matrix:
    [[79 16]
    [16 89]]
    auc:
    0.839598997494
    对应的roc曲线如图1-36所示，综合指标优于之前的knn，也略优于adaboost，不过boosting系列算法都有大量参数可以优化，对性能有一定影响，本章的这个比较只是一个不太严谨的对比。
    图1-36 gbdt的roc曲线