Py之lightgbm:lightgbm的简介、安装、使用方法之详细攻略
Py之lightgbm:lightgbm的简介、安装、使用方法之详细攻略
lightgbm的简介
LightGBM 是一个梯度 boosting 框架, 使用基于学习算法的决策树. 它是分布式的, 高效的, 装逼的, 它具有以下优势:
- 速度和内存使用的优化
- 减少分割增益的计算量
- 通过直方图的相减来进行进一步的加速
- 减少内存的使用 减少并行学习的通信代价
- 稀疏优化
- 准确率的优化
- Leaf-wise (Best-first) 的决策树生长策略
- 类别特征值的最优分割
- 网络通信的优化
- 并行学习的优化
- 特征并行
- 数据并行
- 投票并行
- GPU 支持可处理大规模数据
1、效率
为了比较效率, 我们只运行没有任何测试或者度量输出的训练进程,并且我们不计算 IO 的时间。如下是耗时的对比表格:
Data | xgboost | xgboost_hist | LightGBM |
---|---|---|---|
Higgs | 3794.34 s | 551.898 s | 238.505513 s |
Yahoo LTR | 674.322 s | 265.302 s | 150.18644 s |
MS LTR | 1251.27 s | 385.201 s | 215.320316 s |
Expo | 1607.35 s | 588.253 s | 138.504179 s |
Allstate | 2867.22 s | 1355.71 s | 348.084475 s |
我们发现在所有数据集上 LightGBM 都比 xgboost 快。
2、准确率
为了比较准确率, 我们使用数据集测试集部分的准确率进行公平比较。
Data | Metric | xgboost | xgboost_hist | LightGBM |
---|---|---|---|---|
Higgs | AUC | 0.839593 | 0.845605 | 0.845154 |
Yahoo LTR | NDCG<sub>1</sub> | 0.719748 | 0.720223 | 0.732466 |
NDCG<sub>3</sub> | 0.717813 | 0.721519 | 0.738048 | |
NDCG<sub>5</sub> | 0.737849 | 0.739904 | 0.756548 | |
NDCG<sub>10</sub> | 0.78089 | 0.783013 | 0.796818 | |
MS LTR | NDCG<sub>1</sub> | 0.483956 | 0.488649 | 0.524255 |
NDCG<sub>3</sub> | 0.467951 | 0.473184 | 0.505327 | |
NDCG<sub>5</sub> | 0.472476 | 0.477438 | 0.510007 | |
NDCG<sub>10</sub> | 0.492429 | 0.496967 | 0.527371 | |
Expo | AUC | 0.756713 | 0.777777 | 0.777543 |
Allstate |
3、内存消耗
我们在运行训练任务时监视 RES,并在 LightGBM 中设置 two_round=true
(将增加数据载入时间,但会减少峰值内存使用量,不影响训练速度和准确性)以减少峰值内存使用量。
Data | xgboost | xgboost_hist | LightGBM |
---|---|---|---|
Higgs | 4.853GB | 3.784GB | 0.868GB |
Yahoo LTR | 1.907GB | 1.468GB | 0.831GB |
MS LTR | 5.469GB | 3.654GB | 0.886GB |
Expo | 1.553GB | 1.393GB | 0.543GB |
Allstate | 6.237GB | 4.990GB |
4、综述
LightGBM是个快速的,分布式的,高性能的基于决策树算法的梯度提升框架。可用于排序,分类,回归以及很多其他的机器学习任务中。
Gbdt是受欢迎的机器学习算法,当特征维度很高或数据量很大时,有效性和可拓展性没法满足。lightgbm提出GOSS(Gradient-based One-Side Sampling)和EFB(Exclusive Feature Bundling)进行改进。lightgbm与传统的gbdt在达到相同的精确度时,快20倍。
在竞赛题中,我们知道XGBoost算法非常热门,它是一种优秀的拉动框架,但是在使用过程中,其训练耗时很长,内存占用比较大。在2017年年1月微软在GitHub的上开源了一个新的升压工具--LightGBM。在不降低准确率的前提下,速度提升了10倍左右,占用内存下降了3倍左右。因为他是基于决策树算法的,它采用最优的叶明智策略分裂叶子节点,然而其它的提升算法分裂树一般采用的是深度方向或者水平明智而不是叶,明智的。因此,在LightGBM算法中,当增长到相同的叶子节点,叶明智算法比水平-wise算法减少更多的损失。因此导致更高的精度,而其他的任何已存在的提升算法都不能够达。与此同时,它的速度也让人感到震惊,这就是该算法名字 灯的原因。
LightGBM 中文文档:http://lightgbm.apachecn.org/#/
lightgbm github:https://github.com/Microsoft/LightGBM
lightgbm pypi:https://pypi.org/project/lightgbm/
lightgbm的安装
pip install lightgbm
lightgbm的使用方法
1、class lightgbm.Dataset
class lightgbm.Dataset(data, label=None, max_bin=None, reference=None, weight=None, group=None, init_score=None, silent=False, feature_name='auto', categorical_feature='auto', params=None, free_raw_data=True)
Parameters:
- data (string__, numpy array or scipy.sparse) – Data source of Dataset. If string, it represents the path to txt file.
- label (list__, numpy 1-D array or None__, optional (default=None)) – Label of the data.
- max_bin (int or None__, optional (default=None)) – Max number of discrete bins for features. If None, default value from parameters of CLI-version will be used.
- reference (Dataset or None__, optional (default=None)) – If this is Dataset for validation, training data should be used as reference.
- weight (list__, numpy 1-D array or None__, optional (default=None)) – Weight for each instance.
- group (list__, numpy 1-D array or None__, optional (default=None)) – Group/query size for Dataset.
- init_score (list__, numpy 1-D array or None__, optional (default=None)) – Init score for Dataset.
- silent (bool__, optional (default=False)) – Whether to print messages during construction.
- feature_name (list of strings or 'auto'__, optional (default="auto")) – Feature names. If 'auto’ and data is pandas DataFrame, data columns names are used.
- categorical_feature (list of strings or int__, or 'auto'__, optional (default="auto")) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify
feature_name
as well). If 'auto’ and data is pandas DataFrame, pandas categorical columns are used. - params (dict or None__, optional (default=None)) – Other parameters.
- free_raw_data (bool__, optional (default=True)) – If True, raw data is freed after constructing inner Dataset.
2、LGBMRegressor类
https://lightgbm.readthedocs.io/en/latest/Python-API.html?highlight=LGBMRegressor
classlightgbm.LGBMModel
(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=100, subsample_for_bin=200000, objective=None, class_weight=None, min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20, subsample=1.0, subsample_freq=0, colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, random_state=None, n_jobs=-1, silent=True, importance_type='split', **kwargs)
- boosting_type (string, optional (default='gbdt')) – 'gbdt’, traditional Gradient Boosting Decision Tree. 'dart’, Dropouts meet Multiple Additive Regression Trees. 'goss’, Gradient-based One-Side Sampling. 'rf’, Random Forest.
- num_leaves (int, optional (default=31)) – Maximum tree leaves for base learners.
- max_depth (int, optional (default=-1)) – Maximum tree depth for base learners, -1 means no limit.
- learning_rate (float, optional (default=0.1)) – Boosting learning rate. You can use
callbacks
parameter offit
method to shrink/adapt learning rate in training usingreset_parameter
callback. Note, that this will ignore thelearning_rate
argument in training. - n_estimators (int, optional (default=100)) – Number of boosted trees to fit.
- subsample_for_bin (int, optional (default=200000)) – Number of samples for constructing bins.
- objective (string, callable or None, optional (default=None)) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below). Default: 'regression’ for LGBMRegressor, 'binary’ or 'multiclass’ for LGBMClassifier, 'lambdarank’ for LGBMRanker.
- class_weight (dict, 'balanced' or None, optional (default=None)) – Weights associated with classes in the form
{class_label: weight}
. Use this parameter only for multi-class classification task; for binary classification task you may useis_unbalance
orscale_pos_weight
parameters. The 'balanced’ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data asn_samples / (n_classes * np.bincount(y))
. If None, all classes are supposed to have weight one. Note, that these weights will be multiplied withsample_weight
(passed through thefit
method) ifsample_weight
is specified. - min_split_gain (float, optional (default=0.)) – Minimum loss reduction required to make a further partition on a leaf node of the tree.
- min_child_weight (float, optional (default=1e-3)) – Minimum sum of instance weight (hessian) needed in a child (leaf).
- min_child_samples (int, optional (default=20)) – Minimum number of data needed in a child (leaf).
- subsample (float, optional (default=1.)) – Subsample ratio of the training instance.
- subsample_freq (int, optional (default=0)) – Frequence of subsample, <=0 means no enable.
- colsample_bytree (float, optional (default=1.)) – Subsample ratio of columns when constructing each tree.
- reg_alpha (float, optional (default=0.)) – L1 regularization term on weights.
- reg_lambda (float, optional (default=0.)) – L2 regularization term on weights.
- random_state (int or None, optional (default=None)) – Random number seed. If None, default seeds in C++ code will be used.
- n_jobs (int, optional (default=-1)) – Number of parallel threads.
- silent (bool, optional (default=True)) – Whether to print messages while running boosting.
- importance_type (string, optional (default='split')) – The type of feature importance to be filled into
feature_importances_
. If 'split’, result contains numbers of times the feature is used in a model. If 'gain’, result contains total gains of splits which use the feature. - bagging_fraction ( )
- feature_fraction ( )
- min_data_in_leaf ( )
- min_sum_hessian_in_leaf ( )
class LGBMRegressor Found at: lightgbm.sklearn
class LGBMRegressor(LGBMModel, _LGBMRegressorBase):
"""LightGBM regressor."""
def fit(self, X, y,
sample_weight=None, init_score=None,
eval_set=None, eval_names=None, eval_sample_weight=None,
eval_init_score=None, eval_metric=None, early_stopping_rounds=None,
verbose=True, feature_name='auto', categorical_feature='auto', callbacks=None):
"""Docstring is inherited from the LGBMModel."""
super(LGBMRegressor, self).fit(X, y, sample_weight=sample_weight, init_score=init_score,
eval_set=eval_set, eval_names=eval_names, eval_sample_weight=eval_sample_weight,
eval_init_score=eval_init_score, eval_metric=eval_metric,
early_stopping_rounds=early_stopping_rounds, verbose=verbose,
feature_name=feature_name, categorical_feature=categorical_feature, callbacks=callbacks)
return self
_base_doc = LGBMModel.fit.__doc__
fit.__doc__ = _base_doc[:_base_doc.find('eval_class_weight :')] + _base_doc[_base_doc.find
('eval_init_score :'):]