本文是数据挖掘比赛入门教程,以车辆贷款违约预测挑战赛为例,演示用LightGBM树模型快速搭建基线。涵盖数据读取与内存优化、EDA分析、特征筛选,通过5折交叉验证训练模型,输出预测结果,还分享进阶思路,助力初学者系统认识比赛并入门。
☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

本项目作为个比赛的入门教程,将演示如何用树模型快速搭建比赛基线及分享比赛进阶提升思路。希望能够帮助初学者对比赛形成一个系统的认识,更好地入门并在比赛中取得好成绩。
LightGBM是基于XGBoost的一款可以快速并行的树模型框架,内部集成了多种集成学习思路,在代码实现上对XGBoost的节点划分进行了改进,内存占用更低训练速度更快。
LightGBM官网:https://lightgbm.readthedocs.io/en/latest/
参数介绍:https://lightgbm.readthedocs.io/en/latest/Parameters.html
使用介绍:你应该知道的LightGBM各种操作!
使用树模型的优势:树模型是生成规则的利器,能够从一系列有特征和标签的数据中总结出决策规则,并用树状图的结构来呈现这些规则,以解决分类和回归问题。
对于采用表格数据的任务,基本都是决策树模型的主场,像XGBoost和LightGBM这类提升(Boosting)树模型已经成为了现在数据挖掘比赛中的标配。
# LightGBM的安装# 默认版本!pip install lightgbm# GPU版本,训练更快# !pip install lightgbm --install-option=--gpu
Looking in indexes: https://mirror.baidu.com/pypi/simple/ Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1) Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.16.4) Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.33.6) Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.22.1) Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.3.0) Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)
此次以讯飞赛题:车辆贷款违约预测挑战赛为例,并以树模型构建赛题基线模型
赛事地址:http://challenge.xfyun.cn/topic/info?type=car-loan
赛题任务:通过训练集训练模型,来预测测试集中loan_default字段的具体值,即借款人是否会拖欠付款,其中1表示客户逾期,0表示客户未逾期。
运行要求:对配置上无高要求,选择CPU版本即可运行本项目。树模型一般处理特征多或维度高时才会对内存会有一定要求。
# 解压比赛数据集%cd /home/aistudio/data/data101719/ !unzip data.zip
/home/aistudio/data/data101719 Archive: data.zip inflating: sample_submit.csv inflating: test.csv inflating: train.csv
# 导入依赖包import pandas as pdimport numpy as npfrom sklearn.model_selection import KFoldfrom sklearn.metrics import f1_score, roc_auc_scorefrom tqdm import tqdmimport gcimport timeimport lightgbm as lgbimport warnings
warnings.filterwarnings('ignore')<IPython.core.display.HTML object>
# 内存优化脚本,避免内存溢出def reduce_mem(df, cols):
start_mem = df.memory_usage().sum() / 1024 ** 2
for col in tqdm(cols):
col_type = df[col].dtypes if col_type != object:
c_min = df[col].min()
c_max = df[col].max() if str(col_type)[:3] == 'int': if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8) elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16) elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32) elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64) else: if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16) elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32) else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024 ** 2
print('{:.2f} Mb, {:.2f} Mb ({:.2f} %)'.format(start_mem, end_mem, 100 * (start_mem - end_mem) / start_mem))
gc.collect() return df# 读取比赛数据集train = pd.read_csv('./train.csv') # 训练集test = pd.read_csv('./test.csv') # 测试集# 对数据集进行内存优化train = reduce_mem(train, [f for f in train.columns])
test = reduce_mem(test, [f for f in test.columns])100%|██████████| 53/53 [00:01<00:00, 42.04it/s] 100%|██████████| 52/52 [00:00<00:00, 559.02it/s]
60.65 Mb, 18.02 Mb (70.28 %) 11.90 Mb, 3.55 Mb (70.19 %)
<br/>
# 根据赛题要求设置提交结果文件格式:'customer_id', 'loan_default'# 'loan_default'作为要对测试集数据进行预测的标签,1表示客户逾期,0表示客户未逾期。sample_submit = pd.DataFrame(columns=['customer_id', 'loan_default']) sample_submit['customer_id'] = test['customer_id']
全局数据分析:数据的整体情况,包括数据类型、大小、质量等
单变量数据分析:对每个变量进行探索性分析,包括类别变量,连续变量,文本变量等
交叉特征分析:特征与标签的交叉分析以及特征与特征之间的交叉等
训练集、测试集分布分析:训练集和测试集的分布不一致是导致线上和线下不一致的重要原因
参考文章:初学者竞赛学习手册
# 数据大小概览,可以看出此赛题的字段较多,如何善用好特征是比赛一大难点train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150000 entries, 0 to 149999 Data columns (total 53 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 150000 non-null int32 1 main_account_loan_no 150000 non-null int16 2 main_account_active_loan_no 150000 non-null int16 3 main_account_overdue_no 150000 non-null int8 4 main_account_outstanding_loan 150000 non-null int32 5 main_account_sanction_loan 150000 non-null int32 6 main_account_disbursed_loan 150000 non-null int32 7 sub_account_loan_no 150000 non-null int8 8 sub_account_active_loan_no 150000 non-null int8 9 sub_account_overdue_no 150000 non-null int8 10 sub_account_outstanding_loan 150000 non-null int32 11 sub_account_sanction_loan 150000 non-null int32 12 sub_account_disbursed_loan 150000 non-null int32 13 disbursed_amount 150000 non-null int32 14 asset_cost 150000 non-null int32 15 branch_id 150000 non-null int8 16 supplier_id 150000 non-null int16 17 manufacturer_id 150000 non-null int8 18 area_id 150000 non-null int8 19 employee_code_id 150000 non-null int16 20 mobileno_flag 150000 non-null int8 21 idcard_flag 150000 non-null int8 22 Driving_flag 150000 non-null int8 23 passport_flag 150000 non-null int8 24 credit_score 150000 non-null int16 25 main_account_monthly_payment 150000 non-null int32 26 sub_account_monthly_payment 150000 non-null int32 27 last_six_month_new_loan_no 150000 non-null int8 28 last_six_month_defaulted_no 150000 non-null int8 29 average_age 150000 non-null int8 30 credit_history 150000 non-null int8 31 enquirie_no 150000 non-null int8 32 loan_to_asset_ratio 150000 non-null float16 33 total_account_loan_no 150000 non-null int16 34 sub_account_inactive_loan_no 150000 non-null int16 35 total_inactive_loan_no 150000 non-null int8 36 main_account_inactive_loan_no 150000 non-null int16 37 total_overdue_no 150000 non-null int8 38 total_outstanding_loan 150000 non-null int32 39 total_sanction_loan 150000 non-null int32 40 total_disbursed_loan 150000 non-null int32 41 total_monthly_payment 150000 non-null int32 42 outstanding_disburse_ratio 150000 non-null float64 43 main_account_tenure 150000 non-null int32 44 sub_account_tenure 150000 non-null int32 45 disburse_to_sactioned_ratio 150000 non-null float32 46 active_to_inactive_act_ratio 150000 non-null float16 47 year_of_birth 150000 non-null int16 48 disbursed_date 150000 non-null int16 49 Credit_level 150000 non-null int8 50 employment_type 150000 non-null int8 51 age 150000 non-null int8 52 loan_default 150000 non-null int8 dtypes: float16(2), float32(1), float64(1), int16(10), int32(17), int8(22) memory usage: 18.0 MB
# 确定每个字段中不同的个数,对nunique为1的字段直接删除。train.nunique()
customer_id 150000 main_account_loan_no 104 main_account_active_loan_no 35 main_account_overdue_no 19 main_account_outstanding_loan 48609 main_account_sanction_loan 30564 main_account_disbursed_loan 32862 sub_account_loan_no 36 sub_account_active_loan_no 21 sub_account_overdue_no 8 sub_account_outstanding_loan 2108 sub_account_sanction_loan 1519 sub_account_disbursed_loan 1725 disbursed_amount 19235 asset_cost 38902 branch_id 82 supplier_id 2888 manufacturer_id 10 area_id 22 employee_code_id 3241 mobileno_flag 1 idcard_flag 1 Driving_flag 2 passport_flag 2 credit_score 570 main_account_monthly_payment 21499 sub_account_monthly_payment 1304 last_six_month_new_loan_no 24 last_six_month_defaulted_no 14 average_age 100 credit_history 100 enquirie_no 23 loan_to_asset_ratio 1994 total_account_loan_no 103 sub_account_inactive_loan_no 90 total_inactive_loan_no 27 main_account_inactive_loan_no 91 total_overdue_no 19 total_outstanding_loan 49406 total_sanction_loan 31216 total_disbursed_loan 33557 total_monthly_payment 21843 outstanding_disburse_ratio 4391 main_account_tenure 12816 sub_account_tenure 1230 disburse_to_sactioned_ratio 375 active_to_inactive_act_ratio 211 year_of_birth 48 disbursed_date 1 Credit_level 14 employment_type 3 age 48 loan_default 2 dtype: int64
1.特征交互:特征和特征之间组合、特征和特征之间衍生
2.特征编码:one-hot编码、label-encode编码等
3.特征选择:通过对特征重要性及相关性的分析,精简掉无用的特征
特征工程很大程度上是在帮助模型学习,在模型学习不好的地方或者难以学习的地方,采用特征工程的方式帮助其学习,通过人为筛选、人为构建组合特征让模型原本很难学好的东西可以更加容易地进行学习、进而拿到更好的效果。
# 筛掉无用特征all_cols = [f for f in train.columns if f not in ['customer_id','loan_default','mobileno_flag','idcard_flag','disbursed_date']]
主要演示如何用树模型快速地搭建一个比赛基线模型,在特征工程及模型优化上需要结合具体赛题要求进行针对性地优化。
# 训练集x_train = train[all_cols]# 训练集标签字段y_train = train['loan_default']# 要进行预测的测试集x_test = test[all_cols]
# 定义训练和预测函数def train_predict(clf, train_x, train_y, test_x, clf_name='lgb'):
# 5折交叉验证
folds = 5
seed = 2021
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
train = np.zeros(train_x.shape[0])
test = np.zeros(test_x.shape[0])
cv_scores = [] for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)): print('************************************ {} ************************************'.format(str(i+1)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
train_matrix = clf.Dataset(trn_x, label=trn_y)
valid_matrix = clf.Dataset(val_x, label=val_y) # 树模型参数设置
params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'min_child_weight': 5, 'num_leaves': 2 ** 7, 'lambda_l2': 10, 'feature_fraction': 0.9, 'bagging_fraction': 0.9, 'bagging_freq': 4, 'learning_rate': 0.01, 'seed': 2021, 'n_jobs':-1, 'verbose': -1,
} # 早停和验证步数需要根据具体情况进行调优
model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=500,early_stopping_rounds=200) # 对验证集进行预测
val_pred = model.predict(val_x, num_iteration=model.best_iteration) # 对测试集进行预测
test_pred = model.predict(test_x, num_iteration=model.best_iteration)
train[valid_index] = val_pred
test += test_pred / kf.n_splits
cv_scores.append(roc_auc_score(val_y, val_pred)) # 输出验证集结果分数
print(cv_scores) print("%s_scotrainre_list:" % clf_name, cv_scores) print("%s_score_mean:" % clf_name, np.mean(cv_scores)) print("%s_score_std:" % clf_name, np.std(cv_scores)) # 在训练完成后输出feature_importance,输出各特征的重要性
print(pd.DataFrame({ 'column': all_cols, 'importance': model.feature_importance()/5,
}).sort_values(by='importance',ascending=False)) return train, test# 进行模型的训练与预测lgb_train, lgb_test = train_predict(lgb, x_train, y_train, x_test)
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds
[500] training's auc: 0.757221 valid_1's auc: 0.665608
Early stopping, best iteration is:
[648] training's auc: 0.774819 valid_1's auc: 0.666395
[0.6663954692558639]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds
[500] training's auc: 0.756217 valid_1's auc: 0.6646
Early stopping, best iteration is:
[774] training's auc: 0.786664 valid_1's auc: 0.665809
[0.6663954692558639, 0.6658088579217993]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds
[500] training's auc: 0.757318 valid_1's auc: 0.664588
[1000] training's auc: 0.809107 valid_1's auc: 0.665196
Early stopping, best iteration is:
[840] training's auc: 0.794933 valid_1's auc: 0.665534
[0.6663954692558639, 0.6658088579217993, 0.6655342821383231]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds
[500] training's auc: 0.758371 valid_1's auc: 0.650627
[1000] training's auc: 0.809869 valid_1's auc: 0.652059
Early stopping, best iteration is:
[996] training's auc: 0.809559 valid_1's auc: 0.652149
[0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds
[500] training's auc: 0.757135 valid_1's auc: 0.662366
Early stopping, best iteration is:
[692] training's auc: 0.779432 valid_1's auc: 0.662648
[0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985, 0.662648392749281]
lgb_scotrainre_list: [0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985, 0.662648392749281]
lgb_score_mean: 0.6625071824914504
lgb_score_std: 0.005338481209206612
column importance
18 employee_code_id 1421.0
15 supplier_id 1374.6
14 branch_id 1341.0
29 loan_to_asset_ratio 1307.0
12 disbursed_amount 1150.2
13 asset_cost 1089.6
44 year_of_birth 995.4
21 credit_score 781.6
17 area_id 760.6
39 outstanding_disburse_ratio 635.6
27 credit_history 565.6
40 main_account_tenure 560.8
26 average_age 560.8
22 main_account_monthly_payment 445.8
16 manufacturer_id 434.6
38 total_monthly_payment 371.6
3 main_account_outstanding_loan 339.4
43 active_to_inactive_act_ratio 304.6
35 total_outstanding_loan 264.8
36 total_sanction_loan 233.0
46 employment_type 228.6
4 main_account_sanction_loan 213.2
37 total_disbursed_loan 205.6
28 enquirie_no 188.8
5 main_account_disbursed_loan 182.2
31 sub_account_inactive_loan_no 155.4
0 main_account_loan_no 155.4
25 last_six_month_defaulted_no 155.2
30 total_account_loan_no 152.6
42 disburse_to_sactioned_ratio 141.6
33 main_account_inactive_loan_no 134.4
1 main_account_active_loan_no 126.4
2 main_account_overdue_no 126.4
24 last_six_month_new_loan_no 122.8
47 age 117.6
34 total_overdue_no 87.4
45 Credit_level 53.4
19 Driving_flag 27.0
23 sub_account_monthly_payment 12.4
41 sub_account_tenure 12.4
6 sub_account_loan_no 10.8
9 sub_account_outstanding_loan 8.0
20 passport_flag 7.0
32 total_inactive_loan_no 5.6
10 sub_account_sanction_loan 5.6
11 sub_account_disbursed_loan 3.0
8 sub_account_overdue_no 0.2
7 sub_account_active_loan_no 0.2# 保存预测结果文件sample_submit['loan_default'] = lgb_test# 注意由于赛题要求输出的为0或1,故需要对预测结果进行一定的转换。此处设置大于0.25为1,小于或等于0.25则为0。sample_submit['loan_default'] = sample_submit['loan_default'].apply(lambda x:1 if x>0.25 else 0).values# 保存结果文件sample_submit.to_csv('result.csv', index=False)以上就是【数据挖掘入门】使用树模型快速搭建比赛基线模型及进阶分享的详细内容,更多请关注php中文网其它相关文章!
每个人都需要一台速度更快、更稳定的 PC。随着时间的推移,垃圾文件、旧注册表数据和不必要的后台进程会占用资源并降低性能。幸运的是,许多工具可以让 Windows 保持平稳运行。
Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号