【金融风控系列】_[3]_贷款违约识别

P粉084495128
发布: 2025-07-22 11:59:21
原创
305人浏览过
本文围绕Kaggle的Home Credit Default Risk赛题展开,利用客户申请表等7张表数据构建模型预测客户还款能力。通过数据清洗、特征工程,融合多表信息生成衍生特征,经LightGBM模型训练,最终线上评分为0.78277,为信用记录不足人群的贷款评估提供参考。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

【金融风控系列】_[3]_贷款违约识别 - php中文网

Home Credit Default Risk(家庭信用违约风险)

该赛题来自 KAGGLE,仅用作学习交流


由于信用记录不足或不存在,许多人往往被划分为低信用借贷人而难以获得贷款。 为了确保这些人群获得贷款,Home Credit公司利用替代数据(包括电信和交易信息)预测客户的还款能力。

Home Credit一共提供了7张表,一共218个字段,其中训练集样本约31万(逾期8%),测试集样本约5万。


信息表

application_train/test 客户申请表

包含了

  • 目标变量(客户是否违约-0/1变量)
  • 客户申请贷款信息(贷款类型, 贷款总额, 年金)
  • 客户基本信息(性别, 年龄, 家庭, 学历, 职业, 行业, 居住地情况)
  • 客户财务信息(年收入, 房/车情况)
  • 申请时提供的资料等.

bureau/bureau_balance 由其他金融机构提供给征信中心的客户信用记录历史(月数据)

包含了客户在征信中心的

  • 信用记录,
  • 违约金额,
  • 违约时间等.

以时间序列(按行)的形式进行记录.

POS_CASH_balance 客户在Home Credit数据库中POS(point of sales)和现金贷款历史(月数据)

包含了客户

采风问卷
采风问卷

采风问卷是一款全新体验的调查问卷、表单、投票、评测的调研平台,新奇的交互形式,漂亮的作品,让客户眼前一亮,让创作者获得更多的回复。

采风问卷 20
查看详情 采风问卷
  • 已付款情况
  • 未付款情况

credit_card_balance 客户在Home Credit数据库中信用卡的snapshot历史(月数据)

包含了客户

  • 消费次数
  • 消费金额

等情况.

previous_application 客户先前的申请记录

包含了客户所有历史申请记录(申请信息, 申请结果等).

installments_payments 客户先前信用卡的还款记录

包含了客户的还款情况

  • 还款日期
  • 是否逾期
  • 还款金额
  • 是否欠款等

参考:

[1] https://zhuanlan.zhihu.com/p/43541825

[2] https://www.kaggle.com/xucheng/cv-7993-private-score-7996/

[3] https://zhuanlan.zhihu.com/p/40790434

[4] https://www.kaggle.com/tahmidnafi/cse499

[5] https://blog.csdn.net/zhangchen2449/article/details/83338978

主要字段表

Field Description
SK_ID_CURR 此次申请的ID
TARGET 申请人本次申请的还款风险:1-风险较高;0-风险较低
NAME_CONTRACT_TYPE 贷款类型:cash(现金)还是revolving(周转金,一次申请,多次循环提取)
CODE_GENDER 申请人性别
FLAG_OWN_CAR 申请人是否有车
FLAG_OWN_REALTY 申请人是否有房
CNT_CHILDREN 申请人子女个数
AMT_INCOME_TOTAL 申请人收入状况
AMT_CREDIT 此次申请的贷款金额
AMT_ANNUITY 贷款年金
AMT_GOODS_PRICE 如果是消费贷款,改字段表示商品的实际价格
NAME_TYPE_SUITE 申请人此次申请的陪同人员
NAME_INCOME_TYPE 申请人收入类型
NAME_EDUCATION_TYPE 申请人受教育程度
NAME_FAMILY_STATUS 申请人婚姻状况
NAME_HOUSING_TYPE 申请人居住状况(租房,已购房,和父母一起住等)
REGION_POPULATION_RELATIVE 申请人居住地人口密度,已标准化
DAYS_BIRTH 申请人出生日(距离申请当日的天数,负值)
DAYS_EMPLOYED 申请人当前工作的工作年限(距离申请当日的天数,负值)
DAYS_REGISTRATION 申请人最近一次修改注册信息的时间(距离申请当日的天数,负值)
DAYS_ID_PUBLISH 申请人最近一次修改申请贷款的身份证明文件的时间(距离申请当日的天数,负值)
FLAG_MOBIL 申请人是否提供个人电话(1-yes,0-no)
FLAG_EMP_PHONE 申请人是否提供家庭电话(1-yes,0-no)
FLAG_WORK_PHONE 申请人是否提供工作电话(1-yes,0-no)
FLAG_CONT_MOBILE 申请人个人电话是否能拨通(1-yes,0-no)
FLAG_EMAIL 申请人是否提供电子邮箱(1-yes,0-no)
OCCUPATION_TYPE 申请人职务
REGION_RATING_CLIENT ben公司对申请人居住区域的评分等级(1,2,3)
REGION_RATING_CLIENT_W_CITY 在考虑所在城市的情况下,ben公司对申请人居住区域的评分等级(1,2,3)
WEEKDAY_APPR_PROCESS_START 申请人发起申请日是星期几
HOUR_APPR_PROCESS_START 申请人发起申请的hour
REG_REGION_NOT_LIVE_REGION 申请人提供的的永久地址和联系地址是否匹配(1-不匹配,2-匹配,区域级别的)
REG_REGION_NOT_WORK_REGION 申请人提供的的永久地址和工作地址是否匹配(1-不匹配,2-匹配,区域级别的)
LIVE_REGION_NOT_WORK_REGION 申请人提供的的联系地址和工作地址是否匹配(1-不匹配,2-匹配,区域级别的)
REG_CITY_NOT_LIVE_CITY 申请人提供的的永久地址和联系地址是否匹配(1-不匹配,2-匹配,城市级别的)
REG_CITY_NOT_WORK_CITY 申请人提供的的永久地址和工作地址是否匹配(1-不匹配,2-匹配,城市级别的)
LIVE_CITY_NOT_WORK_CITY 申请人提供的的联系地址和工作地址是否匹配(1-不匹配,2-匹配,城市级别的)
ORGANIZATION_TYPE 申请人工作所属组织类型
EXT_SOURCE_1 外部数据源1的标准化评分
EXT_SOURCE_2 外部数据源2的标准化评分
EXT_SOURCE_3 外部数据源3的标准化评分
APARTMENTS_AVG <----> EMERGENCYSTATE_MODE 申请人居住环境各项指标的标准化评分
OBS_30_CNT_SOCIAL_CIRC LE <----> DEF_60_CNT_SOCIAL_CIRCLE 这部分字段含义没看懂
DAYS_LAST_PHONE_CHANGE 申请人最近一次修改手机号码的时间(距离申请当日的天数,负值)
FLAG_DOCUMENT_2 <----> FLAG_DOCUMENT_21 申请人是否额外提供了文件2,3,4. . .21
AMT_REQ_CREDIT_BUREAU_HOUR 申请人发起申请前1个小时以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_DAY 申请人发起申请前一天以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_WEEK 申请人发起申请前一周以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_MONTH 申请人发起申请前一个月以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_QRT 申请人发起申请前一个季度以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_YEAR 申请人发起申请前一年以内,被查询征信的次数
In [20]
#!unzip -q -o data/data105246/home_credit_default_risk.zip -d /home/aistudio/data
登录后复制
       
unzip:  cannot find or open data/data104475/IEEE_CIS_Fraud_Detection.zip, data/data104475/IEEE_CIS_Fraud_Detection.zip.zip or data/data104475/IEEE_CIS_Fraud_Detection.zip.ZIP.
登录后复制
       
In [22]
# 安装依赖包!pip install xgboost
!pip install lightgbm
登录后复制
       
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: xgboost in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (1.3.3)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.6.3)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.20.3)
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1)
Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.24.2)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.20.3)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.6.3)
Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.36.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (2.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)
登录后复制
       
In [23]
import osimport gcimport numpy as npimport pandas as pdfrom scipy.stats import kurtosisfrom sklearn.metrics import roc_auc_scorefrom sklearn.preprocessing import MinMaxScalerfrom sklearn.impute import SimpleImputerfrom sklearn.linear_model import LogisticRegressionimport matplotlib.pyplot as pltimport seaborn as snsimport warningsfrom sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFoldimport xgboost as xgbfrom xgboost import XGBClassifier
warnings.simplefilter(action='ignore', category=FutureWarning)from lightgbm import LGBMClassifier
登录后复制
   
In [24]
DATA_DIRECTORY = "./data"df_train = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_train.csv'))
df_test = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_test.csv'))
df = df_train.append(df_test)del df_train, df_test; gc.collect()
登录后复制
       
39
登录后复制
               
In [25]
df = df[df['AMT_INCOME_TOTAL'] < 20000000]
df = df[df['CODE_GENDER'] != 'XNA']
df['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)
df['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)
登录后复制
   
In [26]
def get_age_group(days_birth):
    age_years = -days_birth / 365
    if age_years < 27: return 1
    elif age_years < 40: return 2
    elif age_years < 50: return 3
    elif age_years < 65: return 4
    elif age_years < 99: return 5
    else: return 0
登录后复制
   
In [27]
docs = [f for f in df.columns if 'FLAG_DOC' in f]
df['DOCUMENT_COUNT'] = df[docs].sum(axis=1)
df['NEW_DOC_KURT'] = df[docs].kurtosis(axis=1)
df['AGE_RANGE'] = df['DAYS_BIRTH'].apply(lambda x: get_age_group(x))
登录后复制
   
In [28]
df['EXT_SOURCES_PROD'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_2'] * df['EXT_SOURCE_3']
df['EXT_SOURCES_WEIGHTED'] = df.EXT_SOURCE_1 * 2 + df.EXT_SOURCE_2 * 1 + df.EXT_SOURCE_3 * 3np.warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')for function_name in ['min', 'max', 'mean', 'nanmedian', 'var']:
    feature_name = 'EXT_SOURCES_{}'.format(function_name.upper())
    df[feature_name] = eval('np.{}'.format(function_name))(
        df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']], axis=1)
登录后复制
   
In [29]
df['CREDIT_TO_ANNUITY_RATIO'] = df['AMT_CREDIT'] / df['AMT_ANNUITY']
df['CREDIT_TO_GOODS_RATIO'] = df['AMT_CREDIT'] / df['AMT_GOODS_PRICE']
df['ANNUITY_TO_INCOME_RATIO'] = df['AMT_ANNUITY'] / df['AMT_INCOME_TOTAL']
df['CREDIT_TO_INCOME_RATIO'] = df['AMT_CREDIT'] / df['AMT_INCOME_TOTAL']
df['INCOME_TO_EMPLOYED_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_EMPLOYED']
df['INCOME_TO_BIRTH_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_BIRTH']    
df['EMPLOYED_TO_BIRTH_RATIO'] = df['DAYS_EMPLOYED'] / df['DAYS_BIRTH']
df['ID_TO_BIRTH_RATIO'] = df['DAYS_ID_PUBLISH'] / df['DAYS_BIRTH']
df['CAR_TO_BIRTH_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_BIRTH']
df['CAR_TO_EMPLOYED_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_EMPLOYED']
df['PHONE_TO_BIRTH_RATIO'] = df['DAYS_LAST_PHONE_CHANGE'] / df['DAYS_BIRTH']
登录后复制
   
In [30]
def do_mean(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].mean().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
登录后复制
   
In [31]
def do_median(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].median().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
登录后复制
   
In [32]
def do_std(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].std().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
登录后复制
   
In [33]
def do_sum(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].sum().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
登录后复制
   
In [34]
group = ['ORGANIZATION_TYPE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_RANGE', 'CODE_GENDER']
df = do_median(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_MEDIAN')
df = do_std(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_STD')
df = do_mean(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_MEAN')
df = do_std(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_STD')
df = do_mean(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_MEAN')
df = do_std(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_STD')
df = do_mean(df, group, 'AMT_CREDIT', 'GROUP_CREDIT_MEAN')
df = do_mean(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_MEAN')
df = do_std(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_STD')
登录后复制
   
In [35]
def label_encoder(df, categorical_columns=None):
    if not categorical_columns:
        categorical_columns = [col for col in df.columns if df[col].dtype == 'object']    for col in categorical_columns:
        df[col], uniques = pd.factorize(df[col])    return df, categorical_columns
登录后复制
   
In [36]
def drop_application_columns(df):
    drop_list = [        'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'HOUR_APPR_PROCESS_START',        'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'FLAG_CONT_MOBILE', 'FLAG_EMAIL', 'FLAG_PHONE',        'FLAG_OWN_REALTY', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',        'REG_CITY_NOT_WORK_CITY', 'OBS_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',        'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_YEAR', 
        'COMMONAREA_MODE', 'NONLIVINGAREA_MODE', 'ELEVATORS_MODE', 'NONLIVINGAREA_AVG',        'FLOORSMIN_MEDI', 'LANDAREA_MODE', 'NONLIVINGAREA_MEDI', 'LIVINGAPARTMENTS_MODE',        'FLOORSMIN_AVG', 'LANDAREA_AVG', 'FLOORSMIN_MODE', 'LANDAREA_MEDI',        'COMMONAREA_MEDI', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'BASEMENTAREA_AVG',        'BASEMENTAREA_MODE', 'NONLIVINGAPARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 
        'LIVINGAPARTMENTS_AVG', 'ELEVATORS_AVG', 'YEARS_BUILD_MEDI', 'ENTRANCES_MODE',        'NONLIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'LIVINGAPARTMENTS_MEDI',        'YEARS_BUILD_MODE', 'YEARS_BEGINEXPLUATATION_AVG', 'ELEVATORS_MEDI', 'LIVINGAREA_MEDI',        'YEARS_BEGINEXPLUATATION_MODE', 'NONLIVINGAPARTMENTS_AVG', 'HOUSETYPE_MODE',        'FONDKAPREMONT_MODE', 'EMERGENCYSTATE_MODE'
    ]    for doc_num in [2,4,5,6,7,9,10,11,12,13,14,15,16,17,19,20,21]:
        drop_list.append('FLAG_DOCUMENT_{}'.format(doc_num))
    df.drop(drop_list, axis=1, inplace=True)    return df
登录后复制
   
In [37]
df, le_encoded_cols = label_encoder(df, None)
df = drop_application_columns(df)
登录后复制
   
In [38]
df = pd.get_dummies(df)
登录后复制
   
In [39]
bureau = pd.read_csv(os.path.join(DATA_DIRECTORY, 'bureau.csv'))
登录后复制
   
In [40]
bureau['CREDIT_DURATION'] = -bureau['DAYS_CREDIT'] + bureau['DAYS_CREDIT_ENDDATE']
bureau['ENDDATE_DIF'] = bureau['DAYS_CREDIT_ENDDATE'] - bureau['DAYS_ENDDATE_FACT']
bureau['DEBT_PERCENTAGE'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_CREDIT_SUM_DEBT']
bureau['DEBT_CREDIT_DIFF'] = bureau['AMT_CREDIT_SUM'] - bureau['AMT_CREDIT_SUM_DEBT']
bureau['CREDIT_TO_ANNUITY_RATIO'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_ANNUITY']
登录后复制
   
In [41]
def one_hot_encoder(df, categorical_columns=None, nan_as_category=True):
    original_columns = list(df.columns)    if not categorical_columns:
        categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)
    categorical_columns = [c for c in df.columns if c not in original_columns]    return df, categorical_columns
登录后复制
   
In [42]
def group(df_to_agg, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):
    agg_df = df_to_agg.groupby(aggregate_by).agg(aggregations)
    agg_df.columns = pd.Index(['{}{}_{}'.format(prefix, e[0], e[1].upper())                               for e in agg_df.columns.tolist()])    return agg_df.reset_index()
登录后复制
   
In [43]
def group_and_merge(df_to_agg, df_to_merge, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):
    agg_df = group(df_to_agg, prefix, aggregations, aggregate_by= aggregate_by)    return df_to_merge.merge(agg_df, how='left', on= aggregate_by)
登录后复制
   
In [44]
def get_bureau_balance(path, num_rows= None):
    bb = pd.read_csv(os.path.join(path, 'bureau_balance.csv'))
    bb, categorical_cols = one_hot_encoder(bb, nan_as_category= False)    # Calculate rate for each category with decay
    bb_processed = bb.groupby('SK_ID_BUREAU')[categorical_cols].mean().reset_index()    # Min, Max, Count and mean duration of payments (months)
    agg = {'MONTHS_BALANCE': ['min', 'max', 'mean', 'size']}
    bb_processed = group_and_merge(bb, bb_processed, '', agg, 'SK_ID_BUREAU')    del bb; gc.collect()    return bb_processed
登录后复制
   
In [45]
bureau, categorical_cols = one_hot_encoder(bureau, nan_as_category= False)
bureau = bureau.merge(get_bureau_balance(DATA_DIRECTORY), how='left', on='SK_ID_BUREAU')
bureau['STATUS_12345'] = 0for i in range(1,6):
    bureau['STATUS_12345'] += bureau['STATUS_{}'.format(i)]
登录后复制
   
In [46]
features = ['AMT_CREDIT_MAX_OVERDUE', 'AMT_CREDIT_SUM_OVERDUE', 'AMT_CREDIT_SUM',    'AMT_CREDIT_SUM_DEBT', 'DEBT_PERCENTAGE', 'DEBT_CREDIT_DIFF', 'STATUS_0', 'STATUS_12345']
agg_length = bureau.groupby('MONTHS_BALANCE_SIZE')[features].mean().reset_index()
agg_length.rename({feat: 'LL_' + feat for feat in features}, axis=1, inplace=True)
bureau = bureau.merge(agg_length, how='left', on='MONTHS_BALANCE_SIZE')del agg_length; gc.collect()
登录后复制
       
39
登录后复制
               
In [47]
BUREAU_AGG = {    'SK_ID_BUREAU': ['nunique'],    'DAYS_CREDIT': ['min', 'max', 'mean'],    'DAYS_CREDIT_ENDDATE': ['min', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean', 'sum'],    'AMT_ANNUITY': ['mean'],    'DEBT_CREDIT_DIFF': ['mean', 'sum'],    'MONTHS_BALANCE_MEAN': ['mean', 'var'],    'MONTHS_BALANCE_SIZE': ['mean', 'sum'],    'STATUS_0': ['mean'],    'STATUS_1': ['mean'],    'STATUS_12345': ['mean'],    'STATUS_C': ['mean'],    'STATUS_X': ['mean'],    'CREDIT_ACTIVE_Active': ['mean'],    'CREDIT_ACTIVE_Closed': ['mean'],    'CREDIT_ACTIVE_Sold': ['mean'],    'CREDIT_TYPE_Consumer credit': ['mean'],    'CREDIT_TYPE_Credit card': ['mean'],    'CREDIT_TYPE_Car loan': ['mean'],    'CREDIT_TYPE_Mortgage': ['mean'],    'CREDIT_TYPE_Microloan': ['mean'],    'LL_AMT_CREDIT_SUM_OVERDUE': ['mean'],    'LL_DEBT_CREDIT_DIFF': ['mean'],    'LL_STATUS_12345': ['mean'],
}

BUREAU_ACTIVE_AGG = {    'DAYS_CREDIT': ['max', 'mean'],    'DAYS_CREDIT_ENDDATE': ['min', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM': ['max', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'],    'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean'],    'DAYS_CREDIT_UPDATE': ['min', 'mean'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'CREDIT_TO_ANNUITY_RATIO': ['mean'],    'MONTHS_BALANCE_MEAN': ['mean', 'var'],    'MONTHS_BALANCE_SIZE': ['mean', 'sum'],
}

BUREAU_CLOSED_AGG = {    'DAYS_CREDIT': ['max', 'var'],    'DAYS_CREDIT_ENDDATE': ['max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM_OVERDUE': ['mean'],    'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['max', 'sum'],    'DAYS_CREDIT_UPDATE': ['max'],    'ENDDATE_DIF': ['mean'],    'STATUS_12345': ['mean'],
}

BUREAU_LOAN_TYPE_AGG = {    'DAYS_CREDIT': ['mean', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['mean', 'max'],    'AMT_CREDIT_SUM': ['mean', 'max'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'max'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'DAYS_CREDIT_ENDDATE': ['max'],
}

BUREAU_TIME_AGG = {    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM_OVERDUE': ['mean'],    'AMT_CREDIT_SUM': ['max', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'STATUS_0': ['mean'],    'STATUS_12345': ['mean'],
}
登录后复制
   
In [48]
agg_bureau = group(bureau, 'BUREAU_', BUREAU_AGG)
active = bureau[bureau['CREDIT_ACTIVE_Active'] == 1]
agg_bureau = group_and_merge(active,agg_bureau,'BUREAU_ACTIVE_',BUREAU_ACTIVE_AGG)
closed = bureau[bureau['CREDIT_ACTIVE_Closed'] == 1]
agg_bureau = group_and_merge(closed,agg_bureau,'BUREAU_CLOSED_',BUREAU_CLOSED_AGG)del active, closed; gc.collect()for credit_type in ['Consumer credit', 'Credit card', 'Mortgage', 'Car loan', 'Microloan']:
    type_df = bureau[bureau['CREDIT_TYPE_' + credit_type] == 1]
    prefix = 'BUREAU_' + credit_type.split(' ')[0].upper() + '_'
    agg_bureau = group_and_merge(type_df, agg_bureau, prefix, BUREAU_LOAN_TYPE_AGG)    del type_df; gc.collect()for time_frame in [6, 12]:
    prefix = "BUREAU_LAST{}M_".format(time_frame)
    time_frame_df = bureau[bureau['DAYS_CREDIT'] >= -30*time_frame]
    agg_bureau = group_and_merge(time_frame_df, agg_bureau, prefix, BUREAU_TIME_AGG)    del time_frame_df; gc.collect()
登录后复制
   
In [49]
sort_bureau = bureau.sort_values(by=['DAYS_CREDIT'])
gr = sort_bureau.groupby('SK_ID_CURR')['AMT_CREDIT_MAX_OVERDUE'].last().reset_index()
gr.rename({'AMT_CREDIT_MAX_OVERDUE': 'BUREAU_LAST_LOAN_MAX_OVERDUE'}, inplace=True)
agg_bureau = agg_bureau.merge(gr, on='SK_ID_CURR', how='left')
agg_bureau['BUREAU_DEBT_OVER_CREDIT'] = \
    agg_bureau['BUREAU_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_AMT_CREDIT_SUM_SUM']
agg_bureau['BUREAU_ACTIVE_DEBT_OVER_CREDIT'] = \
    agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_SUM']
登录后复制
   
In [50]
df = pd.merge(df, agg_bureau, on='SK_ID_CURR', how='left')del agg_bureau, bureau
gc.collect()
登录后复制
       
39
登录后复制
               
In [51]
prev = pd.read_csv(os.path.join(DATA_DIRECTORY, 'previous_application.csv'))
pay = pd.read_csv(os.path.join(DATA_DIRECTORY, 'installments_payments.csv'))
登录后复制
   
In [52]
PREVIOUS_AGG = {    'SK_ID_PREV': ['nunique'],    'AMT_ANNUITY': ['min', 'max', 'mean'],    'AMT_DOWN_PAYMENT': ['max', 'mean'],    'HOUR_APPR_PROCESS_START': ['min', 'max', 'mean'],    'RATE_DOWN_PAYMENT': ['max', 'mean'],    'DAYS_DECISION': ['min', 'max', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    'DAYS_TERMINATION': ['max'],    # Engineered features
    'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'],    'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean', 'var'],    'DOWN_PAYMENT_TO_CREDIT': ['mean'],
}

PREVIOUS_ACTIVE_AGG = {    'SK_ID_PREV': ['nunique'],    'SIMPLE_INTERESTS': ['mean'],    'AMT_ANNUITY': ['max', 'sum'],    'AMT_APPLICATION': ['max', 'mean'],    'AMT_CREDIT': ['sum'],    'AMT_DOWN_PAYMENT': ['max', 'mean'],    'DAYS_DECISION': ['min', 'mean'],    'CNT_PAYMENT': ['mean', 'sum'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features
    'AMT_PAYMENT': ['sum'],    'INSTALMENT_PAYMENT_DIFF': ['mean', 'max'],    'REMAINING_DEBT': ['max', 'mean', 'sum'],    'REPAYMENT_RATIO': ['mean'],
}
PREVIOUS_LATE_PAYMENTS_AGG = {    'DAYS_DECISION': ['min', 'max', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features
    'APPLICATION_CREDIT_DIFF': ['min'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}

PREVIOUS_LOAN_TYPE_AGG = {    'AMT_CREDIT': ['sum'],    'AMT_ANNUITY': ['mean', 'max'],    'SIMPLE_INTERESTS': ['min', 'mean', 'max', 'var'],    'APPLICATION_CREDIT_DIFF': ['min', 'var'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    'DAYS_DECISION': ['max'],    'DAYS_LAST_DUE_1ST_VERSION': ['max', 'mean'],    'CNT_PAYMENT': ['mean'],
}

PREVIOUS_TIME_AGG = {    'AMT_CREDIT': ['sum'],    'AMT_ANNUITY': ['mean', 'max'],    'SIMPLE_INTERESTS': ['mean', 'max'],    'DAYS_DECISION': ['min', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features
    'APPLICATION_CREDIT_DIFF': ['min'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}

PREVIOUS_APPROVED_AGG = {    'SK_ID_PREV': ['nunique'],    'AMT_ANNUITY': ['min', 'max', 'mean'],    'AMT_CREDIT': ['min', 'max', 'mean'],    'AMT_DOWN_PAYMENT': ['max'],    'AMT_GOODS_PRICE': ['max'],    'HOUR_APPR_PROCESS_START': ['min', 'max'],    'DAYS_DECISION': ['min', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    'DAYS_TERMINATION': ['mean'],    # Engineered features
    'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'],    'APPLICATION_CREDIT_DIFF': ['max'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    # The following features are only for approved applications
    'DAYS_FIRST_DRAWING': ['max', 'mean'],    'DAYS_FIRST_DUE': ['min', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    'DAYS_LAST_DUE': ['max', 'mean'],    'DAYS_LAST_DUE_DIFF': ['min', 'max', 'mean'],    'SIMPLE_INTERESTS': ['min', 'max', 'mean'],
}

PREVIOUS_REFUSED_AGG = {    'AMT_APPLICATION': ['max', 'mean'],    'AMT_CREDIT': ['min', 'max'],    'DAYS_DECISION': ['min', 'max', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    # Engineered features
    'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean', 'var'],    'APPLICATION_CREDIT_RATIO': ['min', 'mean'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}
登录后复制
   
In [53]
ohe_columns = [    'NAME_CONTRACT_STATUS', 'NAME_CONTRACT_TYPE', 'CHANNEL_TYPE',    'NAME_TYPE_SUITE', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',    'NAME_PRODUCT_TYPE', 'NAME_CLIENT_TYPE']
prev, categorical_cols = one_hot_encoder(prev, ohe_columns, nan_as_category= False)
登录后复制
   
In [54]
prev['APPLICATION_CREDIT_DIFF'] = prev['AMT_APPLICATION'] - prev['AMT_CREDIT']
prev['APPLICATION_CREDIT_RATIO'] = prev['AMT_APPLICATION'] / prev['AMT_CREDIT']
prev['CREDIT_TO_ANNUITY_RATIO'] = prev['AMT_CREDIT']/prev['AMT_ANNUITY']
prev['DOWN_PAYMENT_TO_CREDIT'] = prev['AMT_DOWN_PAYMENT'] / prev['AMT_CREDIT']
total_payment = prev['AMT_ANNUITY'] * prev['CNT_PAYMENT']
prev['SIMPLE_INTERESTS'] = (total_payment/prev['AMT_CREDIT'] - 1)/prev['CNT_PAYMENT']
登录后复制
   
In [55]
approved = prev[prev['NAME_CONTRACT_STATUS_Approved'] == 1]
active_df = approved[approved['DAYS_LAST_DUE'] == 365243]
active_pay = pay[pay['SK_ID_PREV'].isin(active_df['SK_ID_PREV'])]
active_pay_agg = active_pay.groupby('SK_ID_PREV')[['AMT_INSTALMENT', 'AMT_PAYMENT']].sum()
active_pay_agg.reset_index(inplace= True)
active_pay_agg['INSTALMENT_PAYMENT_DIFF'] = active_pay_agg['AMT_INSTALMENT'] - active_pay_agg['AMT_PAYMENT']
active_df = active_df.merge(active_pay_agg, on= 'SK_ID_PREV', how= 'left')
active_df['REMAINING_DEBT'] = active_df['AMT_CREDIT'] - active_df['AMT_PAYMENT']
active_df['REPAYMENT_RATIO'] = active_df['AMT_PAYMENT'] / active_df['AMT_CREDIT']
active_agg_df = group(active_df, 'PREV_ACTIVE_', PREVIOUS_ACTIVE_AGG)
active_agg_df['TOTAL_REPAYMENT_RATIO'] = active_agg_df['PREV_ACTIVE_AMT_PAYMENT_SUM']/\
                                            active_agg_df['PREV_ACTIVE_AMT_CREDIT_SUM']del active_pay, active_pay_agg, active_df; gc.collect()
登录后复制
       
0
登录后复制
               
In [56]
prev['DAYS_FIRST_DRAWING'].replace(365243, np.nan, inplace= True)
prev['DAYS_FIRST_DUE'].replace(365243, np.nan, inplace= True)
prev['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace= True)
prev['DAYS_LAST_DUE'].replace(365243, np.nan, inplace= True)
prev['DAYS_TERMINATION'].replace(365243, np.nan, inplace= True)
登录后复制
   
In [57]
prev['DAYS_LAST_DUE_DIFF'] = prev['DAYS_LAST_DUE_1ST_VERSION'] - prev['DAYS_LAST_DUE']
approved['DAYS_LAST_DUE_DIFF'] = approved['DAYS_LAST_DUE_1ST_VERSION'] - approved['DAYS_LAST_DUE']
登录后复制
   
In [58]
categorical_agg = {key: ['mean'] for key in categorical_cols}
登录后复制
   
In [59]
agg_prev = group(prev, 'PREV_', {**PREVIOUS_AGG, **categorical_agg})
agg_prev = agg_prev.merge(active_agg_df, how='left', on='SK_ID_CURR')del active_agg_df; gc.collect()
登录后复制
       
0
登录后复制
               
In [60]
agg_prev = group_and_merge(approved, agg_prev, 'APPROVED_', PREVIOUS_APPROVED_AGG)
refused = prev[prev['NAME_CONTRACT_STATUS_Refused'] == 1]
agg_prev = group_and_merge(refused, agg_prev, 'REFUSED_', PREVIOUS_REFUSED_AGG)del approved, refused; gc.collect()
登录后复制
       
0
登录后复制
               
In [61]
for loan_type in ['Consumer loans', 'Cash loans']:
    type_df = prev[prev['NAME_CONTRACT_TYPE_{}'.format(loan_type)] == 1]
    prefix = 'PREV_' + loan_type.split(" ")[0] + '_'
    agg_prev = group_and_merge(type_df, agg_prev, prefix, PREVIOUS_LOAN_TYPE_AGG)    del type_df; gc.collect()
登录后复制
   
In [62]
pay['LATE_PAYMENT'] = pay['DAYS_ENTRY_PAYMENT'] - pay['DAYS_INSTALMENT']
pay['LATE_PAYMENT'] = pay['LATE_PAYMENT'].apply(lambda x: 1 if x > 0 else 0)
dpd_id = pay[pay['LATE_PAYMENT'] > 0]['SK_ID_PREV'].unique()
登录后复制
   
In [63]
agg_dpd = group_and_merge(prev[prev['SK_ID_PREV'].isin(dpd_id)], agg_prev,                                    'PREV_LATE_', PREVIOUS_LATE_PAYMENTS_AGG)del agg_dpd, dpd_id; gc.collect()
登录后复制
       
0
登录后复制
               
In [64]
for time_frame in [12, 24]:
    time_frame_df = prev[prev['DAYS_DECISION'] >= -30*time_frame]
    prefix = 'PREV_LAST{}M_'.format(time_frame)
    agg_prev = group_and_merge(time_frame_df, agg_prev, prefix, PREVIOUS_TIME_AGG)    del time_frame_df; gc.collect()del prev; gc.collect()
登录后复制
       
0
登录后复制
               
In [65]
df = pd.merge(df, agg_prev, on='SK_ID_CURR', how='left')
登录后复制
   
In [66]
train = df[df['TARGET'].notnull()]
test = df[df['TARGET'].isnull()]del df
gc.collect()
登录后复制
       
98
登录后复制
               
In [67]
labels = train['TARGET']
test_lebels=test['TARGET']
train = train.drop(columns=['TARGET'])
test = test.drop(columns=['TARGET'])
登录后复制
   
In [68]
feature = list(train.columns)

train.replace([np.inf, -np.inf], np.nan, inplace=True)
test.replace([np.inf, -np.inf], np.nan, inplace=True)
test_df = test.copy()
train_df = train.copy()
train_df['TARGET'] = labels
test_df['TARGET'] = test_lebels
登录后复制
   
In [69]
imputer = SimpleImputer(strategy = 'median')
imputer.fit(train)
imputer.fit(test)
train = imputer.transform(train)
test = imputer.transform(test)
登录后复制
   
In [70]
scaler = MinMaxScaler(feature_range = (0, 1))
scaler.fit(train)
scaler.fit(test)
train = scaler.transform(train)
test = scaler.transform(test)
登录后复制
   
In [71]
from lightgbm import LGBMClassifier

lgbmc = LGBMClassifier()
lgbmc.fit(train, labels)
登录后复制
       
LGBMClassifier()
登录后复制
               
In [72]
lgbm_pred = lgbmc.predict_proba(test)[:, 1]
登录后复制
   
In [74]
submit = test_df[['SK_ID_CURR']]
submit['TARGET'] = lgbm_pred
登录后复制
   
In [75]
submit.to_csv('lgbm.csv', index = False)
登录后复制
   

总结

数据的提交结果如下:(提交需要科学上网)

数据集 Home Credit Default Risk
线上评分 0.78277

以上就是【金融风控系列】_[3]_贷款违约识别的详细内容,更多请关注php中文网其它相关文章!

360借条
360借条

3分钟审核,最快5分钟放款,极速到账,低服务费,年化综合息费率7.2%起。

下载
来源:php中文网
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn
最新问题
开源免费商场系统广告
热门教程
更多>
最新下载
更多>
网站特效
网站源码
网站素材
前端模板
关于我们 免责申明 意见反馈 讲师合作 广告合作 最新更新 English
php中文网:公益在线php培训,帮助PHP学习者快速成长!
关注服务号 技术交流群
PHP中文网订阅号
每天精选资源文章推送
PHP中文网APP
随时随地碎片化学习

Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号