Overview

We have done the last Data Science Project with U.S. Adult Income Dataset with around 86% model accuracy, which can go into the top level accuracy in Kaggle competition. we have done the following steps:

Understand the business problem.
EDA(Exploratory Data Analysis): Look through and investigate the overall dataset, visualize it with matplotlib and finding any missing value and outliers.
Data cleaning: impute the missing and outliers value.
Baseline model: Dummy classifier gave us 75% accuracy as baseline, meaning that anything below 75% accuracy, the model do nothing better than flipping a coin, and above this value, the model have some skill to classify the labels.
Model evaluate and fine turn: we have evaluate Support Vector Machine; RamdonForestClassifier; BaggingClassifier; GradientBoostingClassifier and Neural Network, the best performance model is GradientBoostingClassifier which providing 86% of accuracy.

Today we are going to use another light weight, powerful, and fast algorithem: lightGBM, open source at 2017, and now maintainced by Microsoft

As the EDA was already done by the last blog, we will just skip it and move directly into today's topic.

Get the imports done and read the dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import lightgbm as lgb

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('datasets/adult.csv', na_values="?")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        46043 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       46033 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   47985 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB

Data Cleaning part: Scaled the numerical column, and Label encoding the binary categorical column

The target column income need to be encoded as 1 and 0
As well as the gender column

import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()

num_col = df.select_dtypes(exclude=['object', 'datetime']).columns
df[num_col] = scaler.fit_transform(df[num_col])

le = LabelEncoder()

df['gender'] = le.fit_transform(df['gender'])
df['income'] = le.fit_transform(df['income'])

df

Data Cleaning Part: Impute the missing value

The missing value are all fall into categorical features

df.isnull().sum()

age                   0
workclass          2799
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2809
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      857
income                0
dtype: int64

Impute the missing value with the most frequent value

df = df.fillna(df.mode().iloc[0])

df.isnull().sum()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

Data Cleaning Part: Convert the `object` Data type into `category` Data type

LightGBM can handle the category feature by itself, but before that, we need to convert the object dtype to category dtype, so that LightGBM can handle it.

for column in df.columns:
    if df[column].dtype == 'object':
        df[column] = df[column].astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   age              48842 non-null  float64 
 1   workclass        48842 non-null  category
 2   fnlwgt           48842 non-null  float64 
 3   education        48842 non-null  category
 4   educational-num  48842 non-null  float64 
 5   marital-status   48842 non-null  category
 6   occupation       48842 non-null  category
 7   relationship     48842 non-null  category
 8   race             48842 non-null  category
 9   gender           48842 non-null  int64   
 10  capital-gain     48842 non-null  float64 
 11  capital-loss     48842 non-null  float64 
 12  hours-per-week   48842 non-null  float64 
 13  native-country   48842 non-null  category
 14  income           48842 non-null  int64   
dtypes: category(7), float64(6), int64(2)
memory usage: 3.3 MB

Modeling Part

X = df.drop('income', axis=1)
y = df['income']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

clf = lgb.LGBMClassifier(objective='binary', silent=False, colsample_bytree=0.9, subsample=0.9, learning_rate=0.05)

fit_params = {
    'early_stopping_rounds': 10,
    'eval_metric': 'accuracy',
    'eval_set': [(X_test, y_test)],
    'eval_names': ['valid'], 
    'verbose': 100,
    'feature_name': 'auto', # actually this is default
    'categorical_feature': 'auto' # actually this is default
}

clf.fit(X_train, y_train, **fit_params)

Training until validation scores don't improve for 10 rounds
[100]	valid's binary_logloss: 0.2779
Did not meet early stopping. Best iteration is:
[100]	valid's binary_logloss: 0.2779

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=0.9,
               importance_type='split', learning_rate=0.05, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective='binary',
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=False,
               subsample=0.9, subsample_for_bin=200000, subsample_freq=0)

print(f"The Model Accuracy: {(clf.score(X_test, y_test)*100):.2f}%")

The Model Accuracy: 87.53%

Accuracy imporvement

Compare to last blog, the best performing model: GradientBoostingClassifier have achieved around 86% of Accuracy, here using LightGBM, without One-Hot Encoding the categorical feature, it have around 1% of Accuracy improving.

%matplotlib inline
feat_imp = pd.Series(clf.feature_importances_, index=X.columns)
feat_imp.nlargest(30).plot(kind='barh', figsize=(8,10))

<matplotlib.axes._subplots.AxesSubplot at 0x127c9ac10>

Feature Importance

LightGBM has built-in Feature Importance examination, it shows clearly in the plot, the age and capital-gain feature is the most important features that impact the income target.

	age	workclass	fnlwgt	education	educational-num	marital-status	occupation	relationship	race	gender	capital-gain	capital-loss	hours-per-week	native-country	income
0	0.109589	Private	0.145129	11th	0.400000	Never-married	Machine-op-inspct	Own-child	Black	1	0.000000	0.0	0.397959	United-States	0
1	0.287671	Private	0.052451	HS-grad	0.533333	Married-civ-spouse	Farming-fishing	Husband	White	1	0.000000	0.0	0.500000	United-States	0
2	0.150685	Local-gov	0.219649	Assoc-acdm	0.733333	Married-civ-spouse	Protective-serv	Husband	White	1	0.000000	0.0	0.397959	United-States	1
3	0.369863	Private	0.100153	Some-college	0.600000	Married-civ-spouse	Machine-op-inspct	Husband	Black	1	0.076881	0.0	0.397959	United-States	1
4	0.013699	NaN	0.061708	Some-college	0.600000	Never-married	NaN	Own-child	White	0	0.000000	0.0	0.295918	United-States	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
48837	0.136986	Private	0.165763	Assoc-acdm	0.733333	Married-civ-spouse	Tech-support	Wife	White	0	0.000000	0.0	0.377551	United-States	0
48838	0.315068	Private	0.096129	HS-grad	0.533333	Married-civ-spouse	Machine-op-inspct	Husband	White	1	0.000000	0.0	0.397959	United-States	1
48839	0.561644	Private	0.094462	HS-grad	0.533333	Widowed	Adm-clerical	Unmarried	White	0	0.000000	0.0	0.397959	United-States	0
48840	0.068493	Private	0.128004	HS-grad	0.533333	Never-married	Adm-clerical	Own-child	White	1	0.000000	0.0	0.193878	United-States	0
48841	0.479452	Self-emp-inc	0.186482	HS-grad	0.533333	Married-civ-spouse	Exec-managerial	Wife	White	0	0.150242	0.0	0.397959	United-States	1