Data Science Project imporvement, using LightGBM to gain more accuracy and no need to One-Hot Encoding
Using LightGBM native categorical feature support for Adult Dataset
- Overview
- Get the imports done and read the dataset
- Data Cleaning part: Scaled the numerical column, and Label encoding the binary categorical column
- Modeling Part
Overview
We have done the last Data Science Project with U.S. Adult Income Dataset with around 86%
model accuracy, which can go into the top level accuracy in Kaggle competition. we have done the following steps:
- Understand the business problem.
- EDA(Exploratory Data Analysis): Look through and investigate the overall dataset, visualize it with matplotlib and finding any missing value and outliers.
- Data cleaning: impute the missing and outliers value.
- Baseline model: Dummy classifier gave us
75%
accuracy as baseline, meaning that anything below75%
accuracy, the model do nothing better than flipping a coin, and above this value, the model have some skill to classify the labels. - Model evaluate and fine turn: we have evaluate
Support Vector Machine
;RamdonForestClassifier
;BaggingClassifier
;GradientBoostingClassifier
andNeural Network
, the best performance model isGradientBoostingClassifier
which providing86%
of accuracy.
Today we are going to use another light weight, powerful, and fast algorithem: lightGBM, open source at 2017, and now maintainced by Microsoft
As the EDA
was already done by the last blog, we will just skip it and move directly into today's topic.
Get the imports done and read the dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('datasets/adult.csv', na_values="?")
df.info()
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
num_col = df.select_dtypes(exclude=['object', 'datetime']).columns
df[num_col] = scaler.fit_transform(df[num_col])
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
df['income'] = le.fit_transform(df['income'])
df
df.isnull().sum()
Impute the missing value with the most frequent value
df = df.fillna(df.mode().iloc[0])
df.isnull().sum()
for column in df.columns:
if df[column].dtype == 'object':
df[column] = df[column].astype('category')
df.info()
X = df.drop('income', axis=1)
y = df['income']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
clf = lgb.LGBMClassifier(objective='binary', silent=False, colsample_bytree=0.9, subsample=0.9, learning_rate=0.05)
fit_params = {
'early_stopping_rounds': 10,
'eval_metric': 'accuracy',
'eval_set': [(X_test, y_test)],
'eval_names': ['valid'],
'verbose': 100,
'feature_name': 'auto', # actually this is default
'categorical_feature': 'auto' # actually this is default
}
clf.fit(X_train, y_train, **fit_params)
print(f"The Model Accuracy: {(clf.score(X_test, y_test)*100):.2f}%")
%matplotlib inline
feat_imp = pd.Series(clf.feature_importances_, index=X.columns)
feat_imp.nlargest(30).plot(kind='barh', figsize=(8,10))