End-to-End Data Science Project with Adult Income Dataset
From understand the problem, to deliver the solution.
- Overview
- Typical Data Science workflow
- Exploratory Data Analysis (EDA)
- Data Cleaning
- Baseline Model Result
- Model Evaluate and Fine Turn
- Final: Train the best model, save it, and deliver to customer.
- Further Reading
Overview
Recently I got a freelance Data Science Job from a LinkedIn member. It's typical binary classification task with applicant basic information from an online commerical startup. The business onwer consult me about building a simpel Machine Learning model to provide an automated account opening checking system, and gave me a small dataset(around 60,000 instances). I would like to share some of my experiences of this Data Science project, of course I won't be able to share the excat dataset and it's detail , but I found out the U.S. Adult Income Dataset could be one candidate to share the similarity, so I pick it as a sharing sample.
About U.S. Adult Income Dataset
Abstract: Based on every observation's attributes, predict whether a instance aka a person, income exceeds US$50,000 annually. Also known as "Census Income" dataset, it's originally donated by Ronny Kohavi and Barry Becker at 1996. It's Classification Task, with categorical and numerical features, some of the instance have missing values, the missing value was denoted as "?"
Total have 14 columns of features, the last column: income
is the classification label: >50k, <=50k
, other features as below:
-
age
: numerical -
workclass
: categorical -
fnlwgt
: numerical -
education
: categorical -
education-num
: numerical -
marital-status
: categorical -
occupation
: categorical -
relationship
: categorical -
race
: categorical -
sex
: categorical -
capital-gain
: numerical -
capital-loss
: numerical -
hours-per-week
: numerical -
native-country
: categorical
And one more things need to take note is the dataset is imbalanced, there are two classes values: >50k
and <=50k
.
-
>50k
: miniority class, around25%
-
<=50k
: majority class, around75%
This dataset is openly accessable, either visit: UCI(University of California, Irvine) Machine Learning Repository, or Kaggle Adult Dataset.
Typical Data Science workflow
Before we get our hand dirty, let's understand our workflow so that we can follow the steps, normally when we receive a Data Science job, no matter is from which sector, it could be Financial, consumer, computer network, manufacturing, we need to have a proper understanding of the problem we are going to solve. An article from Towardsdatascience have a very good explaination:
For most of the project, it will fall into the flows:
-
Understand the Business Problem:
- In here, the problem is to predict whether a person's income is exceeded
US$50,000.00
based on the features, in commercial, it could be predict a person will place order on particular products based on his/her browsing behavior
- In here, the problem is to predict whether a person's income is exceeded
-
EDA(Exploratory Data Analysis):
- Not all the data science project dataset is perfectly clean like the professior gave you in academic, real word dataset that the customer gave you will be very "dirty", it contain lot's of outlier, missing value, or others intentionly wrong filling. We need to identify it and procss to next steps
-
Data Cleaning:
- Following previous step, once we have identified the outlier, missing values, we will "clean" it via statistical method or others method.
-
Feature Engineering:
- Well, this is one of the most time and brain power consuming steps, it mean we need to figure out each feature's corrlation between the label, selecting, or extracting the most relevent feature to feed into the model, it's very important step that help us to fight with the Curse of Dimensionality and the under-fit/over-fit problem.
-
Baseline Model Result:
- Some of the Data Scientist will miss out this part, most of the time they will just simple feed the "cleaned" dataset into the model and get a simply
75%
accuracy, well, in Adult dataset, I can say I can throw any dummy classification model it will get75%
accuracy, as it's imbalanced with75%
majority class, in order to understand how well the model is working, we must have some baseline data, so this is the part we figure the baseline.
- Some of the Data Scientist will miss out this part, most of the time they will just simple feed the "cleaned" dataset into the model and get a simply
-
Model Evaluate and Fine Turn:
- How good is the model, can we improve more? here we fine turn the hyperparameter to make it closer to production level.
-
Iteration:
- Present the work to customer, put the model into production, get feedback, imporve.
Exploratory Data Analysis (EDA)
Let's get into Adult Dataset
Move to Dataset Kaggle Site to download it.
Or we can use scikit-learn fetch_openml
to fetch it.
import numpy as np
import pandas as pd
from collections import Counter
# load dataset
adult_df = pd.read_csv('/storage/adult.csv', na_values='?')
# overview of the dataset
print(adult_df.info())
print("\n")
print(adult_df.head())
print("Checking dataframe missing values:\n")
for column in adult_df.columns:
if adult_df[column].isnull().sum() != 0:
missingValue = adult_df[column].isnull().sum()
percentage = missingValue / len(adult_df[column]) * 100
dtype = adult_df[column].dtype
print(f"The column: '{column}' with Data Type: '{dtype}' has missing value: {missingValue}, percentage: {percentage:.2f}%")
# memory cleaning
del missingValue
del percentage
del dtype
Well, that not too much, about how to handle the missing value, is a balance game, either throw it away, if there's not too much impact to the model performance, or impute it with some strategy, such as most-frquent
since they are all categorical
value, or mean
, median
etc if they are numerical
value.
In this case, as the missing value fall into the categorical
features, we will use the pandas
DataFrame mode() method to fill the missing value
label = adult_df.values[:, -1]
counter = Counter(label)
for key, value in counter.items():
percentage = value / len(label) * 100
print(f"Class: {key}, Count = {value}, Percentage = {percentage:.1f}%.")
numerical_subset = adult_df.select_dtypes(include=['int64', 'float64'])
print(numerical_subset)
import matplotlib.pyplot as plt
numerical_subset.hist(bins=20, figsize=(20, 15))
plt.show()
adult_df = adult_df.fillna(adult_df.mode().iloc[0])
print(adult_df.info())
The Non-Null Count
shows there's no more missing value from the dataset, it filled by the most_frequent
value
print(numerical_subset.describe())
Well, from the column capital-gain
, the maximum value is 99999
which seems like little bit werid, we can check the dataset description, the capital-gain
means the additional income from capital market, such as stocks, securities, 99999
indicated somebody wrongly imput or it repersented as None
additional income, here we can try to replace it with the mean value, and we will replace the 99
hours in the hours-per-week
column also
print(f"There's {adult_df[adult_df['capital-gain'] == 99999].shape[0]} outlier in the capital-gain column")
print(f"There's {adult_df[adult_df['hours-per-week'] == 99].shape[0]} outlier in the hours-per-week column")
adult_df['capital-gain'].replace(99999, np.mean(adult_df['capital-gain'].values), inplace=True)
adult_df['hours-per-week'].replace(99, np.mean(adult_df['hours-per-week'].values), inplace=True)
print(adult_df.describe())
After the data exploration and cleaning, we save the cleaned DataFrame to adult_cleaned.csv
file
adult_df.to_csv('/storage/adult_cleaned.csv', index=False)
Baseline Model Result
We will evaluate candidate models using repeated stratified k-fold cross-validation
The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10
, meaning each fold will contain about 45,222/10
, or about 4,522 examples
.
Stratified
means that each fold will contain the same mixture of examples by class, that is about 75%
to 25%
for the majority and minority classes respectively. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.
This means a single model will be fit
and evaluated 10 * 3
or 30
times and the mean and standard deviation of these runs will be reported.
This can be achieved using the RepeatedStratifiedKFold scikit-learn class.
We will predict a class label for each example and measure model performance using classification accuracy.
The evaluate_model()
function below will take the loaded dataset and a defined model and will evaluate it using repeated stratified k-fold cross-validation, then return a list of accuracy scores that can later be summarized.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
def load_dataset(filename):
df = pd.read_csv(filename)
X, y = df.iloc[:, :-1], df.iloc[:, -1]
cate_index = X.select_dtypes(include=['object']).columns
num_index = X.select_dtypes(include=['int64', 'float64']).columns
y = LabelEncoder().fit_transform(y)
return X, y, cate_index, num_index
def evaluate_model(X, y, model):
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
return scores
from sklearn.dummy import DummyClassifier
X, y, cate_index, num_index = load_dataset('storage/adult_cleaned.csv')
model = DummyClassifier(strategy='most_frequent')
scores = evaluate_model(X, y, model)
print(scores)
print(f"The Dummy Classifier mean accuracy: {(np.mean(scores)*100):.2f}%, with Standard Deviation: {np.std(scores):.2f}")
print(f"The type of dataset: {type(X)}.")
print(f"The shape of the dataset: Row: {X.shape[0]}, with {X.shape[1]} fetures")
print(f"The type of the target label: {type(y)}")
print(f"The shape of the target label is: {y.shape[0]} dimensional vector.")
Now that we have a test harness and a baseline in performance. In this case, we can see that the baseline algorithm achieves an accuracy of about 76.07%
. This score provides a lower limit on model skill; any model that achieves an average accuracy above about 76.07%
has skill, whereas models that achieve a score below this value do not have skill on this dataset. Now we can begin to evaluate some models on this dataset
Model Evaluate and Fine Turn
Evaluate Machine Learning Algorithms
Let’s start by evaluating a mixture of machine learning models on the dataset.
It can be a good idea to spot check a suite of different nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention, and what doesn’t.
We will evaluate the following machine learning models on the adult dataset:
- Decision Tree (CART)
- Support Vector Machine (SVM)
- Bagged Decision Trees (BAG)
- Random Forest (RF)
- Gradient Boosting Machine (GBM)
We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 100
.
We will define each model in turn and add them to a list so that we can evaluate them sequentially. The generate_models()
function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
def load_dataset(filename):
df = pd.read_csv(filename)
X, y = df.iloc[:, :-1], df.iloc[:, -1]
cate_index = X.select_dtypes(include=['object']).columns
num_index = X.select_dtypes(include=['int64', 'float64']).columns
y = LabelEncoder().fit_transform(y)
return X, y, cate_index, num_index
X, y, cate_index, num_index = load_dataset('/storage/adult_cleaned.csv')
print(type(X))
print(X.shape)
print(type(y))
print(y.shape)
def generate_models():
models, names = [], []
names.append('CART')
models.append(DecisionTreeClassifier())
names.append('SVM')
models.append(SVC(gamma='scale'))
names.append('BAG')
models.append(BaggingClassifier(n_estimators=100))
names.append('RF')
models.append(RandomForestClassifier(n_estimators=100))
names.append('GBM')
models.append(GradientBoostingClassifier(n_estimators=100))
names.append('Neural Network')
models.append(MLPClassifier(early_stopping=True))
return models, names
models, names = generate_models()
As now the X array still in pandas DataFrame
with categorical values, here we need to "encoding" the categorical values into numerical values, OneHotEncoder with Scikit-Learn Pipeline are quite handy
steps = [('Categorical', OneHotEncoder(handle_unknown='ignore'), cate_index), ('Numerical', MinMaxScaler(), num_index)]
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer(steps, verbose=True)
X = transformer.fit_transform(X)
print(type(X))
print(X.shape)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
import warnings
warnings.filterwarnings('ignore')
for i in range(len(models)):
print(f"""
********************************
Now evaluating {names[i]} model
********************************\n""")
scores = evaluate_model(X_train, y_train, models[i])
print(f"The {names[i]} model average accuracy is: {(np.mean(scores)*100):.2f}%, with Standard Deviation: {(np.std(scores)*100):.2f}.")
In this case, we can see that all of the chosen algorithms are skillful, achieving a classification accuracy above 76.07%
. We can see that the ensemble decision tree algorithms perform the best with perhaps stochastic gradient boosting performing the best with a classification accuracy of about 86.3%
.
This accuracy is using the default Hyperperameter, we can pick two top performance algorithms to use scikit-learn GridSearch()
to fine turn the Hyperperameter to see whether it can get better performance.
The best two performance algorithms:
- BaggingClassfier(n_estimators=100)
- GradientBoostingClassfier(n_estimators=100)
We can try to fine turn this two model.
from sklearn.model_selection import GridSearchCV
BAGgrid = {'n_estimators': [100, 200]}
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)
BAGclf = BaggingClassifier()
BAGgrid_search = GridSearchCV(estimator=BAGclf, param_grid=BAGgrid, n_jobs=-1, cv=cv,
scoring='accuracy', error_score=0)
BAGgrid_result = BAGgrid_search.fit(X_train, y_train)
print(BAGgrid_result.best_score_)
print(BAGgrid_result.best_params_)
GBMgrid = {'n_estimators': [100, 200]}
GBMclf = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=500, min_samples_leaf=50,
max_depth=8, max_features='sqrt', subsample=0.8, random_state=42)
GBMgrid_search = GridSearchCV(estimator=GBMclf, param_grid=GBMgrid, n_jobs=-1, cv=cv,
scoring='accuracy', error_score=0)
GBMgrid_result = GBMgrid_search.fit(X_train, y_train)
print(GBMgrid_result.best_score_)
print(GBMgrid_result.best_params_)
Well, seems like if the n_estimators
is equal to 200, the GradientBoostingClassifier performance incrase to 86.69%
, then we can update our hyperparameter for GradientBoostingClassfier and train it according to our Training Subset, now we have the winner, is GradientBoostingClassifier algorithm.
Actually using GridSearchCV
is quite computational expensive, I would suggest to use Cloud Notebook Envirnoment, such Google Colab, AWS, or Google Cloud or Gradient, or Kaggle, both of them provide quite power CPU and tons of memory, and most important, they provide free GPU in certain amount of time
model = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, min_samples_split=500, min_samples_leaf=50,
max_depth=8, max_features='sqrt', subsample=0.8, random_state=42)
model.fit(X_train, y_train)
TestScore = model.score(X_test, y_test)
print(f"The model test set accuracy is: {(TestScore*100):.1f}%.")
from sklearn.metrics import classification_report
predicted = model.predict(X_test)
print(classification_report(y_test, predicted))
import joblib
joblib.dump(model, 'storage/final_model.sav')