Deep Dive | Decision Trees
By Arzan Irani on Aug 16, 2019
Decision trees are a supervised learning algorithm used for classification and regression.This post will dive deep into the workings of a decision tree. Along the way, I will walk through a deatiled example and showcase some neccessary skills required to tune and improve any machine learning algorithm. The post will also exhibit some exploratory data analysis and visualization practices that can help better understand the data.
HUGOMORE42Table of Contents
- Decision Trees
- Example Model
- Importing Libraries
- Importing the data
- Feature Dictionary
- Data Wrangling and Exploratory Data Analysis
- Base Model
- Feature Engineering
- Conclusion
Decision Trees¶
Decision trees are a supervised learning algorithm used for classification and regression.
How do they work?¶
As the name suggests, a decision tree makes a series of decisions based on each feature in the dataset by following a if-this-then-that logic.
Imagine a hypothetical situation in which there are 60 students in a class room (30 male, 30 female) and all the girls love playing volleyball, while the boys do not.
In this hypothetical scenario, we could say that 'if the gender is Female, then the person would like volleyball'. Hence the value of the Gender variable would act as a good predictor for whether the person likes volleyball or not. This process of partitioning is called splitting.
A decision tree that would split the data set on the Gender variable would hence be able to make good predictions about whether a random student likes volleyball or not.
import pandas as pd
data = pd.read_csv('./data/volleyball_example.csv')
data.head(5)
How do we quantitatively evaluate how good a split is?¶
Option 1) FOR CLASSIFICATION - Entropy or Information gain¶
The decision making process of the algorithm is influenced by what is called entropy or information gain. Entropy is defined as the measure of disorder. In its original form, let's say the entropy of our dataset was 1. But after splitting the data based on the gender, we have now created two smaller datasets (in this case one of 30 Male and another of 30 Female students). These smaller datasets have lesser disorder in them on account of the fact that they both have only one type of gender compared to the two in the original data set.
By measuring the difference in the entropy of the original data set to the sum of the entropies of the two smaller datasets, we can measure how much disorder was lost as a result of our splitting decision or in other words how much order or information was added or gained. </font>
$Entropy(Original Data) - (Entropy(Females) + Entropy(Males)) = Information\ gained$
Each decision or splitting step in a decision tree is referred to as a node. At a node, the decision tree would consider every feature in the dataset and try to split the data-set based on a certain threshold of that feature. **The feature that gets selected is the one the provides the most information gained after the split. **
Option 2) FOR CLASSIFICATION - Gini Impurity¶
Alternatively, a decision tree can use what is called the gini impurity. The gini impurity is a quantitative measure that calculates the probability of incorrectly classifying a randomly selected data point based on the distribution of the classes.
Going back to our hypothetical volleball example, let's consider the following: 30 of the students prefer volleyball, while 30 of the students do not. Hence the probability that a randomly selected student likes volleyball is 0.50.</font>
Selected Student | Prediction | Probability | |
---|---|---|---|
Case 1 | Likes volleyball | Does not like volleyball | $0.5 * 0.5 = 0.25$ |
Case 2 | Likes volleyball | Likes volleyball | $0.5 * 0.5 = 0.25$ |
Case 3 | Does not like volleyball | Likes volleyball | $0.5 * 0.5 = 0.25$ |
Case 4 | Does not like volleyball | Does not like volleyball | $0.5 * 0.5 = 0.25$ |
As can be seen from the above table, the probability of us incorrectly classifying a randomly selected student is the sum of Case 1 and Case 3 i.e.
$0.25 + 0.25 = 0.5$
Mathematically, the gini coefficient can be expressed as: </font>
$Gini\ Coefficient = \sum_{i=1}^C p(i)*(1-p(i))$
where,
C - Total number of classes
i - is the iterator of each class
p(i) - probability of being that class
1 - p(i) - probability of not being that class
The decision tree makes a split based on the feature and the threshold that results in the lowest gini coefficient after the split.
Option 3) FOR REGRESSION - Mean squared error¶
As can be expected, decision tree regressors require an alternative measure to quantify the goodness of a split. One method to do so is to monitor the mean squared error and select the feature and threshold that reduces the error the most.
Note: Decision tree regressors do not predict on a continuous range. Instead, the prediction is the mean of the response values left in the leaf node in which you land.
Alternatively, we could also use the Friedman's MSE or the mean absolute error (MAE) to help quantitatively decide the best split.</font>
Hyperparameters¶
Now that we understand the mechanism that decision tree algorithms use to identify the best split, let us understand the hyperparamters that can be tuned to control the way the algorithm learns from the data. Tuning hyperparameters is an important step of any machine learning algorithm.
A full list of hyperparameters can be found in the sklearn documentation:
Decision Tree Regressors: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Decision Tree Classifiers: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
We'll focus on a few hyperparameters that I consider important through my experience. </font>
criterion¶
The criterion is the quantitative measure that can be used by the algorithm to make the decision of the best split.
For decision tree classifiers, the options are gini impurity or entropy. There doesn't seem to be a significant advantage in picking one over the other, so either should work in practice.
Decision tree regressors, on the other hand, could use mean squared error, friedman mean squared error, or mean absolute error. The difference between mean squared error and the mean absolute error is that the former minimizes the loss with the mean, while the latter does so with the median. The friedman mean squared error accounts for a weighted mean squared error that accounts for the number of samples in the left and right split of a node to calculate an improved or weighted mean squared error. </font>
max_depth¶
Each split in a node of a decision tree accounts for an additional layer of depth. For instance, the first node in a decision tree is referred to as the root node. This is also known as a decision stump. Splitting the root node results in two daughter nodes each with a smaller subset of the data and in turn an increase in the depth of the tree. The max_depth hyperparameter allows you to dictate the maximum depth that a decision tree may attain before the algorithm must stop.
How to use this?
The greater the depth of the tree, the more complex the algorithms learning path. By setting a tree with a larger depth, the algorithm can more precisely fit the learning data. As with all hyperparameters, there isn't a golden rule that fits all situations. However, a general understanding that you may have is that the deeper the tree, the more likely it is to over-fit the training data and conversely, the smaller the depth is the more likely it is to under-fit the training data.
min_samples_split¶
As we've understood by now, by traversing down a tree we will encounter nodes splitting into daughter nodes and each daughter node would contain a smaller subset of the data to be further split based on a certain feature and threshold. It isn't necessary that each pair of sister nodes (daughter nodes originating from the same parent node) have the same number of data points in them. In fact, more often than not, sister nodes would not have the same number of data points in them. The min_samples_split hyperparameter allows you to govern which node can be further split and which cannot based on the number of data points left in them.
How to use this?
Just as in the case with max_depth, the min_samples_split hyperparameter also helps govern how complex or how strongly the algorithm is fitting the training data. Having a really small number for min_sample_split would allow the algorithm to dissect the data-set until a very small number of data points remain in each node - provided the maximum allowable depth of the tree (max_depth) hasn't been reached. The Smaller min_samples_split is, the more complex is the learning path of your algorithm and the higher is its tendency to over-fit the training data. Conversely, the higher the value the more likely it is to under-fit the training data.
min_samples_leaf¶
Before we understand this hyperparameter, it is important to recollect the difference between a leaf node and an internal node. Contrary to regular node, a leaf node is one that does not have any daughter/children nodes. It marks the end of the splitting of the data in that branch of the decision tree. The min_samples_leaf allows us to control how many samples must be left in a daughter of an internal node in order for the split to be allowed. The purpose of min_samples_leaf can often be hard to differentiate from min_samples_split, so let's consider an example: Let's say an internal node has 10 data points, and we've set the min_samples_split to 7 and the min_samples_leaf to 5. Since there are more than 7 data points, we pass the min_samples_split requirement. However, if the resulting split results in a daughter node having less than 5 data-points then it would violate the min_samples_leaf threshold and the split would not be allowed.
How to use this The primary difference in effect between min_samples_split and min_samples_leaf is that the former governs whether node can be split, while the latter governs whether the resulting split creates nodes that are viable. By setting a min_samples_leaf to a really small number, we allow the the algorithm to learn the training data more precisely, which may result in over-fitting. Another way to understand this is that leaves with smaller number data-points allow the algorithm to dissect the data into smaller groups than would more likely be unique to the training data, and might not be reflective of the real world or unseen data.
What kind of data do you pass to Decision Trees?¶
Categorical vs Numerical¶
Decision trees can accept categorical and numerical data. However, the programming and packages you use will govern the format in which you can pass in categorical variables. Since we are using sci-kit learn in this example, I'll explain the procedure for the same.
Sci-kit learn's decision tree algorithm requires categorical variables to be encoded into numbers. Hence, a variable like sex that could have values of 'male' and 'female' would need to be transformed to '1' and '0'.
</font>
Missing Values¶
The decision tree algorithm will throw an error message if you pass it any features with missing values. Hence, it is imperative to deal with missing values before feeding the data to the algorithm.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold, GridSearchCV, RandomizedSearchCV, cross_val_score, cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, Imputer
from IPython.display import Image as PImage
from subprocess import check_call
from PIL import Image, ImageDraw, ImageFont
%matplotlib inline
Importing the data¶
# Use pandas inbuild function to read in the learning and testing data
learning = pd.read_csv('./data/titanic/train.csv')
test = pd.read_csv('./data/titanic/test.csv')
# Make a list of both data sets to conveniently apply data wrangling to all the data
full_data = [learning, test]
learning.head(5)
Feature Dictionary¶
Independent Variables¶
PassengerId: A serial number for the passengers
Pclass: The class of ticket the passenger purchased (1 = 1st, 2 = 2nd, 3 = 3rd)
Name: The name of each passenger
Sex: Whether the passenger was male or female
Age: The age of the passenger
SibSp: The number of passengers who were either siblings or a spouse of the passenger in question
Parch: The number of passengers who were either parents or children of the passenger in question
Ticket: The ticket number of the ticket purchased by the passenger
Fare: The price paid for the ticket
Cabin: The cabin number allocated to the passenger. If NaN, it implies the passenger did not have a cabin
Embarked: The port that the passenger embarked from
Response Variable¶
Survived: Whether the passenger survived the sinking of the Titanic
Data Wrangling and Exploratory Data Analysis¶
Data Summary¶
# Get the data types of each variable as a dataframe
learning.dtypes.to_frame(name = 'data_type')
# Get summary statistics of all numerical variables
learning.describe()
# Get descriptive statistics of non-numerical variables
learning.describe(exclude = ['int64', 'float64'])
Exploring Ticket Number variable¶
In spite of having no missing values, there seem to be only 681 unique ticket numbers. Let's explore to understand why each passenger didn't have a unique ticket number.
learning['Ticket'].value_counts().to_frame(name = 'Count').head(5)
# Let's explore a few repeating ticket numbers
learning[learning['Ticket'] == "347082"]
learning[learning['Ticket'] == "CA. 2343"]
It appears that repeated ticket numbers signify families traveling together. We can deduce this from the common last names of the family.
Adjusting the Cabin Variable¶
Since the NaN values in the cabin variable signify that the passenger was without a cabin, we can adjust the column to a boolean variable. Furthermore, sklearn's decision tree algorithm requires categorical variables to be encoded into numbers.
# Change the Cabin variable to a boolean in both test and learning dataframes
for data in full_data:
data['Cabin'] = data['Cabin'].fillna(0)
data['Cabin'][data['Cabin'] != 0] =1
learning.head(5)
Feature Distributions¶
Lets visualize the variables to have a better understanding of their distribution.
Sex¶
learning['Sex'].value_counts().plot.bar(alpha = 0.8)
plt.title('Distribution of Sex')
plt.xlabel('Sex')
plt.ylabel('Frequency')
plt.show()
Cabin¶
learning['Cabin'].value_counts().plot.bar(alpha = 0.8)
plt.title('Distribution of Passengers with and without Cabins')
plt.xlabel('Cabin')
plt.ylabel('Frequency')
plt.show()
Passenger Ticket Class¶
learning['Pclass'].value_counts().plot.bar(alpha = 0.8)
plt.title('Distribution of Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Frequency')
plt.show()
Port of Embarkation¶
learning['Embarked'].value_counts().plot.bar(alpha = 0.8)
plt.title('Distribution of port of Embarkation')
plt.xlabel('Port of Embarkation')
plt.ylabel('Frequency')
plt.show()
Age¶
learning['Age'].hist(bins = 20, alpha = 0.8)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Number of parents and children on board¶
learning['Parch'].hist(bins = 6, alpha = 0.8)
plt.title('Distribution of the number of passengers who were either parents or children of the passenger in question')
plt.xlabel('Number of Parents and/or children')
plt.ylabel('Frequency')
plt.show()
Sum of Siblings and Spouse on-board¶
learning['SibSp'].hist(bins = 8, alpha = 0.8)
plt.title('Distribution of the number of passengers who were either siblings or a spouse of the passenger in question')
plt.xlabel('Number of Spouse and Siblings')
plt.ylabel('Frequency')
plt.show()
Fare¶
learning['Fare'].hist(bins = 10, alpha = 0.8)
plt.title('Distribution of Fare')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.show()
Feature Disparity grouped by response¶
It is good idea to see how a feature's distribution varies based on the classes of our categorical response variable. Features that show a good amount of disparity could serve as strong predictors
Age¶
import warnings
warnings.filterwarnings('ignore')
sns.distplot(learning[learning['Survived'] == 1][['Age']].dropna(),
kde = False, color = 'red', label = 'Survived').set_title('Distribution of Age')
sns.distplot(learning[learning['Survived'] == 0][['Age']].dropna(),
kde = False, color = 'grey', label = 'Did not Survive').set_title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.legend()
plt.show()
Sum of Siblings and Spouse on-board¶
sns.distplot(learning[learning['Survived'] == 1][['SibSp']].dropna(), bins =5,
kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[learning['Survived'] == 0][['SibSp']].dropna(), bins =5,
kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Sum of number of Siblings and Spouse')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the number of siblings and spouse they had')
plt.legend()
plt.show()
Number of parents and children on board¶
sns.distplot(learning[learning['Survived'] == 1][['Parch']].dropna(), bins =8,
kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[learning['Survived'] == 0][['Parch']].dropna(), bins =8,
kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Sum of number of Parents and Children')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the number of parents and children they had')
plt.legend()
plt.show()
Fare¶
sns.distplot(learning[learning['Survived'] == 1][['Fare']].dropna(), bins = 10,
kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[learning['Survived'] == 0][['Fare']].dropna(), bins = 10,
kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Fare paid')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the fare they paid')
plt.legend()
plt.show()
Port of Embarkation¶
#Get summary counts for each port of embarkation for passengers that survived
embarked_summary_survived = learning[learning['Survived'] == 1]['Embarked'].dropna().value_counts().to_frame(name='Count')
#Get summary counts for each port of embarkation for passengers that did not survive
embarked_summary_did_not_survive = learning[learning['Survived'] == 0]['Embarked'].dropna().value_counts().to_frame(name='Count')
# Append the two data frames into one
embarked_summary = embarked_summary_survived.append(embarked_summary_did_not_survive)
# Add a column for whehter the count represents passengers that survived or not
embarked_summary['Survived'] = ['Yes', 'Yes', 'Yes', 'No', 'No', 'No']
# Add a column that states port of embarkation
embarked_summary['Embarked'] = embarked_summary.index
# Reset the index to numbers
embarked_summary.reset_index(level = 0, inplace = True)
# Drop the additional index column
embarked_summary = embarked_summary.drop(['index'], axis = 1)
# Final Summary
embarked_summary
sns.catplot(x="Embarked", y="Count", hue="Survived", data=embarked_summary,
height=6, kind="bar", palette = ['#FF999A','#CCCCCC'])
plt.title("Counts of passengers that did and did not survive per port of Embarkation")
plt.show()
Cabin¶
#Get summary counts for each passengers with or without cabins that survived
cabin_summary_survived = learning[learning['Survived'] == 1]['Cabin'].dropna().value_counts().to_frame(name='Count')
#Get summary counts for each port of embarkation for passengers that did not survive
cabin_summary_did_not_survive = learning[learning['Survived'] == 0]['Cabin'].dropna().value_counts().to_frame(name='Count')
# Append the two data frames into one
cabin_summary = cabin_summary_survived.append(cabin_summary_did_not_survive)
# Add a column for whehter the count represents passengers that survived or not
cabin_summary['Survived'] = ['Yes', 'Yes', 'No', 'No']
# Add a column that states port of embarkation
cabin_summary['Cabin'] = cabin_summary.index
# Reset the index to numbers
cabin_summary.reset_index(level = 0, inplace = True)
# Drop the additional index column
cabin_summary = cabin_summary.drop(['index'], axis = 1)
# Final Summary
cabin_summary
sns.catplot(x="Cabin", y="Count", hue="Survived", data=cabin_summary,
height=6, kind="bar", palette = ['#FF999A','#CCCCCC'])
plt.title("Counts of passengers that did and did not survive by based on whether that did or did not have a cabin")
plt.show()
Sex¶
#Get summary counts for each port of embarkation for passengers that survived
sex_summary_survived = learning[learning['Survived'] == 1]['Sex'].dropna().value_counts().to_frame(name='Count')
#Get summary counts for each port of embarkation for passengers that did not survive
sex_summary_did_not_survive = learning[learning['Survived'] == 0]['Sex'].dropna().value_counts().to_frame(name='Count')
# Append the two data frames into one
sex_summary = sex_summary_survived.append(sex_summary_did_not_survive)
# Add a column for whehter the count represents passengers that survived or not
sex_summary['Survived'] = ['Yes', 'Yes', 'No', 'No']
# Add a column that states port of embarkation
sex_summary['Sex'] = sex_summary.index
# Reset the index to numbers
sex_summary.reset_index(level = 0, inplace = True)
# Drop the additional index column
sex_summary = sex_summary.drop(['index'], axis = 1)
# Final Summary
sex_summary
sns.catplot(x="Sex", y="Count", hue="Survived", data=sex_summary,
height=6, kind="bar", palette = ['#FF999A','#CCCCCC'])
plt.title("Counts of passengers that did and did not survive by sex")
plt.show()
Class of Passenger's Ticket¶
#Get summary counts for each port of embarkation for passengers that survived
pclass_summary_survived = learning[learning['Survived'] == 1]['Pclass'].dropna().value_counts().to_frame(name='Count')
#Get summary counts for each port of embarkation for passengers that did not survive
pclass_summary_did_not_survive = learning[learning['Survived'] == 0]['Pclass'].dropna().value_counts().to_frame(name='Count')
# Append the two data frames into one
pclass_summary = pclass_summary_survived.append(pclass_summary_did_not_survive)
# Add a column for whehter the count represents passengers that survived or not
pclass_summary['Survived'] = ['Yes', 'Yes', 'Yes', 'No', 'No', 'No']
# Add a column that states port of embarkation
pclass_summary['Pclass'] = pclass_summary.index
# Reset the index to numbers
pclass_summary.reset_index(level = 0, inplace = True)
# Drop the additional index column
pclass_summary = pclass_summary.drop(['index'], axis = 1)
# Final Summary
pclass_summary
sns.catplot(x="Pclass", y="Count", hue="Survived", data=pclass_summary,
height=6, kind="bar", palette = ['#FF999A','#CCCCCC'])
plt.title("Counts of passengers that did and did not survive by class of ticket")
plt.show()
Data Insights:¶
- Missing Values: There seem to be missing values in the Age, Cabin and Embarked variables. We can tell this because the count is less than 891. These will need to be filled in.
- Data Anomaly: Repeated ticket numbers seem to imply that the passengers are part of the same family. This can be deduced from the common last names
- Bad Predictors: Based on intuition, in there current form, it would be unlikely that PassengerId, Name and Ticket number would be strong predictors of the Survival of a passenger.
- Data Adjustment: The categorical variables of 'Sex', 'Cabin' and 'Embarked' need to be encoded in order to be used with sklearn's decision tree algorithms
- Cabin makes more sense as a boolean variable that represents passengers who had cabins vs. those who did not </ul>
- Good Predictors: From the feature disparity plots, it appears that Sex, Cabin and Passenger Class seem to be good predictors of a passenger's survival. In particular, the likelihood of survival was higher if the passenger's sex was female or if they did not possess a class 3 ticket, or if they had a cabin. </ul> </font>
Adjusting the Sex Variable¶
Like the cabin variable, we need to encode the categorical variable "sex" into a numerical one for it to be accept by sklearn's decision tree algorithm
le = LabelEncoder()
le.fit(['male','female'])
for data in full_data:
data['Sex'] = le.transform(data[['Sex']])
learning.head(5)
Adjusting the Embarked Variable¶
Again, encoding the variable to work with sklearn. However, there are missing values in the 'Embarked' column and we are gonna replace them with 'S', the most common port of entry. This might not be the true port of embarkation, however it seems like the most intuitive thing to do considering that it has the highest frequency amongst the three ports.
le_embark = LabelEncoder()
le_embark.fit(['S','C','Q'])
for data in full_data:
data[['Embarked']] = data[['Embarked']].fillna('S')
data['Embarked'] = le_embark.transform(data[['Embarked']])
learning.head(5)
Adding missing values to age¶
There are several ways to deal with missing values in a data set. You could drop the row/data point that has the missing value or if a feature has too many missing values drop the entire column. While dropping rows and columns seem like an easy fix, there's obviously a big downside to these strategies in that you lose viable data.
Other mathematically viable ways to impute the missing values with a sensible replacement are often considered better alternatives to dropping data points and features. In particular, single and multiple imputation methods are the two strategies that we could consider.
For the sake of this project, we will use single imputation with the mean.
imputer = Imputer()
learning['Age'] = imputer.fit_transform(learning[['Age']])
test['Age'] = imputer.fit_transform(test[['Age']])
Base Model¶
Let's create a base decision tree model to try an see what our accuracy is based on the default set of features to understand our starting point.
This base model could then be improved by using several techniques
- Hyperparameter Tuning with cross validation
- Feature Selection
- Feature Engineering
- Regularization </ul> </font>
Training the Base model¶
# Splitting the data into training and validations sets
X = learning.drop(['Survived','Name', 'PassengerId', 'Ticket'], axis = 1)
y = learning[['Survived']]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 42)
# Training the Base model
base_decision_tree = tree.DecisionTreeClassifier()
base_decision_tree.fit(X_train, y_train)
# Training score
base_decision_tree.score(X_train, y_train)
# Validation Score
base_decision_tree.score(X_valid, y_valid)
Base model results¶
The base model has an accuracy score of 98.5% on the training set and a 78.77% accuracy on the validation set.
pd.DataFrame(data = base_decision_tree.feature_importances_,
index = list(X_train.columns),
columns = ['Importance']).sort_values(by = ['Importance'], axis = 0, ascending = False)
As postulated, we can see that Sex is one of the most important features when deciding on whether a passenger survived or not. While Age and Fare were not intuitively strong predictors - based on the feature disparity visualizations - the algorithm believes they are strong predictors.
Of the other variables in our consideration, passenger class and cabin seem to be moderately important to the algorithm.
It's important to note, that these feature importances are produced using the current hyperparameters. Adjusting the hyperparameters may result in a different set of feature importances. </font>
Visualizing the base model¶
with open("tree1.dot", 'w') as f:
f = tree.export_graphviz(base_decision_tree,
out_file=f,
impurity = True,
feature_names = list(X_train.columns),
class_names = ['Died', 'Survived'],
rounded = True,
filled= True )
#Convert .dot to .png to allow display in web notebook
check_call(['dot','-Tpng','tree1.dot','-o','tree1.png'])
img = Image.open("./tree1.png")
draw = ImageDraw.Draw(img)
img.save('sample-out.png')
PImage("sample-out.png")
As we can see, this base model decision tree has a very large depth and could be considered as one that is over-fitting given the last discrepancy between the training and validation error. Let's try to rectify this and improve our model
Feature Engineering¶
Considering the ubiquity of the titanic data-set, there is extensive work done on it and we are gonna try to take inspiration from some of the work that's out there. Sina, Anisotropic, Diego_Milia and also Megan Risdal
Family Size and IsAlone¶
learning.head()
for data in full_data:
data['Family_size'] = data['SibSp'] + data['Parch'] + 1
data['IsAlone'] = 0
data['IsAlone'][data['Family_size'] == 1] = 1
test.head(5)
sns.distplot(learning[learning['Survived'] == 1][['Family_size']].dropna(), bins =10,
kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[learning['Survived'] == 0][['Family_size']].dropna(), bins =10,
kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Family Size')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the size of family on-board')
plt.legend()
plt.show()
Let's drop the smaller family sizes and gain some more resolution on the larger family sizes.
sns.distplot(learning[(learning['Survived'] == 1) & (learning['Family_size'] >=3)][['Family_size']].dropna(), bins =8,
kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[(learning['Survived'] == 0) & (learning['Family_size'] >=3)][['Family_size']].dropna(), bins =8,
kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Family Size')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the size of family on-board')
plt.legend()
plt.show()
The above visualization shows, that larger the family size of a person less likely are they to survive.
summary_isalone = pd.DataFrame(learning.groupby(['Survived','IsAlone']).size(), columns= ['Count'])
summary_isalone.reset_index(level=0, inplace=True)
summary_isalone.reset_index(level=0, inplace=True)
summary_isalone
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,6))
my_pal = {1: "red", 0: "grey"}
sns.catplot(ax = ax, data= summary_isalone, x="IsAlone", y="Count", hue="Survived",
kind="bar", palette=my_pal, alpha = 0.5)
plt.close(2)
plt.show()
Title¶
As we can see from the name column, there are titles such as Mr. and Mrs. embedded within the name. Let's extract this information and form a title variable. Using a feature such as the title allows us to extract a little more information about each passenger. For instance, a Master or a Sir would be have a higher chance of survival due to their elevated social status.
Title: The salutation in name of the passenger</font>
re_res = re.search('([A-Za-z]+)\.', learning['Name'][0])
re_res.group(1)
learning['Name'][0]
def extract_title(name):
title_search = re.search('([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return ""
for data in full_data:
data['Title'] = data['Name'].apply(extract_title)
test.head()
summary_title = pd.DataFrame(learning.groupby(['Survived','Title']).size(), columns= ['Count'])
summary_title.reset_index(level=0, inplace=True)
summary_title.reset_index(level=0, inplace=True)
summary_title
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,6))
my_pal = {1: "red", 0: "grey"}
sns.catplot(ax = ax, data= summary_title, x="Title", y="Count", hue="Survived",
kind="bar",palette=my_pal, alpha = 0.5)
plt.close(2)
plt.show()
It appears that the Miss and Mrs Titles have a higher likelihood of survival, while the Mr Title has a lower likelihood of survival.
Let's zoom into the graph a little more by dropping the most common titles.</font>
summary_title_reduced = summary_title[(summary_title['Title'] != 'Master') &
(summary_title['Title'] != 'Miss') &
(summary_title['Title'] != 'Mr') &
(summary_title['Title'] != 'Mrs') ]
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,6))
sns.catplot(ax = ax, data= summary_title_reduced, x="Title", y="Count", hue="Survived",
kind="bar", palette=my_pal, alpha = 0.5)
plt.close(2)
plt.show()
Interestingly, none of the passengers with a title of 'Rev' survived, while all of the passengers with a title of 'Countess', 'Lady', 'Mike', "Mme', 'Ms', and 'Sir' survived. This makes title a quite a valuable predictor.
Finally, let's label encode our title variable
le = LabelEncoder()
le.fit(np.unique(np.concatenate((test.Title.unique(), learning.Title.unique()))))
for data in full_data:
data['Title'] = le.transform(data[['Title']])
learning.head(5)
test.head()
Extended Feature Dictionary¶
Family_size: Given that we have information about the parent/children and siblings/spouse of each passenger, let's create a family size feature which would constitute the total number of family member on-board the titanic for each passenger including themselves
IsAlone:A categorical variable that is set to 1 if Family_size == 1, and 0 otherwise
Title: The salutation in name of the passenger such as Mr , Ms Mrs, Countess
</font>
Feature Engineering Data Insights¶
Let's summarize the insights we gained through our engineered features
Family_size: Passengers with larger family sizes seemed less likely to survive
IsAlone: Being the only member on-board also seemed to have a higher mortality rate
Title: Lastly, the title of the passenger was a strong indicator of their survival. The titles of Miss and Mrs had a higher likelihood of survival, while the Mr Title had a lower likelihood of survival. Additionally, none of the passengers with a title of 'Rev' survived, while all of the passengers with a title of 'Countess', 'Lady', 'Mike', "Mme', 'Ms', and 'Sir' survived. This makes title a quite a valuable predictor.
Drop Sex: Since most of the information we have about sex would be implied in the Title of the passenger, we can drop sex to avoid redundancy in our features.
</font>
learning = learning.drop('Sex', axis = 1)
test = test.drop('Sex', axis = 1)
test.head()
Hyperparameter Optimization¶
There are four strategies that can be used to optimize hyperparameters of an algorithm, all of which require you to have handy a validation set.
- Validation set: Wherein, we verify our choices of hyperparameters by observing the impact on the accuracy of our model on a validation/unseen set of data.
- k-fold Cross-Validation: The same concept as cross-validation accept, you now have "k" validation sets and "k" training sets. The idea is to repeat the training and validation steps "k" times, with each repetition seeing a different set of training and validation data.
- GridSearchCV - wherein you provide a set of values for every hyperparameter you wish to tune, and then every combination of the hyperparameters is inputted in the model. GridSearchCV will return the best combination of hyperparameters that provide the highest accuracy
- RandomSearchCV - Unlike GridSearchCV, RandomSearchCV will try to randomly select Hyperparameter values and return the best set of hyperparameters after a fixed number of tries.
</ol></font>
Validaion Set¶
X = learning.drop(['Survived','Name', 'PassengerId', 'Ticket'], axis = 1)
y = learning[['Survived']]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 42)
test_depths = range(1,10)
test_depths
training_scores = []
validation_scores = []
for depth in test_depths:
updated_decision_tree = tree.DecisionTreeClassifier(max_depth = depth, class_weight='balanced', random_state=42)
updated_decision_tree.fit(X_train, y_train)
training_scores.append(updated_decision_tree.score(X_train, y_train))
validation_scores.append(updated_decision_tree.score(X_valid, y_valid))
print(validation_scores)
plt.plot(test_depths,training_scores, color = 'blue')
plt.plot(test_depths,validation_scores, color = 'green')
plt.title("Accuracy Scores vs max_depth")
plt.xlabel("Max Depth of Tree")
plt.ylabel("Accuracy Scores")
plt.legend(labels = ['training','validation'])
plt.show()
random_min_sample_splits = np.random.randint(low=2, high=60, size=30)
random_min_sample_splits.sort()
random_min_sample_splits = np.unique(random_min_sample_splits)
training_scores = []
validation_scores = []
for split_size in random_min_sample_splits:
updated_decision_tree = tree.DecisionTreeClassifier(min_samples_split = split_size, class_weight='balanced',
random_state=42)
updated_decision_tree.fit(X_train, y_train)
training_scores.append(updated_decision_tree.score(X_train, y_train))
validation_scores.append(updated_decision_tree.score(X_valid, y_valid))
plt.plot(random_min_sample_splits,training_scores, color = 'blue')
plt.plot(random_min_sample_splits,validation_scores, color = 'green')
plt.title("Accuracy Scores vs min_samples_split")
plt.xlabel("Min. Sample Split at node")
plt.ylabel("Accuracy Scores")
plt.legend(labels = ['training','validation'])
plt.show()
Cross Validation¶
training_scores = []
validation_scores = []
avg_cross_val_scores = []
for depth in test_depths:
updated_decision_tree = tree.DecisionTreeClassifier(max_depth = depth, class_weight='balanced', random_state=42)
updated_decision_tree.fit(X_train, y_train)
training_scores.append(updated_decision_tree.score(X_train, y_train))
validation_scores.append(updated_decision_tree.score(X_valid, y_valid))
# Computing additional Cross validation score.
avg_cross_val_scores.append(np.mean(cross_val_score(updated_decision_tree, X, y, cv=10)))
plt.plot(test_depths,training_scores, color = 'blue')
plt.plot(test_depths,validation_scores, color = 'green')
plt.plot(test_depths,avg_cross_val_scores, color = 'red')
plt.title("Accuracy Scores vs max_depth")
plt.xlabel("Max Depth of Tree")
plt.ylabel("Accuracy Scores")
plt.legend(labels = ['training','validation', 'avg_cross_validation'])
plt.show()
training_scores = []
validation_scores = []
avg_cross_val_scores = []
for split_size in random_min_sample_splits:
updated_decision_tree = tree.DecisionTreeClassifier(min_samples_split = split_size,
class_weight='balanced', random_state=42)
updated_decision_tree.fit(X_train, y_train)
training_scores.append(updated_decision_tree.score(X_train, y_train))
validation_scores.append(updated_decision_tree.score(X_valid, y_valid))
# Computing additional Cross validation score.
avg_cross_val_scores.append(np.mean(cross_val_score(updated_decision_tree, X, y, cv=10)))
plt.plot(random_min_sample_splits,training_scores, color = 'blue')
plt.plot(random_min_sample_splits,validation_scores, color = 'green')
plt.plot(random_min_sample_splits,avg_cross_val_scores, color = 'red')
plt.title("Accuracy Scores vs min_samples_split")
plt.xlabel("Min. Sample Split at node")
plt.ylabel("Accuracy Scores")
plt.legend(labels = ['training','validation', 'avg_cross_validation'])
plt.show()
In general, the avg_cross_validation accuracy score serves as a stronger indicator of the model's accuracy on unseen data given that it has been tested on more than a single validation set. However, in both the cases above, we have tried to optimize a single hyperparameter at a time. Let's try to optimize a couple of them simultaneously.
#test_dict = dict.fromkeys(['max_depth', 'split_size', 'training', 'validation', 'avg_cross_validaion'])
#test_dict['max_depth'].append(1)
keys = ['max_depth', 'min_samples_split', 'training', 'validation', 'avg_cross_validaion']
test_dict = {key: [] for key in keys}
#test_dict['max_depth'].append(1)
test_dict
for depth in test_depths:
for split_size in random_min_sample_splits:
updated_decision_tree = tree.DecisionTreeClassifier(max_depth = depth, min_samples_split = split_size,
class_weight='balanced', random_state=42)
updated_decision_tree.fit(X_train, y_train)
test_dict['max_depth'].append(depth)
test_dict['min_samples_split'].append(split_size)
test_dict['training'].append(updated_decision_tree.score(X_train, y_train))
test_dict['validation'].append(updated_decision_tree.score(X_valid, y_valid))
test_dict['avg_cross_validaion'].append(np.mean(cross_val_score(updated_decision_tree, X, y, cv=10)))
scores_pd = pd.DataFrame(test_dict)
scores_pd.head()
g = sns.FacetGrid(scores_pd, col='max_depth', col_wrap=3)
g.map(sns.lineplot, "min_samples_split", "training", alpha=.7 , color = 'blue')
g.map(sns.lineplot, "min_samples_split", "validation", alpha=.7 , color = 'green')
g.map(sns.lineplot, "min_samples_split", "avg_cross_validaion", alpha=.7 , color = 'red')
plt.legend(labels = ['training','validation', 'avg_cross_validation'])
plt.show()
It seems like a max depth of 5 with a small min_sample_split size works the best both in terms of accuracy and optimizing the variance-bias trade of.
scores_pd[(scores_pd['max_depth'] == 5) & (scores_pd['min_samples_split'] == 6)]
In essence this was the manual process of doing grid search cross validation. As explained above, with GridSearch CV you give the algorithm a set of values for each hyperparameter. The algorithm will then take the cartesian product of these sets and try every combination of the hyperparameters and return the hyperparmeters that yield the highest accuracy.
GridSearchCV¶
parameters = {'max_depth':(1,2,3,4,5,6,7,8,9),
'min_samples_split':[6,10,11,13,14,15,17,18,19,20,21,26,27,33,38,43,44,45,48,49,50,51,56,59]}
gridCV_decision_tree = tree.DecisionTreeClassifier(class_weight='balanced', random_state=42)
clf = GridSearchCV(gridCV_decision_tree, parameters, cv=10)
clf.fit(X, y)
clf.best_params_
clf.best_score_
As can be seen, give the same set of values for the hyperparameters, GridSearch CV resulted in selected the max_depth of 5 and min_samples_split of 6, just like as we did with our simultaneous k-fold cross validation on the two hyperparameters.
RandomSearchCV¶
Like GridSearchCV, RandomSearchCV also takes a dictionary of hyperparameter values to search over. However, unlike the exhaustive search of GridSearch that checks every combination of hyperparameter values, RandomSearchCV random selects values to try. The number of combinations of parameters tried can be set using the n_iter parameter when initializing the RandomSearchCV object.
parameters = {'max_depth':(1,2,3,4,5,6,7,8,9),
'min_samples_split':[6,10,11,13,14,15,17,18,19,20,21,26,27,33,38,43,44,45,48,49,50,51,56,59]}
randomCV_decision_tree = tree.DecisionTreeClassifier(class_weight='balanced', random_state=42)
random_search = RandomizedSearchCV(randomCV_decision_tree, param_distributions=parameters,
n_iter=20, cv=10, iid=False, random_state = 42)
random_search.fit(X, y)
random_search.best_params_
random_search.best_score_
Final Model¶
g_sub = pd.read_csv('./data/Titanic/gender_submission.csv')
test_w_response = pd.merge(test, g_sub, on='PassengerId', how='inner')
test_w_response.head()
test_x = test_w_response.drop(['PassengerId', 'Name', 'Survived', 'Ticket'], axis = 1)
test_y = test_w_response[['Survived']]
# Adjust the one missing Fare with the mean
test_x['Fare'][test_x['Fare'].isna()] = np.mean(test_x['Fare'])
final_model = tree.DecisionTreeClassifier(max_depth = 5, min_samples_split = 6,
class_weight='balanced', random_state=42)
final_model.fit(X,y)
test_predict_y = final_model.predict(test_x)
accuracy_score(test_y, test_predict_y)
with open("tree_final.dot", 'w') as f:
f = tree.export_graphviz(final_model,
out_file=f,
max_depth = 5,
impurity = True,
feature_names = list(X.columns),
class_names = ['Died', 'Survived'],
rounded = True,
filled= True )
#Convert .dot to .png to allow display in web notebook
check_call(['dot','-Tpng','tree_final.dot','-o','tree_final.png'])
img_final = Image.open("./tree_final.png")
draw = ImageDraw.Draw(img_final)
img_final.save('final-out.png')
PImage("final-out.png")