post-image

Deep Dive | Decision Trees


Decision trees are a supervised learning algorithm used for classification and regression.This post will dive deep into the workings of a decision tree. Along the way, I will walk through a deatiled example and showcase some neccessary skills required to tune and improve any machine learning algorithm. The post will also exhibit some exploratory data analysis and visualization practices that can help better understand the data.

HUGOMORE42

Decision Trees

Decision trees are a supervised learning algorithm used for classification and regression.

How do they work?


As the name suggests, a decision tree makes a series of decisions based on each feature in the dataset by following a if-this-then-that logic.

Imagine a hypothetical situation in which there are 60 students in a class room (30 male, 30 female) and all the girls love playing volleyball, while the boys do not.

In this hypothetical scenario, we could say that 'if the gender is Female, then the person would like volleyball'. Hence the value of the Gender variable would act as a good predictor for whether the person likes volleyball or not. This process of partitioning is called splitting.

A decision tree that would split the data set on the Gender variable would hence be able to make good predictions about whether a random student likes volleyball or not.

In [1]:
import pandas as pd
In [2]:
data = pd.read_csv('./data/volleyball_example.csv')
data.head(5)
Out[2]:
Gender Volleyball
0 Male 0
1 Female 1
2 Female 1
3 Female 1
4 Female 1

How do we quantitatively evaluate how good a split is?


Option 1) FOR CLASSIFICATION - Entropy or Information gain


The decision making process of the algorithm is influenced by what is called entropy or information gain. Entropy is defined as the measure of disorder. In its original form, let's say the entropy of our dataset was 1. But after splitting the data based on the gender, we have now created two smaller datasets (in this case one of 30 Male and another of 30 Female students). These smaller datasets have lesser disorder in them on account of the fact that they both have only one type of gender compared to the two in the original data set.

By measuring the difference in the entropy of the original data set to the sum of the entropies of the two smaller datasets, we can measure how much disorder was lost as a result of our splitting decision or in other words how much order or information was added or gained. </font>


$Entropy(Original Data) - (Entropy(Females) + Entropy(Males)) = Information\ gained$


Each decision or splitting step in a decision tree is referred to as a node. At a node, the decision tree would consider every feature in the dataset and try to split the data-set based on a certain threshold of that feature. **The feature that gets selected is the one the provides the most information gained after the split. **

Option 2) FOR CLASSIFICATION - Gini Impurity


Alternatively, a decision tree can use what is called the gini impurity. The gini impurity is a quantitative measure that calculates the probability of incorrectly classifying a randomly selected data point based on the distribution of the classes.

Going back to our hypothetical volleball example, let's consider the following: 30 of the students prefer volleyball, while 30 of the students do not. Hence the probability that a randomly selected student likes volleyball is 0.50.</font>

Selected Student Prediction Probability
Case 1 Likes volleyball Does not like volleyball $0.5 * 0.5 = 0.25$
Case 2 Likes volleyball Likes volleyball $0.5 * 0.5 = 0.25$
Case 3 Does not like volleyball Likes volleyball $0.5 * 0.5 = 0.25$
Case 4 Does not like volleyball Does not like volleyball $0.5 * 0.5 = 0.25$

As can be seen from the above table, the probability of us incorrectly classifying a randomly selected student is the sum of Case 1 and Case 3 i.e.
$0.25 + 0.25 = 0.5$

Mathematically, the gini coefficient can be expressed as: </font>


$Gini\ Coefficient = \sum_{i=1}^C p(i)*(1-p(i))$

where, C - Total number of classes i - is the iterator of each class p(i) - probability of being that class 1 - p(i) - probability of not being that class


The decision tree makes a split based on the feature and the threshold that results in the lowest gini coefficient after the split.

Option 3) FOR REGRESSION - Mean squared error


As can be expected, decision tree regressors require an alternative measure to quantify the goodness of a split. One method to do so is to monitor the mean squared error and select the feature and threshold that reduces the error the most.

Note: Decision tree regressors do not predict on a continuous range. Instead, the prediction is the mean of the response values left in the leaf node in which you land.

Alternatively, we could also use the Friedman's MSE or the mean absolute error (MAE) to help quantitatively decide the best split.</font>

Hyperparameters


Now that we understand the mechanism that decision tree algorithms use to identify the best split, let us understand the hyperparamters that can be tuned to control the way the algorithm learns from the data. Tuning hyperparameters is an important step of any machine learning algorithm.

A full list of hyperparameters can be found in the sklearn documentation:
Decision Tree Regressors: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Decision Tree Classifiers: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

We'll focus on a few hyperparameters that I consider important through my experience. </font>

criterion

The criterion is the quantitative measure that can be used by the algorithm to make the decision of the best split.

For decision tree classifiers, the options are gini impurity or entropy. There doesn't seem to be a significant advantage in picking one over the other, so either should work in practice.

Decision tree regressors, on the other hand, could use mean squared error, friedman mean squared error, or mean absolute error. The difference between mean squared error and the mean absolute error is that the former minimizes the loss with the mean, while the latter does so with the median. The friedman mean squared error accounts for a weighted mean squared error that accounts for the number of samples in the left and right split of a node to calculate an improved or weighted mean squared error. </font>

max_depth

Each split in a node of a decision tree accounts for an additional layer of depth. For instance, the first node in a decision tree is referred to as the root node. This is also known as a decision stump. Splitting the root node results in two daughter nodes each with a smaller subset of the data and in turn an increase in the depth of the tree. The max_depth hyperparameter allows you to dictate the maximum depth that a decision tree may attain before the algorithm must stop.

How to use this?
The greater the depth of the tree, the more complex the algorithms learning path. By setting a tree with a larger depth, the algorithm can more precisely fit the learning data. As with all hyperparameters, there isn't a golden rule that fits all situations. However, a general understanding that you may have is that the deeper the tree, the more likely it is to over-fit the training data and conversely, the smaller the depth is the more likely it is to under-fit the training data.

min_samples_split

As we've understood by now, by traversing down a tree we will encounter nodes splitting into daughter nodes and each daughter node would contain a smaller subset of the data to be further split based on a certain feature and threshold. It isn't necessary that each pair of sister nodes (daughter nodes originating from the same parent node) have the same number of data points in them. In fact, more often than not, sister nodes would not have the same number of data points in them. The min_samples_split hyperparameter allows you to govern which node can be further split and which cannot based on the number of data points left in them.

How to use this?
Just as in the case with max_depth, the min_samples_split hyperparameter also helps govern how complex or how strongly the algorithm is fitting the training data. Having a really small number for min_sample_split would allow the algorithm to dissect the data-set until a very small number of data points remain in each node - provided the maximum allowable depth of the tree (max_depth) hasn't been reached. The Smaller min_samples_split is, the more complex is the learning path of your algorithm and the higher is its tendency to over-fit the training data. Conversely, the higher the value the more likely it is to under-fit the training data.

min_samples_leaf

Before we understand this hyperparameter, it is important to recollect the difference between a leaf node and an internal node. Contrary to regular node, a leaf node is one that does not have any daughter/children nodes. It marks the end of the splitting of the data in that branch of the decision tree. The min_samples_leaf allows us to control how many samples must be left in a daughter of an internal node in order for the split to be allowed. The purpose of min_samples_leaf can often be hard to differentiate from min_samples_split, so let's consider an example: Let's say an internal node has 10 data points, and we've set the min_samples_split to 7 and the min_samples_leaf to 5. Since there are more than 7 data points, we pass the min_samples_split requirement. However, if the resulting split results in a daughter node having less than 5 data-points then it would violate the min_samples_leaf threshold and the split would not be allowed.

How to use this The primary difference in effect between min_samples_split and min_samples_leaf is that the former governs whether node can be split, while the latter governs whether the resulting split creates nodes that are viable. By setting a min_samples_leaf to a really small number, we allow the the algorithm to learn the training data more precisely, which may result in over-fitting. Another way to understand this is that leaves with smaller number data-points allow the algorithm to dissect the data into smaller groups than would more likely be unique to the training data, and might not be reflective of the real world or unseen data.

What kind of data do you pass to Decision Trees?

Categorical vs Numerical

Decision trees can accept categorical and numerical data. However, the programming and packages you use will govern the format in which you can pass in categorical variables. Since we are using sci-kit learn in this example, I'll explain the procedure for the same.

Sci-kit learn's decision tree algorithm requires categorical variables to be encoded into numbers. Hence, a variable like sex that could have values of 'male' and 'female' would need to be transformed to '1' and '0'. </font>

Missing Values

The decision tree algorithm will throw an error message if you pass it any features with missing values. Hence, it is imperative to deal with missing values before feeding the data to the algorithm.

Example Model


Importing Libraries

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re

from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold, GridSearchCV, RandomizedSearchCV, cross_val_score, cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, Imputer

from IPython.display import Image as PImage
from subprocess import check_call
from PIL import Image, ImageDraw, ImageFont

%matplotlib inline

Importing the data

In [4]:
# Use pandas inbuild function to read in the learning and testing data
learning = pd.read_csv('./data/titanic/train.csv')
test = pd.read_csv('./data/titanic/test.csv')

# Make a list of both data sets to conveniently apply data wrangling to all the data
full_data = [learning, test]


learning.head(5)
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Feature Dictionary

Independent Variables

PassengerId: A serial number for the passengers
Pclass: The class of ticket the passenger purchased (1 = 1st, 2 = 2nd, 3 = 3rd)
Name: The name of each passenger
Sex: Whether the passenger was male or female
Age: The age of the passenger
SibSp: The number of passengers who were either siblings or a spouse of the passenger in question
Parch: The number of passengers who were either parents or children of the passenger in question
Ticket: The ticket number of the ticket purchased by the passenger
Fare: The price paid for the ticket
Cabin: The cabin number allocated to the passenger. If NaN, it implies the passenger did not have a cabin
Embarked: The port that the passenger embarked from

Response Variable

Survived: Whether the passenger survived the sinking of the Titanic

Data Wrangling and Exploratory Data Analysis

Data Summary

In [5]:
# Get the data types of each variable as a dataframe
learning.dtypes.to_frame(name = 'data_type')
Out[5]:
data_type
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
In [6]:
# Get summary statistics of all numerical variables
learning.describe()
Out[6]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [7]:
# Get descriptive statistics of non-numerical variables
learning.describe(exclude = ['int64', 'float64'])
Out[7]:
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Turpin, Mr. William John Robert male 347082 C23 C25 C27 S
freq 1 577 7 4 644

Exploring Ticket Number variable

In spite of having no missing values, there seem to be only 681 unique ticket numbers. Let's explore to understand why each passenger didn't have a unique ticket number.

In [8]:
learning['Ticket'].value_counts().to_frame(name = 'Count').head(5)
Out[8]:
Count
347082 7
CA. 2343 7
1601 7
347088 6
CA 2144 6
In [9]:
# Let's explore a  few repeating ticket numbers
learning[learning['Ticket'] == "347082"]
Out[9]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.275 NaN S
119 120 0 3 Andersson, Miss. Ellis Anna Maria female 2.0 4 2 347082 31.275 NaN S
541 542 0 3 Andersson, Miss. Ingeborg Constanzia female 9.0 4 2 347082 31.275 NaN S
542 543 0 3 Andersson, Miss. Sigrid Elisabeth female 11.0 4 2 347082 31.275 NaN S
610 611 0 3 Andersson, Mrs. Anders Johan (Alfrida Konstant... female 39.0 1 5 347082 31.275 NaN S
813 814 0 3 Andersson, Miss. Ebba Iris Alfrida female 6.0 4 2 347082 31.275 NaN S
850 851 0 3 Andersson, Master. Sigvard Harald Elias male 4.0 4 2 347082 31.275 NaN S
In [10]:
learning[learning['Ticket'] == "CA. 2343"]
Out[10]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
159 160 0 3 Sage, Master. Thomas Henry male NaN 8 2 CA. 2343 69.55 NaN S
180 181 0 3 Sage, Miss. Constance Gladys female NaN 8 2 CA. 2343 69.55 NaN S
201 202 0 3 Sage, Mr. Frederick male NaN 8 2 CA. 2343 69.55 NaN S
324 325 0 3 Sage, Mr. George John Jr male NaN 8 2 CA. 2343 69.55 NaN S
792 793 0 3 Sage, Miss. Stella Anna female NaN 8 2 CA. 2343 69.55 NaN S
846 847 0 3 Sage, Mr. Douglas Bullen male NaN 8 2 CA. 2343 69.55 NaN S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.55 NaN S

It appears that repeated ticket numbers signify families traveling together. We can deduce this from the common last names of the family.

Adjusting the Cabin Variable

Since the NaN values in the cabin variable signify that the passenger was without a cabin, we can adjust the column to a boolean variable. Furthermore, sklearn's decision tree algorithm requires categorical variables to be encoded into numbers.

In [11]:
# Change the Cabin variable to a boolean in both test and learning dataframes
for data in full_data:
    data['Cabin'] = data['Cabin'].fillna(0)
    data['Cabin'][data['Cabin'] != 0] =1
/Users/arzan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
In [12]:
learning.head(5)
Out[12]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 0 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 1 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 0 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 1 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 0 S

Feature Distributions

Lets visualize the variables to have a better understanding of their distribution.

Sex

In [13]:
learning['Sex'].value_counts().plot.bar(alpha = 0.8)
plt.title('Distribution of Sex')
plt.xlabel('Sex')
plt.ylabel('Frequency')
plt.show()

Cabin

In [14]:
learning['Cabin'].value_counts().plot.bar(alpha = 0.8)
plt.title('Distribution of Passengers with and without Cabins')
plt.xlabel('Cabin')
plt.ylabel('Frequency')
plt.show()

Passenger Ticket Class

In [15]:
learning['Pclass'].value_counts().plot.bar(alpha = 0.8)
plt.title('Distribution of Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Frequency')
plt.show()

Port of Embarkation

In [16]:
learning['Embarked'].value_counts().plot.bar(alpha = 0.8)
plt.title('Distribution of port of Embarkation')
plt.xlabel('Port of Embarkation')
plt.ylabel('Frequency')
plt.show()

Age

In [17]:
learning['Age'].hist(bins = 20, alpha = 0.8)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Number of parents and children on board

In [18]:
learning['Parch'].hist(bins = 6, alpha = 0.8)
plt.title('Distribution of the number of passengers who were either parents or children of the passenger in question')
plt.xlabel('Number of Parents and/or children')
plt.ylabel('Frequency')
plt.show()

Sum of Siblings and Spouse on-board

In [19]:
learning['SibSp'].hist(bins = 8, alpha = 0.8)
plt.title('Distribution of the number of passengers who were either siblings or a spouse of the passenger in question')
plt.xlabel('Number of Spouse and Siblings')
plt.ylabel('Frequency')
plt.show()

Fare

In [20]:
learning['Fare'].hist(bins = 10, alpha = 0.8)
plt.title('Distribution of Fare')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.show()

Feature Disparity grouped by response

It is good idea to see how a feature's distribution varies based on the classes of our categorical response variable. Features that show a good amount of disparity could serve as strong predictors

Age

In [21]:
import warnings
warnings.filterwarnings('ignore')
sns.distplot(learning[learning['Survived'] == 1][['Age']].dropna(),
             kde = False, color = 'red', label = 'Survived').set_title('Distribution of Age')
sns.distplot(learning[learning['Survived'] == 0][['Age']].dropna(),
             kde = False, color = 'grey', label = 'Did not Survive').set_title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Sum of Siblings and Spouse on-board

In [22]:
sns.distplot(learning[learning['Survived'] == 1][['SibSp']].dropna(), bins =5,
             kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[learning['Survived'] == 0][['SibSp']].dropna(), bins =5,
             kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Sum of number of Siblings and Spouse')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the number of siblings and spouse they had')
plt.legend()
plt.show()

Number of parents and children on board

In [23]:
sns.distplot(learning[learning['Survived'] == 1][['Parch']].dropna(), bins =8,
             kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[learning['Survived'] == 0][['Parch']].dropna(), bins =8,
             kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Sum of number of Parents and Children')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the number of parents and children they had')
plt.legend()
plt.show()

Fare

In [24]:
sns.distplot(learning[learning['Survived'] == 1][['Fare']].dropna(), bins = 10,
             kde = False, color = 'red', label = 'Survived')
sns.distplot(learning[learning['Survived'] == 0][['Fare']].dropna(), bins = 10,
             kde = False, color = 'grey', label = 'Did not Survive')
plt.xlabel('Fare paid')
plt.ylabel('Frequency')
plt.title('Number of passengers that survived based on the fare they paid')
plt.legend()
plt.show()

Port of Embarkation

In [25]:
#Get summary counts for each port of embarkation for passengers that survived
embarked_summary_survived = learning[learning['Survived'] == 1]['Embarked'].dropna().value_counts().to_frame(name='Count')

#Get summary counts for each port of embarkation for passengers that did not survive
embarked_summary_did_not_survive = learning[learning['Survived'] == 0]['Embarked'].dropna().value_counts().to_frame(name='Count')

# Append the two data frames into one
embarked_summary = embarked_summary_survived.append(embarked_summary_did_not_survive)

# Add a column for whehter the count represents passengers that survived or not
embarked_summary['Survived'] = ['Yes', 'Yes', 'Yes', 'No', 'No', 'No']

# Add a column that states port of embarkation
embarked_summary['Embarked'] = embarked_summary.index

# Reset the index to numbers
embarked_summary.reset_index(level = 0, inplace = True)

# Drop the additional index column
embarked_summary = embarked_summary.drop(['index'], axis = 1)

# Final Summary
embarked_summary
Out[25]:
Count Survived Embarked
0 217 Yes S
1 93 Yes C
2 30 Yes Q
3 427 No S
4 75 No C
5 47 No Q
In [26]:
sns.catplot(x="Embarked", y="Count", hue="Survived", data=embarked_summary,
                height=6, kind="bar", palette = ['#FF999A','#CCCCCC'])
plt.title("Counts of passengers that did and did not survive per port of Embarkation")
plt.show()

Cabin

In [27]:
#Get summary counts for each passengers with or without cabins that survived
cabin_summary_survived = learning[learning['Survived'] == 1]['Cabin'].dropna().value_counts().to_frame(name='Count')

#Get summary counts for each port of embarkation for passengers that did not survive
cabin_summary_did_not_survive = learning[learning['Survived'] == 0]['Cabin'].dropna().value_counts().to_frame(name='Count')

# Append the two data frames into one
cabin_summary = cabin_summary_survived.append(cabin_summary_did_not_survive)

# Add a column for whehter the count represents passengers that survived or not
cabin_summary['Survived'] = ['Yes', 'Yes', 'No', 'No']

# Add a column that states port of embarkation
cabin_summary['Cabin'] = cabin_summary.index

# Reset the index to numbers
cabin_summary.reset_index(level = 0, inplace = True)

# Drop the additional index column
cabin_summary = cabin_summary.drop(['index'], axis = 1)

# Final Summary
cabin_summary
Out[27]:
Count Survived Cabin
0 206 Yes 0
1 136 Yes 1
2 481 No 0
3 68 No 1
In [28]:
sns.catplot(x="Cabin", y="Count", hue="Survived", data=cabin_summary,
                height=6, kind="bar", palette = ['#FF999A','#CCCCCC'])
plt.title("Counts of passengers that did and did not survive by based on whether that did or did not have a cabin")
plt.show()

Sex

In [29]:
#Get summary counts for each port of embarkation for passengers that survived
sex_summary_survived = learning[learning['Survived'] == 1]['Sex'].dropna().value_counts().to_frame(name='Count')

#Get summary counts for each port of embarkation for passengers that did not survive
sex_summary_did_not_survive = learning[learning['Survived'] == 0]['Sex'].dropna().value_counts().to_frame(name='Count')

# Append the two data frames into one
sex_summary = sex_summary_survived.append(sex_summary_did_not_survive)

# Add a column for whehter the count represents passengers that survived or not
sex_summary['Survived'] = ['Yes', 'Yes', 'No', 'No']

# Add a column that states port of embarkation
sex_summary['Sex'] = sex_summary.index

# Reset the index to numbers
sex_summary.reset_index(level = 0, inplace = True)

# Drop the additional index column
sex_summary = sex_summary.drop(['index'], axis = 1)

# Final Summary
sex_summary
Out[29]:
Count Survived Sex
0 233 Yes female
1 109 Yes male
2 468 No male
3 81 No female
In [30]:
sns.catplot(x="Sex", y="Count", hue="Survived", data=sex_summary,
                height=6, kind="bar", palette = ['#FF999A','#CCCCCC'])
plt.title("Counts of passengers that did and did not survive by sex")
plt.show()

Class of Passenger's Ticket

In [31]:
#Get summary counts for each port of embarkation for passengers that survived
pclass_summary_survived = learning[learning['Survived'] == 1]['Pclass'].dropna().value_counts().to_frame(name='Count')

#Get summary counts for each port of embarkation for passengers that did not survive
pclass_summary_did_not_survive = learning[learning['Survived'] == 0]['Pclass'].dropna().value_counts().to_frame(name='Count')

# Append the two data frames into one
pclass_summary = pclass_summary_survived.append(pclass_summary_did_not_survive)

# Add a column for whehter the count represents passengers that survived or not
pclass_summary['Survived'] = ['Yes', 'Yes', 'Yes', 'No', 'No', 'No']

# Add a column that states port of embarkation
pclass_summary['Pclass'] = pclass_summary.index

# Reset the index to numbers
pclass_summary.reset_index(level = 0, inplace = True)

# Drop the additional index column
pclass_summary = pclass_summary.drop(['index'], axis = 1)

# Final Summary
pclass_summary
Out[31]:
Count Survived Pclass
0 136 Yes 1
1 119 Yes 3
2 87 Yes 2
3 372 No 3
4 97 No 2
5 80 No 1
In [32]:
sns.catplot(x="Pclass", y="Count", hue="Survived", data=pclass_summary,
                height=6, kind="bar", palette = ['#FF999A','#CCCCCC'])
plt.title("Counts of passengers that did and did not survive by class of ticket")
plt.show()

Data Insights:

  • Missing Values: There seem to be missing values in the Age, Cabin and Embarked variables. We can tell this because the count is less than 891. These will need to be filled in.
  • Data Anomaly: Repeated ticket numbers seem to imply that the passengers are part of the same family. This can be deduced from the common last names
  • Bad Predictors: Based on intuition, in there current form, it would be unlikely that PassengerId, Name and Ticket number would be strong predictors of the Survival of a passenger.
  • Data Adjustment: The categorical variables of 'Sex', 'Cabin' and 'Embarked' need to be encoded in order to be used with sklearn's decision tree algorithms
    • Cabin makes more sense as a boolean variable that represents passengers who had cabins vs. those who did not
    • </ul>

    • Good Predictors: From the feature disparity plots, it appears that Sex, Cabin and Passenger Class seem to be good predictors of a passenger's survival. In particular, the likelihood of survival was higher if the passenger's sex was female or if they did not possess a class 3 ticket, or if they had a cabin.
    • </ul> </font>

Adjusting the Sex Variable

Like the cabin variable, we need to encode the categorical variable "sex" into a numerical one for it to be accept by sklearn's decision tree algorithm

In [33]:
le = LabelEncoder()
le.fit(['male','female'])
for data in full_data:
    data['Sex'] = le.transform(data[['Sex']])
learning.head(5)
Out[33]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 0 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 1 C
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 0 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 1 S
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 0 S

Adjusting the Embarked Variable

Again, encoding the variable to work with sklearn. However, there are missing values in the 'Embarked' column and we are gonna replace them with 'S', the most common port of entry. This might not be the true port of embarkation, however it seems like the most intuitive thing to do considering that it has the highest frequency amongst the three ports.

In [34]:
le_embark = LabelEncoder()
le_embark.fit(['S','C','Q'])
for data in full_data:
    data[['Embarked']] = data[['Embarked']].fillna('S')
    data['Embarked'] = le_embark.transform(data[['Embarked']])
learning.head(5)
Out[34]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 0 2
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 1 0
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 0 2
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 1 2
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 0 2

Adding missing values to age

There are several ways to deal with missing values in a data set. You could drop the row/data point that has the missing value or if a feature has too many missing values drop the entire column. While dropping rows and columns seem like an easy fix, there's obviously a big downside to these strategies in that you lose viable data.
Other mathematically viable ways to impute the missing values with a sensible replacement are often considered better alternatives to dropping data points and features. In particular, single and multiple imputation methods are the two strategies that we could consider.
For the sake of this project, we will use single imputation with the mean.

In [35]:
imputer = Imputer()
learning['Age'] = imputer.fit_transform(learning[['Age']])
test['Age'] = imputer.fit_transform(test[['Age']])

Base Model

Let's create a base decision tree model to try an see what our accuracy is based on the default set of features to understand our starting point.
This base model could then be improved by using several techniques

  • Hyperparameter Tuning with cross validation
  • Feature Selection
  • Feature Engineering
  • Regularization
  • </ul> </font>

Training the Base model

In [36]:
# Splitting the data into training and validations sets
X = learning.drop(['Survived','Name', 'PassengerId', 'Ticket'], axis = 1)
y = learning[['Survived']]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 42)
In [37]:
# Training the Base model
base_decision_tree = tree.DecisionTreeClassifier()
base_decision_tree.fit(X_train, y_train)

# Training score
base_decision_tree.score(X_train, y_train)
Out[37]:
0.9845505617977528
In [38]:
# Validation Score
base_decision_tree.score(X_valid, y_valid)
Out[38]:
0.7877094972067039

Base model results

The base model has an accuracy score of 98.5% on the training set and a 78.77% accuracy on the validation set.

In [39]:
pd.DataFrame(data = base_decision_tree.feature_importances_,
             index = list(X_train.columns),
             columns = ['Importance']).sort_values(by = ['Importance'], axis = 0, ascending = False)
Out[39]:
Importance
Sex 0.305733
Fare 0.232230
Age 0.221471
Pclass 0.101862
SibSp 0.058011
Parch 0.032988
Cabin 0.027605
Embarked 0.020100

As postulated, we can see that Sex is one of the most important features when deciding on whether a passenger survived or not. While Age and Fare were not intuitively strong predictors - based on the feature disparity visualizations - the algorithm believes they are strong predictors.

Of the other variables in our consideration, passenger class and cabin seem to be moderately important to the algorithm.

It's important to note, that these feature importances are produced using the current hyperparameters. Adjusting the hyperparameters may result in a different set of feature importances. </font>

Visualizing the base model

In [40]:
with open("tree1.dot", 'w') as f:
     f = tree.export_graphviz(base_decision_tree,
                              out_file=f,
                              impurity = True,
                              feature_names = list(X_train.columns),
                              class_names = ['Died', 'Survived'],
                              rounded = True,
                              filled= True )

#Convert .dot to .png to allow display in web notebook
check_call(['dot','-Tpng','tree1.dot','-o','tree1.png'])
Out[40]:
0
In [41]:
img = Image.open("./tree1.png")
draw = ImageDraw.Draw(img)
img.save('sample-out.png')
PImage("sample-out.png")
Out[41]:

As we can see, this base model decision tree has a very large depth and could be considered as one that is over-fitting given the last discrepancy between the training and validation error. Let's try to rectify this and improve our model

Feature Engineering

Considering the ubiquity of the titanic data-set, there is extensive work done on it and we are gonna try to take inspiration from some of the work that's out there. Sina, Anisotropic, Diego_Milia and also Megan Risdal

Family Size and IsAlone

  • Family_size: Given that we have information about the parent/children and siblings/spouse of each passenger, let's create a family size feature which would constitute the total number of family member on-board the titanic for each passenger including themselves
  • IsAlone: A categorical variable that is set to 1 if Family_size == 1, and 0 otherwise
  • In [42]:
    learning.head()
    
    Out[42]:
    PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
    0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 0 2
    1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 1 0
    2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 0 2
    3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 1 2
    4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 0 2
    In [43]:
    for data in full_data:
        data['Family_size'] = data['SibSp'] + data['Parch'] + 1
        data['IsAlone'] = 0
        data['IsAlone'][data['Family_size'] == 1] = 1
    
    In [44]:
    test.head(5)
    
    Out[44]:
    PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family_size IsAlone
    0 892 3 Kelly, Mr. James 1 34.5 0 0 330911 7.8292 0 1 1 1
    1 893 3 Wilkes, Mrs. James (Ellen Needs) 0 47.0 1 0 363272 7.0000 0 2 2 0
    2 894 2 Myles, Mr. Thomas Francis 1 62.0 0 0 240276 9.6875 0 1 1 1
    3 895 3 Wirz, Mr. Albert 1 27.0 0 0 315154 8.6625 0 2 1 1
    4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) 0 22.0 1 1 3101298 12.2875 0 2 3 0
    In [45]:
    sns.distplot(learning[learning['Survived'] == 1][['Family_size']].dropna(), bins =10,
                 kde = False, color = 'red', label = 'Survived')
    sns.distplot(learning[learning['Survived'] == 0][['Family_size']].dropna(), bins =10,
                 kde = False, color = 'grey', label = 'Did not Survive')
    plt.xlabel('Family Size')
    plt.ylabel('Frequency')
    plt.title('Number of passengers that survived based on the size of family on-board')
    plt.legend()
    plt.show()
    

    Let's drop the smaller family sizes and gain some more resolution on the larger family sizes.

    In [46]:
    sns.distplot(learning[(learning['Survived'] == 1) & (learning['Family_size'] >=3)][['Family_size']].dropna(), bins =8,
                 kde = False, color = 'red', label = 'Survived')
    sns.distplot(learning[(learning['Survived'] == 0) & (learning['Family_size'] >=3)][['Family_size']].dropna(), bins =8,
                 kde = False, color = 'grey', label = 'Did not Survive')
    plt.xlabel('Family Size')
    plt.ylabel('Frequency')
    plt.title('Number of passengers that survived based on the size of family on-board')
    plt.legend()
    plt.show()
    

    The above visualization shows, that larger the family size of a person less likely are they to survive.

    In [47]:
    summary_isalone = pd.DataFrame(learning.groupby(['Survived','IsAlone']).size(), columns= ['Count'])
    summary_isalone.reset_index(level=0, inplace=True)
    summary_isalone.reset_index(level=0, inplace=True)
    summary_isalone
    
    Out[47]:
    IsAlone Survived Count
    0 0 0 175
    1 1 0 374
    2 0 1 179
    3 1 1 163
    In [48]:
    fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,6))
    my_pal = {1: "red", 0: "grey"}
    
    sns.catplot(ax = ax, data= summary_isalone, x="IsAlone", y="Count", hue="Survived",
                kind="bar", palette=my_pal, alpha = 0.5)
    
    plt.close(2)
    plt.show()
    

    Title

    As we can see from the name column, there are titles such as Mr. and Mrs. embedded within the name. Let's extract this information and form a title variable. Using a feature such as the title allows us to extract a little more information about each passenger. For instance, a Master or a Sir would be have a higher chance of survival due to their elevated social status.

    Title: The salutation in name of the passenger</font>

    In [49]:
    re_res = re.search('([A-Za-z]+)\.', learning['Name'][0])
    
    In [50]:
    re_res.group(1)
    
    Out[50]:
    'Mr'
    In [51]:
    learning['Name'][0]
    
    Out[51]:
    'Braund, Mr. Owen Harris'
    In [52]:
    def extract_title(name):
        title_search = re.search('([A-Za-z]+)\.', name)
        # If the title exists, extract and return it.
        if title_search:
            return title_search.group(1)
        return ""
    
    In [53]:
    for data in full_data:
        data['Title']  = data['Name'].apply(extract_title)
    
    In [54]:
    test.head()
    
    Out[54]:
    PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family_size IsAlone Title
    0 892 3 Kelly, Mr. James 1 34.5 0 0 330911 7.8292 0 1 1 1 Mr
    1 893 3 Wilkes, Mrs. James (Ellen Needs) 0 47.0 1 0 363272 7.0000 0 2 2 0 Mrs
    2 894 2 Myles, Mr. Thomas Francis 1 62.0 0 0 240276 9.6875 0 1 1 1 Mr
    3 895 3 Wirz, Mr. Albert 1 27.0 0 0 315154 8.6625 0 2 1 1 Mr
    4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) 0 22.0 1 1 3101298 12.2875 0 2 3 0 Mrs
    In [55]:
    summary_title = pd.DataFrame(learning.groupby(['Survived','Title']).size(), columns= ['Count'])
    summary_title.reset_index(level=0, inplace=True)
    summary_title.reset_index(level=0, inplace=True)
    summary_title
    
    Out[55]:
    Title Survived Count
    0 Capt 0 1
    1 Col 0 1
    2 Don 0 1
    3 Dr 0 4
    4 Jonkheer 0 1
    5 Major 0 1
    6 Master 0 17
    7 Miss 0 55
    8 Mr 0 436
    9 Mrs 0 26
    10 Rev 0 6
    11 Col 1 1
    12 Countess 1 1
    13 Dr 1 3
    14 Lady 1 1
    15 Major 1 1
    16 Master 1 23
    17 Miss 1 127
    18 Mlle 1 2
    19 Mme 1 1
    20 Mr 1 81
    21 Mrs 1 99
    22 Ms 1 1
    23 Sir 1 1
    In [56]:
    fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,6))
    my_pal = {1: "red", 0: "grey"}
    
    sns.catplot(ax = ax, data= summary_title, x="Title", y="Count", hue="Survived",
                kind="bar",palette=my_pal, alpha = 0.5)
    plt.close(2)
    plt.show()
    

    It appears that the Miss and Mrs Titles have a higher likelihood of survival, while the Mr Title has a lower likelihood of survival.

    Let's zoom into the graph a little more by dropping the most common titles.</font>

    In [57]:
    summary_title_reduced = summary_title[(summary_title['Title'] != 'Master') &
                  (summary_title['Title'] != 'Miss') &
                  (summary_title['Title'] != 'Mr') &
                  (summary_title['Title'] != 'Mrs') ]
    
    In [58]:
    fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,6))
    sns.catplot(ax = ax, data= summary_title_reduced, x="Title", y="Count", hue="Survived",
                kind="bar", palette=my_pal, alpha = 0.5)
    plt.close(2)
    plt.show()
    

    Interestingly, none of the passengers with a title of 'Rev' survived, while all of the passengers with a title of 'Countess', 'Lady', 'Mike', "Mme', 'Ms', and 'Sir' survived. This makes title a quite a valuable predictor.

    Finally, let's label encode our title variable

    In [59]:
    le = LabelEncoder()
    le.fit(np.unique(np.concatenate((test.Title.unique(), learning.Title.unique()))))
    for data in full_data:
        data['Title'] = le.transform(data[['Title']])
    learning.head(5)
    
    Out[59]:
    PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family_size IsAlone Title
    0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 0 2 2 0 13
    1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 1 0 2 0 14
    2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 0 2 1 1 10
    3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 1 2 2 0 14
    4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 0 2 1 1 13
    In [60]:
    test.head()
    
    Out[60]:
    PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family_size IsAlone Title
    0 892 3 Kelly, Mr. James 1 34.5 0 0 330911 7.8292 0 1 1 1 13
    1 893 3 Wilkes, Mrs. James (Ellen Needs) 0 47.0 1 0 363272 7.0000 0 2 2 0 14
    2 894 2 Myles, Mr. Thomas Francis 1 62.0 0 0 240276 9.6875 0 1 1 1 13
    3 895 3 Wirz, Mr. Albert 1 27.0 0 0 315154 8.6625 0 2 1 1 13
    4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) 0 22.0 1 1 3101298 12.2875 0 2 3 0 14

    Extended Feature Dictionary

    Family_size: Given that we have information about the parent/children and siblings/spouse of each passenger, let's create a family size feature which would constitute the total number of family member on-board the titanic for each passenger including themselves

    IsAlone:A categorical variable that is set to 1 if Family_size == 1, and 0 otherwise

    Title: The salutation in name of the passenger such as Mr , Ms Mrs, Countess

    </font>

    Feature Engineering Data Insights

    Let's summarize the insights we gained through our engineered features

    Family_size: Passengers with larger family sizes seemed less likely to survive

    IsAlone: Being the only member on-board also seemed to have a higher mortality rate

    Title: Lastly, the title of the passenger was a strong indicator of their survival. The titles of Miss and Mrs had a higher likelihood of survival, while the Mr Title had a lower likelihood of survival. Additionally, none of the passengers with a title of 'Rev' survived, while all of the passengers with a title of 'Countess', 'Lady', 'Mike', "Mme', 'Ms', and 'Sir' survived. This makes title a quite a valuable predictor.

    Drop Sex: Since most of the information we have about sex would be implied in the Title of the passenger, we can drop sex to avoid redundancy in our features.
    </font>

    In [61]:
    learning = learning.drop('Sex', axis = 1)
    test = test.drop('Sex', axis = 1)
    test.head()
    
    Out[61]:
    PassengerId Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked Family_size IsAlone Title
    0 892 3 Kelly, Mr. James 34.5 0 0 330911 7.8292 0 1 1 1 13
    1 893 3 Wilkes, Mrs. James (Ellen Needs) 47.0 1 0 363272 7.0000 0 2 2 0 14
    2 894 2 Myles, Mr. Thomas Francis 62.0 0 0 240276 9.6875 0 1 1 1 13
    3 895 3 Wirz, Mr. Albert 27.0 0 0 315154 8.6625 0 2 1 1 13
    4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) 22.0 1 1 3101298 12.2875 0 2 3 0 14

    Hyperparameter Optimization

    There are four strategies that can be used to optimize hyperparameters of an algorithm, all of which require you to have handy a validation set.

    1. Validation set: Wherein, we verify our choices of hyperparameters by observing the impact on the accuracy of our model on a validation/unseen set of data.
    2. k-fold Cross-Validation: The same concept as cross-validation accept, you now have "k" validation sets and "k" training sets. The idea is to repeat the training and validation steps "k" times, with each repetition seeing a different set of training and validation data.
    3. GridSearchCV - wherein you provide a set of values for every hyperparameter you wish to tune, and then every combination of the hyperparameters is inputted in the model. GridSearchCV will return the best combination of hyperparameters that provide the highest accuracy
    4. RandomSearchCV - Unlike GridSearchCV, RandomSearchCV will try to randomly select Hyperparameter values and return the best set of hyperparameters after a fixed number of tries.

    5. </ol></font>

    Validaion Set

    In [62]:
    X = learning.drop(['Survived','Name', 'PassengerId', 'Ticket'], axis = 1)
    y = learning[['Survived']]
    
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 42)
    
    In [63]:
    test_depths = range(1,10)
    test_depths
    
    Out[63]:
    range(1, 10)
    In [64]:
    training_scores = []
    validation_scores = []
    
    for depth in test_depths:
        updated_decision_tree = tree.DecisionTreeClassifier(max_depth = depth, class_weight='balanced', random_state=42)
        updated_decision_tree.fit(X_train, y_train)
        training_scores.append(updated_decision_tree.score(X_train, y_train))
        validation_scores.append(updated_decision_tree.score(X_valid, y_valid))
    
    
    print(validation_scores)
    
    [0.6368715083798883, 0.7821229050279329, 0.7988826815642458, 0.7877094972067039, 0.8156424581005587, 0.8156424581005587, 0.8324022346368715, 0.8547486033519553, 0.8324022346368715]
    
    In [65]:
    plt.plot(test_depths,training_scores, color = 'blue')
    plt.plot(test_depths,validation_scores, color = 'green')
    plt.title("Accuracy Scores vs max_depth")
    plt.xlabel("Max Depth of Tree")
    plt.ylabel("Accuracy Scores")
    plt.legend(labels = ['training','validation'])
    plt.show()
    
    In [66]:
    random_min_sample_splits = np.random.randint(low=2, high=60, size=30)
    random_min_sample_splits.sort()
    random_min_sample_splits = np.unique(random_min_sample_splits)
    
    In [67]:
    training_scores = []
    validation_scores = []
    
    for split_size in random_min_sample_splits:
        updated_decision_tree = tree.DecisionTreeClassifier(min_samples_split = split_size, class_weight='balanced',
                                                            random_state=42)
        updated_decision_tree.fit(X_train, y_train)
        training_scores.append(updated_decision_tree.score(X_train, y_train))
        validation_scores.append(updated_decision_tree.score(X_valid, y_valid))
    
    In [68]:
    plt.plot(random_min_sample_splits,training_scores, color = 'blue')
    plt.plot(random_min_sample_splits,validation_scores, color = 'green')
    plt.title("Accuracy Scores vs min_samples_split")
    plt.xlabel("Min. Sample Split at node")
    plt.ylabel("Accuracy Scores")
    plt.legend(labels = ['training','validation'])
    plt.show()
    

    Cross Validation

    In [69]:
    training_scores = []
    validation_scores = []
    avg_cross_val_scores = []
    
    for depth in test_depths:
        updated_decision_tree = tree.DecisionTreeClassifier(max_depth = depth, class_weight='balanced', random_state=42)
        updated_decision_tree.fit(X_train, y_train)
        training_scores.append(updated_decision_tree.score(X_train, y_train))
        validation_scores.append(updated_decision_tree.score(X_valid, y_valid))
    
        # Computing additional Cross validation score.
        avg_cross_val_scores.append(np.mean(cross_val_score(updated_decision_tree, X,  y, cv=10)))
    
    In [70]:
    plt.plot(test_depths,training_scores, color = 'blue')
    plt.plot(test_depths,validation_scores, color = 'green')
    plt.plot(test_depths,avg_cross_val_scores, color = 'red')
    plt.title("Accuracy Scores vs max_depth")
    plt.xlabel("Max Depth of Tree")
    plt.ylabel("Accuracy Scores")
    plt.legend(labels = ['training','validation', 'avg_cross_validation'])
    plt.show()
    
    In [71]:
    training_scores = []
    validation_scores = []
    avg_cross_val_scores = []
    
    for split_size in random_min_sample_splits:
        updated_decision_tree = tree.DecisionTreeClassifier(min_samples_split = split_size,
                                                            class_weight='balanced', random_state=42)
        updated_decision_tree.fit(X_train, y_train)
        training_scores.append(updated_decision_tree.score(X_train, y_train))
        validation_scores.append(updated_decision_tree.score(X_valid, y_valid))
    
        # Computing additional Cross validation score.
        avg_cross_val_scores.append(np.mean(cross_val_score(updated_decision_tree, X,  y, cv=10)))
    
    In [72]:
    plt.plot(random_min_sample_splits,training_scores, color = 'blue')
    plt.plot(random_min_sample_splits,validation_scores, color = 'green')
    plt.plot(random_min_sample_splits,avg_cross_val_scores, color = 'red')
    plt.title("Accuracy Scores vs min_samples_split")
    plt.xlabel("Min. Sample Split at node")
    plt.ylabel("Accuracy Scores")
    plt.legend(labels = ['training','validation', 'avg_cross_validation'])
    plt.show()
    

    In general, the avg_cross_validation accuracy score serves as a stronger indicator of the model's accuracy on unseen data given that it has been tested on more than a single validation set. However, in both the cases above, we have tried to optimize a single hyperparameter at a time. Let's try to optimize a couple of them simultaneously.

    In [73]:
    #test_dict = dict.fromkeys(['max_depth', 'split_size', 'training', 'validation', 'avg_cross_validaion'])
    #test_dict['max_depth'].append(1)
    keys = ['max_depth', 'min_samples_split', 'training', 'validation', 'avg_cross_validaion']
    test_dict = {key: [] for key in keys}
    #test_dict['max_depth'].append(1)
    test_dict
    
    Out[73]:
    {'max_depth': [],
     'min_samples_split': [],
     'training': [],
     'validation': [],
     'avg_cross_validaion': []}
    In [74]:
    for depth in test_depths:
        for split_size in random_min_sample_splits:
            updated_decision_tree = tree.DecisionTreeClassifier(max_depth = depth, min_samples_split = split_size,
                                                                 class_weight='balanced', random_state=42)
            updated_decision_tree.fit(X_train, y_train)
            test_dict['max_depth'].append(depth)
            test_dict['min_samples_split'].append(split_size)
            test_dict['training'].append(updated_decision_tree.score(X_train, y_train))
            test_dict['validation'].append(updated_decision_tree.score(X_valid, y_valid))
            test_dict['avg_cross_validaion'].append(np.mean(cross_val_score(updated_decision_tree, X,  y, cv=10)))
    
    scores_pd = pd.DataFrame(test_dict)
    scores_pd.head()
    
    Out[74]:
    max_depth min_samples_split training validation avg_cross_validaion
    0 1 2 0.707865 0.636872 0.675671
    1 1 6 0.707865 0.636872 0.675671
    2 1 8 0.707865 0.636872 0.675671
    3 1 9 0.707865 0.636872 0.675671
    4 1 12 0.707865 0.636872 0.675671
    In [75]:
    g = sns.FacetGrid(scores_pd, col='max_depth', col_wrap=3)
    g.map(sns.lineplot, "min_samples_split", "training", alpha=.7 , color = 'blue')
    g.map(sns.lineplot, "min_samples_split", "validation", alpha=.7 , color = 'green')
    g.map(sns.lineplot, "min_samples_split", "avg_cross_validaion", alpha=.7 , color = 'red')
    plt.legend(labels = ['training','validation', 'avg_cross_validation'])
    plt.show()
    

    It seems like a max depth of 5 with a small min_sample_split size works the best both in terms of accuracy and optimizing the variance-bias trade of.

    In [76]:
    scores_pd[(scores_pd['max_depth'] == 5) & (scores_pd['min_samples_split'] == 6)]
    
    Out[76]:
    max_depth min_samples_split training validation avg_cross_validaion
    89 5 6 0.835674 0.815642 0.823879

    In essence this was the manual process of doing grid search cross validation. As explained above, with GridSearch CV you give the algorithm a set of values for each hyperparameter. The algorithm will then take the cartesian product of these sets and try every combination of the hyperparameters and return the hyperparmeters that yield the highest accuracy.

    GridSearchCV

    In [77]:
    parameters = {'max_depth':(1,2,3,4,5,6,7,8,9),
                  'min_samples_split':[6,10,11,13,14,15,17,18,19,20,21,26,27,33,38,43,44,45,48,49,50,51,56,59]}
    
    gridCV_decision_tree = tree.DecisionTreeClassifier(class_weight='balanced', random_state=42)
    clf = GridSearchCV(gridCV_decision_tree, parameters, cv=10)
    clf.fit(X, y)
    
    Out[77]:
    GridSearchCV(cv=10, error_score='raise',
           estimator=DecisionTreeClassifier(class_weight='balanced', criterion='gini',
                max_depth=None, max_features=None, max_leaf_nodes=None,
                min_impurity_decrease=0.0, min_impurity_split=None,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, presort=False, random_state=42,
                splitter='best'),
           fit_params=None, iid=True, n_jobs=1,
           param_grid={'max_depth': (1, 2, 3, 4, 5, 6, 7, 8, 9), 'min_samples_split': [6, 10, 11, 13, 14, 15, 17, 18, 19, 20, 21, 26, 27, 33, 38, 43, 44, 45, 48, 49, 50, 51, 56, 59]},
           pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
           scoring=None, verbose=0)
    In [78]:
    clf.best_params_
    
    Out[78]:
    {'max_depth': 5, 'min_samples_split': 6}
    In [79]:
    clf.best_score_
    
    Out[79]:
    0.8237934904601572

    As can be seen, give the same set of values for the hyperparameters, GridSearch CV resulted in selected the max_depth of 5 and min_samples_split of 6, just like as we did with our simultaneous k-fold cross validation on the two hyperparameters.

    RandomSearchCV

    Like GridSearchCV, RandomSearchCV also takes a dictionary of hyperparameter values to search over. However, unlike the exhaustive search of GridSearch that checks every combination of hyperparameter values, RandomSearchCV random selects values to try. The number of combinations of parameters tried can be set using the n_iter parameter when initializing the RandomSearchCV object.

    In [80]:
    parameters = {'max_depth':(1,2,3,4,5,6,7,8,9),
                  'min_samples_split':[6,10,11,13,14,15,17,18,19,20,21,26,27,33,38,43,44,45,48,49,50,51,56,59]}
    
    
    randomCV_decision_tree = tree.DecisionTreeClassifier(class_weight='balanced', random_state=42)
    random_search = RandomizedSearchCV(randomCV_decision_tree, param_distributions=parameters,
                                       n_iter=20, cv=10, iid=False, random_state = 42)
    
    random_search.fit(X, y)
    
    Out[80]:
    RandomizedSearchCV(cv=10, error_score='raise',
              estimator=DecisionTreeClassifier(class_weight='balanced', criterion='gini',
                max_depth=None, max_features=None, max_leaf_nodes=None,
                min_impurity_decrease=0.0, min_impurity_split=None,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, presort=False, random_state=42,
                splitter='best'),
              fit_params=None, iid=False, n_iter=20, n_jobs=1,
              param_distributions={'max_depth': (1, 2, 3, 4, 5, 6, 7, 8, 9), 'min_samples_split': [6, 10, 11, 13, 14, 15, 17, 18, 19, 20, 21, 26, 27, 33, 38, 43, 44, 45, 48, 49, 50, 51, 56, 59]},
              pre_dispatch='2*n_jobs', random_state=42, refit=True,
              return_train_score='warn', scoring=None, verbose=0)
    In [81]:
    random_search.best_params_
    
    Out[81]:
    {'min_samples_split': 49, 'max_depth': 3}
    In [82]:
    random_search.best_score_
    
    Out[82]:
    0.8080725797298831

    Final Model

    In [83]:
    g_sub = pd.read_csv('./data/Titanic/gender_submission.csv')
    test_w_response = pd.merge(test, g_sub, on='PassengerId', how='inner')
    test_w_response.head()
    
    Out[83]:
    PassengerId Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked Family_size IsAlone Title Survived
    0 892 3 Kelly, Mr. James 34.5 0 0 330911 7.8292 0 1 1 1 13 0
    1 893 3 Wilkes, Mrs. James (Ellen Needs) 47.0 1 0 363272 7.0000 0 2 2 0 14 1
    2 894 2 Myles, Mr. Thomas Francis 62.0 0 0 240276 9.6875 0 1 1 1 13 0
    3 895 3 Wirz, Mr. Albert 27.0 0 0 315154 8.6625 0 2 1 1 13 0
    4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) 22.0 1 1 3101298 12.2875 0 2 3 0 14 1
    In [84]:
    test_x = test_w_response.drop(['PassengerId', 'Name', 'Survived', 'Ticket'], axis = 1)
    test_y = test_w_response[['Survived']]
    
    In [85]:
    # Adjust the one missing Fare with the mean
    test_x['Fare'][test_x['Fare'].isna()] = np.mean(test_x['Fare'])
    
    In [86]:
    final_model = tree.DecisionTreeClassifier(max_depth = 5, min_samples_split = 6,
                                              class_weight='balanced', random_state=42)
    final_model.fit(X,y)
    
    test_predict_y = final_model.predict(test_x)
    
    In [87]:
    accuracy_score(test_y, test_predict_y)
    
    Out[87]:
    0.8899521531100478
    In [88]:
    with open("tree_final.dot", 'w') as f:
         f = tree.export_graphviz(final_model,
                                  out_file=f,
                                  max_depth = 5,
                                  impurity = True,
                                  feature_names = list(X.columns),
                                  class_names = ['Died', 'Survived'],
                                  rounded = True,
                                  filled= True )
    
    #Convert .dot to .png to allow display in web notebook
    check_call(['dot','-Tpng','tree_final.dot','-o','tree_final.png'])
    
    Out[88]:
    0
    In [89]:
    img_final = Image.open("./tree_final.png")
    draw = ImageDraw.Draw(img_final)
    img_final.save('final-out.png')
    PImage("final-out.png")
    
    Out[89]:

    Conclusion

    Back to blog

    GET IN TOUCH

    “Great things in business are never done by one person; they’re done by a team of people.” – Steve Jobs

    Contact Me