Julia on Titanic
This is an introduction to Data Analysis and Decision Trees using Julia. In this tutorial we will explore how to tackle Kaggle’s Titanic competition using Julia and Machine Learning. This tutorial is adopted from the Kaggle R tutorial on Machine Learning on Datacamp In case you’re new to Julia, you can read more about its awesomeness on julialang.org.
Again, the point of this tutorial is not to teach machine learning but to provide a starting point to get your hands dirty with Julia code.
The benchmark numbers on the Julia website look pretty impressive. So get ready to embrace Julia with a warm hug!
Let’s get started. We will mostly be using 3 main packages from Julia ecosystem
We start with loading the dataset from the Titanic Competition from kaggle. We will use readtable for that and inspect the first few data points with head() on the loaded DataFrame. We will use the already split train and test sets from DataCamp training set: http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv test set: http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv I have them downloaded in the data directory.
using DataFrames
train = readtable("data/train.csv")
head(train)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.25 | NA | S |
2 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.925 | NA | S |
4 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1 | C123 | S |
5 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.05 | NA | S |
6 | 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | NA | Q |
test = readtable("data/test.csv")
head(test)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NA | Q |
2 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0 | NA | S |
3 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NA | Q |
4 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NA | S |
5 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NA | S |
6 | 897 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 0 | 0 | 7538 | 9.225 | NA | S |
size(train, 1)
891
Let’s take a closer look at our datasets. describe()
helps us to summarize the entire dataset.
describe(train)
PassengerId
Min 1.0
1st Qu. 223.5
Median 446.0
Mean 446.0
3rd Qu. 668.5
Max 891.0
NAs 0
NA% 0.0%
Survived
Min 0.0
1st Qu. 0.0
Median 0.0
Mean 0.3838383838383838
3rd Qu. 1.0
Max 1.0
NAs 0
NA% 0.0%
Pclass
Min 1.0
1st Qu. 2.0
Median 3.0
Mean 2.308641975308642
3rd Qu. 3.0
Max 3.0
NAs 0
NA% 0.0%
Name
Length 891
Type UTF8String
NAs 0
NA% 0.0%
Unique 891
Sex
Length 891
Type UTF8String
NAs 0
NA% 0.0%
Unique 2
Age
Min 0.42
1st Qu. 20.125
Median 28.0
Mean 29.69911764705882
3rd Qu. 38.0
Max 80.0
NAs 177
NA% 19.87%
SibSp
Min 0.0
1st Qu. 0.0
Median 0.0
Mean 0.5230078563411896
3rd Qu. 1.0
Max 8.0
NAs 0
NA% 0.0%
Parch
Min 0.0
1st Qu. 0.0
Median 0.0
Mean 0.38159371492704824
3rd Qu. 0.0
Max 6.0
NAs 0
NA% 0.0%
Ticket
Length 891
Type UTF8String
NAs 0
NA% 0.0%
Unique 681
Fare
Min 0.0
1st Qu. 7.9104
Median 14.4542
Mean 32.20420796857464
3rd Qu. 31.0
Max 512.3292
NAs 0
NA% 0.0%
Cabin
Length 891
Type UTF8String
NAs 687
NA% 77.1%
Unique 148
Embarked
Length 891
Type UTF8String
NAs 2
NA% 0.22%
Unique 4
If you want to just check the datatypes of the columns you can use eltypes()
eltypes(train)
12-element Array{Type{T<:Top},1}:
Int64
Int64
Int64
UTF8String
UTF8String
Float64
Int64
Int64
UTF8String
Float64
UTF8String
UTF8String
Data Analysis
Let’s get our hand real dirty now. We want to check how many people survived the disaster. Its good to know how the distribution looks like on the classification classes.
counts()
will give you a frequency table, but it does not tell you how the split is.
It is the equivalent of table()
in R.
counts(train[:Survived])
2-element Array{Int64,1}:
549
342
countmap()
for rescue! Countmap gives you a dictionary of value => frequency
countmap(train[:Survived])
Dict{Union(NAtype,Int64),Int64} with 2 entries:
0 => 549
1 => 342
If you want proportions like prop.table()
in R, you can use proportions()
or proportionmap()
proportions(train[:Survived])
2-element Array{Float64,1}:
0.616162
0.383838
proportionmap(train[:Survived])
Dict{Union(NAtype,Int64),Float64} with 2 entries:
0 => 0.6161616161616161
1 => 0.3838383838383838
Counts
does not work for categorical variables, so you can use countmap()
there.
countmap(train[:Sex])
Dict{Union(NAtype,UTF8String),Int64} with 2 entries:
"male" => 577
"female" => 314
Now that we know the split between male and female population on Titanic and the proportions that survived, let check if the sex of a person had an impact on the chances of surviving.
We will examine this using a stacked histogram of Sex vs Survived using Gadfly package.
using Gadfly
plot(train, x="Sex", y="Survived", color="Survived", Geom.histogram(position=:stack), Scale.color_discrete_manual("red","green"))
Its pretty clear that more females survived over males.
So let’s predict Survived=1 if Sex=female else Survived=0. With that you have your first predictive model ready!!
But let’s not stop there, we have lot more dimensions to the data. Let’s see if we can make use of those to enhance the model. We have Age as one of the columns, using which we can create another dimension to the data - a variable indicating whether the person was a child or not.
train[:Child] = 1
train[isna(train[:Age]), :Child] = 1 #this can be avoided. Its just an explicit way to demonstrate the rule
train[train[:Age] .< 18, :Child] = 1 #same applies for this, as all Child fields are 1 by default except when Age >= 18
train[train[:Age] .>= 18, :Child] = 0
head(train)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Child | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.25 | NA | S | 0 |
2 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 |
3 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.925 | NA | S | 0 |
4 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1 | C123 | S | 0 |
5 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.05 | NA | S | 0 |
6 | 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | NA | Q | 1 |
Now that we have our Child indicator variable, let’s try to plot the survival rate of children on Titanic.
plot(train, x="Child", y="Survived", color="Survived", Geom.histogram(position=:stack), Scale.color_discrete_manual("red","green"))
We see the number of Children survived is lower than the number of victims.
Let try to combine the gender and child variables to see how that impacts the survival rate. We go back to our friendly bug Gadfly to construct a stacked histogram with subplots.
plot(train, xgroup="Child", x="Sex", y="Survived", color="Survived", Geom.subplot_grid(Geom.histogram(position=:stack)), Scale.color_discrete_manual("red","green"))
These kind of visual analysis is a powerful tool to perform the initial data mining before diving into any deeper machine learning algorithms for predictions.
We clearly see that Age and Gender influence the survival. But we cannot generate all combinations to create a ifelse model for prediction. Enter DecisionTrees. Decision trees do exactly what we did with Gender but at a larger scale where all dimensions (or features) are invovled in forming the decision tree.
In our case the decision we need to make is Survived or Not.
Before doing that we need to handle the missing data in our feature variables. There are multiple ways to handle missing data and we wont go deeper into them. We will use a simple technique of filling the missing value with
- Mean value if the variable is numeric.
- Max frequency categorical variable value for categorical variables.
From describe() above we have seen that Embarked and Age have some missing values (NA)
countmap(train[:Embarked])
Dict{Union(NAtype,UTF8String),Int64} with 4 entries:
"Q" => 77
"S" => 644
"C" => 168
NA => 2
train[isna(train[:Embarked]), :Embarked] = "S"
"S"
meanAge = mean(train[!isna(train[:Age]), :Age])
29.69911764705882
train[isna(train[:Age]), :Age] = meanAge
29.69911764705882
Now that we knocked out the NAs out of our way, let’s roll up our sleeves and grow a decision tree!
We will be using the DecisionTree
julia package.
Features form the predictors and labels are the response variables for the decision tree. We will start with building the Arrays for features and labels.
using DecisionTree
train[:Age] = float64(train[:Age])
train[:Fare] = float64(train[:Fare])
features = array(train[:,[:Pclass, :Sex, :Age, :SibSp, :Parch, :Fare, :Embarked, :Child]])
labels=array(train[:Survived])
891-element Array{Int64,1}:
0
1
1
1
0
0
0
0
1
1
1
1
0
⋮
1
1
0
0
0
0
0
0
1
0
1
0
features
891x8 Array{Any,2}:
3 "male" 22.0 1 0 7.25 "S" 0
1 "female" 38.0 1 0 71.2833 "C" 0
3 "female" 26.0 0 0 7.925 "S" 0
1 "female" 35.0 1 0 53.1 "S" 0
3 "male" 35.0 0 0 8.05 "S" 0
3 "male" 29.6991 0 0 8.4583 "Q" 1
1 "male" 54.0 0 0 51.8625 "S" 0
3 "male" 2.0 3 1 21.075 "S" 1
3 "female" 27.0 0 2 11.1333 "S" 0
2 "female" 14.0 1 0 30.0708 "C" 1
3 "female" 4.0 1 1 16.7 "S" 1
1 "female" 58.0 0 0 26.55 "S" 0
3 "male" 20.0 0 0 8.05 "S" 0
⋮ ⋮
1 "female" 56.0 0 1 83.1583 "C" 0
2 "female" 25.0 0 1 26.0 "S" 0
3 "male" 33.0 0 0 7.8958 "S" 0
3 "female" 22.0 0 0 10.5167 "S" 0
2 "male" 28.0 0 0 10.5 "S" 0
3 "male" 25.0 0 0 7.05 "S" 0
3 "female" 39.0 0 5 29.125 "Q" 0
2 "male" 27.0 0 0 13.0 "S" 0
1 "female" 19.0 0 0 30.0 "S" 0
3 "female" 29.6991 1 2 23.45 "S" 1
1 "male" 26.0 0 0 30.0 "C" 0
3 "male" 32.0 0 0 7.75 "Q" 0
Stumps
A stump is the simplest decision tree, with only one split and two classification nodes (leaves). It tries to generate a single split with most predictive power. This is performed by splitting the dataset into 2 subset and returning the split with highest information gain. We will use the build_stumps
api from DecisionTree and feed it with the above generated labels and features.
stump = build_stump(labels, features)
print_tree(stump)
Feature 2, Threshold male
L-> 1 : 233/314
R-> 0 : 468/577
The stump picks feature 2, Sex to make the split with threshold being “male”.
The interpretation being - all males data points to the right leaf and all female datapoints go to the left leaf.
Hence the 314 on L and 577 on R - number of female and male datapoints (respectively) in train set.
L -> 1
implies Survived=1 is on left branch
R -> 0
implies Survived=0 is on right branch
This creates a basic 1-rules tree with deciding factor being gender of the person.
Let’s verify the split by predicting on the training data.
predictions = apply_tree(stump, features)
confusion_matrix(labels, predictions)
2x2 Array{Int64,2}:
468 81
109 233
Classes: {0,1}
Matrix:
Accuracy: 0.7867564534231201
Kappa: 0.5421129503407983
Since we predict male who survived and females who didnt survive wrongly we get a miclassification count of 81 + 109. This gives use an accuracy of 1-190/891 = 0.7867
Decision Trees
Now that we have our 1-level decision stump, lets bring it all together and build our first multi-split decision tree.
We can use build_tree()
which recursively splits the dataset till we get pure leaves which contain samples of only 1 class.
model = build_tree(labels, features)
print_tree(model, 5)
length(model)
Feature 2, Threshold male
L-> Feature 1, Threshold 3
L-> Feature 6, Threshold 29.0
L-> Feature 6, Threshold 28.7125
L-> Feature 3, Threshold 24.0
L-> 1 : 15/15
R->
R-> 0 : 1/1
R-> Feature 3, Threshold 3.0
L-> 0 : 1/1
R-> Feature 5, Threshold 2
L-> 1 : 84/84
R->
R-> Feature 6, Threshold 23.45
L-> Feature 8, Threshold 1
L-> Feature 3, Threshold 37.0
L->
R->
R-> Feature 6, Threshold 15.5
L->
R->
R-> Feature 5, Threshold 1
L-> 1 : 1/1
R-> Feature 6, Threshold 31.3875
L-> 0 : 15/15
R->
R-> Feature 6, Threshold 26.2875
L-> Feature 3, Threshold 15.0
L-> Feature 4, Threshold 3
L-> Feature 3, Threshold 11.0
L-> 1 : 12/12
R->
R-> 0 : 1/1
R-> Feature 7, Threshold Q
L-> Feature 6, Threshold 15.2458
L->
R->
R-> Feature 6, Threshold 13.5
L->
R->
R-> Feature 4, Threshold 3
L-> Feature 3, Threshold 16.0
L-> 1 : 7/7
R-> Feature 6, Threshold 26.55
L-> 1 : 4/4
R->
R-> Feature 3, Threshold 4.0
L-> Feature 3, Threshold 3.0
L-> 0 : 4/4
R-> 1 : 1/1
R-> 0 : 18/18
200
We get a large tree so we use the depth parameter of print_tree()
to pretty print to a depth of 5.
We can count the number of leaves of our tree using length()
on the tree model. Length of our tree is 200
.
Although we have a nice looking decision tree ready, we have a bunch of problems - * our classifier is complex because of the tree size * our model overfits on the training data in an attempt to improve accuracy on each bucket. Notice
Solution - Tree Pruning
From wikipedia/Pruning_decision_tree
“Pruning is a technique in machine learning that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.”
# prune tree: merge leaves having >= 70% combined purity (default: 100%)
model = prune_tree(model, 0.7)
# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model, 5)
length(model)
Feature 2, Threshold male
L-> Feature 1, Threshold 3
L-> Feature 6, Threshold 29.0
L-> Feature 6, Threshold 28.7125
L-> Feature 3, Threshold 24.0
L-> 1 : 15/15
R->
R-> 0 : 1/1
R-> 1 : 98/100
R-> Feature 6, Threshold 23.45
L-> Feature 8, Threshold 1
L-> Feature 3, Threshold 37.0
L->
R-> 0 : 6/7
R-> Feature 6, Threshold 15.5
L->
R-> 1 : 14/16
R-> 0 : 24/27
R-> Feature 6, Threshold 26.2875
L-> Feature 3, Threshold 15.0
L-> Feature 4, Threshold 3
L-> Feature 3, Threshold 11.0
L-> 1 : 12/12
R->
R-> 0 : 1/1
R-> Feature 7, Threshold Q
L-> Feature 6, Threshold 15.2458
L->
R-> 1 : 4/6
R-> Feature 6, Threshold 13.5
L->
R-> 0 : 62/64
R-> Feature 4, Threshold 3
L-> Feature 3, Threshold 16.0
L-> 1 : 7/7
R-> Feature 6, Threshold 26.55
L-> 1 : 4/4
R->
R-> 0 : 22/23
125
Pruning is a bottom-up process which merges leaves into one leaf in a repeated fashion.
The merge criteria is based on the pruning parameter (0.7 in the example above). The pruning parameter is a purity threshold. If the combined purity of two adjacent leaves is greater than the pruning threshold then the two leaves are merged into one.
Length of our pruned tree is reduced to 125 from 200.
We can further reduce the number of leaves by reducing the threshold. But how do you know what is a good threshold ?
There are multiple techniques to determine that one of them being grid search where you try multiple thresholds, plot the number of trees with the threshold values to compare against each other.
One way to do this is using n-fold cross validation.
tl;dr - Cross validation breaks the dataset into train and test sets for measuring accuracy.
N-fold CV does this by randomly partitioning the train set into N equal subsets and each subset is used once as a test set while others are used as train sets. Using this we get N prediction performance numbers.
# run n-fold cross validation for pruned tree,
purities = linspace(0.1, 1.0, 5)
accuracies = zeros(length(purities))
for i in 1:length(purities)
accuracies[i] = mean(nfoldCV_tree(labels, features, purities[i], 5))
end
plot(x=purities, y=accuracies, Geom.point, Geom.line, Guide.xlabel("Threshold"), Guide.ylabel("Accuracy"), Guide.title("Pruning Comparison"))
Fold 1
2x2 Array{Int64,2}:
111 0
67 0
2x2 Array{Int64,2}:
118 0
60 0
Classes: {0,1}
Matrix:
Accuracy: 0.6235955056179775
Kappa: 0.0
Fold 2
Classes: {0,1}
Matrix:
Accuracy: 0.6629213483146067
Kappa: 0.0
Fold
2x2 Array{Int64,2}:
107 0
71 0
3
Classes: {0,1}
Matrix:
Accuracy: 0.601123595505618
Kappa: 0.0
Fold
2x2 Array{Int64,2}:
111 0
67 0
2x2 Array{Int64,2}:
101 0
77 0
4
Classes: {0,1}
Matrix:
Accuracy: 0.6235955056179775
Kappa: 0.0
Fold 5
Classes: {0,1}
Matrix:
Accuracy: 0.5674157303370787
Kappa: 0.0
Mean Accuracy: 0.6157303370786517
Fold
2x2 Array{Int64,2}:
110 0
68 0
2x2 Array{Int64,2}:
111 0
67 0
1
Classes: {0,1}
Matrix:
Accuracy: 0.6179775280898876
Kappa: 0.0
Fold 2
Classes: {0,1}
Matrix:
Accuracy: 0.6235955056179775
Kappa: 0.0
Fold
2x2 Array{Int64,2}:
114 0
64 0
2x2 Array{Int64,2}:
107 0
71 0
3
Classes: {0,1}
Matrix:
Accuracy: 0.6404494382022472
Kappa: 0.0
Fold 4
Classes: {0,1}
Matrix:
Accuracy: 0.601123595505618
Kappa: 0.0
Fold
2x2 Array{Int64,2}:
107 0
71 0
2x2 Array{Int64,2}:
87 20
14 57
5
Classes: {0,1}
Matrix:
Accuracy: 0.601123595505618
Kappa: 0.0
Mean Accuracy: 0.6168539325842697
Fold 1
Classes: {0,1}
Matrix:
Accuracy: 0.8089887640449438
Kappa: 0.6072680077871512
Fold
2x2 Array{Int64,2}:
88 21
22 47
2
Classes: {0,1}
Matrix:
Accuracy: 0.7584269662921348
Kappa: 0.4898013598186908
Fold
2x2 Array{Int64,2}:
98 16
21 43
2x2 Array{Int64,2}:
93 15
23 47
3
Classes: {0,1}
Matrix:
Accuracy: 0.7921348314606742
Kappa: 0.5407892901966255
Fold 4
Classes: {0,1}
Matrix:
Accuracy: 0.7865168539325843
Kappa: 0.5434665226781858
Fold
2x2 Array{Int64,2}:
91 20
17 50
2x2 Array{Int64,2}:
95 17
30 36
5
Classes: {0,1}
Matrix:
Accuracy: 0.7921348314606742
Kappa: 0.5611088897774225
Mean Accuracy: 0.7876404494382022
Fold 1
Classes: {0,1}
Matrix:
Accuracy: 0.7359550561797753
Kappa: 0.4102636402086564
Fold
2x2 Array{Int64,2}:
87 14
28 49
2
Classes: {0,1}
Matrix:
Accuracy: 0.7640449438202247
Kappa: 0.5087396504139834
Fold
2x2 Array{Int64,2}:
80 23
15 60
2x2 Array{Int64,2}:
105 16
10 47
3
Classes: {0,1}
Matrix:
Accuracy: 0.7865168539325843
Kappa: 0.5684573178512186
Fold 4
Classes: {0,1}
Matrix:
Accuracy: 0.8539325842696629
Kappa: 0.6735787840315982
Fold
2x2 Array{Int64,2}:
80 31
16 51
2x2 Array{Int64,2}:
95 23
17 43
5
Classes: {0,1}
Matrix:
Accuracy: 0.7359550561797753
Kappa: 0.461439423200721
Mean Accuracy: 0.7752808988764045
Fold 1
Classes: {0,1}
Matrix:
Accuracy: 0.7752808988764045
Kappa: 0.5092362834298318
Fold
2x2 Array{Int64,2}:
91 20
21 46
2x2 Array{Int64,2}:
99 11
18 50
2
Classes: {0,1}
Matrix:
Accuracy: 0.7696629213483146
Kappa: 0.5078894133513149
Fold 3
Classes: {0,1}
Matrix:
Accuracy: 0.8370786516853933
Kappa: 0.6480294558843585
Fold
2x2 Array{Int64,2}:
87 19
21 51
2x2 Array{Int64,2}:
77 26
26 49
4
Classes: {0,1}
Matrix:
Accuracy: 0.7752808988764045
Kappa: 0.5314556462226901
Fold 5
Classes: {0,1}
Matrix:
Accuracy: 0.7078651685393258
Kappa: 0.4009061488673138
Mean Accuracy: 0.7730337078651685
It seems we get the highest accuracy is at a threshold just above 0.5
Conclusion
We saw how easy it is to get your hands dirty in Julia waters. Decision trees are pretty effective but they are slower to train on large datasets, but once trained the predictions are fast - just a walk down the tree.