Kaggle Titanic Competition - Decision Tree

Import packages first. I will only use decision tree on this post, so import tree from sklearn.

import pandas as pd       
import numpy as np        
from sklearn import tree  

(I worked this on Pycharm, and it did not display entire columns. This line allows to check entire columns on console)

pd.set_option('display.max_columns', None)

Read train/test csv files.

train_path = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_path)

test_path = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_path)

See train and test dataset.

Train data set has ‘Survivied’ column, which is not included in test set. Also, train data set misses 177 of Age data, and test data set omitted 86 of Age and 1 Fare data. So, I should fullfil the missed data first. (Data cleaning)

print('-'*15 + 'Train Set' + '-'*15 + '\n', train.describe())
print('-'*15 + 'Test Set' + '-'*15 + '\n', test.describe())

---------------Train Set---------------
        PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
---------------Test Set---------------
        PassengerId      Pclass         Age       SibSp       Parch        Fare
count   418.000000  418.000000  332.000000  418.000000  418.000000  417.000000
mean   1100.500000    2.265550   30.272590    0.447368    0.392344   35.627188
std     120.810458    0.841838   14.181209    0.896760    0.981429   55.907576
min     892.000000    1.000000    0.170000    0.000000    0.000000    0.000000
25%     996.250000    1.000000   21.000000    0.000000    0.000000    7.895800
50%    1100.500000    3.000000   27.000000    0.000000    0.000000   14.454200
75%    1204.750000    3.000000   39.000000    1.000000    0.000000   31.500000
max    1309.000000    3.000000   76.000000    8.000000    9.000000  512.329200

Data Cleaning.

For now, I replace the missed data as median value of each column.

train['Age'] = train['Age'].fillna(train['Age'].median())  
test['Age'] = test['Age'].fillna(test['Age'].median())     
test['Fare'] = test['Fare'].fillna(test['Fare'].median())  

Mapping character to integer for convenience to manipulate and search data.

#Former way that how I changed data. Mapping is clear.
#train["Sex"][train["Sex"] == "male"] = 0
#train["Sex"][train["Sex"] == "female"] = 1

sex_mapping = {"male": 0, "female": 1}
train['Sex'] = train['Sex'].map(sex_mapping)
test['Sex'] = test['Sex'].map(sex_mapping)

train['Embarked'] = train["Embarked"].fillna({"Embarked": "S"})
embarked_mapping = {"S": 0, "C": 1, "Q": 2}
train["Embarked"] = train["Embarked"].map(embarked_mapping)

Extract ‘survived’ value from train data set as a target. Also select ‘Pclass’, ‘Sex’, ‘Age’, ‘Fare’ as features.

This means I assume Ticket Class(‘Pclass’), Sex, Age, Fare are related to survive. Then assign a decision tree class from tree and put features and target.

target = train['Survived'].values
train_features = train[['Pclass', 'Sex', 'Age', 'Fare']].values

decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(train_features, target)

feature_importances_ shows how each feature is related to target. For here, it seems ‘Sex’ and ‘Fare’ are more important compare to ticket class and age.

Score returns the mean accuracy on the target and fetures.

print(decision_tree.feature_importances_)
print(decision_tree.score(train_features, target))

[0.12968841 0.31274009 0.23646917 0.32110233]
0.9775533108866442

Apply selected features on test data and predict. For here, prediction is ‘Survived’ or not

test_features = test[["Pclass", "Sex", "Age", "Fare"]].values
prediction = decision_tree.predict(test_features)
print(prediction)

[0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0
0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 1 0 1 0
0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0
1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
0 1 0 0 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 1 0 1 1 0 1
1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 1 0
1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0
0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1
0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0
1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0
1 1 1 1 0 0 1 0 0 0]

Then make a csv file for submitting on Kaggle and check the result.

# Report (Make csv file) 

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId = np.array(test["PassengerId"]).astype(int)
decision_tree_solution = pd.DataFrame(prediction, PassengerId, columns = ["Survived"])

# Write a solution to a csv file with the name my_solution.csv
decision_tree_solution.to_csv("decision_tree_solution.csv", index_label = ["PassengerId"])

Here is my first result. The accuracy is about 0.722

Imgur

So, there are limitations.

.Data Cleaning - I replaced missed data to median values of columns, which is not proper. There are several ways to improve this.

.Decision Tree - Decision Tree has a problem of over-fitting and sklearn provides a solution for this. With the over-fitting problem, a predictive model is perfectly fitted to the training set, and it decreases accuracy on the test set.

.Model - DataCamp also teaches Random Forest on the same course. Also, I can see diverse models of how people solve this problem and apply what kind of algorithm.

Share on: