Import packages first. I will only use decision tree on this post, so import tree from sklearn.
import pandas as pd
import numpy as np
from sklearn import tree
(I worked this on Pycharm, and it did not display entire columns. This line allows to check entire columns on console)
pd.set_option('display.max_columns', None)
Read train/test csv files.
train_path = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_path)
test_path = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_path)
See train and test dataset.
Train data set has ‘Survivied’ column, which is not included in test set. Also, train data set misses 177 of Age data, and test data set omitted 86 of Age and 1 Fare data. So, I should fullfil the missed data first. (Data cleaning)
print('-'*15 + 'Train Set' + '-'*15 + '\n', train.describe())
print('-'*15 + 'Test Set' + '-'*15 + '\n', test.describe())
---------------Train Set---------------
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
---------------Test Set---------------
PassengerId Pclass Age SibSp Parch Fare
count 418.000000 418.000000 332.000000 418.000000 418.000000 417.000000
mean 1100.500000 2.265550 30.272590 0.447368 0.392344 35.627188
std 120.810458 0.841838 14.181209 0.896760 0.981429 55.907576
min 892.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 996.250000 1.000000 21.000000 0.000000 0.000000 7.895800
50% 1100.500000 3.000000 27.000000 0.000000 0.000000 14.454200
75% 1204.750000 3.000000 39.000000 1.000000 0.000000 31.500000
max 1309.000000 3.000000 76.000000 8.000000 9.000000 512.329200
Data Cleaning.
For now, I replace the missed data as median value of each column.
train['Age'] = train['Age'].fillna(train['Age'].median())
test['Age'] = test['Age'].fillna(test['Age'].median())
test['Fare'] = test['Fare'].fillna(test['Fare'].median())
Mapping character to integer for convenience to manipulate and search data.
#Former way that how I changed data. Mapping is clear.
#train["Sex"][train["Sex"] == "male"] = 0
#train["Sex"][train["Sex"] == "female"] = 1
sex_mapping = {"male": 0, "female": 1}
train['Sex'] = train['Sex'].map(sex_mapping)
test['Sex'] = test['Sex'].map(sex_mapping)
train['Embarked'] = train["Embarked"].fillna({"Embarked": "S"})
embarked_mapping = {"S": 0, "C": 1, "Q": 2}
train["Embarked"] = train["Embarked"].map(embarked_mapping)
Extract ‘survived’ value from train data set as a target. Also select ‘Pclass’, ‘Sex’, ‘Age’, ‘Fare’ as features.
This means I assume Ticket Class(‘Pclass’), Sex, Age, Fare are related to survive. Then assign a decision tree class from tree and put features and target.
target = train['Survived'].values
train_features = train[['Pclass', 'Sex', 'Age', 'Fare']].values
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(train_features, target)
feature_importances_ shows how each feature is related to target. For here, it seems ‘Sex’ and ‘Fare’ are more important compare to ticket class and age.
Score returns the mean accuracy on the target and fetures.
print(decision_tree.feature_importances_)
print(decision_tree.score(train_features, target))
[0.12968841 0.31274009 0.23646917 0.32110233]
0.9775533108866442
Apply selected features on test data and predict. For here, prediction is ‘Survived’ or not
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values
prediction = decision_tree.predict(test_features)
print(prediction)
[0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0
0 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 1 0 1 0
1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0
0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 1 0 1 1 0 1
0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 1 0
1 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0
1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1
1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0
0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0
0 1 1 1 1 0 0 1 0 0 0]
Then make a csv file for submitting on Kaggle and check the result.
# Report (Make csv file)
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId = np.array(test["PassengerId"]).astype(int)
decision_tree_solution = pd.DataFrame(prediction, PassengerId, columns = ["Survived"])
# Write a solution to a csv file with the name my_solution.csv
decision_tree_solution.to_csv("decision_tree_solution.csv", index_label = ["PassengerId"])
Here is my first result. The accuracy is about 0.722
So, there are limitations.
.Data Cleaning - I replaced missed data to median values of columns, which is not proper. There are several ways to improve this.
.Decision Tree - Decision Tree has a problem of over-fitting and sklearn provides a solution for this. With the over-fitting problem, a predictive model is perfectly fitted to the training set, and it decreases accuracy on the test set.
.Model - DataCamp also teaches Random Forest on the same course. Also, I can see diverse models of how people solve this problem and apply what kind of algorithm.