seonnyseo Business Analytics Data Analytics blog by seonnyseo

Airbnb New Users Booking

1. Introduction

While I am in quarantine due to the COVID, I would like to develop and sharpen my skills for data analysis and data science. XGBoost, which is a popular framework in the Kaggle, data science field and I have no experience before, was an interesting subject to me and I decided to jump in.

Airbnb recruiting competition which was held in Kaggle about 4years ago looks suitable to me since many participants have shared their works using XGBoost, and the goal of this contest is interesting. Airbnb asked to predict new users’ first destination with the users’ demographic, summary statistics, and their web session records.

Simply I focused on using and learning the XGBoost framework at first, but my attention has been extended to analyze users’ data and adopt funnel analysis(AARRR) on it.

So I will talk about Explatory data anlaysis includes AARRR, and a predicting model below.

2. Data Analysis

Airbnb provides two types of datasets. Users’ information and their web session data. Information data includes the date of the account created, date of the first booking, and name of destination country if they have a history, demographic info, signup info, and marketing channel info. This dataset is split to train and test sets. I have to predict users’ first destination country in the test set.

Session data includes user ids, action, action types and detail, and secs elapsed. Airbnb does not give detailed information on it, so I assumed the meaning of unclear info such as secs elapsed that I considered it as a time elapsed after the last session.

I concentrated on building time-series graphs because of every row in both datasets hold time information and it provides meaningful information of users action while they were staying at the service.

1. Pre-processing

While the tracing system of the service was able to record robust information for most columns, lack of demographic data like age and gender happened since it requires the participation of the users. The age column is tricky here.

RAW Age graph

Fig. Age Raw Data Distribution

It contains information on users who are older than 1000 years old, which does not make sense. However, most of the unreasonable data are distributed around late 1900. I assumed that the system asked to type the year users were born or it did not prevent mistakes properly before, so I subtracted the age from 2015, which is the year the dataset was released. Then a new graph shows a reasonable distribution. (This graph only displays between age 18 - 100.)

Corrected Age graph

In the case of the session dataset, some instances do not include user ids. That means I cannot connect the sessions data to the users’ information, so I dropped the instances.

Also splitting date information into the year, month, the day has been processed for convenience in data wrangling.

2. EDA

3. AARRR

1. Acquisition

Since the dataset only includes signed up users’ information, it is not available to trace the visitors who only accessed and left the service. However, it provides the routes to the service such as affiliate channels, providers, and the first marketing the user interacted with before the signing up info. The first device used data is also provided.

Marketing Channel

From the marketing channel perspective, a direct connection is a top channel among marketing channels. This trend started in the middle of 2011 and the difference between direct and other channels has been increased constantly every year. And the second most popular channel is SEM-Branded until 2014/08. I can assume that most users already heard about Airbnb before they visit the service.

First Device Year

Device by month

Let’s look at the type of the first device. The number of desktop users increased rapidly between 2011-2013 while mobile users incremented gradually. On the contrary, first visiting through mobile devices exploded from 2013. If I break down this figure by month, the number of iPhone users overtook desktop users.

Mobile & Desktop

A graph drawn by groups into mobile and desktop reveals more dynamical changes from early 2014 and even more visits were connected through mobile devices later.

Web browser

The increase in mobile application users is expected to help increase the number of mobile users. The graph above displays changes in the 8 widely used web browsers users. Airbnb does not describe the Unknown browser users in their explanation, so I guessed it marks the users flew in the service through mobile applications. It seems reasonable to speculate that the increasing trend of the two graphs is similar.

2. Activation

Activation

Since every user has activated their account, it is not meaningful to discuss activation here. Thus I drew the changes the number of created accounts by year and month. Airbnb has grown up every year and it exploded in 2014.

Activation Month

This graph shows that the number of created accounts had increased every year again and the number of new users is the highest in the summer. This could be an effect of summer vacation.

3. Retention

Session

Sessions Expanded

For analyzing users’ retention, I manipulated the session dataset. Originally it only contains (id, action, action_type, action_detail, device_type, secs_elapsed) columns. It traces the user’s activity while staying in the service, but it does not provide timestamps for the actions. So I accumulated elapsed seconds by id, calculated days, and connected it with accounts created day. So the date of the activity can be estimated at an approximate value.

Retention Graph

Retention calculated the difference between the day the user created the account and the date he or she last accessed it. According to the graph, Top 20% of users retention day is longer than 27 days, while bottom 20% users had not come back after 2.5 days.

Top 20%   Bottom 20%  
count 27809 count 26342
mean 177.942285 mean 8.721547
std 160.384814 std 7.953145
min 3 min 0
25% 80 25% 3
50% 134 50% 6
75% 222 75% 12
max 2715 max 134

The number of Top 20% of users is 27809, and the bottom 20% is 26342. As can be easily predicted, the difference in teh number of sessions between Top and Bottom is significant. Top 20% of users are tend to click more pages and try more actions.

Top 20%   Bottom 20%  
Action Ratio Action Ratio
show 0.2336425 show 0.1643121
index 0.0993014 header_userpic 0.0822204
search_results 0.0848848 active 0.0682795
personalize 0.0844791 index 0.0612957
ajax_refresh_subtotal 0.0591886 create 0.0596574
search 0.047732 dashboard 0.0533352
similar_listings 0.0455265 personalize 0.0493038
update 0.0406442 search 0.0481939
social_connections 0.0336528 update 0.037676
reviews 0.0299896 search_results 0.0304125

The table above shows the 10 most common actions that the top and the bottom 20% of users by retention days have done in the service. The second most top column in Topside, ‘Index’, includes viewing various results such as search results, listing, and wish list. Also, Top 20% of users do a lot of actions that are far from immediate reservations such as personalize, which is updating wish lists and social connections. On the other hand, one-time actions such as update user profile pictures, phone numbers, and creating accounts are the most common actions in Bottom 20% users’ actions list.

Retention Booked / Unbooked

There was also a difference between customers who used the service and customers who were not. The return rate of experienced customers tended to be slightly higher. (This graph reflects only users data from the training dataset)

MAU

From the session data, I was able to trace the number of daily users after manipulating and calculating the seconds elapsed column. Because I only have data from both of the demographic and sessions sets between 2014 January to September, the graph falls down after 2014 October.

DAU

Conver to Daily Active Users to Monthly Active users graph.

4. Referral

Airbnb does not provide specific information related to referral. However, I found name of actions that seems related to referral. So I extracted a description of the actions that contain ‘referr’ in the names. (‘ajax_get_referrals_amt’ seems the action to check the number of referrals by the users, so the frequency of )

  • ajax_get_referrals_amt
  • ajax_referral_banner_experiment_type
  • ajax_referral_banner_type
  • referrer_status
  • signup_weibo_referral
  • weibo_signup_referral_finish
Referral Description  
count 9076
sum 25019
mean 2.75
std 2.58
min 1.00
25% 2.00
50% 2.00
75% 3.00
max 56.00

The result reveals that 9076 users made 25019 actions related to the referral. The number of total users in dataset is 275547, so conversion rate from acquisition to referral in this dataset is around 3.2%.

5. Revenue

For estimating the Revenue, I only use the training dataset because the testing set does not include booking information. Accroding to the training data, 88,908 users have history to book accormordations through Airbnb out of 213,451 total users. Nearly 42% of users have used the service.

It seems the conversion rate is abnormally high and I assumed that Airbnb provides filtered data for the competition since the purpose of it is to predict the users’ first destination country. Anyway, I extracted revenue and converted users information as much as I can from the dataset.

First Destination

This graph presents the ratio by users’ first destination. ‘NDF’ in here means ‘No Destination Found’, users had not made a reservation yet. Let’s split this by gender again.

Destination By Gender

According to the graph above, users who have not registered their sex less tend to have booked accomodation through Airbnb. I can assume two things.

  • Users who provide more personal information are likely to make a reservation
  • Users tend to provide their information after making a reservation for hosts

By the way, I can guess that leading users to provide their personal information would keep them stay in the service and increase a possibility to use it in the furture.

Join to First Booking Dates

This is a graph of the difference between the joining data and the first reservation date for users who have made a reservation before. As you can see, most of users made their first reservation with in a few days after they joined Airbnb. It may effects on referral and acquiring strategy.

booked No Yes
affiliate_channel    
seo 0.543461 0.456539
direct 0.568727 0.431273
sem-brand 0.574045 0.425955
other 0.598259 0.401741
sem-non-brand 0.620569 0.379431
api 0.658994 0.341006
remarketing 0.664234 0.335766
content 0.858663 0.141337

Conversion Rates By Affiliate Channels

This table contains conversion rates by affiliate channels. As the plot that reveals the changes of new acquisition above, the top 3 most active acquisition channels’ (Direct, SEO, SEM-Brand) conversion rates are also in the top here.

3. Modeling

4. Result

Implementing Knowledge Graph on Medical Abstracts

Abstract

Extracting relevant data and information from large datasets and research paper can be a complex and time-consuming process. Moreover, being able to narrow down contextual information can be challenging because of the disconnect between various datasets and research findings. In this project, I have explored the usage of knowledge graph in medical context. The enriched knowledge graph I have built is capable of finding relevant relationships among anatomical components such as genes/proteins/diseases with drugs and drug types. Applying queries can help in narrowing down vast number of abstracts into only the relevant ones in a short amount of time. Medical researchers can use these outputs to do further research and detailed analysis.

Introduction

The data and research outputs created by the biomedical field are largely fragmented and stored in databases that typically do not connect to each other. To make sense of the data and to generate insights, we have to transform and organize it into knowledge by connecting information. This will be possible through knowledge graphs, which are contextualized data with logical structure imposed on data through entities and relationships. Using knowledge graphs will help retrieve data faster through connected graphs, provide contextual relationships to understand anatomical components and drug interaction better and assist in making research more efficient and economical through interpretation and exploration of insights. The output of this project will help the pharmaceutical company attain these goals.

Scope

This project focused on designing knowledge graphs to be used for research and faster retrieval of information and insights. Techniques of natural language processing and concept enrichment have been used to create the most information rich knowledge graphs. To understand and explore its use in research, this project connects and creates insights from research abstracts, available through the PubMed database. The objective of the project is to find relationship among protein targets, genes, disease mechanisms and drugs in these abstracts by using various processes in Neo4j.

Methodology

Methodology

This project utilized Neo4j as the graph database and GraphAware as the framework. The dataset containing 1146 abstracts from the PubMed database has been used to find relationships among the word tags. The ontology document contains focus entities and their codes. This project concentrated mostly on the five groups shown in the table below. The entities document contains phrases that are present in the abstracts and their related entities from the ontology document. In order to connect these data together, we indexed the date. We created Index on Abstract id (data corpus), Code (Ontology), Value (TERMite Response).

Annotation

Annotation

Annotation is the first step in building knowledge graph and is the process of breaking down the abstracts by tokenizing the words using the GraphAware NLP framework. Each abstract is separated by sentences and sentences divided into tags (words). During the process all the stop words are removed, and the remaining words are assigned their parts of speech. In this way, we create a hierarchy of three nodes: Annotated Text, Sentence and Tag and two relationships: Contain_sentence and Has_tag. Annotating the abstracts is important for us to do further graph-based analysis. The simplest form of graph created through this process helps us point out the relevant keywords and sentences present in the vast list of abstracts.

Occurs With

Occurs With

After the tags are extracted from the abstracts, it is helpful to look at keywords that frequently occur together. Often lot of times, it is possible to build relationships among two keywords by looking at their number of occurrences together. In this process, we create a function that runs through the keywords and counts the number of occurrences of each of the two keywords. For instance, in the sentence: Among genes related to milk fat synthesis and lipid droplet formation, only LPIN1 and dgat1 were upregulated by ad-nsrepb1, LPIN1, dfat, upregulate and ad-nsrepb1 occur together. It will then try to find these occurrences in other extracts.

This process counts the frequency of occurrences and builds the occurs_with relationship. It helps in building the graph and to understand the phrases. The frequencies of these occurrences can also be used to figure out the relevant relationships among keywords.

Keyword Extraction

Keyword Extraction

In this process, the words or phrases that best describe the abstract are identified and selected using the Term Frequency – Inverse Document Frequency (TF-IDF). So, in any particular abstract, the term that appears the most can be used as the simple summary of the context of the abstract. The tags that are the most relevant in the abstracts are stored in the node keyword. The relationship describes is created to link the annotated text and keywords.

This process can play an important role in the simplification of the summary for abstracts. It saves time for researchers by narrowing down the abstracts by only looking at the relevant keywords. Further queries can be created as per the need of researchers.

Concept Enrichment

Concept Enrichment

To build a knowledge graph, we also need to extract the hidden structure of our textual data and organize it so that a machine can process it. Sometimes, the data that we have is inadequate for making any kind of process. It also might not be the right set of data that is needed. Therefore, the enrichment process allows us to extend the knowledge, thereby introducing external sources. This process allows to gain more insight at the end of the process. We used external knowledge bases ConceptNet 5 and Microsoft Concept which offers the ability to enrich entities.

For instance, the enrichment can discover that Fibrosis is a degenerative change, but that it is also a chromic complication. We can also create new connections into the graph between the documents. For example, if we take a document with Fibrosis and a document with, nephropathy there is nothing that the entity recognizes that will relate the two. However, by using external knowledge, we can create a new connection with Fibrosis, and nephropathy with Chronicle Complication as link between two. This will automatically enrich our graph, share connected data, and increase knowledge about your documents.

TextRank Summarization

TextRank Summarization

Similar approach to the keyword extraction can be employed to implement simple summarization of an abstract. A densely connected graph of sentences is created, with Sentence-Sentence relationships representing their similarity based on shared words (number of shared words vs sum of logarithms of number of words in a sentence). PageRank is then used as a centrality measure to rank the relative importance of sentences in the document.

For a given annotate id we can get the list of sentences with their relevance ranking. In output below for annotatedText id 247474 sentence 7 and 3 are the most important with text rank .178 and .170 respectively.

Cosine Similarity

Cosine Similarity

Once tags are extracted from all the abstracts, it is possible to compute similarities between them using content-based similarity. During this process, each annotated text is described using the TF- IDF encoding format. TF-IDF. Text documents can be TF-IDF encoded as vectors in a multidimensional Euclidean space. The space dimensions correspond to the tags, previously extracted from the documents. The coordinates of a given document in each dimension are calculated as a product of two sub-measures: term frequency and inverse document frequency.

Similarity between two or more abstracts can be calculated using Cosine similarity. Cosine similarity value range from 0 to 1. Abstracts having similarity cosine value near to 1 are most similar in meaning.

Result and Insights

With all of methodology above, we have tried to draw results from two perspectives. First, we want to exploit the methods that we worked on the methodology fully. It means we must find a way to weave the processes and draw insights. Second, we consider the processes in the researcher at the company’s view and think how to take an advantage to a graph database. The graph database defines relationships between nodes and supports to search based on the relationships. We think this is an advantage of this database compared with structure database and programming languages.

Neo4j provides visualization of graphs that helps users to explore the database, but it displays complicated visuals in most cases, not efficient to deliver insights and consumes lots of computer resources. In order to avoid this issue, we write two queries that receive keywords or abstract ids as an input and show text as a result. As a result, we realize that the graph database can be used like Wikipedia. Users can surf interesting abstracts by keywords or similar abstracts in our queries. We will discuss how the model works.

Keyword to Abstracts

Keyword to Abstracts

Keyword Searching Concept

Query

This query receives a keyword that user interests and displays a list of most relevant sentences in each abstract and ids of the abstracts that contain the keyword. The query only searches keywords that were extracted from the Keyword Extraction process above, not seeking the entire words in the abstracts. The most relevant sentences are chosen based on the frequency of terms shared in each abstract by TextRank Summarization process. For example, if we search the keyword ’NASH’ in the query, it returns 8 sentences and ids. We expect users save time to search and read articles at this point.

Abstract to Abstracts

Abstract to Abstracgs

Result Table

The other query that we made is for pulling information from similar abstracts that are close to the user’s interesting abstract. So, this query receives an abstract id as an input argument. The similarity value is calculated by the Cosine Similarity process, so each abstract has similarity value with every other abstract in the database. According to this reason, we limit the number of presenting abstracts. The reason that we devise this query is we think the researcher wants to surf similar abstracts to gather more information. Thus, the query helps shorten the time to search the abstracts.

Result Visual

Retrieving Spotify daily Top 50 charts data by using Spotify API and Python

Introduction

Spotify is a digital music service that gives users access to millions of songs and has a huge client base all around the world. Driven by the interest in music and the data, I choose to explore more from the data of Spotify. My goal is to build a “Top 50” music dataset with different categories of songs, including 62 lists of different countries and areas as well as global. The data is created by the information of songs and artists and also based on a daily playlist from Spotify users. I would like my data to show the popularity of certain songs, artists and categories.

You can check out python code for this work on this Jupyter Notebook.

Potential Users

Based on my dataset, there will be some potential users and applications for multiple purposes:

Potential Users

  1. Record company - The dataset could help record companies decide which singers have unlimited potential and are worth investing in. And for the current singers in their company, this dataset can help the company to decide what type of songs need to be made to open the target foreign market.
  2. Events organizer - The dataset can be used as a reference for companies to know which songs their singers are going to choose on their world tour
  3. Analyst - The dataset can be provided to analysts for data analysis
  4. Listener - The dataset allows listeners to quickly know about the most popular type of music and singers in other countries

Spotify provides the data I use: information of songs, artists and albums, their available markets and popularity score, the number of followers of artists and so on. I use python and API to access the data, and the raw data I obtained is in json format. I clean the data to extract the information that is exactly I want and preprocess the data to finally build a easily readable data set.

How to retrieve data from Spotify

Spotipy

Spotify allows users to access their API and also supports neat documentation. Although Spotify only provides Web API officially, they archive the list of 3rd party libraries for integrating with the Spotify Web API using several programming languages and platforms. And I am going to connect Spotify through Spotipy library.

Spotify API Documentation

Spotipy Documentation

What kind of information

Features

Here is the features that I collect for each songs.

24 Features

  • continent
  • country rank
  • song
  • artist
  • album
  • release date
  • song id
  • artist id
  • album id
  • genre
  • acousticness
  • danceability
  • energy
  • instrumentalness
  • liveness
  • loudness
  • speechiness
  • valence
  • followers
  • artist popularity
  • song popularity
  • album popularity
  • created date

You can check out more explanation on the features from readme.

The category of songs I get from the dataset is determined by the category of its artist. However, many artists release many different styles of songs and no single style label can represent the entire songs. So limitation occurs when I want to category our songs by its exact style, not the artist type. Spotify provides analyses of each song such as acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness and tempo. These data can help provide more information for analysts and record companies and I can use this information to help us obtain the artist type.

Programming

  • Collecting Name of Countries Serviced

Where is Spotify Available

Top 50 Charts are updated daily by Spotify and my goal is to scrap the official charts as much as possible. I assume Spotify will add other countries charts in the future, so I think it is worth to make a way that covers it. My resolution is to grab a list of names of countries serviced from Spotify, and combine this with ‘Top 50’ keyword and search playlists that contain the combined word. Also, I have to filter one more time by chart creator ID, ‘Spotify Official ID’. This work is in the code.

  • Add a temporary function on Spotipy to gather data

Temporarily Add

Although Spotipy supports lots of Spotify official API endpoint functions, it does not reflect retrieving playlists’ tracks information function yet (Dec 2019). This is an essential information for this work and I have to find a solution. So I analyze the Spotipy code first and decide to write a custom function and add it into Spotipy object by module ‘types’ temporarily. Then, I am still available to exploit the advantages of using Spotipy.

  • Generator to gather IDs easily and reduece usage of memory

Generator

The code retrieves every data based on using each IDs. (It can be a chart, a singer, a song, and an album) Spotify API provides several information on a single request and it only requires to modify the endpoint of the request. However, there is a limited number of information that users can gather at once and it depends on a type of information. (Mostly 50, only 20 for albums) So I made a generator to return IDs based on type and limit numbers. I expect it would save a usage on memory.

Result

Result Table

Final table looks like this. You can check out the whole result from the jupyter notebook code.

Future Works and Limitataions

  1. Rank in data charts obtained from website is based on daily play counts in the country, which is not supported in API. I assume the rank as ascending order which is returned by API in its track information. (I have not seen an exception yet)

  2. I could not figure out the time that Spotify updates the charts yet and the code cannot check the charts have been updated. (Future Works)

  3. The code only creates a single daily charts csv file. Supporting aggretation ways such as reading a sheet and stack new data or creating new sheets everyday are cadndidates to improve this work.

Text Classification on Social Media Text

This summer, I had a chance to work with one of the biggest pharmacist companies to evaluate the performance of machine learning models on detecting drug abuse in Tweets. Primarily, the company focused on assessing the BERT compared to other machine learning models. It was my first time to work on Natural Language Processing(NLP) and an excellent opportunity to come across new concepts.

I will share the explanation and result of the project below, and you can check out the Python code from my github page.

Before moving into the project, I would like to share what I have learned and the points where I should improve on in the future.

  • NLP concepts (Lemmatization/Stemming, TF-IDF)
  • Reviewed the Tweets before working on the methodology and found that several medications have two names. So I devised the Term replacement to unify the names. Once again, it reminded me of the value of looking at the data first.
  • Reason to choose F-measure as a significant performance. In real business, I thought it was important to get the right answers right, but also to make fewer mistakes. In this regard, I thought f-measure with two balance would be suitable as a metric. I had thought about the metrics deeply and become familiar with it through the project.
  • Insufficient backgrounds on Neural Network models and also BERT. I studied and experienced a bit of Deep Learning models, Keras, and BERT this time. It’s been fun, and will put some effort into it.
  • Overfitting on classification models. Although I cross-validated the models and run optimization, it seems the models were overfitted, especially in the case of the Decision Tree model. I will find room for improvement on this issue.

Summary

Social media messaging is a significant resource to collect and understand human behavior. As this data is generated in abundant volumes on a minute basis, evaluating an automated technique to identify these behaviors would be very beneficial. The goal of this project is to evaluate the performance of Google’s Bidirectional Encoder Representation of Transformers (BERT) in the classification of Social Media Texts. BERT was compared to four different classifiers: K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Naïve Bayes, and Decision Tree (DT); and two neural network models – Recurrent Neural Networks (RNN) and Convolutional Neural Networks(CNN). Twitter posts associated with three commonly abused medications (Adderall, oxycodone, and quetiapine) were collected to classify users into those who are likely to abuse drugs and those who would not. BERT performed the best. Hyper-parameter tuning of BERT was performed to achieve the best results.

Introduction

Intorduction

The US is the world’s biggest abusers for prescribed drugs. According to the New York Times, American accounts for 80% of the world’s oxycodone consumption. It kills more people than car accidents every year (Ricks, 2019). The Centers for Disease Control and Prevention classifies it as an epidemic and the world health organization reported it threatens the achievements of modern medicine (Hoffman. J, 2019). The aim of this project is to correctly identify users that are likely to abuse drugs from the Twitter tweets. A total sample of 3017 tweets were collected, with 447 (15%) categorized as “abuse” and 2570 (85%) categorized as “not abuse”. The data was split into training and testing set in the 80:20 ratio respectively.

Data Overview

Pre-processing

Pre-processing

From the collected tweets no missing data was observed. Prior to dividing the dataset into training and testing, the text was pre-processed. Since the tweets were written in the informal language in most cases, and contained reserved words related to Twitter, we applied several steps to clean the tweets. First, we used python twitter-preprocessor library to erase Twitter reserved words, URLs, and emojis. Second, ‘stop words’ are the words that appear the most in the English vocabulary. These do not provide important meaning and are therefore removed. In our study, we treated words like “Seroquel” and “seroquel” in the same way therefore to avoid our model from recognizing these two words as two different things, we lowercased our text data. Based on the same idea, we used ‘stemming’ to reduce the number of terms based on grammatical inflections where the end of a words are removed to bring the words to a common base form. Next, upon reading the tweets, we found ‘Seroquel’ and ‘Oxycodone’ are also called ‘Quetiapine’ and ‘Oxycontin’, respectively. These are the same medications differed by generic and brand name, so we replaced the generic names with the brand names. Similarly, we replaced the word “red bull” with “redbull” to maintain consistency among the same words and removed characters such as “&amp” and “&lt” as they did not add meaning to the text.

Resampling

Resampling

Since we have an imbalanced dataset with 85% classified as class “not abuse”, and the remaining 15% classified as class “abuse”. To avoid the learning models from being biased towards the majority class, we tried 3 resampling methods – Naïve Random Oversampling and KNN based Synthetic Minority Oversampling Technique(SMOTE) to increase the “abuse” class to match the representation of the “not abuse” class and Naïve Random Undersampling to decrease the “not abuse” class to match the less-represented class, “abuse”. We applied all three resampling methods to the dataset with a minimum of 20-, 30- and 50-word frequency each and then performed the below 4 classification models to each scenario to compare model performance and determine the resampling method and the appropriate minimum frequency (Appendix: Table 1). SMOTE was the chosen method of resampling.

Feature Selection

Feature Selection

To analyze our results after performing stemming, we converted our text data into a document-term representation. Followed by tokenizing and counting the frequency of our terms. However, mere frequency count does not equate to feature importance, so we applied an alternate weighting schemes to better the important terms through Term Frequency-Inverse Document Frequency (tf-idf). We then created three datasets where only the terms that appeared a minimum of 20, 30 and 50 times respectively would be included. With the minimum term appearance of 20, 226 features were retained for a minimum of 30 terms, 137 features were retained, and with a minimum of 50 term appearances, 67 features were retained.

Classification Models

The data was split into the training and testing set in the 80:20 ratio, respectively. On the training set, four supervised classification models were performed to determine the best classification technique: KNN, Naïve Bayes, SVM, and Decision Tree. We focused on a multitude of performance metrics to determine the best classification model, mainly the F1-measure, recall value, and the model accuracy of the testing set. The recall is the ratio of correctly predicted positive observations to all the observations in the actual class. F1-measure, on the other hand, takes both false positives and false negatives into account. This is especially helpful because our class distribution is uneven. And finally, the model accuracy, which is the ratio of correctly predicted observation to the total observations. For the F1-measure and recall measures, we mainly focused on the values for the “abuse” class since we wanted to focus on the model being able to correctly predict and identify the tweets that are likely to be classified as “abuse” over “not abuse.”

Models

  • KNN

KNN is a non-parametric technique that stores observations and classifies the cases based on the similarities between the observations. Euclidean distance was the chosen metric to measure the similarity, with k=10. Cross-validation was performed to determine the optimal k-value by using an independent dataset to validate the k value. The model achieved a training accuracy of 0.81 (F1-measure: 0.84 and recall: 0.97) and a testing accuracy of 0.56 (F1-measure: .31 and recall: .62). There is a clear issue of overfitting here.

  • Naive Bayes

Naive Bayes computes conditional probability for every category and then selects the outcome with the highest probability, where it assumes independence of predictor variables. The model achieved a training accuracy of 0.69 (F1- measure: 0.74 and recall: 0.90) and a testing accuracy of 0.51 (F1- measure: .33 and recall: .75). Overfitting exists here as well.

  • SVM

“Linear” kernel was used to search if it would create the optimal hyperplane with the largest margin of separation between the two classes. The model achieved a training accuracy of 0.74 (F1- measure: 0.75 and recall: 0.79) and a testing accuracy of 0.67 (F1- measure: .40 and recall: .69). Overfitting exists here as well.

  • Decision Tree

Decision Tree does not require the assumption of linearity and is apt for this case due. The model achieved a training accuracy of 0.96 (F1-measure: 0.96 and recall: 0.97) and a testing accuracy of 0.74 (F1- measure: .23 and recall: .24). Overfitting exists here as well.

Evaluation

Training

Testing

Upon comparing the four classification model (Table 1 below, Appendix: Graph 1) SVM model with the minimum word frequency of 30, resampled with the SMOTE method has the highest F1-measure of 0.4, recall of 0.69, and the test accuracy of 0.67. Out of 98 “abuse” tweets, SVM correctly predicts 47 (48%) as “abuse” and out of 506 “not abuse” tweets, 287 (57%) are correctly predicted.

Classification Models ROC Curve

Validation and Optimization

We performed 10-fold cross-validation on the SMOTE and original datasets to combat the issue of overfitting and to validate the model. A large difference in the mean score of F1-measure between the original un-resampled dataset (F1- measure mean: .357 and standard deviation: .346) and the SMOTE resampled dataset (F1- measure mean: .721; standard deviation: .233) was observed. This confirms that our resampled dataset performed much better. We further performed a GridSearch to tune the parameters of the SVM model by trying four different values for C (1,10,100,1000) for the kernel ‘linear’, while choosing the Stratified K-fold cross-validation option. The model is optimized with C=1000, with a F1- measure of .37, recall of .68 and accuracy of 0.66. This confirms our optimized model performance.

Neural Networks

  • Tokenization

In our study, we tokenize the text of drug tweets by performing white space tokenization, removing punctuation, and converting to lowercase. Before tokenization, we set the size of our vocabulary to 5,000.

  • Padding

Keras requires that all samples be of the same size. For this reason, we will need to pad some observations and (possibly) truncate others. If padding occurs, 0s are added to either the beginning or the end of the sequence to match the maximum length specified. In our study, we choose to add 0s to the end of the sequence. We set the maximum number of tokens to be the average plus 2 standard deviations, which we saved as an integer and selected 26 tokens at the end.

  • Class-Weight

Our dataset is unbalanced. To address this problem, we used the parameter: class_weight = ’balanced” to consider the issue of imbalance. Using this parameter, we give different weights on labels and thus can balance “abuse” and “non-abuse” weight which will help us address the bias problem of our data.

Models

  • RNN (Recurrent Neural Networks)

RNN are used for processing sequences of data, and they are a popular choice for predictive text input. In our RNN model we create a Sequential model first and added three configured layers: (1) Embedding layer (2) LSTM layer: the units argument is the number of neurons in the layer and we choose 16 first, (3) A total of 3 Dropout Layer and finally the (4) Dense Output layer. To compile the model, we specified the loss function: “Binary cross entropy”, optimizer: “Adam” and metrics as “Accuracy” to assess the goodness of our model and each epoch and overall. We performed various combinations of parameter tuning by changing the number of layers, the embedding size, number of epochs, and the learning rates, for both cased and uncased datasets. The best result on the testing dataset, generated a F1-measure of 0.40, recall of 0.60, and accuracy of 0.81.

  • CNN (Convolutional Neural Networks)

CNN also known as covnet, has been gaining traction in NLP classification. Like RNN, in our CNN model, we first created a Sequential model and then added five configured layers: (1) Embedding layer (2) Convolution layer (3) Pooling layer: A pooling layer, in this case “GlobalMaxPooling” was used to compress the results by discarding features, which helps to increase the generalizability of the model. Finally, like RNN, the (4) Dense Layer (with 100 units) and (5) Dense Output layer were added. Then we compile the model using the same loss function, optimizer and metrics to assess the goodness of our model and each epoch and overall. We fit the model by training the model on a sample of data and make predictions by using the model to generate predictions on new data. A F1-measure of 0.186 with a recall of 0.371was observed on the testing set.

Evaluation

NN ROC Curve

Upon changing multiple parameters for CNN, our result with a batch size of 32 and epoch of 2 has a F1- measure of .391 for the abuse case, with a recall of .454 and accuracy of 0.773. A ROC curve was generated to compare the RNN and CNN results (Appendix A: Graph 3). RNN outperformed with an AUC of .649 in comparison to AUC of .603 for CNN.

BERT

  • Introduction

BERT is a bidirectional model. The Transformer encoder of BERT reads the entire sequence of words at once. The model learns from its surroundings (left and right of the word). Model of the main innovation, points in the pre - “train” method, which is the Masked LM (Mask out k% of the input words, and then predict the masked words, usually k = 15) and the Next Sentence Prediction, two methods respectively to catch phrases and Sentence level representation.

  • Pre-trained model

In our study, we used two models– cased model and uncased model–to pre-train our data, which makes our data well fitted into BERT model. The uncased model processes the data in lowercase format and the cased model leaves the text in its original form, that is a combination of upper- and lower-case words.

  • Hyper parameter tuning

We performed a total of 144 combinations of parameters from Batch sizes of [8, 16, 32, 64, 128, 256], Epoch sizes of [3, 4, 5], Learning Rates of [2e-5, 3e-5, 4e-5, 5e-5] and text Case of [‘Uncased’, ‘Cased’]. Our best result, of Batch Size 8, an Epoch of 4, Learning Rate of 2e-5 and Uncased data generated a F1-measure of 0.485, recall of 0.576, accuracy of 0.797 and AUC of 0.708

Result

Result_Comparison

Google’s BERT model outperforms others (Appendix: Graph 4) in identifying users that are likely to abuse drugs based on Twitter posts, with a f1-measure of 0.485, recall of 0.576 and accuracy of 0.797. Our study indicates that social media–Twitter– can be an important resource for detecting the prevalence of abuse of prescription medications, and that BERT can help us complete the task of identifying potential abuse-indicating user posts.

PCA (Principal Component Analysis)

Analysis on Grocery price

1. Explore Data

data <- read.csv("/Users/keonhoseo/Documents/Q2/STAT 630/Week 3/food.csv")
data

rownames(data)=data[,1]
data=data[,-1]

unscaled_data <- data

data=scale(data)
data=as.data.frame(data)

#### Step 1 , Explore the data
dim(data)
var1=var(data$Bread)
var2=var(data$Hamburger)
var3=var(data$Butter)
var4=var(data$Apples)
var5=var(data$Tomato)

This dataset contains grocery price in U.S cities table. It has information about 24 U.S cities and 5 kinds of groceries. So I can consider it is 5 dimensions of data and it is difficult to identify comparable cities. The purpose of conducting PCA here is to explain the cities with a few variables, then we can compare the cities easily.

2. Define the problem in terms of PCs

sigma <- var(data)
sigma

vars=diag(sigma)
percentvars=vars/sum(vars)

The data is standardized above and compute covariances, because we want to know how

3. Compute all Eigenvalues/Eigenvectors

eigenvalues=eigen(sigma)$values
eigenvectors=eigen(sigma)$vectors

eigenvalues
eigenvectors

y=as.matrix(data)%*%eigenvectors
sum(vars)
sum(eigenvalues)

Decompose dataset with Eigenvectors and Eigenvalues. Eigenvectors mean directions of variables, so it indicates relations between variables and principal components. Eigenvalues reveal the amount of paired eigenvector can explain origin variances.

On here, all grocery has a negative relation wiht PC1, and only hamburger and tomato have positive relationship with PC2.

4. Check variance estimates of the pcs and all other properties

percentvars_pc = eigenvalues / sum(eigenvalues)
percentvars_pc

It reveals each eigenvalues’ proportion out of total eigenvalue. First two principal components possess about 67.37% of the value, which means the two components clarify 67.37% of variances in data.

ts.plot(cbind(percentvars,percentvars_pc),col=c("blue","red"),xlab="ith vector",ylab="percent variance")

Graph a scree plot. Red line shows eigenvalues and the slope is getting slower after first 2 components. We can decide the first two components as variables for PCA.

5. Check correlation between components

y1=as.matrix(data)%*%(eigenvectors[,1])
y2=as.matrix(data)%*%(eigenvectors[,2])
y3=as.matrix(data)%*%(eigenvectors[,3])
rho_pc <- cor(y)
rho_pc

This is correlations between principal components. Off-diagonal values are converging to 0, we don’t have to worry about multicollinearity on the components.

6. Regression

Let’s compare explanatory power for variances of origin variables and Principal components through regression analysis.

set.seed(1002)
dv=rowSums(unscaled_data)+rnorm(24,mean=0,sd=10)
summary(lm(dv~as.matrix(data)))
summary(lm(dv~as.matrix(y)))

If we put complete dataset of origin and principal components into regression analysis, R-squared value and residual standard error are same in both analysis. This means component analysis explains whole varainces in origin data and PCA only finds linear lines through linear combination, not manipulating origin data.

#cor(dv,data)
summary(lm(dv~y1+y2))
summary(lm(dv~data$Hamburger+data$Tomato))

Let’s put Hamberguer and Tomato into regression as input and compare with principal components regression result. Hambugrger and Tomato combination get 0.7652 R-squared score, while the first two principal components scored 0.9259. So, we can say two of the PCA components explain more variances than any of two origin variables.

7. Draw a plot

plot(y1,y2)
text(y1,y2, labels = rownames(data), pos = 3)

Observable data are scattered by following principal components value. X-axis(PC1) explains about 48.8% of variances, and Y-axis(PC2) accounts for 18.6% of variances.

Since every grocery variables have negative relations with PC1, positioning on left of X-axis means expensive in general. On the contrary, grocery price in the cities on right of X-axis should be cheaper. So livinging in Anchorage and Honolulu should cost a lot, and it makes sense.

On PC2(Y-axis), the most important variable on this is apple. Apple has a high negative relation with PC2. So we can expect the apple price in the cities on upper in Y-axis is cheaper than cities in bottom. Also, PC2 has positive relations with Hamburger and Tomato, which implies if Hamburger and Tomato are expensive the city should be on upper in Y-axis.

Through the PCA plot, it is easy to compare grocery prices in U.S cities and guesses living costs in the cities.

8. Unstandardized

data <- read.csv("/Users/keonhoseo/Documents/Q2/STAT 630/Week 3/food.csv")
rownames(data)=data[,1]
data=data[,-1]
rho=cor(data)
## Step 2 - Define the problem in terms of principal components
sigma=var(data)
vars=diag(sigma)
percentvars=vars/sum(vars)

## Step 3 - Compute all the eigenvalues and eigenvectors in R

eigenvalues=eigen(sigma)$values
eigenvectors=eigen(sigma)$vectors

# define principal componenets
y1=as.matrix(data)%*%(eigenvectors[,1])
y2=as.matrix(data)%*%(eigenvectors[,2])

y=as.matrix(data)%*%eigenvectors

set.seed(1002)
dv=rowSums(data)+rnorm(24,mean=0,sd=10)
summary(lm(dv~y1+y2))
summary(lm(dv~data$Hamburger+data$Tomato))

The PCA regression results above are little bit different from the one that we saw in #6.Regression, while origin data regression is same. This is because I put unstandardized data as an input into regression. PCA works by finding maximum variances. So if data is not scaled, results should be distorted.

plot(y1,y2)
text(y1,y2, labels = rownames(data), pos = 4)

Also a plot is quite similar with standardized one, but little different from the scaled one.

Sentiment Analysis on Music Streaming Services Reviews

As a group project, I practiced sentiment analysis on music streaming services reviews. Purpose of the project is to find insights from the reviews, and I focused on disclosing relationships between events of the services and user reactions. It can be software updates, offline events, or increasing membership fee. For checking this out, I wanted to see review trends by a word, period and sentiment. The reviews that I handled for the project are from Pandora, Spotify and Amazon music and the total sum of those is about 39,290 reviews

  1. Manual Scoring
  2. Resampling
  3. Extract Data
  4. Findings
  5. Limitations

Manual Scoring

For practicing sentiment analysis, I had to decide which lexicon(sentiment dictionary) should I use for the analysis. R provides lexicons such as ‘Bing’, ‘NRC’, and ‘AFINN’. I chose ‘bing’ for the project. This decision is based on comparisons between manual judgments and lexicon scores. The manual judgement datasets are about 1% of total reviews and were selected randomly. Here is the result of the comparison.

Spotify Imgur

Pandora Imgur

Amazon music Imgur

‘Bing’ determined the highest accuracy compared two lexicons on every service. Somehow, the conclusions on negative reviews were shallow, but we couldn’t find out a solution this time. (I assume this is related to the length of reviews and sarcasm)

Also, another interesting point of the result is neutral reviews. The proportion of neutral reviews are around 14% on every services review. I pondered how to resolve this because 14% of data is not small. Then, I decided to deal this with probability, because the sentiment trend of keywords is the only thing that I want to know. I am not interested in each review.

Resampling

I have 1% of manually scored reviews. Now, let’s think this as a population and practice resampling. When we have enough number of samples, we can extract samples from it and infer a normal distribution. So, I extract 50 reviews randomly from the manually scored samples multiple times and deduce proportions of positive and negative in neutral reviews.

Here is an R code for resampling and a result of Amazon music.

library(tidytext)
library(readr)
library(dplyr)
library(anytime)

# Load manually scored review files here. ("../service_scored.csv")
service <- read_csv("https://raw.githubusercontent.com/seonnyseo/Streaming_Sentiment_Analysis/master/Data/amazon_scored.csv")
  
# I decided to consider ambiguous reviews as negative
service$score <- ifelse(service$score == 0, -1, 1)
  
# Split reviews by each word 
tidy_service <- service %>% unnest_tokens(word, review)
  
bing_sentiments <- tidy_service %>% inner_join(get_sentiments("bing"), by = "word")
bing_sentiments$score <- ifelse(bing_sentiments$sentiment == "negative", -1, 1)
bing_aggregate <- bing_sentiments %>% 
  select(review_id, score) %>% 
  group_by(review_id) %>% 
  summarise(bing_score = sum(score))
  
score_compare_service <- merge(x = service, y = bing_aggregate, all.x = TRUE, by = 'review_id')
score_compare_service[is.na(score_compare_service)] <- 0
  

# score_compare_service(review_id, date, review, score, bing_score)
# Pick 50 each time 
resampling <- function(service){
  
  neutral_average <- 0
  postive_average <- 0
  negative_average <- 0
  
  for(i in c(2000:3000)){
    set.seed(i)
    random_data <- service[sample(nrow(service), 50),]
    
    neutral_count <- sum(random_data$bing_score == 0)
    positive_neutral <- sum(random_data$bing_score == 0 & random_data$score == 1)
    negative_neutral <- sum(random_data$bing_score == 0 & random_data$score == -1)
    
    neutral_average <- neutral_average + neutral_count/50
    postive_average <- postive_average + positive_neutral/neutral_count
    negative_average <- negative_average + negative_neutral/neutral_count
  }
  cat(sprintf("Neutral : %.3f  Positive : %.3f  Negative : %.3f\n", 
              neutral_average/1000, postive_average/1000, negative_average/1000))
}
  
# Run resampling
resampling(score_compare_service)
## Neutral : 0.140  Positive : 0.710  Negative : 0.291

This is the results of resampling. (Neutral / (Positive|Neutral) / (Negative|Neutral) # Pandora 0.14 0.54 0.46 # Spotify 0.13 0.59 0.41 # Amazon 0.14 0.70 0.30

Extract Data

My purpose is to see a change in trend of sentiment by keywords. Thus, expected outcome is a graph that shows time period in x axis and quantity of reviews by sentiment in y axis. R code is composed of three functions to implement a graph.

  1. Pre-Process
  2. Extract reviews
  3. Create a graph

Pre-Process

pre_process <- function(service)
{
  # PreProcessing
  service$review_id <- 1:nrow(service)
  service$date <- anydate(service$date)
  service <- service[c(3,1,2)]
  
  # Sentiment Score
  tidy_service <- service %>% unnest_tokens(word, review)
  
  # Edit Dictionary, I only add 4 words with sentiment at this time. This can be expanded later. 
  bing_edit <- rbind(get_sentiments("bing"), c("commercial", "negative"))
  bing_edit <- rbind(bing_edit, c("commercials", "negative"))
  bing_edit <- rbind(bing_edit, c("ad", "negative"))
  bing_edit <- rbind(bing_edit, c("ads", "negative"))
  bing_edit <- rbind(bing_edit, c("wish", "negative"))
  
  bing_sentiments <- tidy_service %>% inner_join(bing_edit, by = "word")
  
  bing_sentiments$score <- ifelse(bing_sentiments$sentiment == "negative", -1, 1)
  bing_aggregate <- bing_sentiments %>% select(review_id, score) %>% group_by(review_id) %>% summarise(bing_score = sum(score))
  
  service <- merge(x = service, y = bing_aggregate, all.x = TRUE, by = 'review_id')
  service[is.na(service)] <- 0
  service$bing_judgement <- ifelse(service$bing_score > 0, "positive", 
                                   ifelse(service$bing_score < 0, "negative", "neutral" ))
  
  return(service)
}

Pre-Process function is called only once for each service dataset. On this step, the function cleans data and judge sentiment of reviews based on bing lexicon. Also, I edited bing dictionary by adding some words are not included in the dictionary such as ‘commercials’, ‘ad’, and ‘wish’. Those words connote negative emotion in many reviews, so I intend to treat these as negative words.

So, dataset acquires sentiment score column after pass this function.

Extract reviews

word_data

word_data <-function(service, start, end, word, sentiment){

  word <- tolower(word)
  
  # Filter Data between start date & end date
  extracted <- service[service$date >= start & service$date <= end,]
  # Filter Date that only contains word
  extracted <- extracted[grepl(word, tolower(extracted$review)),]
  
 
  set.seed(101)

  # Neutral / (Positive|Neutral) / (Negative/Neutral)
  # Pandora	0.14	0.54	0.46
  # Spotify	0.13	0.59	0.41
  # Amazon	0.14	0.70	0.30
  
  ifelse(service$service == "pandora", positive_weight <- 0.54,
         ifelse(service$service == "spotify", positive_weight <- 0.59, positive_weight <- 0.70))
    
  neutral_reviews <- extracted[extracted$bing_judgement == "neutral",] %>% select(review_id)
  positive_neutral <- neutral_reviews[sample(nrow(neutral_reviews), nrow(neutral_reviews) * positive_weight),]
  negative_neutral <- neutral_reviews[!(neutral_reviews$review_id %in% positive_neutral),]
    
  extracted$bing_judgement <- ifelse(extracted$review_id %in% positive_neutral, "positive",
                                      ifelse(extracted$review_id %in% negative_neutral, "negative",
                                            extracted$bing_judgement))
  
  
 
  extracted <- extracted[extracted$bing_judgement == sentiment,]
  
  return(extracted)
}

This function does two operations. This function extracts data by service, period, word, and sentiment from pre-processed raw dataset. As I mentioned above, the extracted dataset should includes around 14% of neutral reviews. So, the function handles neutral reviews with estimated probability from resampling results. It splits neutral reviews as positive or negative randomly follow the weights.

frequnecy_month

frequency_month <- function(service, start, end, word, sentiment){
  
  extracted <- word_data(service, start, end, word, sentiment)

  # Make year-month column
  extracted$year_month <- anydate(format(as.Date(extracted$date), "%Y-%m"))

  frequency_df <- extracted %>% group_by(year_month) %>% summarise(frequency = n())
  frequency_df <- frequency_df %>% pad(interval = 'month', start_val = anydate(start), end_val = anydate(end))
  frequency_df[is.na(frequency_df)] <- 0
  return (frequency_df)
}

I only want to know frequency of reviews each months. This function counts how many reviews were posted in a certain period by keywords.

word_graph

word_graph <- function(service, word, start, end){

  positive = frequency_month(service, start, end, word, "positive")
  negative = frequency_month(service, start, end, word, "negative")
  
  positive$sentiment <- 'positive'
  negative$sentiment <- 'negative'
  
  frequency_df <- positive %>% full_join(negative)
  
  ret <- ggplot(frequency_df, aes(x = year_month)) +
        geom_line(aes(y = frequency, col = sentiment)) +
        theme(axis.title.x=element_blank())
  
  return(ret)
}

User calls this function to see results.

Kaggle Titanic Competition - Decision Tree

Import packages first. I will only use decision tree on this post, so import tree from sklearn.

import pandas as pd       
import numpy as np        
from sklearn import tree  

(I worked this on Pycharm, and it did not display entire columns. This line allows to check entire columns on console)

pd.set_option('display.max_columns', None)

Read train/test csv files.

train_path = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_path)

test_path = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_path)

See train and test dataset.

Train data set has ‘Survivied’ column, which is not included in test set. Also, train data set misses 177 of Age data, and test data set omitted 86 of Age and 1 Fare data. So, I should fullfil the missed data first. (Data cleaning)

print('-'*15 + 'Train Set' + '-'*15 + '\n', train.describe())
print('-'*15 + 'Test Set' + '-'*15 + '\n', test.describe())
---------------Train Set---------------
        PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
---------------Test Set---------------
        PassengerId      Pclass         Age       SibSp       Parch        Fare
count   418.000000  418.000000  332.000000  418.000000  418.000000  417.000000
mean   1100.500000    2.265550   30.272590    0.447368    0.392344   35.627188
std     120.810458    0.841838   14.181209    0.896760    0.981429   55.907576
min     892.000000    1.000000    0.170000    0.000000    0.000000    0.000000
25%     996.250000    1.000000   21.000000    0.000000    0.000000    7.895800
50%    1100.500000    3.000000   27.000000    0.000000    0.000000   14.454200
75%    1204.750000    3.000000   39.000000    1.000000    0.000000   31.500000
max    1309.000000    3.000000   76.000000    8.000000    9.000000  512.329200

Data Cleaning.

For now, I replace the missed data as median value of each column.

train['Age'] = train['Age'].fillna(train['Age'].median())  
test['Age'] = test['Age'].fillna(test['Age'].median())     
test['Fare'] = test['Fare'].fillna(test['Fare'].median())  

Mapping character to integer for convenience to manipulate and search data.

#Former way that how I changed data. Mapping is clear.
#train["Sex"][train["Sex"] == "male"] = 0
#train["Sex"][train["Sex"] == "female"] = 1

sex_mapping = {"male": 0, "female": 1}
train['Sex'] = train['Sex'].map(sex_mapping)
test['Sex'] = test['Sex'].map(sex_mapping)

train['Embarked'] = train["Embarked"].fillna({"Embarked": "S"})
embarked_mapping = {"S": 0, "C": 1, "Q": 2}
train["Embarked"] = train["Embarked"].map(embarked_mapping)

Extract ‘survived’ value from train data set as a target. Also select ‘Pclass’, ‘Sex’, ‘Age’, ‘Fare’ as features.

This means I assume Ticket Class(‘Pclass’), Sex, Age, Fare are related to survive. Then assign a decision tree class from tree and put features and target.

target = train['Survived'].values
train_features = train[['Pclass', 'Sex', 'Age', 'Fare']].values

decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(train_features, target)

feature_importances_ shows how each feature is related to target. For here, it seems ‘Sex’ and ‘Fare’ are more important compare to ticket class and age.

Score returns the mean accuracy on the target and fetures.

print(decision_tree.feature_importances_)
print(decision_tree.score(train_features, target))
[0.12968841 0.31274009 0.23646917 0.32110233]
0.9775533108866442

Apply selected features on test data and predict. For here, prediction is ‘Survived’ or not

test_features = test[["Pclass", "Sex", "Age", "Fare"]].values
prediction = decision_tree.predict(test_features)
print(prediction)
[0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0
 0 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0
 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
 1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 1 0 1 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 1 0
 1 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1
 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0
 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0
 0 1 1 1 1 0 0 1 0 0 0]

Then make a csv file for submitting on Kaggle and check the result.

# Report (Make csv file) 

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId = np.array(test["PassengerId"]).astype(int)
decision_tree_solution = pd.DataFrame(prediction, PassengerId, columns = ["Survived"])

# Write a solution to a csv file with the name my_solution.csv
decision_tree_solution.to_csv("decision_tree_solution.csv", index_label = ["PassengerId"])

Here is my first result. The accuracy is about 0.722

Imgur

So, there are limitations.

.Data Cleaning - I replaced missed data to median values of columns, which is not proper. There are several ways to improve this.

.Decision Tree - Decision Tree has a problem of over-fitting and sklearn provides a solution for this. With the over-fitting problem, a predictive model is perfectly fitted to the training set, and it decreases accuracy on the test set.

.Model - DataCamp also teaches Random Forest on the same course. Also, I can see diverse models of how people solve this problem and apply what kind of algorithm.

20181223

I started to work Titanic Survival Prediction data set in Kaggle on Dec. 20. I couldn’t estimate how I should start this project, and then I found coursework in DataCamp to help run an elemental analysis of the project. I followed a guide in DataCamp, which includes predicting with Decision Tree and Random Forest, and completed it on the same day. Working in DataCamp was simple since I read explanations and fill blanks on code lines.

After the coursework, I read some articles about the Titanic project in Kaggle. It helps me understand the importance of feature engineering. Titanic Survival Predictions (Beginner) So far, I skimmed the article and understood what she did for her work.

Today, I tried to build code on my local computer myself and find weaknesses that I have. Well, from this work I could realize some problems that did not happen in DataCamp and most of them are related to a knowledge of Pandas, NumPy packages, which means I consumed lots of time in manipulating data.

So, I decided to learn Pandas, NumPy first (It took a while to get used to with R before too) and then read some more works in Kaggle. I don’t expect I can find a new methodology to shape a better output on the project, but I can learn something more from the study.