Skip to main content

Titanic: A case study for predictive analysis on R (Part 1)

Kaggle.com is a popular community of data scientists, which holds various competitions of data science. The article performs predictive analysis on a benchmark case study -- Titanic, picked from Kaggle.com -- in-depth.

The case study is a classification problem, where the objective is to determine which class does an instance of data belong to. This can also be called prediction problem, because we are predicting class of a record based on its attributes.

Note: This tutorial requires some basic R programming background. If you haven't yet gotten yourself acquainted with R, maybe this is the right time. Codeacademy's tutorial is my personal recommendation. We will be using RStudio here, the most used IDE for 'R' language. It is free and open-source, you can download it here.

Dataset:
RMS Titanic was a British cruise that sank on its course in the North Atlantic Ocean on its maiden voyage. 1502 people, out of 2224 on board lost their lives in this disaster. Due to lack of lifeboats, the death toll was so high. When the data was gathered about the passengers that survived or killed, it was observed that some people, like women, children and those belonging to upper-class survived more than the others. Our objective is to depict the attributes of the people who survived with as much accuracy as possible.
We have a set of 2 data files: train, which contains records of 891 passengers with label Survived = 0 (did not survive) or 1 (survived); the other one is test, which has 418 records without the information of survival. We want to predict this and post to Kaggle.com to check how accurately we were able to predict this label.

Please go on and download the two files (train.csv and test.csv) from here. You can alternatively download them from this link as well. The first file, train.csv contains records in which the value of target variable Survived is given; we will use this to generate a classification model, to be used for prediction. The other file, test.csv contains data about different passengers, but the information that the passenger survived or not is not provided. We will apply our learnt model on this data to see which passengers are predicted to live or perish.

That'll be all about the case study, let's start the real thing.


Reading data
First, set the working directory to wherever you have placed the downloaded files.

> setwd("D:/Datasets/Titanic")

read.csv() funtion can be used to read a CSV file in table format and save in a data frame 
> dataset <- read.csv("train.csv")

Check what variables are available.
> colnames(dataset)



PassengerID is a unique identifier of patients; Survived is the target variable; Pclass represents in which class the Passenger traveled; Name, Sex and Age are the demographics; SibSp is the count of siblings and spouse on board; Parch is the count of parents and children on board; Ticket is the ticket number; Fare is the amount paid as fair; Cabin is the cabin numbers reserved for the passenger; Embarked tells which port did the passenger get on board from.

Let's now have a sneak peek of how the data looks like using head() function :


We can clearly notice a few things here. First, some data is available in ready-to-use form, like Sex; some fields like Age and Cabin are missing values; variables like Name can be fine-tuned to make them meaningful, or derive some information from them.

All of this is Pre-processing task, in which we clean-up the data, fill in missing values, select and deselect fields and derive new variables before running analysis algorithms.

Base package of R contains a very handy function, summary() that tells some primary statistics about the data.


There are 314 females and 477 males in the training data; most passengers on board were 28 years old, while the age is not available for 177 passengers; maximum fare paid is 512.33 pounds. You can get get an overall picture of the data just by using this command. Have a keen look.

Submitting on Kaggle:
In order to evaluate our model, we will have to submit our results on Kaggle.com by predicting value of Survived variable for each passenger in the test.csv file.

We know that during disasters, women and children are given preference over males and adults. We first see what our training data tells us about this.

Run table() function on variables Survived and Sex to create a distribution table over these two variables:


As observed, the probability of female-survived is 0.26, while probability of male-died is 0.525. Which means, we can tell with some confidence that females survive and males do not.

Let's test this hypothesis on test data. Read test data in another data frame and introduce a new column, Survived with 0 as default value.



Next, we set Suvived = 1 for all females. The command below says: "set variable Survived of testset to 1 wherever value of variable Sex is female".

> testset$Survived <- 0
> testset$Survived[testset$Sex == 'female'] <- 1

Now, we will export PassengerID and Survived variables into a new file.

> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="all_femailes_survive.csv", row.names=FALSE)

Let's submit this new file to Kaggle.com's leaderboard in Titanic competition.

Go to "Make a submission" page in Titanic Disaster challenge and Accept the terms and conditions. The next page will ask whether you want to take part as a team or as an individual. You can choose individual (you can later add people to make a team).
On the submission page, browse for the newly created file "all_families_survive.csv" and submit. 




The results take some time to compute on Kaggle...... there you go!



This just means we have achieved 76.55% accuracy. This, only using Gender attribute.

In our next attempts, we will try to improve our accuracy by remaining data.

Have fun till then...

Comments

Popular posts from this blog

Playing in Amazon's Clouds - Introduction to Elastic Computing Cloud - Part 1

A really brief Intro.. Researcher, Trying to execute an extremely computationally resource hungry experiment? App developer, unsure of how much data you'll be collecting from the users? Student, tasked to build your FYP (final year project) on distributed computing environment? Just an ordinary techie trying to catch up with the world? If you're any of these, you cannot escape the fact that Cloud computing is storming in and you have to engage yourself actively in it. Adopt it, or perish. I'm a newbie (better say wannabe) in this massive web of computing, and here just to share some experiences I'm having - successes and failures. First of all, Cloud computing is nothing new, it has been there for over 3 decades and was referred with names like Grid computing  and Distributed computing . It was business people that came up with a catchy name to attract business. The idea behind distributed computing is simple. We create a network of computers t...

How to detach from Facebook... properly

Yesterday, I deactivated my Facebook account after using it for 10 years. Of course there had to be a very solid reason; there was, indeed... their privacy policy . If you go through this page, you might consider pulling off as well. Anyways, that's not what this blog post is about. What I learned from yesterday is that the so-called "deactivate" option on Facebook is nothing more than logging out. You can log in again without any additional step and resume from where you last left. Since I really wanted to remove myself from Facebook as much as I can, I investigated ways to actually delete a Facebook account. There's a plethora of blogs on the internet, which will tell you how you can simply remove Facebook account. But almost all of them will either tell you to use "deactivate" and "request delete" options. The problem with that is that Facebook still has a last reusable copy of your data. If you really want to be as safe from its s...

Yet another Blog on Query Optimization for MySQL Server

If you have been into MIS development for some time, then you may have realized that buying latest, multi-thousand-dollar Machine, stuffed with a top notch processor and an army of memory chips is not sufficient to your needs when it comes to processing large data, especially when your DBMS is MySQL Server. In this article, I have tried to input  the tips and techniques to-be-followed - some in general and some specific to MySQL Server; but I would, as every blogger, repeat the same common phrase that " in the end   it all depends on your scenario ". The results you are going to see will mostly be in milliseconds so before thinking "is it worth the effort if the result is in a few milliseconds?", do know that these results are derived using a very very simple database with not more than 100000 records in a table.  With complex databases and records in millions, the effort will pay you back. Coming straight to topic, here are some points you should not ign...