Skip to main content

Titanic: A case study for predictive analysis on R (Part 1) is a popular community of data scientists, which holds various competitions of data science. The article performs predictive analysis on a benchmark case study -- Titanic, picked from -- in-depth.

The case study is a classification problem, where the objective is to determine which class does an instance of data belong to. This can also be called prediction problem, because we are predicting class of a record based on its attributes.

Note: This tutorial requires some basic R programming background. If you haven't yet gotten yourself acquainted with R, maybe this is the right time. Codeacademy's tutorial is my personal recommendation. We will be using RStudio here, the most used IDE for 'R' language. It is free and open-source, you can download it here.

RMS Titanic was a British cruise that sank on its course in the North Atlantic Ocean on its maiden voyage. 1502 people, out of 2224 on board lost their lives in this disaster. Due to lack of lifeboats, the death toll was so high. When the data was gathered about the passengers that survived or killed, it was observed that some people, like women, children and those belonging to upper-class survived more than the others. Our objective is to depict the attributes of the people who survived with as much accuracy as possible.
We have a set of 2 data files: train, which contains records of 891 passengers with label Survived = 0 (did not survive) or 1 (survived); the other one is test, which has 418 records without the information of survival. We want to predict this and post to to check how accurately we were able to predict this label.

Please go on and download the two files (train.csv and test.csv) from here. You can alternatively download them from this link as well. The first file, train.csv contains records in which the value of target variable Survived is given; we will use this to generate a classification model, to be used for prediction. The other file, test.csv contains data about different passengers, but the information that the passenger survived or not is not provided. We will apply our learnt model on this data to see which passengers are predicted to live or perish.

That'll be all about the case study, let's start the real thing.

Reading data
First, set the working directory to wherever you have placed the downloaded files.

> setwd("D:/Datasets/Titanic")

read.csv() funtion can be used to read a CSV file in table format and save in a data frame 
> dataset <- read.csv("train.csv")

Check what variables are available.
> colnames(dataset)

PassengerID is a unique identifier of patients; Survived is the target variable; Pclass represents in which class the Passenger traveled; Name, Sex and Age are the demographics; SibSp is the count of siblings and spouse on board; Parch is the count of parents and children on board; Ticket is the ticket number; Fare is the amount paid as fair; Cabin is the cabin numbers reserved for the passenger; Embarked tells which port did the passenger get on board from.

Let's now have a sneak peek of how the data looks like using head() function :

We can clearly notice a few things here. First, some data is available in ready-to-use form, like Sex; some fields like Age and Cabin are missing values; variables like Name can be fine-tuned to make them meaningful, or derive some information from them.

All of this is Pre-processing task, in which we clean-up the data, fill in missing values, select and deselect fields and derive new variables before running analysis algorithms.

Base package of R contains a very handy function, summary() that tells some primary statistics about the data.

There are 314 females and 477 males in the training data; most passengers on board were 28 years old, while the age is not available for 177 passengers; maximum fare paid is 512.33 pounds. You can get get an overall picture of the data just by using this command. Have a keen look.

Submitting on Kaggle:
In order to evaluate our model, we will have to submit our results on by predicting value of Survived variable for each passenger in the test.csv file.

We know that during disasters, women and children are given preference over males and adults. We first see what our training data tells us about this.

Run table() function on variables Survived and Sex to create a distribution table over these two variables:

As observed, the probability of female-survived is 0.26, while probability of male-died is 0.525. Which means, we can tell with some confidence that females survive and males do not.

Let's test this hypothesis on test data. Read test data in another data frame and introduce a new column, Survived with 0 as default value.

Next, we set Suvived = 1 for all females. The command below says: "set variable Survived of testset to 1 wherever value of variable Sex is female".

> testset$Survived <- 0
> testset$Survived[testset$Sex == 'female'] <- 1

Now, we will export PassengerID and Survived variables into a new file.

> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="all_femailes_survive.csv", row.names=FALSE)

Let's submit this new file to's leaderboard in Titanic competition.

Go to "Make a submission" page in Titanic Disaster challenge and Accept the terms and conditions. The next page will ask whether you want to take part as a team or as an individual. You can choose individual (you can later add people to make a team).
On the submission page, browse for the newly created file "all_families_survive.csv" and submit. 

The results take some time to compute on Kaggle...... there you go!

This just means we have achieved 76.55% accuracy. This, only using Gender attribute.

In our next attempts, we will try to improve our accuracy by remaining data.

Have fun till then...


Popular posts from this blog

A faster, Non-recursive Algorithm to compute all Combinations of a String

Imagine you're me, and you studied Permutations and Combinations in your high school maths and after so many years, you happen to know that to solve a certain problem, you need to apply Combinations.

You do your revision and confidently open your favourite IDE to code; after typing some usual lines, you pause and think, then you do the next best thing - search on Internet. You find out a nice recursive solution, which does the job well. Like the following:

import java.util.ArrayList;
import java.util.Date;

public class Combination {
   public ArrayList<ArrayList<String>> compute (ArrayList<String> restOfVals) {
      if (restOfVals.size () < 2) {
         ArrayList<ArrayList<String>> c = new ArrayList<ArrayList<String>> ();
         c.add (restOfVals);
         return c;
      else {
         ArrayList<ArrayList<String>> newList = new ArrayList<ArrayList<String>> ();
         for (String o : restOfVals) {

Executing MapReduce Applications on Hadoop (Single-node Cluster) - Part 1

Okay. You just set up Hadoop on a single node on a VM and now wondering what comes next. Of course, you’ll run something on it, and what could be better than your own piece of code? But before we move to that, let’s first try to run an existing program to make sure things are well set on our Hadoop cluster.
Power up your Ubuntu with Hadoop on it and on Terminal (Ctrl+Alt+T) run the following command: $
Provide the password whenever asked and when all the jobs have started, execute the following command to make sure all the jobs are running: $ jps
Note: The “jps” utility is available only in Oracle JDK, not Open JDK. See, there are reasons it was recommended in the first place.
You should be able to see the following services: NameNode SecondaryNameNode DataNode JobTracker TaskTracker Jps

We'll take a minute to very briefly define these services first.
NameNode: a component of HDFS (Hadoop File System) that manages all the file system metadata, links, trees, directory structure, etc…

A step-by-step guide to query data on Hadoop using Hive

Hadoop empowers us to solve problems that require intense processing and storage on commodity hardware harnessing the power of distributed computing, while ensuring reliability. When it comes to applicability beyond experimental purposes, the industry welcomes Hadoop with warm heart, as it can query their databases in realistic time regardless of the volume of data. In this post, we will try to run some experiments to see how this can be done.

Before you start, make sure you have set up a Hadoop cluster. We will use Hive, a data warehouse to query large data sets and a adequate-sized sample data set, along with an imaginary database of a travelling agency on MySQL; the DB consisting of details about their clients, including Flight bookings, details of bookings and hotel reservations. Their data model is as below:

The number of records in the database tables are as:
- booking: 2.1M
- booking_detail: 2.1M
- booking_hotel: 1.48M
- city: 2.2K

We will write a query that retrieves total country-wi…