Skip to main content

Titanic: A case study for predictive analysis on R (Final)

Our previous attempt to accurately predict whether a passenger is likely to survive, a competition from We used some statistics and machine learning models to classify the passengers.

In our final part, we will push our limits using advanced machine learning models, including Random Forests, Neural Networks, Support Vector Machines and other algorithms, and see how long we can torture our data before it confesses.

Let's resume from where we left. We are applying an implementation of Random forest method of classification. Shortly, this model grows many decision trees and then uses a voting system to decide which trees to pick. This way, the common issue with decision trees, over fitting is mitigated (learn more here).

> library(randomForest)
> formula <- as.factor(Survived) ~ Sex + Pclass + FareGroup + SibSp + Parch + Embarked + HasCabin + AgePredicted + AgeGroup 
> set.seed(seed)
> rf_fit <- randomForest(formula, data=dataset[dataset$Dataset == 'train',], importance=TRUE, ntree=100, mtry=1)
> varImpPlot(rf_fit)
> testset$PredSurvived <- predict(rf_fit, dataset[dataset$Dataset == 'test',])
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)

> write.csv(submit, file="rforest.csv", row.names=FALSE)

The results were not as promising as expected. We did not make any improvements using this algorithm. This indicates that the decision tree model is not over-fitting.

This is the point where we rethink our data. We noticed that missing Age is an important factor; some records are missing Fare and Embarked; we also extracted Title from names; we derived from seemingly useless variable, Cabin, a boolean variable HasCabin.

Now let's have a look at Ticket.

> unique(dataset$Ticket)

Notice anything? We see some strings like PC, CA, SOTON, PARIS, etc. Now without actually knowing what these represent, how about clipping off the digits and extract only this part? Here's how we'll do so (you'll need to install stringr package if it's missing):

> library(stringr)
> dataset$TicketPart <- NULL
> dataset$TicketPart <- str_replace_all(dataset$Ticket, "[^[:alpha:]]", "")

> dataset$TicketPart <- as.factor(dataset$TicketPart)
> plot(table(dataset$TicketPart[dataset$TicketPart != '']))
The plot reveals that some parts appear frequently. These might hint at where the passenger is coming from.

Next, we can use SibSp and Parch to determine the size of family on board. The thought behind this is that if more members of a family are on board, they'll have high support, and chances of survival.

> dataset$FamilySize <- dataset$SibSp + dataset$Parch + 1
# +1 for the passenger himself

Torturing data even more, we'll explore Name variable even more. We notice that apart from Title, we can also extract Surname, since names are in format [Surname], [Title] [Given Names]

> dataset$Surname <- sapply(dataset$Name, FUN=function(x) {strsplit(as.character(x), split='[,.]')[[1]][1]})
> dataset$Surname <- as.factor(sub(' ', '', dataset$Surname))

> dataset$Surname <- factor(dataset$Surname)

We are only interested in frequent names; we can reduce levels where family size is less than 3:

> dataset$FamilyID <- paste(as.character(dataset$FamilySize), dataset$Surname, sep="")
> dataset$FamilyID[dataset$FamilySize <= 2] <- 'Small'
> famIDs <- data.frame(table(dataset$FamilyID))
> famIDs <- famIDs[famIDs$Freq <= 2,]
> dataset$FamilyID[dataset$FamilyID %in% famIDs$Var1] <- 'Small'
> dataset$FamilyID <- factor(dataset$FamilyID)
> plot(table(dataset$FamilyID[dataset$FamilyID != 'Small']))
As visible, we have several records with same FamilyID and size of family. Therefore, we can conclude that Surname successfully mapped passengers of same family.

There can be many ways to continue torturing this data, but we will now limit ourselves to these variables only.

Now we'll apply SVM and other models (wildly) and see what combination of variables worked for us.

> formula <- as.factor(Survived) ~ Sex + AgeGroup + Pclass + FareGroup + SibSp + Parch + Embarked + Title + FamilySize + FamilyID + HasCabin + TicketPart
> svm_fit <- svm(formula, data=dataset[dataset$Dataset == 'train',])
> testset$PredSurvived <- predict(svm_fit, testset, type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="svm.csv", row.names=FALSE)

I continued my experiments on different models, each of which cannot be described here, but for ease, I have created this R script and partitioned the code into functions.

What should be noted is that later in my experiments, I used Age and Fare instead of their discretized variables for accuracy (yes, the execution time increases as a result).

Here are the results from some of the models:

Random Forest:


Neural Networks:


For our experiments so far, CForest proved to be the top performer. But please don't stop here; apply your own ideas of twisting and squashing data to gain more accuracy.

For now, I guess this series should serve as a good starter on Predictive analytics. You can find a variety of different problems on to participate in and polish your analytics spectacles.

Please feel free to comment and maybe share your score...


Popular posts from this blog

A faster, Non-recursive Algorithm to compute all Combinations of a String

Imagine you're me, and you studied Permutations and Combinations in your high school maths and after so many years, you happen to know that to solve a certain problem, you need to apply Combinations.

You do your revision and confidently open your favourite IDE to code; after typing some usual lines, you pause and think, then you do the next best thing - search on Internet. You find out a nice recursive solution, which does the job well. Like the following:

import java.util.ArrayList;
import java.util.Date;

public class Combination {
   public ArrayList<ArrayList<String>> compute (ArrayList<String> restOfVals) {
      if (restOfVals.size () < 2) {
         ArrayList<ArrayList<String>> c = new ArrayList<ArrayList<String>> ();
         c.add (restOfVals);
         return c;
      else {
         ArrayList<ArrayList<String>> newList = new ArrayList<ArrayList<String>> ();
         for (String o : restOfVals) {

Executing MapReduce Applications on Hadoop (Single-node Cluster) - Part 1

Okay. You just set up Hadoop on a single node on a VM and now wondering what comes next. Of course, you’ll run something on it, and what could be better than your own piece of code? But before we move to that, let’s first try to run an existing program to make sure things are well set on our Hadoop cluster.
Power up your Ubuntu with Hadoop on it and on Terminal (Ctrl+Alt+T) run the following command: $
Provide the password whenever asked and when all the jobs have started, execute the following command to make sure all the jobs are running: $ jps
Note: The “jps” utility is available only in Oracle JDK, not Open JDK. See, there are reasons it was recommended in the first place.
You should be able to see the following services: NameNode SecondaryNameNode DataNode JobTracker TaskTracker Jps

We'll take a minute to very briefly define these services first.
NameNode: a component of HDFS (Hadoop File System) that manages all the file system metadata, links, trees, directory structure, etc…

Titanic: A case study for predictive analysis on R (Part 1) is a popular community of data scientists, which holds various competitions of data science. The article performs predictive analysis on a benchmark case study -- Titanic, picked from -- in-depth.

The case study is a classification problem, where the objective is to determine which class does an instance of data belong to. This can also be called prediction problem, because we are predicting class of a record based on its attributes.

Note: This tutorial requires some basic R programming background. If you haven't yet gotten yourself acquainted with R, maybe this is the right time. Codeacademy's tutorial is my personal recommendation. We will be using RStudio here, the most used IDE for 'R' language. It is free and open-source, you can download it here.

RMS Titanic was a British cruise that sank on its course in the North Atlantic Ocean on its maiden voyage. 1502 people, out of 2224 on board lost their lives in this disaster. Due to lack of li…