Skip to main content

Titanic: A case study for predictive analysis on R (Final)

Our previous attempt to accurately predict whether a passenger is likely to survive, a competition from We used some statistics and machine learning models to classify the passengers.

In our final part, we will push our limits using advanced machine learning models, including Random Forests, Neural Networks, Support Vector Machines and other algorithms, and see how long we can torture our data before it confesses.

Let's resume from where we left. We are applying an implementation of Random forest method of classification. Shortly, this model grows many decision trees and then uses a voting system to decide which trees to pick. This way, the common issue with decision trees, over fitting is mitigated (learn more here).

> library(randomForest)
> formula <- as.factor(Survived) ~ Sex + Pclass + FareGroup + SibSp + Parch + Embarked + HasCabin + AgePredicted + AgeGroup 
> set.seed(seed)
> rf_fit <- randomForest(formula, data=dataset[dataset$Dataset == 'train',], importance=TRUE, ntree=100, mtry=1)
> varImpPlot(rf_fit)
> testset$PredSurvived <- predict(rf_fit, dataset[dataset$Dataset == 'test',])
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)

> write.csv(submit, file="rforest.csv", row.names=FALSE)

The results were not as promising as expected. We did not make any improvements using this algorithm. This indicates that the decision tree model is not over-fitting.

This is the point where we rethink our data. We noticed that missing Age is an important factor; some records are missing Fare and Embarked; we also extracted Title from names; we derived from seemingly useless variable, Cabin, a boolean variable HasCabin.

Now let's have a look at Ticket.

> unique(dataset$Ticket)

Notice anything? We see some strings like PC, CA, SOTON, PARIS, etc. Now without actually knowing what these represent, how about clipping off the digits and extract only this part? Here's how we'll do so (you'll need to install stringr package if it's missing):

> library(stringr)
> dataset$TicketPart <- NULL
> dataset$TicketPart <- str_replace_all(dataset$Ticket, "[^[:alpha:]]", "")

> dataset$TicketPart <- as.factor(dataset$TicketPart)
> plot(table(dataset$TicketPart[dataset$TicketPart != '']))
The plot reveals that some parts appear frequently. These might hint at where the passenger is coming from.

Next, we can use SibSp and Parch to determine the size of family on board. The thought behind this is that if more members of a family are on board, they'll have high support, and chances of survival.

> dataset$FamilySize <- dataset$SibSp + dataset$Parch + 1
# +1 for the passenger himself

Torturing data even more, we'll explore Name variable even more. We notice that apart from Title, we can also extract Surname, since names are in format [Surname], [Title] [Given Names]

> dataset$Surname <- sapply(dataset$Name, FUN=function(x) {strsplit(as.character(x), split='[,.]')[[1]][1]})
> dataset$Surname <- as.factor(sub(' ', '', dataset$Surname))

> dataset$Surname <- factor(dataset$Surname)

We are only interested in frequent names; we can reduce levels where family size is less than 3:

> dataset$FamilyID <- paste(as.character(dataset$FamilySize), dataset$Surname, sep="")
> dataset$FamilyID[dataset$FamilySize <= 2] <- 'Small'
> famIDs <- data.frame(table(dataset$FamilyID))
> famIDs <- famIDs[famIDs$Freq <= 2,]
> dataset$FamilyID[dataset$FamilyID %in% famIDs$Var1] <- 'Small'
> dataset$FamilyID <- factor(dataset$FamilyID)
> plot(table(dataset$FamilyID[dataset$FamilyID != 'Small']))
As visible, we have several records with same FamilyID and size of family. Therefore, we can conclude that Surname successfully mapped passengers of same family.

There can be many ways to continue torturing this data, but we will now limit ourselves to these variables only.

Now we'll apply SVM and other models (wildly) and see what combination of variables worked for us.

> formula <- as.factor(Survived) ~ Sex + AgeGroup + Pclass + FareGroup + SibSp + Parch + Embarked + Title + FamilySize + FamilyID + HasCabin + TicketPart
> svm_fit <- svm(formula, data=dataset[dataset$Dataset == 'train',])
> testset$PredSurvived <- predict(svm_fit, testset, type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="svm.csv", row.names=FALSE)

I continued my experiments on different models, each of which cannot be described here, but for ease, I have created this R script and partitioned the code into functions.

What should be noted is that later in my experiments, I used Age and Fare instead of their discretized variables for accuracy (yes, the execution time increases as a result).

Here are the results from some of the models:

Random Forest:


Neural Networks:


For our experiments so far, CForest proved to be the top performer. But please don't stop here; apply your own ideas of twisting and squashing data to gain more accuracy.

For now, I guess this series should serve as a good starter on Predictive analytics. You can find a variety of different problems on to participate in and polish your analytics spectacles.

Please feel free to comment and maybe share your score...


Popular posts from this blog

Executing MapReduce Applications on Hadoop (Single-node Cluster) - Part 1

Okay. You just set up Hadoop on a single node on a VM and now wondering what comes next. Of course, you’ll run something on it, and what could be better than your own piece of code? But before we move to that, let’s first try to run an existing program to make sure things are well set on our Hadoop cluster.
Power up your Ubuntu with Hadoop on it and on Terminal (Ctrl+Alt+T) run the following command: $
Provide the password whenever asked and when all the jobs have started, execute the following command to make sure all the jobs are running: $ jps
Note: The “jps” utility is available only in Oracle JDK, not Open JDK. See, there are reasons it was recommended in the first place.
You should be able to see the following services: NameNode SecondaryNameNode DataNode JobTracker TaskTracker Jps

We'll take a minute to very briefly define these services first.
NameNode: a component of HDFS (Hadoop File System) that manages all the file system metadata, links, trees, directory structure, etc…

A faster, Non-recursive Algorithm to compute all Combinations of a String

Imagine you're me, and you studied Permutations and Combinations in your high school maths and after so many years, you happen to know that to solve a certain problem, you need to apply Combinations.

You do your revision and confidently open your favourite IDE to code; after typing some usual lines, you pause and think, then you do the next best thing - search on Internet. You find out a nice recursive solution, which does the job well. Like the following:

import java.util.ArrayList;
import java.util.Date;

public class Combination {
   public ArrayList<ArrayList<String>> compute (ArrayList<String> restOfVals) {
      if (restOfVals.size () < 2) {
         ArrayList<ArrayList<String>> c = new ArrayList<ArrayList<String>> ();
         c.add (restOfVals);
         return c;
      else {
         ArrayList<ArrayList<String>> newList = new ArrayList<ArrayList<String>> ();
         for (String o : restOfVals) {

Titanic: A case study for predictive analysis on R (Part 4)

Working with titanic data set picked from's competition, we predicted the passenger survivals with 79.426% accuracy in our previous attempt. This time, we will try to learn the missing values instead of setting trying mean or median. Let's start with Age.

Looking at the available data, we can hypothetically correlate Age with attributes like Title, Sex, Fare and HasCabin. Also note that we previous created variable AgePredicted; we will use it here to identify which records were filled previously.

> age_train <- dataset[dataset$AgePredicted == 0, c("Age","Title","Sex","Fare","HasCabin")]
>age_test <- dataset[dataset$AgePredicted == 1, c("Title","Sex","Fare","HasCabin")]
>formula <- Age ~ Title + Sex + Fare + HasCabin
>rp_fit <- rpart(formula, data=age_train, method="class")
>PredAge <- predict(rp_fit, newdata=age_test, type="vector")