Skip to main content

Titanic: A case study for predictive analysis on R (Part 4)

Working with titanic data set picked from's competition, we predicted the passenger survivals with 79.426% accuracy in our previous attempt. This time, we will try to learn the missing values instead of setting trying mean or median. Let's start with Age.

Looking at the available data, we can hypothetically correlate Age with attributes like Title, Sex, Fare and HasCabin. Also note that we previous created variable AgePredicted; we will use it here to identify which records were filled previously.

> age_train <- dataset[dataset$AgePredicted == 0, c("Age","Title","Sex","Fare","HasCabin")]
> age_test <- dataset[dataset$AgePredicted == 1, c("Title","Sex","Fare","HasCabin")]
> formula <- Age ~ Title + Sex + Fare + HasCabin
> rp_fit <- rpart(formula, data=age_train, method="class")
> PredAge <- predict(rp_fit, newdata=age_test, type="vector")
> table(PredAge)

> dataset$Age[dataset$AgePredicted == 1] <- PredAge

The table(PredAge) gave us the following:
  2  23  25 

  8 154 101 

Meaning that values 2, 23 and 25 were predicted for age variable for 8, 154 and 101 records respectively.

Furthermore, instead of providing fixed ranges for AgeGroups by judgement, we will use k-means clustering to derive age groups. The commands below will create 7 clusters of Age variable, the second like assigns each record in dataset a numeric cluster ID.

> k <- kmeans(dataset$Age, 7)

> dataset$AgeGroup <- k$cluster

Let's have a peek at the centers of these clusters as well as their distribution:
> k$centers
1 48.708661
2 16.820144
3 62.152542
4 22.559172
5 37.449495
6 27.429379

7  4.117021

> table(k$cluster)
  1   2   3   4   5   6   7 

127 139  59 338 198 354  94 

Let's see if we have any improvement in our results:

> formula <- Survived ~ Sex + AgeGroup + Pclass
> rpart_fit <- rpart(formula, data=dataset[dataset$Dataset == 'train',], method="class")
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)

> write.csv(submit, file="rpart_learn_age.csv", row.names=FALSE)

Hurrah! We hiked to 310th position with 80.383% accuracy. (Note that the ranks get improved with time as competition slows)

We'll end our data pre-processing here. Next, we will try some more classification models like Random forests and Support vector machines and see if we can do any better than this.


Popular posts from this blog

A faster, Non-recursive Algorithm to compute all Combinations of a String

Imagine you're me, and you studied Permutations and Combinations in your high school maths and after so many years, you happen to know that to solve a certain problem, you need to apply Combinations.

You do your revision and confidently open your favourite IDE to code; after typing some usual lines, you pause and think, then you do the next best thing - search on Internet. You find out a nice recursive solution, which does the job well. Like the following:

import java.util.ArrayList;
import java.util.Date;

public class Combination {
   public ArrayList<ArrayList<String>> compute (ArrayList<String> restOfVals) {
      if (restOfVals.size () < 2) {
         ArrayList<ArrayList<String>> c = new ArrayList<ArrayList<String>> ();
         c.add (restOfVals);
         return c;
      else {
         ArrayList<ArrayList<String>> newList = new ArrayList<ArrayList<String>> ();
         for (String o : restOfVals) {

Executing MapReduce Applications on Hadoop (Single-node Cluster) - Part 1

Okay. You just set up Hadoop on a single node on a VM and now wondering what comes next. Of course, you’ll run something on it, and what could be better than your own piece of code? But before we move to that, let’s first try to run an existing program to make sure things are well set on our Hadoop cluster.
Power up your Ubuntu with Hadoop on it and on Terminal (Ctrl+Alt+T) run the following command: $
Provide the password whenever asked and when all the jobs have started, execute the following command to make sure all the jobs are running: $ jps
Note: The “jps” utility is available only in Oracle JDK, not Open JDK. See, there are reasons it was recommended in the first place.
You should be able to see the following services: NameNode SecondaryNameNode DataNode JobTracker TaskTracker Jps

We'll take a minute to very briefly define these services first.
NameNode: a component of HDFS (Hadoop File System) that manages all the file system metadata, links, trees, directory structure, etc…

Titanic: A case study for predictive analysis on R (Part 1) is a popular community of data scientists, which holds various competitions of data science. The article performs predictive analysis on a benchmark case study -- Titanic, picked from -- in-depth.

The case study is a classification problem, where the objective is to determine which class does an instance of data belong to. This can also be called prediction problem, because we are predicting class of a record based on its attributes.

Note: This tutorial requires some basic R programming background. If you haven't yet gotten yourself acquainted with R, maybe this is the right time. Codeacademy's tutorial is my personal recommendation. We will be using RStudio here, the most used IDE for 'R' language. It is free and open-source, you can download it here.

RMS Titanic was a British cruise that sank on its course in the North Atlantic Ocean on its maiden voyage. 1502 people, out of 2224 on board lost their lives in this disaster. Due to lack of li…