Skip to main content

Titanic: A case study for predictive analysis on R (Final)

Our previous attempt to accurately predict whether a passenger is likely to survive, a competition from Kaggle.com. We used some statistics and machine learning models to classify the passengers.

In our final part, we will push our limits using advanced machine learning models, including Random Forests, Neural Networks, Support Vector Machines and other algorithms, and see how long we can torture our data before it confesses.

Let's resume from where we left. We are applying an implementation of Random forest method of classification. Shortly, this model grows many decision trees and then uses a voting system to decide which trees to pick. This way, the common issue with decision trees, over fitting is mitigated (learn more here).

> library(randomForest)
> formula <- as.factor(Survived) ~ Sex + Pclass + FareGroup + SibSp + Parch + Embarked + HasCabin + AgePredicted + AgeGroup 
> set.seed(seed)
> rf_fit <- randomForest(formula, data=dataset[dataset$Dataset == 'train',], importance=TRUE, ntree=100, mtry=1)
> varImpPlot(rf_fit)
> testset$PredSurvived <- predict(rf_fit, dataset[dataset$Dataset == 'test',])
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)

> write.csv(submit, file="rforest.csv", row.names=FALSE)

The results were not as promising as expected. We did not make any improvements using this algorithm. This indicates that the decision tree model is not over-fitting.

This is the point where we rethink our data. We noticed that missing Age is an important factor; some records are missing Fare and Embarked; we also extracted Title from names; we derived from seemingly useless variable, Cabin, a boolean variable HasCabin.

Now let's have a look at Ticket.

> unique(dataset$Ticket)

Notice anything? We see some strings like PC, CA, SOTON, PARIS, etc. Now without actually knowing what these represent, how about clipping off the digits and extract only this part? Here's how we'll do so (you'll need to install stringr package if it's missing):

> library(stringr)
> dataset$TicketPart <- NULL
> dataset$TicketPart <- str_replace_all(dataset$Ticket, "[^[:alpha:]]", "")

> dataset$TicketPart <- as.factor(dataset$TicketPart)
> plot(table(dataset$TicketPart[dataset$TicketPart != '']))
The plot reveals that some parts appear frequently. These might hint at where the passenger is coming from.

Next, we can use SibSp and Parch to determine the size of family on board. The thought behind this is that if more members of a family are on board, they'll have high support, and chances of survival.

> dataset$FamilySize <- dataset$SibSp + dataset$Parch + 1
# +1 for the passenger himself

Torturing data even more, we'll explore Name variable even more. We notice that apart from Title, we can also extract Surname, since names are in format [Surname], [Title] [Given Names]

> dataset$Surname <- sapply(dataset$Name, FUN=function(x) {strsplit(as.character(x), split='[,.]')[[1]][1]})
> dataset$Surname <- as.factor(sub(' ', '', dataset$Surname))

> dataset$Surname <- factor(dataset$Surname)

We are only interested in frequent names; we can reduce levels where family size is less than 3:

> dataset$FamilyID <- paste(as.character(dataset$FamilySize), dataset$Surname, sep="")
> dataset$FamilyID[dataset$FamilySize <= 2] <- 'Small'
> famIDs <- data.frame(table(dataset$FamilyID))
> famIDs <- famIDs[famIDs$Freq <= 2,]
> dataset$FamilyID[dataset$FamilyID %in% famIDs$Var1] <- 'Small'
> dataset$FamilyID <- factor(dataset$FamilyID)
> plot(table(dataset$FamilyID[dataset$FamilyID != 'Small']))
As visible, we have several records with same FamilyID and size of family. Therefore, we can conclude that Surname successfully mapped passengers of same family.

There can be many ways to continue torturing this data, but we will now limit ourselves to these variables only.

Now we'll apply SVM and other models (wildly) and see what combination of variables worked for us.

> formula <- as.factor(Survived) ~ Sex + AgeGroup + Pclass + FareGroup + SibSp + Parch + Embarked + Title + FamilySize + FamilyID + HasCabin + TicketPart
> svm_fit <- svm(formula, data=dataset[dataset$Dataset == 'train',])
> testset$PredSurvived <- predict(svm_fit, testset, type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="svm.csv", row.names=FALSE)

I continued my experiments on different models, each of which cannot be described here, but for ease, I have created this R script and partitioned the code into functions.

What should be noted is that later in my experiments, I used Age and Fare instead of their discretized variables for accuracy (yes, the execution time increases as a result).

Here are the results from some of the models:

Random Forest:


SVM:

Neural Networks:


CForest:



For our experiments so far, CForest proved to be the top performer. But please don't stop here; apply your own ideas of twisting and squashing data to gain more accuracy.

For now, I guess this series should serve as a good starter on Predictive analytics. You can find a variety of different problems on Kaggle.com to participate in and polish your analytics spectacles.

Please feel free to comment and maybe share your score...

Comments

Popular posts from this blog

A faster, Non-recursive Algorithm to compute all Combinations of a String

Imagine you're me, and you studied Permutations and Combinations in your high school maths and after so many years, you happen to know that to solve a certain problem, you need to apply Combinations. You do your revision and confidently open your favourite IDE to code; after typing some usual lines, you pause and think, then you do the next best thing - search on Internet. You find out a nice recursive solution, which does the job well. Like the following: import java.util.ArrayList; import java.util.Date; public class Combination {    public ArrayList<ArrayList<String>> compute (ArrayList<String> restOfVals) {       if (restOfVals.size () < 2) {          ArrayList<ArrayList<String>> c = new ArrayList<ArrayList<String>> ();          c.add (restOfVals);          return c;       }       else {          ArrayList<ArrayList<String>> newList = new ArrayList<ArrayList<String>> ();          for (String

How to detach from Facebook... properly

Yesterday, I deactivated my Facebook account after using it for 10 years. Of course there had to be a very solid reason; there was, indeed... their privacy policy . If you go through this page, you might consider pulling off as well. Anyways, that's not what this blog post is about. What I learned from yesterday is that the so-called "deactivate" option on Facebook is nothing more than logging out. You can log in again without any additional step and resume from where you last left. Since I really wanted to remove myself from Facebook as much as I can, I investigated ways to actually delete a Facebook account. There's a plethora of blogs on the internet, which will tell you how you can simply remove Facebook account. But almost all of them will either tell you to use "deactivate" and "request delete" options. The problem with that is that Facebook still has a last reusable copy of your data. If you really want to be as safe from its s

A step-by-step guide to query data on Hadoop using Hive

Hadoop empowers us to solve problems that require intense processing and storage on commodity hardware harnessing the power of distributed computing, while ensuring reliability. When it comes to applicability beyond experimental purposes, the industry welcomes Hadoop with warm heart, as it can query their databases in realistic time regardless of the volume of data. In this post, we will try to run some experiments to see how this can be done. Before you start, make sure you have set up a Hadoop cluster . We will use Hive , a data warehouse to query large data sets and a adequate-sized sample data set, along with an imaginary database of a travelling agency on MySQL; the DB  consisting of details about their clients, including Flight bookings, details of bookings and hotel reservations. Their data model is as below: The number of records in the database tables are as: - booking: 2.1M - booking_detail: 2.1M - booking_hotel: 1.48M - city: 2.2K We will write a query that