Tuesday, April 12, 2016

Titanic: A case study for predictive analysis on R (Final)

Our previous attempt to accurately predict whether a passenger is likely to survive, a competition from Kaggle.com. We used some statistics and machine learning models to classify the passengers.

In our final part, we will push our limits using advanced machine learning models, including Random Forests, Neural Networks, Support Vector Machines and other algorithms, and see how long we can torture our data before it confesses.

Let's resume from where we left. We are applying an implementation of Random forest method of classification. Shortly, this model grows many decision trees and then uses a voting system to decide which trees to pick. This way, the common issue with decision trees, over fitting is mitigated (learn more here).

> library(randomForest)
> formula <- as.factor(Survived) ~ Sex + Pclass + FareGroup + SibSp + Parch + Embarked + HasCabin + AgePredicted + AgeGroup 
> set.seed(seed)
> rf_fit <- randomForest(formula, data=dataset[dataset$Dataset == 'train',], importance=TRUE, ntree=100, mtry=1)
> varImpPlot(rf_fit)
> testset$PredSurvived <- predict(rf_fit, dataset[dataset$Dataset == 'test',])
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)

> write.csv(submit, file="rforest.csv", row.names=FALSE)

The results were not as promising as expected. We did not make any improvements using this algorithm. This indicates that the decision tree model is not over-fitting.

This is the point where we rethink our data. We noticed that missing Age is an important factor; some records are missing Fare and Embarked; we also extracted Title from names; we derived from seemingly useless variable, Cabin, a boolean variable HasCabin.

Now let's have a look at Ticket.

> unique(dataset$Ticket)

Notice anything? We see some strings like PC, CA, SOTON, PARIS, etc. Now without actually knowing what these represent, how about clipping off the digits and extract only this part? Here's how we'll do so (you'll need to install stringr package if it's missing):

> library(stringr)
> dataset$TicketPart <- NULL
> dataset$TicketPart <- str_replace_all(dataset$Ticket, "[^[:alpha:]]", "")

> dataset$TicketPart <- as.factor(dataset$TicketPart)
> plot(table(dataset$TicketPart[dataset$TicketPart != '']))
The plot reveals that some parts appear frequently. These might hint at where the passenger is coming from.

Next, we can use SibSp and Parch to determine the size of family on board. The thought behind this is that if more members of a family are on board, they'll have high support, and chances of survival.

> dataset$FamilySize <- dataset$SibSp + dataset$Parch + 1
# +1 for the passenger himself

Torturing data even more, we'll explore Name variable even more. We notice that apart from Title, we can also extract Surname, since names are in format [Surname], [Title] [Given Names]

> dataset$Surname <- sapply(dataset$Name, FUN=function(x) {strsplit(as.character(x), split='[,.]')[[1]][1]})
> dataset$Surname <- as.factor(sub(' ', '', dataset$Surname))

> dataset$Surname <- factor(dataset$Surname)

We are only interested in frequent names; we can reduce levels where family size is less than 3:

> dataset$FamilyID <- paste(as.character(dataset$FamilySize), dataset$Surname, sep="")
> dataset$FamilyID[dataset$FamilySize <= 2] <- 'Small'
> famIDs <- data.frame(table(dataset$FamilyID))
> famIDs <- famIDs[famIDs$Freq <= 2,]
> dataset$FamilyID[dataset$FamilyID %in% famIDs$Var1] <- 'Small'
> dataset$FamilyID <- factor(dataset$FamilyID)
> plot(table(dataset$FamilyID[dataset$FamilyID != 'Small']))
As visible, we have several records with same FamilyID and size of family. Therefore, we can conclude that Surname successfully mapped passengers of same family.

There can be many ways to continue torturing this data, but we will now limit ourselves to these variables only.

Now we'll apply SVM and other models (wildly) and see what combination of variables worked for us.

> formula <- as.factor(Survived) ~ Sex + AgeGroup + Pclass + FareGroup + SibSp + Parch + Embarked + Title + FamilySize + FamilyID + HasCabin + TicketPart
> svm_fit <- svm(formula, data=dataset[dataset$Dataset == 'train',])
> testset$PredSurvived <- predict(svm_fit, testset, type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="svm.csv", row.names=FALSE)

I continued my experiments on different models, each of which cannot be described here, but for ease, I have created this R script and partitioned the code into functions.

What should be noted is that later in my experiments, I used Age and Fare instead of their discretized variables for accuracy (yes, the execution time increases as a result).

Here are the results from some of the models:

Random Forest:


Neural Networks:


For our experiments so far, CForest proved to be the top performer. But please don't stop here; apply your own ideas of twisting and squashing data to gain more accuracy.

For now, I guess this series should serve as a good starter on Predictive analytics. You can find a variety of different problems on Kaggle.com to participate in and polish your analytics spectacles.

Please feel free to comment and maybe share your score...

No comments:

Post a Comment