Friday, January 16, 2015

Titanic: A case study for predictive analysis on R (Part 3)

In our previous attempt, we applied some machine learning techniques to our data and predicted the values for target variable using AgeGroup, Sex, Pclass and Embarked attributes. Now, we will further explore other attributes and see how much information we can extract.

This time, instead of keeping test set apart, we will merge it into the training data set. This will enable us to collect complete range of values for each attribute, in case there are some missing outs in training set:

> dataset$Dataset <- 'train'
> testset$Dataset <- 'test'
> testset$Survived <- 0
> dataset <- rbind(dataset, testset[,c(1,13,2:12)])

This may look a strange way to merge two data sets, but here's some explanation. The first line adds a column Survived to testset, so that both the dataset and testset have identical columns. The next two lines add another column to identify whether a record is from training set or test set. The last line merges both the data sets using rbind (row bind) function. In the parameters, we defined column sequence of testset so that it is identical to dataset, because when we added Survived column to test set, it went at the end, while in training set, it was 2nd in order.

Let's resume our exploration; we start with Name. At first glance, the name appears to be useless, as it is unique for each passenger. But looking closely, we notice a few things: each name contains a Family name and a Title; FamilyName can be used to create relationships among passengers, while titles like Master, Capt, Sir, Lady tell us people's age, job and the social class they belong to.

We will partition the names into sub parts and extract Title and FamilyName. The following command uses sapply function, which basically applies a function to each element in the vector passed as argument. We pass the Name attribute and apply strsplit function that splits a string based on given criteria, which in our case is either , (comma) or . (period). The next line chops off extra white space.

> dataset$Title <- sapply(dataset$Name, FUN=function(x) {strsplit(as.character(x), split='[,.]')[[1]][2]})

> dataset$Title <- sub(' ', '', dataset$Title)

Let's have a look at the titles we have in the data.

> unique(dataset$Title)



These are 18 distinct titles, out of which, we will merge some to keep our model simple.
Mlle is for Mademoiselle in French and Mme is for Madame. Similarly, Dona is Spanish for Lady. Jonkheer and the Countess are again, titles for noble women. All of these can be merged into one name "Lady".


> dataset$Title[dataset$Title %in% c('Mlle', 'Mme', 'Ms', 'Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'

This cuts our distinct values to 13.

Submitting our predictions on new model with Titles added did not improve our score on Kaggle, but it didn't decrease it as well. So, we will keep this information intact.


We can further notice some Titles like Master, Miss and Ms, which can give us a clue about the age of the passenger. We can make use of these when filling in missing values of age.

Let's search for all passengers with missing age and Title "Master" and fill them with mean age of passengers with Master titles whose age is available. But first, we will reset the Age variable by reading both data sets in a temporary variable and combining to get original values of Age:

> temp.train <- read.csv("train.csv")
> temp.test <- read.csv("test.csv")
> temp.test$Survived <- 0
> tempset <- rbind(temp.train, temp.test[,c(1,12,2:11)])
> dataset$Age <- tempset$Age
> bad <- is.na(dataset$Age)
> dataset$Age[bad & dataset$Title == 'Master'] <- mean(dataset$Age[dataset$Title == 'Master'], na.rm=TRUE)

Fill the missing age for some other titles as well:

> dataset$Age[bad & dataset$Title == 'Miss'] <- mean(dataset$Age[dataset$Title == 'Miss'], na.rm=TRUE)
> dataset$Age[bad & dataset$Title == 'Mr'] <- mean(dataset$Age[dataset$Title == 'Mr'], na.rm=TRUE)
> dataset$Age[bad & dataset$Title == 'Mrs'] <- mean(dataset$Age[dataset$Title == 'Mrs'], na.rm=TRUE)
> dataset$AgePredicted <- 0
> dataset$AgePredicted[bad] <- 1

Fill out the remaining ones and round off before discretizing:

> bad <- is.na(dataset$Age)
> dataset$Age[bad] <- median(dataset$Age[bad], na.rm=TRUE)
> dataset$Age <- round(dataset$Age)

Discretize again as before:

> dataset$AgeGroup <- 'old'
> dataset$AgeGroup[dataset$Age < 2] <- 'infant'
> dataset$AgeGroup[dataset$Age >= 2 & dataset$Age < 13] <- 'child'
> dataset$AgeGroup[dataset$Age >= 13 & dataset$Age < 20] <- 'teenager'
> dataset$AgeGroup[dataset$Age >= 20 & dataset$Age < 40] <- 'young'
> dataset$AgeGroup[dataset$Age >= 40] <- 'old'

Time to see how we have improved in terms of accuracy. But we will not include Title yet in our formula:

> rpart_fit <- rpart(formula=Survived ~ Sex + AgeGroup + Pclass + Embarked, data=dataset[dataset$Dataset == 'train',], method="class")
> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval], cex=0.6)
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_relative_ages.csv", row.names=FALSE)



No improvement?! Well, how about adding some more variables. We already have Title, another can easily be added. The Cabin shows the cabin numbers that a passenger has booked, we can find out whether or not a passenger has booked cabins and introduce HasCabin as a new variable.

> dataset$HasCabin <- 1
> dataset$HasCabin[dataset$Cabin == ''] <- 0

We will also make use of SibSp and Parch variables too:

> formula <- Survived ~ Sex + AgeGroup + Pclass + Embarked + HasCabin + Title + SibSp + Parch
> rpart_fit <- rpart(formula, data=dataset[dataset$Dataset == 'train',], method="class")
> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval], cex=0.6)
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_family_attr.csv", row.names=FALSE)

Let's see how we do now with such a load of new information.


This is heartbreaking, I admit. But don't give up just yet. We have more variables to explore, only this time, we'll be a little aggressive ;-)

Have a look at Fare. It is a floating continuous value with only 1 missing record. We will do two things with this one:
1. Round off
2. Fill in missing value
3. Discretize

The first two steps are quick:

> dataset$Fare <- round(dataset$Fare)
> bad <- is.na(dataset$Fare)
> dataset$Fare[bad] <- median(dataset$Fare, na.rm=TRUE)

For the third step, we won't just discretize on judgement this time, but rather use a smart algorithm in "caret" library, k-means that creates clusters or groups of data by maximizing their distances. This article will give you a thorough description of what we are talking about here.

First, have a quick glance at how the data is distributed; this is to find out the ideal number of clusters we want to make.

> plot(table(dataset$Fare), ylim=c(0,75))


Do you notice some gaps? There's a very thin gap between 0 and next value and huge gap between 512 and a step before. Observing closely, we can visually see that there are at least 7 groups in the data. We can call k-means algorithm to create 7 clusters of the Fare variable.

> library(caret)
> kmeans(x=dataset$Fare, centers=7, iter.max=1000)$centers
        [,1]
1 133.391304
2  28.755556
3   0.600000
4  68.291925
5 237.882353
6   9.943759
7 512.000000

What you see above is the virtual center points that the algorithm has created for these clusters. We will create a new variable FareGroup and assign every record a cluster number suggested by k-means:

> k <- kmeans(x=dataset$Fare, centers=7, iter.max=1000)
> dataset$FareGroup <- k$cluster
> table(dataset$FareGroup)
  1   2   3   4   5   6   7 
 38  19 517  46 161 270 258 

The k$cluster has a cluster number for each index - assigned to the FareGroup. The table shows how the records are distributed among the clusters. There's one extra tweak here, I noticed that none of the passengers who paid the amount below 6.0 GBP survived. We can create an additional cluster for such passengers as well.

dataset$FareGroup[dataset$Fare < 6] <- 0

Now, we have another refined variable to add to our rpart model. Let's give it another shot, but we'll go without Title to make sure that it's the FareGroup which increases the accuracy (if it did):

> formula <- Survived ~ Sex + AgeGroup + Pclass + Embarked + HasCabin + SibSp + Parch + FareGroup
> rpart_fit <- rpart(formula, data=dataset[dataset$Dataset == 'train',], method="class")
> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval], cex=0.6)
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_faregroup.csv", row.names=FALSE)


It WORKED! We are now at 628th position. But there's a long journey ahead. Next, we will push our limits on data preparation by learning the missing values. And also see how we can fine-tune recursive partition learning algorithm to fit our data better.

Saturday, January 10, 2015

Titanic: A case study for predictive analysis on R (Part 2)

Previously, we successfully classified passengers as "will survive" or "will not survive" with 76.5% accuracy using Gender only. We will now extend our experiment further and include other attributes in order to improve our accuracy rate.

Let's resume..

Pre-processing

Real data is never in ideal form. It always needs some pre-processing. We will fill missing values, extract some extra information from available variables and convert continuous valued fields into discrete valued.

First, let's have a look at how Age variable is distributed. We will use a parameter useNA="always" in table function

> table(dataset$Age, useNA="always")


We see 177 missing values (NA's). We will fill them with mean value of Age as a straight forward solution. The commands below store TRUE/FALSE values in a vector bad by checking whether the value of age is available for each record or not, then stores the middle value of age where it is missing and plots the bar graph of age afterwards.

> bad <- is.na(dataset$Age)
> dataset$Age[bad] <- median(dataset$Age, na.rm=TRUE)
> plot(table(dataset$Age))


We now have age with no missing values. In order to simplify our decision rules, we need to change this continuous attribute into discrete valued attribute.

Let's discretize this field into infant, child, teenager, adult and old.

> dataset$AgeGroup <- 'old'
> dataset$AgeGroup[dataset$Age < 2] <- 'infant'
> dataset$AgeGroup[dataset$Age >= 2 & dataset$Age < 13] <- 'child'
> dataset$AgeGroup[dataset$Age >= 13 & dataset$Age < 20] <- 'teenager'
> dataset$AgeGroup[dataset$Age >= 20 & dataset$Age < 40] <- 'young'
> dataset$AgeGroup[dataset$Age >= 40] <- 'old'

> table(dataset$Survived, dataset$AgeGroup)/nrow(dataset)


The above distribution tells us that survival rate of infants is very high; children have about half the chance of survival; teenagers and young people have lower and finally, the old have the least survival ratio.

Let's rebuild our model. This time, if a passenger is male, then we make decision on the basis of age variable.

> testset$AgeGroup <- 'old'
> testset$AgeGroup[testset$Age < 2] <- 'infant'
> testset$AgeGroup[testset$Age >= 2 & testset$Age < 13] <- 'child'
> testset$AgeGroup[testset$Age >= 13 & testset$Age < 20] <- 'teenager'
> testset$AgeGroup[testset$Age >= 20 & testset$Age < 40] <- 'young'
> testset$AgeGroup[testset$Age >= 40] <- 'old'
> testset$Survived <- 0
> testset$Survived[testset$Sex == 'female'] <- 1
> male <- testset$Sex == 'male'
> testset$Survived[male & (testset$AgeGroup == 'infant' | testset$AgeGroup == 'child')] <- 1
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)

> write.csv(submit, file="femailes_and_children_survive.csv", row.names=FALSE)


Let's check what does Kaggle say about our new model.



As seen, our score bumped to 77%. This may not come to you as a very impressive score, but this is not where we are going to stop.

At this point, you must be wondering how complex our code becomes once we start adding more and more attributes. Thumbs up if you wondered this. This is where we make use of smart machine learning algorithms for classification.


Classification Algorithms

We can think of predictive analysis as a case of classification, where we are classifying entities based on some prior information about the domain. Machine learning is a branch of theoretical computing that deals with supervised classification. Supervised in a sense that we first tell the algorithm some information about the data, which the algorithm takes further decisions on the basis of.

Decision tree is among the simple examples of classification where we construct a tree of rules based on probability of occurrences of various events. For example, we earlier made a general rule:


If Sex is female then Survive, otherwise if AgeGroup is infant or child then Survive, otherwise Not Survive.



There is a variety of decision tree algorithms that can be used in R out of the box to create such decision trees automatically. Let's try one...

Recursive partitioning is an implementation of decision trees in rpart package in R, used to for predicting classification trees.

You'll have to first install two packages for rpart and to plot its tree:

> install.packages(c("rpart","rpart.plot"))
> library(rpart)
> library(rpart.plot)
> rpart_fit <- rpart(formula=Survived ~ Sex + Age, data=dataset, method="class")

> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval])

Here, the rpart() function analyzes the data and gives a model fitting to the data over the given formula. The formula states that the target variable is Survived and decision tree is to be constructed using Sex and Age attributes.

The prp() function plots the learnt tree. This is how it looks like:

The decision tree describes that there are 35% chance of survival when Sex is not male; when Sex is male and AgeGroup is among old, teenager and young, then chances of survival is 4% only.

We will now predict the Survival values using this model. We pass the rpart_fit - the tree learnt, testset and the type of target variable to predict() function, i.e. class.

> testset$Survived <- predict(rpart_fit, testset, type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_sex_agegroup.csv", row.names=FALSE)

The results from Kaggle did not show any improvement from previous results, but we now have a way to add more attributes without having to code complex rules.


Let us now add more attributes to the formula in rpart and see what develops.


> rpart_fit <- rpart(formula=Survived ~ Sex + AgeGroup + Pclass + Embarked, data=dataset, method="class")
> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval], cex=0.6)
As observed, the decision tree is quite complex after adding only two more attributes to the formula. Imaging writing code for this :-)

Proceed to submitting results on Kaggle.

> testset$Survived <- predict(rpart_fit, testset, type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_more_attr.csv", row.names=FALSE)



This bumped our score to 78.947% and also brings us to 706th rank on the competition.

Let's now attempt to add all of the attributes. But we first need to clean them, right?! Yes, in next part, we will do so.

Wednesday, January 7, 2015

Titanic: A case study for predictive analysis on R (Part 1)

Kaggle.com is a popular community of data scientists, which holds various competitions of data science. The article performs predictive analysis on a benchmark case study -- Titanic, picked from Kaggle.com -- in-depth.

The case study is a classification problem, where the objective is to determine which class does an instance of data belong to. This can also be called prediction problem, because we are predicting class of a record based on its attributes.

Note: This tutorial requires some basic R programming background. If you haven't yet gotten yourself acquainted with R, maybe this is the right time. Codeacademy's tutorial is my personal recommendation. We will be using RStudio here, the most used IDE for 'R' language. It is free and open-source, you can download it here.

Dataset:
RMS Titanic was a British cruise that sank on its course in the North Atlantic Ocean on its maiden voyage. 1502 people, out of 2224 on board lost their lives in this disaster. Due to lack of lifeboats, the death toll was so high. When the data was gathered about the passengers that survived or killed, it was observed that some people, like women, children and those belonging to upper-class survived more than the others. Our objective is to depict the attributes of the people who survived with as much accuracy as possible.
We have a set of 2 data files: train, which contains records of 891 passengers with label Survived = 0 (did not survive) or 1 (survived); the other one is test, which has 418 records without the information of survival. We want to predict this and post to Kaggle.com to check how accurately we were able to predict this label.

Please go on and download the two files (train.csv and test.csv) from here. You can alternatively download them from this link as well. The first file, train.csv contains records in which the value of target variable Survived is given; we will use this to generate a classification model, to be used for prediction. The other file, test.csv contains data about different passengers, but the information that the passenger survived or not is not provided. We will apply our learnt model on this data to see which passengers are predicted to live or perish.

That'll be all about the case study, let's start the real thing.


Reading data
First, set the working directory to wherever you have placed the downloaded files.

> setwd("D:/Datasets/Titanic")

read.csv() funtion can be used to read a CSV file in table format and save in a data frame 
> dataset <- read.csv("train.csv")

Check what variables are available.
> colnames(dataset)



PassengerID is a unique identifier of patients; Survived is the target variable; Pclass represents in which class the Passenger traveled; Name, Sex and Age are the demographics; SibSp is the count of siblings and spouse on board; Parch is the count of parents and children on board; Ticket is the ticket number; Fare is the amount paid as fair; Cabin is the cabin numbers reserved for the passenger; Embarked tells which port did the passenger get on board from.

Let's now have a sneak peek of how the data looks like using head() function :


We can clearly notice a few things here. First, some data is available in ready-to-use form, like Sex; some fields like Age and Cabin are missing values; variables like Name can be fine-tuned to make them meaningful, or derive some information from them.

All of this is Pre-processing task, in which we clean-up the data, fill in missing values, select and deselect fields and derive new variables before running analysis algorithms.

Base package of R contains a very handy function, summary() that tells some primary statistics about the data.


There are 314 females and 477 males in the training data; most passengers on board were 28 years old, while the age is not available for 177 passengers; maximum fare paid is 512.33 pounds. You can get get an overall picture of the data just by using this command. Have a keen look.

Submitting on Kaggle:
In order to evaluate our model, we will have to submit our results on Kaggle.com by predicting value of Survived variable for each passenger in the test.csv file.

We know that during disasters, women and children are given preference over males and adults. We first see what our training data tells us about this.

Run table() function on variables Survived and Sex to create a distribution table over these two variables:


As observed, the probability of female-survived is 0.26, while probability of male-died is 0.525. Which means, we can tell with some confidence that females survive and males do not.

Let's test this hypothesis on test data. Read test data in another data frame and introduce a new column, Survived with 0 as default value.



Next, we set Suvived = 1 for all females. The command below says: "set variable Survived of testset to 1 wherever value of variable Sex is female".

> testset$Survived <- 0
> testset$Survived[testset$Sex == 'female'] <- 1

Now, we will export PassengerID and Survived variables into a new file.

> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="all_femailes_survive.csv", row.names=FALSE)

Let's submit this new file to Kaggle.com's leaderboard in Titanic competition.

Go to "Make a submission" page in Titanic Disaster challenge and Accept the terms and conditions. The next page will ask whether you want to take part as a team or as an individual. You can choose individual (you can later add people to make a team).
On the submission page, browse for the newly created file "all_families_survive.csv" and submit. 




The results take some time to compute on Kaggle...... there you go!



This just means we have achieved 76.55% accuracy. This, only using Gender attribute.

In our next attempts, we will try to improve our accuracy by remaining data.

Have fun till then...