Friday, February 20, 2015

Titanic: A case study for predictive analysis on R (Part 4)

Working with titanic data set picked from Kaggle.com's competition, we predicted the passenger survivals with 79.426% accuracy in our previous attempt. This time, we will try to learn the missing values instead of setting trying mean or median. Let's start with Age.

Looking at the available data, we can hypothetically correlate Age with attributes like Title, Sex, Fare and HasCabin. Also note that we previous created variable AgePredicted; we will use it here to identify which records were filled previously.

> age_train <- dataset[dataset$AgePredicted == 0, c("Age","Title","Sex","Fare","HasCabin")]
> age_test <- dataset[dataset$AgePredicted == 1, c("Title","Sex","Fare","HasCabin")]
> formula <- Age ~ Title + Sex + Fare + HasCabin
> rp_fit <- rpart(formula, data=age_train, method="class")
> PredAge <- predict(rp_fit, newdata=age_test, type="vector")
> table(PredAge)

> dataset$Age[dataset$AgePredicted == 1] <- PredAge

The table(PredAge) gave us the following:
PredAge
  2  23  25 

  8 154 101 

Meaning that values 2, 23 and 25 were predicted for age variable for 8, 154 and 101 records respectively.

Furthermore, instead of providing fixed ranges for AgeGroups by judgement, we will use k-means clustering to derive age groups. The commands below will create 7 clusters of Age variable, the second like assigns each record in dataset a numeric cluster ID.

> k <- kmeans(dataset$Age, 7)

> dataset$AgeGroup <- k$cluster

Let's have a peek at the centers of these clusters as well as their distribution:
> k$centers
       [,1]
1 48.708661
2 16.820144
3 62.152542
4 22.559172
5 37.449495
6 27.429379

7  4.117021

> table(k$cluster)
  1   2   3   4   5   6   7 

127 139  59 338 198 354  94 

Let's see if we have any improvement in our results:

> formula <- Survived ~ Sex + AgeGroup + Pclass
> rpart_fit <- rpart(formula, data=dataset[dataset$Dataset == 'train',], method="class")
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)

> write.csv(submit, file="rpart_learn_age.csv", row.names=FALSE)


Hurrah! We hiked to 310th position with 80.383% accuracy. (Note that the ranks get improved with time as competition slows)

We'll end our data pre-processing here. Next, we will try some more classification models like Random forests and Support vector machines and see if we can do any better than this.

Friday, January 16, 2015

Titanic: A case study for predictive analysis on R (Part 3)

In our previous attempt, we applied some machine learning techniques to our data and predicted the values for target variable using AgeGroup, Sex, Pclass and Embarked attributes. Now, we will further explore other attributes and see how much information we can extract.

This time, instead of keeping test set apart, we will merge it into the training data set. This will enable us to collect complete range of values for each attribute, in case there are some missing outs in training set:

> dataset$Dataset <- 'train'
> testset$Dataset <- 'test'
> testset$Survived <- 0
> dataset <- rbind(dataset, testset[,c(1,13,2:12)])

This may look a strange way to merge two data sets, but here's some explanation. The first line adds a column Survived to testset, so that both the dataset and testset have identical columns. The next two lines add another column to identify whether a record is from training set or test set. The last line merges both the data sets using rbind (row bind) function. In the parameters, we defined column sequence of testset so that it is identical to dataset, because when we added Survived column to test set, it went at the end, while in training set, it was 2nd in order.

Let's resume our exploration; we start with Name. At first glance, the name appears to be useless, as it is unique for each passenger. But looking closely, we notice a few things: each name contains a Family name and a Title; FamilyName can be used to create relationships among passengers, while titles like Master, Capt, Sir, Lady tell us people's age, job and the social class they belong to.

We will partition the names into sub parts and extract Title and FamilyName. The following command uses sapply function, which basically applies a function to each element in the vector passed as argument. We pass the Name attribute and apply strsplit function that splits a string based on given criteria, which in our case is either , (comma) or . (period). The next line chops off extra white space.

> dataset$Title <- sapply(dataset$Name, FUN=function(x) {strsplit(as.character(x), split='[,.]')[[1]][2]})

> dataset$Title <- sub(' ', '', dataset$Title)

Let's have a look at the titles we have in the data.

> unique(dataset$Title)



These are 18 distinct titles, out of which, we will merge some to keep our model simple.
Mlle is for Mademoiselle in French and Mme is for Madame. Similarly, Dona is Spanish for Lady. Jonkheer and the Countess are again, titles for noble women. All of these can be merged into one name "Lady".


> dataset$Title[dataset$Title %in% c('Mlle', 'Mme', 'Ms', 'Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'

This cuts our distinct values to 13.

Submitting our predictions on new model with Titles added did not improve our score on Kaggle, but it didn't decrease it as well. So, we will keep this information intact.


We can further notice some Titles like Master, Miss and Ms, which can give us a clue about the age of the passenger. We can make use of these when filling in missing values of age.

Let's search for all passengers with missing age and Title "Master" and fill them with mean age of passengers with Master titles whose age is available. But first, we will reset the Age variable by reading both data sets in a temporary variable and combining to get original values of Age:

> temp.train <- read.csv("train.csv")
> temp.test <- read.csv("test.csv")
> temp.test$Survived <- 0
> tempset <- rbind(temp.train, temp.test[,c(1,12,2:11)])
> dataset$Age <- tempset$Age
> bad <- is.na(dataset$Age)
> dataset$Age[bad & dataset$Title == 'Master'] <- mean(dataset$Age[dataset$Title == 'Master'], na.rm=TRUE)

Fill the missing age for some other titles as well:

> dataset$Age[bad & dataset$Title == 'Miss'] <- mean(dataset$Age[dataset$Title == 'Miss'], na.rm=TRUE)
> dataset$Age[bad & dataset$Title == 'Mr'] <- mean(dataset$Age[dataset$Title == 'Mr'], na.rm=TRUE)
> dataset$Age[bad & dataset$Title == 'Mrs'] <- mean(dataset$Age[dataset$Title == 'Mrs'], na.rm=TRUE)
> dataset$AgePredicted <- 0
> dataset$AgePredicted[bad] <- 1

Fill out the remaining ones and round off before discretizing:

> bad <- is.na(dataset$Age)
> dataset$Age[bad] <- median(dataset$Age[bad], na.rm=TRUE)
> dataset$Age <- round(dataset$Age)

Discretize again as before:

> dataset$AgeGroup <- 'old'
> dataset$AgeGroup[dataset$Age < 2] <- 'infant'
> dataset$AgeGroup[dataset$Age >= 2 & dataset$Age < 13] <- 'child'
> dataset$AgeGroup[dataset$Age >= 13 & dataset$Age < 20] <- 'teenager'
> dataset$AgeGroup[dataset$Age >= 20 & dataset$Age < 40] <- 'young'
> dataset$AgeGroup[dataset$Age >= 40] <- 'old'

Time to see how we have improved in terms of accuracy. But we will not include Title yet in our formula:

> rpart_fit <- rpart(formula=Survived ~ Sex + AgeGroup + Pclass + Embarked, data=dataset[dataset$Dataset == 'train',], method="class")
> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval], cex=0.6)
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_relative_ages.csv", row.names=FALSE)



No improvement?! Well, how about adding some more variables. We already have Title, another can easily be added. The Cabin shows the cabin numbers that a passenger has booked, we can find out whether or not a passenger has booked cabins and introduce HasCabin as a new variable.

> dataset$HasCabin <- 1
> dataset$HasCabin[dataset$Cabin == ''] <- 0

We will also make use of SibSp and Parch variables too:

> formula <- Survived ~ Sex + AgeGroup + Pclass + Embarked + HasCabin + Title + SibSp + Parch
> rpart_fit <- rpart(formula, data=dataset[dataset$Dataset == 'train',], method="class")
> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval], cex=0.6)
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_family_attr.csv", row.names=FALSE)

Let's see how we do now with such a load of new information.


This is heartbreaking, I admit. But don't give up just yet. We have more variables to explore, only this time, we'll be a little aggressive ;-)

Have a look at Fare. It is a floating continuous value with only 1 missing record. We will do two things with this one:
1. Round off
2. Fill in missing value
3. Discretize

The first two steps are quick:

> dataset$Fare <- round(dataset$Fare)
> bad <- is.na(dataset$Fare)
> dataset$Fare[bad] <- median(dataset$Fare, na.rm=TRUE)

For the third step, we won't just discretize on judgement this time, but rather use a smart algorithm in "caret" library, k-means that creates clusters or groups of data by maximizing their distances. This article will give you a thorough description of what we are talking about here.

First, have a quick glance at how the data is distributed; this is to find out the ideal number of clusters we want to make.

> plot(table(dataset$Fare), ylim=c(0,75))


Do you notice some gaps? There's a very thin gap between 0 and next value and huge gap between 512 and a step before. Observing closely, we can visually see that there are at least 7 groups in the data. We can call k-means algorithm to create 7 clusters of the Fare variable.

> library(caret)
> kmeans(x=dataset$Fare, centers=7, iter.max=1000)$centers
        [,1]
1 133.391304
2  28.755556
3   0.600000
4  68.291925
5 237.882353
6   9.943759
7 512.000000

What you see above is the virtual center points that the algorithm has created for these clusters. We will create a new variable FareGroup and assign every record a cluster number suggested by k-means:

> k <- kmeans(x=dataset$Fare, centers=7, iter.max=1000)
> dataset$FareGroup <- k$cluster
> table(dataset$FareGroup)
  1   2   3   4   5   6   7 
 38  19 517  46 161 270 258 

The k$cluster has a cluster number for each index - assigned to the FareGroup. The table shows how the records are distributed among the clusters. There's one extra tweak here, I noticed that none of the passengers who paid the amount below 6.0 GBP survived. We can create an additional cluster for such passengers as well.

dataset$FareGroup[dataset$Fare < 6] <- 0

Now, we have another refined variable to add to our rpart model. Let's give it another shot, but we'll go without Title to make sure that it's the FareGroup which increases the accuracy (if it did):

> formula <- Survived ~ Sex + AgeGroup + Pclass + Embarked + HasCabin + SibSp + Parch + FareGroup
> rpart_fit <- rpart(formula, data=dataset[dataset$Dataset == 'train',], method="class")
> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval], cex=0.6)
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_faregroup.csv", row.names=FALSE)


It WORKED! We are now at 628th position. But there's a long journey ahead. Next, we will push our limits on data preparation by learning the missing values. And also see how we can fine-tune recursive partition learning algorithm to fit our data better.

Saturday, January 10, 2015

Titanic: A case study for predictive analysis on R (Part 2)

Previously, we successfully classified passengers as "will survive" or "will not survive" with 76.5% accuracy using Gender only. We will now extend our experiment further and include other attributes in order to improve our accuracy rate.

Let's resume..

Pre-processing

Real data is never in ideal form. It always needs some pre-processing. We will fill missing values, extract some extra information from available variables and convert continuous valued fields into discrete valued.

First, let's have a look at how Age variable is distributed. We will use a parameter useNA="always" in table function

> table(dataset$Age, useNA="always")


We see 177 missing values (NA's). We will fill them with mean value of Age as a straight forward solution. The commands below store TRUE/FALSE values in a vector bad by checking whether the value of age is available for each record or not, then stores the middle value of age where it is missing and plots the bar graph of age afterwards.

> bad <- is.na(dataset$Age)
> dataset$Age[bad] <- median(dataset$Age, na.rm=TRUE)
> plot(table(dataset$Age))


We now have age with no missing values. In order to simplify our decision rules, we need to change this continuous attribute into discrete valued attribute.

Let's discretize this field into infant, child, teenager, adult and old.

> dataset$AgeGroup <- 'old'
> dataset$AgeGroup[dataset$Age < 2] <- 'infant'
> dataset$AgeGroup[dataset$Age >= 2 & dataset$Age < 13] <- 'child'
> dataset$AgeGroup[dataset$Age >= 13 & dataset$Age < 20] <- 'teenager'
> dataset$AgeGroup[dataset$Age >= 20 & dataset$Age < 40] <- 'young'
> dataset$AgeGroup[dataset$Age >= 40] <- 'old'

> table(dataset$Survived, dataset$AgeGroup)/nrow(dataset)


The above distribution tells us that survival rate of infants is very high; children have about half the chance of survival; teenagers and young people have lower and finally, the old have the least survival ratio.

Let's rebuild our model. This time, if a passenger is male, then we make decision on the basis of age variable.

> testset$AgeGroup <- 'old'
> testset$AgeGroup[testset$Age < 2] <- 'infant'
> testset$AgeGroup[testset$Age >= 2 & testset$Age < 13] <- 'child'
> testset$AgeGroup[testset$Age >= 13 & testset$Age < 20] <- 'teenager'
> testset$AgeGroup[testset$Age >= 20 & testset$Age < 40] <- 'young'
> testset$AgeGroup[testset$Age >= 40] <- 'old'
> testset$Survived <- 0
> testset$Survived[testset$Sex == 'female'] <- 1
> male <- testset$Sex == 'male'
> testset$Survived[male & (testset$AgeGroup == 'infant' | testset$AgeGroup == 'child')] <- 1
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)

> write.csv(submit, file="femailes_and_children_survive.csv", row.names=FALSE)


Let's check what does Kaggle say about our new model.



As seen, our score bumped to 77%. This may not come to you as a very impressive score, but this is not where we are going to stop.

At this point, you must be wondering how complex our code becomes once we start adding more and more attributes. Thumbs up if you wondered this. This is where we make use of smart machine learning algorithms for classification.


Classification Algorithms

We can think of predictive analysis as a case of classification, where we are classifying entities based on some prior information about the domain. Machine learning is a branch of theoretical computing that deals with supervised classification. Supervised in a sense that we first tell the algorithm some information about the data, which the algorithm takes further decisions on the basis of.

Decision tree is among the simple examples of classification where we construct a tree of rules based on probability of occurrences of various events. For example, we earlier made a general rule:


If Sex is female then Survive, otherwise if AgeGroup is infant or child then Survive, otherwise Not Survive.



There is a variety of decision tree algorithms that can be used in R out of the box to create such decision trees automatically. Let's try one...

Recursive partitioning is an implementation of decision trees in rpart package in R, used to for predicting classification trees.

You'll have to first install two packages for rpart and to plot its tree:

> install.packages(c("rpart","rpart.plot"))
> library(rpart)
> library(rpart.plot)
> rpart_fit <- rpart(formula=Survived ~ Sex + Age, data=dataset, method="class")

> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval])

Here, the rpart() function analyzes the data and gives a model fitting to the data over the given formula. The formula states that the target variable is Survived and decision tree is to be constructed using Sex and Age attributes.

The prp() function plots the learnt tree. This is how it looks like:

The decision tree describes that there are 35% chance of survival when Sex is not male; when Sex is male and AgeGroup is among old, teenager and young, then chances of survival is 4% only.

We will now predict the Survival values using this model. We pass the rpart_fit - the tree learnt, testset and the type of target variable to predict() function, i.e. class.

> testset$Survived <- predict(rpart_fit, testset, type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_sex_agegroup.csv", row.names=FALSE)

The results from Kaggle did not show any improvement from previous results, but we now have a way to add more attributes without having to code complex rules.


Let us now add more attributes to the formula in rpart and see what develops.


> rpart_fit <- rpart(formula=Survived ~ Sex + AgeGroup + Pclass + Embarked, data=dataset, method="class")
> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval], cex=0.6)
As observed, the decision tree is quite complex after adding only two more attributes to the formula. Imaging writing code for this :-)

Proceed to submitting results on Kaggle.

> testset$Survived <- predict(rpart_fit, testset, type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_more_attr.csv", row.names=FALSE)



This bumped our score to 78.947% and also brings us to 706th rank on the competition.

Let's now attempt to add all of the attributes. But we first need to clean them, right?! Yes, in next part, we will do so.