Skip to main content

Titanic: A case study for predictive analysis on R (Part 4)

Working with titanic data set picked from Kaggle.com's competition, we predicted the passenger survivals with 79.426% accuracy in our previous attempt. This time, we will try to learn the missing values instead of setting trying mean or median. Let's start with Age.

Looking at the available data, we can hypothetically correlate Age with attributes like Title, Sex, Fare and HasCabin. Also note that we previous created variable AgePredicted; we will use it here to identify which records were filled previously.

> age_train <- dataset[dataset$AgePredicted == 0, c("Age","Title","Sex","Fare","HasCabin")]
> age_test <- dataset[dataset$AgePredicted == 1, c("Title","Sex","Fare","HasCabin")]
> formula <- Age ~ Title + Sex + Fare + HasCabin
> rp_fit <- rpart(formula, data=age_train, method="class")
> PredAge <- predict(rp_fit, newdata=age_test, type="vector")
> table(PredAge)

> dataset$Age[dataset$AgePredicted == 1] <- PredAge

The table(PredAge) gave us the following:
PredAge
  2  23  25 

  8 154 101 

Meaning that values 2, 23 and 25 were predicted for age variable for 8, 154 and 101 records respectively.

Furthermore, instead of providing fixed ranges for AgeGroups by judgement, we will use k-means clustering to derive age groups. The commands below will create 7 clusters of Age variable, the second like assigns each record in dataset a numeric cluster ID.

> k <- kmeans(dataset$Age, 7)

> dataset$AgeGroup <- k$cluster

Let's have a peek at the centers of these clusters as well as their distribution:
> k$centers
       [,1]
1 48.708661
2 16.820144
3 62.152542
4 22.559172
5 37.449495
6 27.429379

7  4.117021

> table(k$cluster)
  1   2   3   4   5   6   7 

127 139  59 338 198 354  94 

Let's see if we have any improvement in our results:

> formula <- Survived ~ Sex + AgeGroup + Pclass
> rpart_fit <- rpart(formula, data=dataset[dataset$Dataset == 'train',], method="class")
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)

> write.csv(submit, file="rpart_learn_age.csv", row.names=FALSE)


Hurrah! We hiked to 310th position with 80.383% accuracy. (Note that the ranks get improved with time as competition slows)

We'll end our data pre-processing here. Next, we will try some more classification models like Random forests and Support vector machines and see if we can do any better than this.

Comments

Popular posts from this blog

A faster, Non-recursive Algorithm to compute all Combinations of a String

Imagine you're me, and you studied Permutations and Combinations in your high school maths and after so many years, you happen to know that to solve a certain problem, you need to apply Combinations. You do your revision and confidently open your favourite IDE to code; after typing some usual lines, you pause and think, then you do the next best thing - search on Internet. You find out a nice recursive solution, which does the job well. Like the following: import java.util.ArrayList; import java.util.Date; public class Combination {    public ArrayList<ArrayList<String>> compute (ArrayList<String> restOfVals) {       if (restOfVals.size () < 2) {          ArrayList<ArrayList<String>> c = new ArrayList<ArrayList<String>> ();          c.add (restOfVals);          return c;       }       else {          ArrayList<ArrayList<String>> newList = new ArrayList<ArrayList<String>> ();          for (String

How to detach from Facebook... properly

Yesterday, I deactivated my Facebook account after using it for 10 years. Of course there had to be a very solid reason; there was, indeed... their privacy policy . If you go through this page, you might consider pulling off as well. Anyways, that's not what this blog post is about. What I learned from yesterday is that the so-called "deactivate" option on Facebook is nothing more than logging out. You can log in again without any additional step and resume from where you last left. Since I really wanted to remove myself from Facebook as much as I can, I investigated ways to actually delete a Facebook account. There's a plethora of blogs on the internet, which will tell you how you can simply remove Facebook account. But almost all of them will either tell you to use "deactivate" and "request delete" options. The problem with that is that Facebook still has a last reusable copy of your data. If you really want to be as safe from its s

A step-by-step guide to query data on Hadoop using Hive

Hadoop empowers us to solve problems that require intense processing and storage on commodity hardware harnessing the power of distributed computing, while ensuring reliability. When it comes to applicability beyond experimental purposes, the industry welcomes Hadoop with warm heart, as it can query their databases in realistic time regardless of the volume of data. In this post, we will try to run some experiments to see how this can be done. Before you start, make sure you have set up a Hadoop cluster . We will use Hive , a data warehouse to query large data sets and a adequate-sized sample data set, along with an imaginary database of a travelling agency on MySQL; the DB  consisting of details about their clients, including Flight bookings, details of bookings and hotel reservations. Their data model is as below: The number of records in the database tables are as: - booking: 2.1M - booking_detail: 2.1M - booking_hotel: 1.48M - city: 2.2K We will write a query that