Working with titanic data set picked from Kaggle.com's competition, we predicted the passenger survivals with 79.426% accuracy in our previous attempt. This time, we will try to learn the missing values instead of setting trying mean or median. Let's start with Age.
Looking at the available data, we can hypothetically correlate Age with attributes like Title, Sex, Fare and HasCabin. Also note that we previous created variable AgePredicted; we will use it here to identify which records were filled previously.
> age_train <- dataset[dataset$AgePredicted == 0, c("Age","Title","Sex","Fare","HasCabin")]
> age_test <- dataset[dataset$AgePredicted == 1, c("Title","Sex","Fare","HasCabin")]
> formula <- Age ~ Title + Sex + Fare + HasCabin
> rp_fit <- rpart(formula, data=age_train, method="class")
> PredAge <- predict(rp_fit, newdata=age_test, type="vector")
> table(PredAge)
> dataset$Age[dataset$AgePredicted == 1] <- PredAge
The table(PredAge) gave us the following:
PredAge
2 23 25
8 154 101
Meaning that values 2, 23 and 25 were predicted for age variable for 8, 154 and 101 records respectively.
Furthermore, instead of providing fixed ranges for AgeGroups by judgement, we will use k-means clustering to derive age groups. The commands below will create 7 clusters of Age variable, the second like assigns each record in dataset a numeric cluster ID.
> k <- kmeans(dataset$Age, 7)
> dataset$AgeGroup <- k$cluster
Let's have a peek at the centers of these clusters as well as their distribution:
> k$centers
[,1]
1 48.708661
2 16.820144
3 62.152542
4 22.559172
5 37.449495
6 27.429379
7 4.117021
> table(k$cluster)
1 2 3 4 5 6 7
127 139 59 338 198 354 94
Let's see if we have any improvement in our results:
> formula <- Survived ~ Sex + AgeGroup + Pclass
> rpart_fit <- rpart(formula, data=dataset[dataset$Dataset == 'train',], method="class")
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_learn_age.csv", row.names=FALSE)
Hurrah! We hiked to 310th position with 80.383% accuracy. (Note that the ranks get improved with time as competition slows)
We'll end our data pre-processing here. Next, we will try some more classification models like Random forests and Support vector machines and see if we can do any better than this.
Looking at the available data, we can hypothetically correlate Age with attributes like Title, Sex, Fare and HasCabin. Also note that we previous created variable AgePredicted; we will use it here to identify which records were filled previously.
> age_train <- dataset[dataset$AgePredicted == 0, c("Age","Title","Sex","Fare","HasCabin")]
> age_test <- dataset[dataset$AgePredicted == 1, c("Title","Sex","Fare","HasCabin")]
> formula <- Age ~ Title + Sex + Fare + HasCabin
> rp_fit <- rpart(formula, data=age_train, method="class")
> PredAge <- predict(rp_fit, newdata=age_test, type="vector")
> table(PredAge)
> dataset$Age[dataset$AgePredicted == 1] <- PredAge
The table(PredAge) gave us the following:
PredAge
2 23 25
8 154 101
Meaning that values 2, 23 and 25 were predicted for age variable for 8, 154 and 101 records respectively.
Furthermore, instead of providing fixed ranges for AgeGroups by judgement, we will use k-means clustering to derive age groups. The commands below will create 7 clusters of Age variable, the second like assigns each record in dataset a numeric cluster ID.
> k <- kmeans(dataset$Age, 7)
> dataset$AgeGroup <- k$cluster
> k$centers
[,1]
1 48.708661
2 16.820144
3 62.152542
4 22.559172
5 37.449495
6 27.429379
7 4.117021
1 2 3 4 5 6 7
127 139 59 338 198 354 94
> formula <- Survived ~ Sex + AgeGroup + Pclass
> rpart_fit <- rpart(formula, data=dataset[dataset$Dataset == 'train',], method="class")
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_learn_age.csv", row.names=FALSE)
Hurrah! We hiked to 310th position with 80.383% accuracy. (Note that the ranks get improved with time as competition slows)
We'll end our data pre-processing here. Next, we will try some more classification models like Random forests and Support vector machines and see if we can do any better than this.
Comments
Post a Comment