Tuesday, November 17, 2015

Do's and don'ts for Team Leaders - 1

7 do's for Team Leaders

Say you've just got promoted to a team lead position and have no clue, or received no training from your prior leader in this regard.
Well, here's a little something for you. This is, by no means a thorough guide, but rather a kick starter.
These 7 points below are an extract of what I've experienced as a Team leader of smart and versatile software engineers.

1. Be friendly and supportive:

Simplest and most general rule that fits everywhere. Hard to achieve though.
Friendliness requires openness and transparency. Try to make work fun without compromising on your company's mandate.

Dine with your team, talk about stuff other than work. Share jokes and inspirational quotes/videos. Make your presence comfortable. A good leader knows his players' interests, current goals, future plans, daily activities, pet names, favorite pizza toppings... you get the point.

Being supportive means you should always be available, set examples by doing and be the easiest person to communicate to. When you need something, ask their availability and go to their desk; when they need your help, make them a priority and again, go to their desk.

2. Believe in them:

Believe in your team mates, more than they believe themselves. Often, those who lack experience also lack the tendency to estimate their capabilities. One of my former seniors had me do things -- I then could not believe I was able to -- by breaking the job into unit tasks.

Here, by believing, I do not mean just pretending or saying that you believe in them. Have faith, and be courageous to assign them critical tasks if they are capable. Make your team mates realize the importance of such tasks and why you think they have more than half the chance to handle them. This practice builds confidence. A critical task may involve some risk and failing might cost, but the damage of not doing so may and will cost much more.

But careful not be unrealistic; a lioness will encourage her cubs to hunt a deer, but won't let them go near a Gaur. I made a mistake one with one of my best players of assigning a task without proper training. The result was a deadline unmet, a deliverable missed and an unhappy client.

3. Set vision centered goals:

Have a clear vision in your mind about why you are leading your team? What do you want them to be? For example, you can have a vision of training people to have your skills, plus their own skills, minus your weaknesses. Once you have this, set timely goals according to their skills, areas of improvements and interests.

A bright player of my team is very dedicated and hard working. She needs improvement in stress handling, so I set specific goals that put her to stress just as much as she can handle; while keeping this constant, I occasionally assign her additional tasks. The assumption was that this experiment will raise her stress threshold. A few months exhibited significant improvement.

One of the most gifted players in my team is never troubled with work and always delivered quality stuff at supersonic pace. Now the downside is that her efforts are limited to making software; the rest of the teams have not been benefited from her experience. So, her goal for the year is to identify a common problem related to software engineering processes in the organization... fix it.

You cannot exercise this well if you are not friendly and supportive. Also, if you are not sincere with them or put yourself before them, you will hesitate to set goals that might ask more from you then them. If such is the case, you're on the verge of being a bad example.

4. Be honest:

Honesty is the first chapter in the book of wisdom.
[Thomas Jefferson]

You can be manipulative in the name of diplomacy, hide the facts and call it strategy and still be successful - temporarily. Eventually, your strategy will fall flat on its face when your mates will follow your footsteps (what goes around comes around).

Fabricating project deadlines, making promises about appraisals and promotions you know won't happen, making up stories to disapprove leaves are the most common practices of dishonesty in the industry.

My manager clearly gives me project goals and deadlines at once and leaves Yes/No up to me. The only occasions he heard "No" from me were when he too admitted that the job was impossible. I follow the same practice and without keeping any buffer, my team almost always gets things ready in due time.

4. Have communication protocols:

Miscommunication is the root of all business mishaps. You may define "efficient" as someone who executes tasks quickly, while someone else might be talking about someone who gets things done the most optimal way, when using the same term.

Now, I'm not suggesting to write a dictionary and cram into your memory. Communication gets stronger with time, all you need to do is to be consistent when you say something particularly. My manager does not often use "unacceptable", but when he does, it's red alert - always. To point out a mistake casually, he uses "messed up".

Build your protocols just like that. Few examples are:
- "hang on": I'm coming to you personally
- "wait a minute": I'll be right back
- "what do you think?": you told me a problem I'm sure you can solve yourself
- "will get back to you": I'll follow up when I have some progress on this matter
- "by the end of the week": by Friday, before 5:00pm (closing time)
- "it is important": it must be done to meet the goals
- "it is urgent": do it now or it can never be done

Do not use vague words. "It should be ready": is it "it is ready" or "it is not ready"? Using ambiguous terms to stall people is unprofessional, unethical and disrespectful (humour is another thing). Leave that to your politicians.

Following this principle will eventually build a strong communication bridge, which will relieve you from most of your worries.

5. Appreciate and criticize:

Successful people have control over their emotions. You are a human, so you will get aggressive sometimes - inevitably. But remember Vince Lombardi's rule of thumb: "Praise in public; criticize in private". Appreciation should be loud, casual and not be prologue to a lecture on how to improve more. Also, be responsive, not reactive (a leaf out of 7 habits of highly effective people by Stephen Covey).

Criticism should be constructive, point out mistakes and explain ways to correct them. If you're angry, cool down. Give yourself a moment. See if the person is already doing his best. If such is the case, you're not just to criticize. You should either manage resource better or provide more training. Try to ignore minor issues. If you are fault tolerant, your team will not hesitate to report to you if a mishap occurs. I once deleted database of a live server (repeat) a live server. My manager (now CEO) is a cool captain, so I informed him immediately. The next whole day (off day), he sat with me and we recovered about 99% of the lost data. Had he been an intolerant dictator, I would have tried to cover up and the damage would've been much worse.

It is good habit to give extra credit to your team on good performance than due; if they're under-performing or have messed something up, own their mistake. Take some blame off of them. I've seen a manager doing exactly the opposite to her team. Half of her team left her; the rest spent their time making cover up stories, until she was relieved off the company the same year.

6. Welcome mistakes:

You are not what you are without making some mistakes. Making a mistake should not cause embarrassment to anyone in your environment. Rather, encourage your team by sharing your own mistakes and how you corrected them; what were the good/bad lessons you learned. Mistakes will happen, but won't repeat if people start sharing them with others. I shared the story of my blunder on the live database with whole department, and now backing up databases before any kind of change is a common practice throughout the team.

7. Performance feedback:

Performance feedback is an effective way to improve yourself. Establish a mechanism to provide and receive feedback to and from your team members. They should see you as a person who welcomes criticism and responds positively to it. In my company, some seniors encourage their teams to do their appraisal too. This works well if you have a history of not violating your authority in difficult times. On the other hand, some people are reluctant due to lack of confidence in you; it is not always easy to tell your senior about his flaws. You'll have to find your own way of knowing how your players think you can do a better job at leading them.

One of my team members won't give any feedback on my performance. Later, I learnt the hard way that my strict response to her mistakes damaged her performance rather than improve. When settling this, I made her realize that lack of feedback from her resulted in escalation of the issue.

I personally like feedback process in Academia, where pupils are asked by the administration to fill a questionnaire to evaluate the performance of their instructors each semester.

These are a few lessons I've learnt in my not-so-long tenure as a team leader. I've tried to keep them general, but no guidelines are applicable as is to all circumstances and people. Like a military officer will definitely agree to #4 and refute to #6. Next, we'll have a look at some "don'ts" in sha Allah.

Agree/disagree, please leave comments if you think it needs improvement or something is terribly wrong...

Saturday, May 16, 2015

Playing in Amazon's Clouds - Introduction to Elastic Computing Cloud - Part 2

Connecting to Cloud

Previously, we looked at how to configure an EC2 instance on AWS. If you're not sure what this sentence was about, click here.

In this post, we'll look at some ways to connect to your EC2 instance and try out an example. I'm assuming you already know how to get to the EC2 console page from AWS home.

ٖFrom here, you should go to the Running Instances link to check your instances. You should see something like this:

Right now, we only have one instance of t2.micro configuration, running on public IP address defined under Public IP. We will first create an alarm to make sure we do not hit our cap when experimenting.

Click the Alarm icon under Alarm Status. You should see a pop up to configure an alarm. We are interested in making sure that the CPU usage is under certain limits. Let's create an alarm.

We want to generate an email alert whenever our instance is consuming over 90% of processing power for 1 hour or more.

We just created a new warning notification to my email address and also set an action to Stop the instance whenever our set limits are hit.

Note that on setting the email address, you will receive an email for confirmation. If your email address is not confirmed within three days, you'll stop receiving any alerts.

Now, let's come to our real objective. We will try to connect using SSH client for Windows, PuTTy. If you're on Windows, download and install the latest version of it. Follow the links for Linux and OSX.

When you've installed PuTTy, launch the PuTTy applications. You should be able to see a configuration window:

In the host name, put: ec2-user@your_public_dns. For example:

Next step is to provide a private key. Remember we generated a pem file in our earlier walk through? Well! PuTTY uses a ppk file, or PuTTY Private Key. We can easily generate this file from a utility called PuTTYGen; this tool installs with PuTTY. Leave the current window open as is and launch puttygen.exe from your PuTTY installation directory.

Go to conversions menu and import the pem key pair file created earlier. Mine is vision360-keypair.pem. You should be able to see the public key and the RSA fingerprint.

Click "Save private key" button and save the key as <your-key-name>.ppk. If PuTTY displays a warning for missing passphrase, ignore and proceed. I saved my file as vision360-keypair.ppk. Close this window and go back to PuTTY.

Go to Connection > SSH > Auth. Provide the path to private key we generated:

Click Open to connect to your EC2 machine. On first attempt, PuTTY should display a warning that the Server's host key was not previously registered and that you may not be connecting to the desired computer. Ignore this warning by pressing Yes; we know what we're doing :-)

That's it. You're in. The first thing you may want to do is to update the OS of your VM. Try the following command (this is similar to Ubuntu's apt-get):

$ sudo yum update

You'll note that the updates are downloaded and applied like a bullet train. This is because Amazon uses state of the art infrastructure and most optimal settings for the OS of its VMs.

Check your system's resources using some common commands:
To check disk space
$ df -h
To check memory
$ free -m

That's it for now, we'll get to some practical use of the Elastic Computing in future posts. The plan is to do some real-life data analysis on this.

Please comment if you find any mistake. Thanks...

Friday, April 24, 2015

Playing in Amazon's Clouds - Introduction to Elastic Computing Cloud - Part 1

A really brief Intro..

Researcher, Trying to execute an extremely computationally resource hungry experiment?
App developer, unsure of how much data you'll be collecting from the users?
Student, tasked to build your FYP (final year project) on distributed computing environment?
Just an ordinary techie trying to catch up with the world?

If you're any of these, you cannot escape the fact that Cloud computing is storming in and you have to engage yourself actively in it. Adopt it, or perish.

I'm a newbie (better say wannabe) in this massive web of computing, and here just to share some experiences I'm having - successes and failures.

First of all, Cloud computing is nothing new, it has been there for over 3 decades and was referred with names like Grid computing and Distributed computing. It was business people that came up with a catchy name to attract business.

The idea behind distributed computing is simple. We create a network of computers to do handle complex tasks that a single computer cannot do. Cloud computing takes this concept a few steps ahead, creating redundancy of data and processing units such that if one machine goes down, the other backs your job up and providing a range of services from applications to data centers on the basis of usage (unlike fixed rant). Here is a good read on what cloud computing offers.

We will now have a look at how to practically set up our own cloud applications and make use of the power of cloud to solve our everyday problems. For this purpose, we'll play with Amazon Web Services, a.k.a. AWS, which is thus far the most successful cloud services provider.

In this post, we will attempt to set up an instance of virtual machine on AWS.

Let's dive...

The first step is to sign-up. For this purpose, all you need is to provide your basic info and a working credit card. Don't worry, Amazon won't be charging more than 1USD unless you start subscribing services.

Once you're through this step, you should be able to log into your AWS account and see this dashboard:

Yes, there are a ton of options. That's why Amazon is so successful. But we'll just taste a bite of this dessert here. The first thing to note is that there are 11 data centers where Amazon hosts its services; we are connected to Oregon.

The good news is that Amazon allows you 12 months of free (but limited) usage to try out different services before you actually use them at large. Please go through the details before proceeding.

Let's talk about the first option - EC2, or Elastic Computing Cloud. This is a service that can resize itself to fit your computing needs. You can create, run, and terminate virtual server instances, supporting a variety of technologies.

To access this service, go to the Top-left corner of the AWS services and click EC2.

Of course, no instances running. We will be creating a new instance. But first, we need an Identity key. To create one, again from AWS console (home) and follow the steps:

Step 1: Go to the highlighted option Identity & Access Management

You should see the dashboard with nothing running.

Step 2: We'll start by creating a Group and assigning it administrative rights.

Note: If you like, you can customize your aliases with something other than the number given, like I renamed it to https://owaisahussain.signin.aws.amazon.com/console

Step 3: Now go to groups and create one. I named my group vision360. You should also grant AdministrativeAccess to ensure that the group you created has full rights. You can create more groups like this as well.

On next window, review the changes and create group.

Since we have a group, the next obvious thing is to add a user to associate with this group.

Step 4: Go ahead and click the Users link on navigation pane:

Click Create New Users and add user accounts you want to create. For now, we'll just go with mine. When you create a user, IAM creates an access key ID and secret access key. This information is sensitive and should be kept secret. You may also download this key pair for future use.

Next, select the newly created user and add it to the previously created vision360 group from User Actions.

From the same User Actions, go to Manage Password and assign a new password of your choice (should be strong).

Now, just like you created an identity for yourself, the virtual machines on the cloud need an identity as well. We do this by creating key pairs, which are used  to secure the login credentials for your instance (virtual machines). Follow the steps below to create a key pair for your machine:

Step 1: Go back to the EC2 Console and in the navigation panel, click Key Pairs.

Caution! Different regions have different key pairs. Therefore, if you are going to work on a region closer to you, now is the time to change from top-right. I changed mine to Singapore.

Step 2: Create a key pair and give it a suitable name and click Create.

Note that as you create the key pair, a .pem file is downloaded automatically to your computer. Keep this file safe as this is a footprint of your Key generated using RSA algorithm.

We will use this certificate to connect to Linux using secure shell. On windows, this is done by using PuTTy. In fact, just go on, download and install this lightweight tool.

We are now set to launch our EC2 instance. Here are the steps to do so:

Step 1: Go on and click Launch Instance button on EC2 console.

Caution! MUST Check Free tier only option to limit your selection. Or don't complain of bucks missing from your credit card :-)

Step 2: There are 22 types available and just like you, I have no clue which one to choose. But let's think about our selection. We may need Linux due to its flexibility and the preferred choice would probably be Amazon Linux due to its variety in terms of support. Let's hit Select.

Note: Hovering over the selected tier reads: Micro instances are eligible for the AWS free usage tier. For the first 12 months following your AWS sign-up date, you get up to 750 hours of micro instances each month. When your free usage tier expires or if your usage exceeds the free tier restrictions, you pay standard, pay-as-you-go service rates.

Step 3: Another lengthy table of options to choose from. Here, you define your specific requirements. Although we're going with the free tier (t2.micro), but for real world applications, you should carefully choose your instance according to your computational needs. AWS has options for general purpose, memory optimized, storage optimized, CPU optimized and graphics optimized instances. For now, select t2.micro and  click Review and Launch.

Note that we skipped some configuration options, like Storage, Tag and Security Group. These are for you to explore, but let's just have a few sentences on these:
  • Configure Instance: offers various networking options, some incurring charges.
  • Add Storage: is self-explanatory. The free tier is on limited magnetic disk drive, but you can add more storage to fit your needs as per requirements
  • Tag Instance: simply labels your instance, for example: "DataScienceProject-1". You may need this when creating several instances
  • Configure Security Group: can create new rules about how you would restrict access to your instance. For example, you can limit the SSH (secure shell) access to your own IP address only

Step 4: Finally, click Launch button, which would bring a popup and ask if you want to proceed with a key or without any? We shall select the key we created earlier. You can also create new Key pair here. Check "I acknowledge..." option and click Launch Instances.

And that's was all. You just created and launched your first EC2 instance on AWS. Next, we'll have a look at what to do with this free resource.

I'm a newcomer and may have made plenty of mistakes. Please leave comments for corrections wherever you find them.


Friday, February 20, 2015

Titanic: A case study for predictive analysis on R (Part 4)

Working with titanic data set picked from Kaggle.com's competition, we predicted the passenger survivals with 79.426% accuracy in our previous attempt. This time, we will try to learn the missing values instead of setting trying mean or median. Let's start with Age.

Looking at the available data, we can hypothetically correlate Age with attributes like Title, Sex, Fare and HasCabin. Also note that we previous created variable AgePredicted; we will use it here to identify which records were filled previously.

> age_train <- dataset[dataset$AgePredicted == 0, c("Age","Title","Sex","Fare","HasCabin")]
> age_test <- dataset[dataset$AgePredicted == 1, c("Title","Sex","Fare","HasCabin")]
> formula <- Age ~ Title + Sex + Fare + HasCabin
> rp_fit <- rpart(formula, data=age_train, method="class")
> PredAge <- predict(rp_fit, newdata=age_test, type="vector")
> table(PredAge)

> dataset$Age[dataset$AgePredicted == 1] <- PredAge

The table(PredAge) gave us the following:
  2  23  25 

  8 154 101 

Meaning that values 2, 23 and 25 were predicted for age variable for 8, 154 and 101 records respectively.

Furthermore, instead of providing fixed ranges for AgeGroups by judgement, we will use k-means clustering to derive age groups. The commands below will create 7 clusters of Age variable, the second like assigns each record in dataset a numeric cluster ID.

> k <- kmeans(dataset$Age, 7)

> dataset$AgeGroup <- k$cluster

Let's have a peek at the centers of these clusters as well as their distribution:
> k$centers
1 48.708661
2 16.820144
3 62.152542
4 22.559172
5 37.449495
6 27.429379

7  4.117021

> table(k$cluster)
  1   2   3   4   5   6   7 

127 139  59 338 198 354  94 

Let's see if we have any improvement in our results:

> formula <- Survived ~ Sex + AgeGroup + Pclass
> rpart_fit <- rpart(formula, data=dataset[dataset$Dataset == 'train',], method="class")
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)

> write.csv(submit, file="rpart_learn_age.csv", row.names=FALSE)

Hurrah! We hiked to 310th position with 80.383% accuracy. (Note that the ranks get improved with time as competition slows)

We'll end our data pre-processing here. Next, we will try some more classification models like Random forests and Support vector machines and see if we can do any better than this.

Friday, January 16, 2015

Titanic: A case study for predictive analysis on R (Part 3)

In our previous attempt, we applied some machine learning techniques to our data and predicted the values for target variable using AgeGroup, Sex, Pclass and Embarked attributes. Now, we will further explore other attributes and see how much information we can extract.

This time, instead of keeping test set apart, we will merge it into the training data set. This will enable us to collect complete range of values for each attribute, in case there are some missing outs in training set:

> dataset$Dataset <- 'train'
> testset$Dataset <- 'test'
> testset$Survived <- 0
> dataset <- rbind(dataset, testset[,c(1,13,2:12)])

This may look a strange way to merge two data sets, but here's some explanation. The first line adds a column Survived to testset, so that both the dataset and testset have identical columns. The next two lines add another column to identify whether a record is from training set or test set. The last line merges both the data sets using rbind (row bind) function. In the parameters, we defined column sequence of testset so that it is identical to dataset, because when we added Survived column to test set, it went at the end, while in training set, it was 2nd in order.

Let's resume our exploration; we start with Name. At first glance, the name appears to be useless, as it is unique for each passenger. But looking closely, we notice a few things: each name contains a Family name and a Title; FamilyName can be used to create relationships among passengers, while titles like Master, Capt, Sir, Lady tell us people's age, job and the social class they belong to.

We will partition the names into sub parts and extract Title and FamilyName. The following command uses sapply function, which basically applies a function to each element in the vector passed as argument. We pass the Name attribute and apply strsplit function that splits a string based on given criteria, which in our case is either , (comma) or . (period). The next line chops off extra white space.

> dataset$Title <- sapply(dataset$Name, FUN=function(x) {strsplit(as.character(x), split='[,.]')[[1]][2]})

> dataset$Title <- sub(' ', '', dataset$Title)

Let's have a look at the titles we have in the data.

> unique(dataset$Title)

These are 18 distinct titles, out of which, we will merge some to keep our model simple.
Mlle is for Mademoiselle in French and Mme is for Madame. Similarly, Dona is Spanish for Lady. Jonkheer and the Countess are again, titles for noble women. All of these can be merged into one name "Lady".

> dataset$Title[dataset$Title %in% c('Mlle', 'Mme', 'Ms', 'Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'

This cuts our distinct values to 13.

Submitting our predictions on new model with Titles added did not improve our score on Kaggle, but it didn't decrease it as well. So, we will keep this information intact.

We can further notice some Titles like Master, Miss and Ms, which can give us a clue about the age of the passenger. We can make use of these when filling in missing values of age.

Let's search for all passengers with missing age and Title "Master" and fill them with mean age of passengers with Master titles whose age is available. But first, we will reset the Age variable by reading both data sets in a temporary variable and combining to get original values of Age:

> temp.train <- read.csv("train.csv")
> temp.test <- read.csv("test.csv")
> temp.test$Survived <- 0
> tempset <- rbind(temp.train, temp.test[,c(1,12,2:11)])
> dataset$Age <- tempset$Age
> bad <- is.na(dataset$Age)
> dataset$Age[bad & dataset$Title == 'Master'] <- mean(dataset$Age[dataset$Title == 'Master'], na.rm=TRUE)

Fill the missing age for some other titles as well:

> dataset$Age[bad & dataset$Title == 'Miss'] <- mean(dataset$Age[dataset$Title == 'Miss'], na.rm=TRUE)
> dataset$Age[bad & dataset$Title == 'Mr'] <- mean(dataset$Age[dataset$Title == 'Mr'], na.rm=TRUE)
> dataset$Age[bad & dataset$Title == 'Mrs'] <- mean(dataset$Age[dataset$Title == 'Mrs'], na.rm=TRUE)
> dataset$AgePredicted <- 0
> dataset$AgePredicted[bad] <- 1

Fill out the remaining ones and round off before discretizing:

> bad <- is.na(dataset$Age)
> dataset$Age[bad] <- median(dataset$Age[bad], na.rm=TRUE)
> dataset$Age <- round(dataset$Age)

Discretize again as before:

> dataset$AgeGroup <- 'old'
> dataset$AgeGroup[dataset$Age < 2] <- 'infant'
> dataset$AgeGroup[dataset$Age >= 2 & dataset$Age < 13] <- 'child'
> dataset$AgeGroup[dataset$Age >= 13 & dataset$Age < 20] <- 'teenager'
> dataset$AgeGroup[dataset$Age >= 20 & dataset$Age < 40] <- 'young'
> dataset$AgeGroup[dataset$Age >= 40] <- 'old'

Time to see how we have improved in terms of accuracy. But we will not include Title yet in our formula:

> rpart_fit <- rpart(formula=Survived ~ Sex + AgeGroup + Pclass + Embarked, data=dataset[dataset$Dataset == 'train',], method="class")
> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval], cex=0.6)
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_relative_ages.csv", row.names=FALSE)

No improvement?! Well, how about adding some more variables. We already have Title, another can easily be added. The Cabin shows the cabin numbers that a passenger has booked, we can find out whether or not a passenger has booked cabins and introduce HasCabin as a new variable.

> dataset$HasCabin <- 1
> dataset$HasCabin[dataset$Cabin == ''] <- 0

We will also make use of SibSp and Parch variables too:

> formula <- Survived ~ Sex + AgeGroup + Pclass + Embarked + HasCabin + Title + SibSp + Parch
> rpart_fit <- rpart(formula, data=dataset[dataset$Dataset == 'train',], method="class")
> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval], cex=0.6)
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_family_attr.csv", row.names=FALSE)

Let's see how we do now with such a load of new information.

This is heartbreaking, I admit. But don't give up just yet. We have more variables to explore, only this time, we'll be a little aggressive ;-)

Have a look at Fare. It is a floating continuous value with only 1 missing record. We will do two things with this one:
1. Round off
2. Fill in missing value
3. Discretize

The first two steps are quick:

> dataset$Fare <- round(dataset$Fare)
> bad <- is.na(dataset$Fare)
> dataset$Fare[bad] <- median(dataset$Fare, na.rm=TRUE)

For the third step, we won't just discretize on judgement this time, but rather use a smart algorithm in "caret" library, k-means that creates clusters or groups of data by maximizing their distances. This article will give you a thorough description of what we are talking about here.

First, have a quick glance at how the data is distributed; this is to find out the ideal number of clusters we want to make.

> plot(table(dataset$Fare), ylim=c(0,75))

Do you notice some gaps? There's a very thin gap between 0 and next value and huge gap between 512 and a step before. Observing closely, we can visually see that there are at least 7 groups in the data. We can call k-means algorithm to create 7 clusters of the Fare variable.

> library(caret)
> kmeans(x=dataset$Fare, centers=7, iter.max=1000)$centers
1 133.391304
2  28.755556
3   0.600000
4  68.291925
5 237.882353
6   9.943759
7 512.000000

What you see above is the virtual center points that the algorithm has created for these clusters. We will create a new variable FareGroup and assign every record a cluster number suggested by k-means:

> k <- kmeans(x=dataset$Fare, centers=7, iter.max=1000)
> dataset$FareGroup <- k$cluster
> table(dataset$FareGroup)
  1   2   3   4   5   6   7 
 38  19 517  46 161 270 258 

The k$cluster has a cluster number for each index - assigned to the FareGroup. The table shows how the records are distributed among the clusters. There's one extra tweak here, I noticed that none of the passengers who paid the amount below 6.0 GBP survived. We can create an additional cluster for such passengers as well.

dataset$FareGroup[dataset$Fare < 6] <- 0

Now, we have another refined variable to add to our rpart model. Let's give it another shot, but we'll go without Title to make sure that it's the FareGroup which increases the accuracy (if it did):

> formula <- Survived ~ Sex + AgeGroup + Pclass + Embarked + HasCabin + SibSp + Parch + FareGroup
> rpart_fit <- rpart(formula, data=dataset[dataset$Dataset == 'train',], method="class")
> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval], cex=0.6)
> testset$Survived <- predict(rpart_fit, dataset[dataset$Dataset == 'test',], type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_faregroup.csv", row.names=FALSE)

It WORKED! We are now at 628th position. But there's a long journey ahead. Next, we will push our limits on data preparation by learning the missing values. And also see how we can fine-tune recursive partition learning algorithm to fit our data better.

Saturday, January 10, 2015

Titanic: A case study for predictive analysis on R (Part 2)

Previously, we successfully classified passengers as "will survive" or "will not survive" with 76.5% accuracy using Gender only. We will now extend our experiment further and include other attributes in order to improve our accuracy rate.

Let's resume..


Real data is never in ideal form. It always needs some pre-processing. We will fill missing values, extract some extra information from available variables and convert continuous valued fields into discrete valued.

First, let's have a look at how Age variable is distributed. We will use a parameter useNA="always" in table function

> table(dataset$Age, useNA="always")

We see 177 missing values (NA's). We will fill them with mean value of Age as a straight forward solution. The commands below store TRUE/FALSE values in a vector bad by checking whether the value of age is available for each record or not, then stores the middle value of age where it is missing and plots the bar graph of age afterwards.

> bad <- is.na(dataset$Age)
> dataset$Age[bad] <- median(dataset$Age, na.rm=TRUE)
> plot(table(dataset$Age))

We now have age with no missing values. In order to simplify our decision rules, we need to change this continuous attribute into discrete valued attribute.

Let's discretize this field into infant, child, teenager, adult and old.

> dataset$AgeGroup <- 'old'
> dataset$AgeGroup[dataset$Age < 2] <- 'infant'
> dataset$AgeGroup[dataset$Age >= 2 & dataset$Age < 13] <- 'child'
> dataset$AgeGroup[dataset$Age >= 13 & dataset$Age < 20] <- 'teenager'
> dataset$AgeGroup[dataset$Age >= 20 & dataset$Age < 40] <- 'young'
> dataset$AgeGroup[dataset$Age >= 40] <- 'old'

> table(dataset$Survived, dataset$AgeGroup)/nrow(dataset)

The above distribution tells us that survival rate of infants is very high; children have about half the chance of survival; teenagers and young people have lower and finally, the old have the least survival ratio.

Let's rebuild our model. This time, if a passenger is male, then we make decision on the basis of age variable.

> testset$AgeGroup <- 'old'
> testset$AgeGroup[testset$Age < 2] <- 'infant'
> testset$AgeGroup[testset$Age >= 2 & testset$Age < 13] <- 'child'
> testset$AgeGroup[testset$Age >= 13 & testset$Age < 20] <- 'teenager'
> testset$AgeGroup[testset$Age >= 20 & testset$Age < 40] <- 'young'
> testset$AgeGroup[testset$Age >= 40] <- 'old'
> testset$Survived <- 0
> testset$Survived[testset$Sex == 'female'] <- 1
> male <- testset$Sex == 'male'
> testset$Survived[male & (testset$AgeGroup == 'infant' | testset$AgeGroup == 'child')] <- 1
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)

> write.csv(submit, file="femailes_and_children_survive.csv", row.names=FALSE)

Let's check what does Kaggle say about our new model.

As seen, our score bumped to 77%. This may not come to you as a very impressive score, but this is not where we are going to stop.

At this point, you must be wondering how complex our code becomes once we start adding more and more attributes. Thumbs up if you wondered this. This is where we make use of smart machine learning algorithms for classification.

Classification Algorithms

We can think of predictive analysis as a case of classification, where we are classifying entities based on some prior information about the domain. Machine learning is a branch of theoretical computing that deals with supervised classification. Supervised in a sense that we first tell the algorithm some information about the data, which the algorithm takes further decisions on the basis of.

Decision tree is among the simple examples of classification where we construct a tree of rules based on probability of occurrences of various events. For example, we earlier made a general rule:

If Sex is female then Survive, otherwise if AgeGroup is infant or child then Survive, otherwise Not Survive.

There is a variety of decision tree algorithms that can be used in R out of the box to create such decision trees automatically. Let's try one...

Recursive partitioning is an implementation of decision trees in rpart package in R, used to for predicting classification trees.

You'll have to first install two packages for rpart and to plot its tree:

> install.packages(c("rpart","rpart.plot"))
> library(rpart)
> library(rpart.plot)
> rpart_fit <- rpart(formula=Survived ~ Sex + Age, data=dataset, method="class")

> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval])

Here, the rpart() function analyzes the data and gives a model fitting to the data over the given formula. The formula states that the target variable is Survived and decision tree is to be constructed using Sex and Age attributes.

The prp() function plots the learnt tree. This is how it looks like:

The decision tree describes that there are 35% chance of survival when Sex is not male; when Sex is male and AgeGroup is among old, teenager and young, then chances of survival is 4% only.

We will now predict the Survival values using this model. We pass the rpart_fit - the tree learnt, testset and the type of target variable to predict() function, i.e. class.

> testset$Survived <- predict(rpart_fit, testset, type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_sex_agegroup.csv", row.names=FALSE)

The results from Kaggle did not show any improvement from previous results, but we now have a way to add more attributes without having to code complex rules.

Let us now add more attributes to the formula in rpart and see what develops.

> rpart_fit <- rpart(formula=Survived ~ Sex + AgeGroup + Pclass + Embarked, data=dataset, method="class")
> prp(rpart_fit, type=1, extra=100, box.col=c("pink", "palegreen3")[rpart_fit$frame$yval], cex=0.6)
As observed, the decision tree is quite complex after adding only two more attributes to the formula. Imaging writing code for this :-)

Proceed to submitting results on Kaggle.

> testset$Survived <- predict(rpart_fit, testset, type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="rpart_more_attr.csv", row.names=FALSE)

This bumped our score to 78.947% and also brings us to 706th rank on the competition.

Let's now attempt to add all of the attributes. But we first need to clean them, right?! Yes, in next part, we will do so.