Tuesday, April 12, 2016

Titanic: A case study for predictive analysis on R (Final)

Our previous attempt to accurately predict whether a passenger is likely to survive, a competition from Kaggle.com. We used some statistics and machine learning models to classify the passengers.

In our final part, we will push our limits using advanced machine learning models, including Random Forests, Neural Networks, Support Vector Machines and other algorithms, and see how long we can torture our data before it confesses.

Let's resume from where we left. We are applying an implementation of Random forest method of classification. Shortly, this model grows many decision trees and then uses a voting system to decide which trees to pick. This way, the common issue with decision trees, over fitting is mitigated (learn more here).

> library(randomForest)
> formula <- as.factor(Survived) ~ Sex + Pclass + FareGroup + SibSp + Parch + Embarked + HasCabin + AgePredicted + AgeGroup 
> set.seed(seed)
> rf_fit <- randomForest(formula, data=dataset[dataset$Dataset == 'train',], importance=TRUE, ntree=100, mtry=1)
> varImpPlot(rf_fit)
> testset$PredSurvived <- predict(rf_fit, dataset[dataset$Dataset == 'test',])
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)

> write.csv(submit, file="rforest.csv", row.names=FALSE)

The results were not as promising as expected. We did not make any improvements using this algorithm. This indicates that the decision tree model is not over-fitting.

This is the point where we rethink our data. We noticed that missing Age is an important factor; some records are missing Fare and Embarked; we also extracted Title from names; we derived from seemingly useless variable, Cabin, a boolean variable HasCabin.

Now let's have a look at Ticket.

> unique(dataset$Ticket)

Notice anything? We see some strings like PC, CA, SOTON, PARIS, etc. Now without actually knowing what these represent, how about clipping off the digits and extract only this part? Here's how we'll do so (you'll need to install stringr package if it's missing):

> library(stringr)
> dataset$TicketPart <- NULL
> dataset$TicketPart <- str_replace_all(dataset$Ticket, "[^[:alpha:]]", "")

> dataset$TicketPart <- as.factor(dataset$TicketPart)
> plot(table(dataset$TicketPart[dataset$TicketPart != '']))
The plot reveals that some parts appear frequently. These might hint at where the passenger is coming from.

Next, we can use SibSp and Parch to determine the size of family on board. The thought behind this is that if more members of a family are on board, they'll have high support, and chances of survival.

> dataset$FamilySize <- dataset$SibSp + dataset$Parch + 1
# +1 for the passenger himself

Torturing data even more, we'll explore Name variable even more. We notice that apart from Title, we can also extract Surname, since names are in format [Surname], [Title] [Given Names]

> dataset$Surname <- sapply(dataset$Name, FUN=function(x) {strsplit(as.character(x), split='[,.]')[[1]][1]})
> dataset$Surname <- as.factor(sub(' ', '', dataset$Surname))

> dataset$Surname <- factor(dataset$Surname)

We are only interested in frequent names; we can reduce levels where family size is less than 3:

> dataset$FamilyID <- paste(as.character(dataset$FamilySize), dataset$Surname, sep="")
> dataset$FamilyID[dataset$FamilySize <= 2] <- 'Small'
> famIDs <- data.frame(table(dataset$FamilyID))
> famIDs <- famIDs[famIDs$Freq <= 2,]
> dataset$FamilyID[dataset$FamilyID %in% famIDs$Var1] <- 'Small'
> dataset$FamilyID <- factor(dataset$FamilyID)
> plot(table(dataset$FamilyID[dataset$FamilyID != 'Small']))
As visible, we have several records with same FamilyID and size of family. Therefore, we can conclude that Surname successfully mapped passengers of same family.

There can be many ways to continue torturing this data, but we will now limit ourselves to these variables only.

Now we'll apply SVM and other models (wildly) and see what combination of variables worked for us.

> formula <- as.factor(Survived) ~ Sex + AgeGroup + Pclass + FareGroup + SibSp + Parch + Embarked + Title + FamilySize + FamilyID + HasCabin + TicketPart
> svm_fit <- svm(formula, data=dataset[dataset$Dataset == 'train',])
> testset$PredSurvived <- predict(svm_fit, testset, type="class")
> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)
> write.csv(submit, file="svm.csv", row.names=FALSE)

I continued my experiments on different models, each of which cannot be described here, but for ease, I have created this R script and partitioned the code into functions.

What should be noted is that later in my experiments, I used Age and Fare instead of their discretized variables for accuracy (yes, the execution time increases as a result).

Here are the results from some of the models:

Random Forest:


Neural Networks:


For our experiments so far, CForest proved to be the top performer. But please don't stop here; apply your own ideas of twisting and squashing data to gain more accuracy.

For now, I guess this series should serve as a good starter on Predictive analytics. You can find a variety of different problems on Kaggle.com to participate in and polish your analytics spectacles.

Please feel free to comment and maybe share your score...

Tuesday, November 17, 2015

Do's and don'ts for Team Leaders - 1

7 do's for Team Leaders

Say you've just got promoted to a team lead position and have no clue, or received no training from your prior leader in this regard.
Well, here's a little something for you. This is, by no means a thorough guide, but rather a kick starter.
These 7 points below are an extract of what I've experienced as a Team leader of smart and versatile software engineers.

1. Be friendly and supportive:

Simplest and most general rule that fits everywhere. Hard to achieve though.
Friendliness requires openness and transparency. Try to make work fun without compromising on your company's mandate.

Dine with your team, talk about stuff other than work. Share jokes and inspirational quotes/videos. Make your presence comfortable. A good leader knows his players' interests, current goals, future plans, daily activities, pet names, favorite pizza toppings... you get the point.

Being supportive means you should always be available, set examples by doing and be the easiest person to communicate to. When you need something, ask their availability and go to their desk; when they need your help, make them a priority and again, go to their desk.

2. Believe in them:

Believe in your team mates, more than they believe themselves. Often, those who lack experience also lack the tendency to estimate their capabilities. One of my former seniors had me do things -- I then could not believe I was able to -- by breaking the job into unit tasks.

Here, by believing, I do not mean just pretending or saying that you believe in them. Have faith, and be courageous to assign them critical tasks if they are capable. Make your team mates realize the importance of such tasks and why you think they have more than half the chance to handle them. This practice builds confidence. A critical task may involve some risk and failing might cost, but the damage of not doing so may and will cost much more.

But careful not be unrealistic; a lioness will encourage her cubs to hunt a deer, but won't let them go near a Gaur. I made a mistake one with one of my best players of assigning a task without proper training. The result was a deadline unmet, a deliverable missed and an unhappy client.

3. Set vision centered goals:

Have a clear vision in your mind about why you are leading your team? What do you want them to be? For example, you can have a vision of training people to have your skills, plus their own skills, minus your weaknesses. Once you have this, set timely goals according to their skills, areas of improvements and interests.

A bright player of my team is very dedicated and hard working. She needs improvement in stress handling, so I set specific goals that put her to stress just as much as she can handle; while keeping this constant, I occasionally assign her additional tasks. The assumption was that this experiment will raise her stress threshold. A few months exhibited significant improvement.

One of the most gifted players in my team is never troubled with work and always delivered quality stuff at supersonic pace. Now the downside is that her efforts are limited to making software; the rest of the teams have not been benefited from her experience. So, her goal for the year is to identify a common problem related to software engineering processes in the organization... fix it.

You cannot exercise this well if you are not friendly and supportive. Also, if you are not sincere with them or put yourself before them, you will hesitate to set goals that might ask more from you then them. If such is the case, you're on the verge of being a bad example.

4. Be honest:

Honesty is the first chapter in the book of wisdom.
[Thomas Jefferson]

You can be manipulative in the name of diplomacy, hide the facts and call it strategy and still be successful - temporarily. Eventually, your strategy will fall flat on its face when your mates will follow your footsteps (what goes around comes around).

Fabricating project deadlines, making promises about appraisals and promotions you know won't happen, making up stories to disapprove leaves are the most common practices of dishonesty in the industry.

My manager clearly gives me project goals and deadlines at once and leaves Yes/No up to me. The only occasions he heard "No" from me were when he too admitted that the job was impossible. I follow the same practice and without keeping any buffer, my team almost always gets things ready in due time.

4. Have communication protocols:

Miscommunication is the root of all business mishaps. You may define "efficient" as someone who executes tasks quickly, while someone else might be talking about someone who gets things done the most optimal way, when using the same term.

Now, I'm not suggesting to write a dictionary and cram into your memory. Communication gets stronger with time, all you need to do is to be consistent when you say something particularly. My manager does not often use "unacceptable", but when he does, it's red alert - always. To point out a mistake casually, he uses "messed up".

Build your protocols just like that. Few examples are:
- "hang on": I'm coming to you personally
- "wait a minute": I'll be right back
- "what do you think?": you told me a problem I'm sure you can solve yourself
- "will get back to you": I'll follow up when I have some progress on this matter
- "by the end of the week": by Friday, before 5:00pm (closing time)
- "it is important": it must be done to meet the goals
- "it is urgent": do it now or it can never be done

Do not use vague words. "It should be ready": is it "it is ready" or "it is not ready"? Using ambiguous terms to stall people is unprofessional, unethical and disrespectful (humour is another thing). Leave that to your politicians.

Following this principle will eventually build a strong communication bridge, which will relieve you from most of your worries.

5. Appreciate and criticize:

Successful people have control over their emotions. You are a human, so you will get aggressive sometimes - inevitably. But remember Vince Lombardi's rule of thumb: "Praise in public; criticize in private". Appreciation should be loud, casual and not be prologue to a lecture on how to improve more. Also, be responsive, not reactive (a leaf out of 7 habits of highly effective people by Stephen Covey).

Criticism should be constructive, point out mistakes and explain ways to correct them. If you're angry, cool down. Give yourself a moment. See if the person is already doing his best. If such is the case, you're not just to criticize. You should either manage resource better or provide more training. Try to ignore minor issues. If you are fault tolerant, your team will not hesitate to report to you if a mishap occurs. I once deleted database of a live server (repeat) a live server. My manager (now CEO) is a cool captain, so I informed him immediately. The next whole day (off day), he sat with me and we recovered about 99% of the lost data. Had he been an intolerant dictator, I would have tried to cover up and the damage would've been much worse.

It is good habit to give extra credit to your team on good performance than due; if they're under-performing or have messed something up, own their mistake. Take some blame off of them. I've seen a manager doing exactly the opposite to her team. Half of her team left her; the rest spent their time making cover up stories, until she was relieved off the company the same year.

6. Welcome mistakes:

You are not what you are without making some mistakes. Making a mistake should not cause embarrassment to anyone in your environment. Rather, encourage your team by sharing your own mistakes and how you corrected them; what were the good/bad lessons you learned. Mistakes will happen, but won't repeat if people start sharing them with others. I shared the story of my blunder on the live database with whole department, and now backing up databases before any kind of change is a common practice throughout the team.

7. Performance feedback:

Performance feedback is an effective way to improve yourself. Establish a mechanism to provide and receive feedback to and from your team members. They should see you as a person who welcomes criticism and responds positively to it. In my company, some seniors encourage their teams to do their appraisal too. This works well if you have a history of not violating your authority in difficult times. On the other hand, some people are reluctant due to lack of confidence in you; it is not always easy to tell your senior about his flaws. You'll have to find your own way of knowing how your players think you can do a better job at leading them.

One of my team members won't give any feedback on my performance. Later, I learnt the hard way that my strict response to her mistakes damaged her performance rather than improve. When settling this, I made her realize that lack of feedback from her resulted in escalation of the issue.

I personally like feedback process in Academia, where pupils are asked by the administration to fill a questionnaire to evaluate the performance of their instructors each semester.

These are a few lessons I've learnt in my not-so-long tenure as a team leader. I've tried to keep them general, but no guidelines are applicable as is to all circumstances and people. Like a military officer will definitely agree to #4 and refute to #6. Next, we'll have a look at some "don'ts" in sha Allah.

Agree/disagree, please leave comments if you think it needs improvement or something is terribly wrong...

Saturday, May 16, 2015

Playing in Amazon's Clouds - Introduction to Elastic Computing Cloud - Part 2

Connecting to Cloud

Previously, we looked at how to configure an EC2 instance on AWS. If you're not sure what this sentence was about, click here.

In this post, we'll look at some ways to connect to your EC2 instance and try out an example. I'm assuming you already know how to get to the EC2 console page from AWS home.

ٖFrom here, you should go to the Running Instances link to check your instances. You should see something like this:

Right now, we only have one instance of t2.micro configuration, running on public IP address defined under Public IP. We will first create an alarm to make sure we do not hit our cap when experimenting.

Click the Alarm icon under Alarm Status. You should see a pop up to configure an alarm. We are interested in making sure that the CPU usage is under certain limits. Let's create an alarm.

We want to generate an email alert whenever our instance is consuming over 90% of processing power for 1 hour or more.

We just created a new warning notification to my email address and also set an action to Stop the instance whenever our set limits are hit.

Note that on setting the email address, you will receive an email for confirmation. If your email address is not confirmed within three days, you'll stop receiving any alerts.

Now, let's come to our real objective. We will try to connect using SSH client for Windows, PuTTy. If you're on Windows, download and install the latest version of it. Follow the links for Linux and OSX.

When you've installed PuTTy, launch the PuTTy applications. You should be able to see a configuration window:

In the host name, put: ec2-user@your_public_dns. For example:

Next step is to provide a private key. Remember we generated a pem file in our earlier walk through? Well! PuTTY uses a ppk file, or PuTTY Private Key. We can easily generate this file from a utility called PuTTYGen; this tool installs with PuTTY. Leave the current window open as is and launch puttygen.exe from your PuTTY installation directory.

Go to conversions menu and import the pem key pair file created earlier. Mine is vision360-keypair.pem. You should be able to see the public key and the RSA fingerprint.

Click "Save private key" button and save the key as <your-key-name>.ppk. If PuTTY displays a warning for missing passphrase, ignore and proceed. I saved my file as vision360-keypair.ppk. Close this window and go back to PuTTY.

Go to Connection > SSH > Auth. Provide the path to private key we generated:

Click Open to connect to your EC2 machine. On first attempt, PuTTY should display a warning that the Server's host key was not previously registered and that you may not be connecting to the desired computer. Ignore this warning by pressing Yes; we know what we're doing :-)

That's it. You're in. The first thing you may want to do is to update the OS of your VM. Try the following command (this is similar to Ubuntu's apt-get):

$ sudo yum update

You'll note that the updates are downloaded and applied like a bullet train. This is because Amazon uses state of the art infrastructure and most optimal settings for the OS of its VMs.

Check your system's resources using some common commands:
To check disk space
$ df -h
To check memory
$ free -m

That's it for now, we'll get to some practical use of the Elastic Computing in future posts. The plan is to do some real-life data analysis on this.

Please comment if you find any mistake. Thanks...