Skip to main content

Executing MapReduce Applications on Hadoop (Single-node Cluster) - Part 2

Previously, we saw how to execute built-in example of Word Count on Hadoop, in this part, we will try to build the same application on Eclipse from the source code of word count and run it.

First, you need to install Eclipse on your Hadoop-ready Virtual Machine (assuming that JDK is already installed when you set up Hadoop). This can be done by installing from Ubuntu software center, but my recommendation is that you download it and extract to your Home directory. Any version of Eclipse should work, I have done the experiments on version 4.3 (Kepler).

After installation, launch Eclipse and the first thing to do is to make Oracle JDK your default Java Runtime:
- Go to Window > Preferences > Java > Installed JREs
- If the default JRE does not point to Oracle JRE, then edit and set the directory to /usr/lib/jvm/java-7-oracle/
- Press OK to finish

Now we will create a Java Application Project:
- Go to New > Java Project

- Name the project Combinatorics, since we will be doing some counting problems in this project

- No need to change anything else. Press Finish
- A Java project named Combinatorics should appear in your Package Explorer window on the Left

We will need some external libraries in order to build Hadoop's code. Download these libraries:

I have put these libraries in a zipped file here as well.

After you have collected all the libraries:
- Right click on the project > New > Folder
- Name the folder lib and finish

- Copy all jar files in the newly created folder (you can do so in Nautilus as well as in Eclipse)
- Right click on lib folder and click Refresh

- The jars you have added should now appear here
-Go to Project > Properties > Java Build Path > Add Jars > Combinatorics> lib. Select all jar files

- Go to Project and check Build Automatically

Next, we need to create a Source file in src folder. Right click on src folder > New > Class. Name it WordCount and Finish.

Add the following methods to the newly created class:

import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.*;

public class WordCount {
   public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
      private final static IntWritable one = new IntWritable (1);
      private Text word = new Text ();

      public void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {
         String line = value.toString ();
         StringTokenizer tokenizer = new StringTokenizer (line);
         while (tokenizer.hasMoreTokens ()) {
            word.set (tokenizer.nextToken ());
            context.write (word, one);

   public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
      public void reduce (Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
         int sum = 0;
         for (IntWritable val : values) {
            sum += val.get ();
         context.write (key, new IntWritable (sum));

   public static void main (String[] args) throws Exception {
      Configuration conf = new Configuration ();
      Job job = new Job (conf, "wordcount");
      job.setOutputKeyClass (Text.class);
      job.setOutputValueClass (IntWritable.class);
      job.setMapperClass (Map.class);
      job.setReducerClass (Reduce.class);
      job.setInputFormatClass (TextInputFormat.class);
      job.setOutputFormatClass (TextOutputFormat.class);
      FileInputFormat.addInputPath (job, new Path (args[0]));
      FileOutputFormat.setOutputPath (job, new Path (args[1]));
      job.waitForCompletion (true);

This is the simplest for of a MapReduce program. We will have an in-depth look at the code later; first, we need to run this.

- Go to Run > Run to execute the program
- The program should, at first end on an ArrayIndexOutOfBounds Exception
- Go to Run > Run Configurations > Arguments and add the following argument:
/home/hadoop/Documents/books/ /home/hadoop/Documents/books/output (assuming that you followed part 1 and the text files are still in this path)

- Before you press Run, are all the Hadoop services running? You have to start them. Remember! Here is the command:
- Now press Run

Watch the same progress log on the output window that you previously saw on Terminal. Your output should be in the /home/hadoop/Documents/books/output directory.

Next, we will try to understand the code and maybe change it to try something else.

Please feel free to comment for corrections, cricitcs, help, etc.


Popular posts from this blog

Executing MapReduce Applications on Hadoop (Single-node Cluster) - Part 1

Okay. You just set up Hadoop on a single node on a VM and now wondering what comes next. Of course, you’ll run something on it, and what could be better than your own piece of code? But before we move to that, let’s first try to run an existing program to make sure things are well set on our Hadoop cluster.
Power up your Ubuntu with Hadoop on it and on Terminal (Ctrl+Alt+T) run the following command: $
Provide the password whenever asked and when all the jobs have started, execute the following command to make sure all the jobs are running: $ jps
Note: The “jps” utility is available only in Oracle JDK, not Open JDK. See, there are reasons it was recommended in the first place.
You should be able to see the following services: NameNode SecondaryNameNode DataNode JobTracker TaskTracker Jps

We'll take a minute to very briefly define these services first.
NameNode: a component of HDFS (Hadoop File System) that manages all the file system metadata, links, trees, directory structure, etc…

A faster, Non-recursive Algorithm to compute all Combinations of a String

Imagine you're me, and you studied Permutations and Combinations in your high school maths and after so many years, you happen to know that to solve a certain problem, you need to apply Combinations.

You do your revision and confidently open your favourite IDE to code; after typing some usual lines, you pause and think, then you do the next best thing - search on Internet. You find out a nice recursive solution, which does the job well. Like the following:

import java.util.ArrayList;
import java.util.Date;

public class Combination {
   public ArrayList<ArrayList<String>> compute (ArrayList<String> restOfVals) {
      if (restOfVals.size () < 2) {
         ArrayList<ArrayList<String>> c = new ArrayList<ArrayList<String>> ();
         c.add (restOfVals);
         return c;
      else {
         ArrayList<ArrayList<String>> newList = new ArrayList<ArrayList<String>> ();
         for (String o : restOfVals) {

Titanic: A case study for predictive analysis on R (Part 4)

Working with titanic data set picked from's competition, we predicted the passenger survivals with 79.426% accuracy in our previous attempt. This time, we will try to learn the missing values instead of setting trying mean or median. Let's start with Age.

Looking at the available data, we can hypothetically correlate Age with attributes like Title, Sex, Fare and HasCabin. Also note that we previous created variable AgePredicted; we will use it here to identify which records were filled previously.

> age_train <- dataset[dataset$AgePredicted == 0, c("Age","Title","Sex","Fare","HasCabin")]
>age_test <- dataset[dataset$AgePredicted == 1, c("Title","Sex","Fare","HasCabin")]
>formula <- Age ~ Title + Sex + Fare + HasCabin
>rp_fit <- rpart(formula, data=age_train, method="class")
>PredAge <- predict(rp_fit, newdata=age_test, type="vector")