Skip to main content

Executing MapReduce Applications on Hadoop (Single-node Cluster) - Part 2

Previously, we saw how to execute built-in example of Word Count on Hadoop, in this part, we will try to build the same application on Eclipse from the source code of word count and run it.

First, you need to install Eclipse on your Hadoop-ready Virtual Machine (assuming that JDK is already installed when you set up Hadoop). This can be done by installing from Ubuntu software center, but my recommendation is that you download it and extract to your Home directory. Any version of Eclipse should work, I have done the experiments on version 4.3 (Kepler).

After installation, launch Eclipse and the first thing to do is to make Oracle JDK your default Java Runtime:
- Go to Window > Preferences > Java > Installed JREs
- If the default JRE does not point to Oracle JRE, then edit and set the directory to /usr/lib/jvm/java-7-oracle/
- Press OK to finish




Now we will create a Java Application Project:
- Go to New > Java Project



- Name the project Combinatorics, since we will be doing some counting problems in this project



- No need to change anything else. Press Finish
- A Java project named Combinatorics should appear in your Package Explorer window on the Left

We will need some external libraries in order to build Hadoop's code. Download these libraries:

I have put these libraries in a zipped file here as well.


After you have collected all the libraries:
- Right click on the project > New > Folder
- Name the folder lib and finish



- Copy all jar files in the newly created folder (you can do so in Nautilus as well as in Eclipse)
- Right click on lib folder and click Refresh



- The jars you have added should now appear here
-Go to Project > Properties > Java Build Path > Add Jars > Combinatorics> lib. Select all jar files



- Go to Project and check Build Automatically

Next, we need to create a Source file in src folder. Right click on src folder > New > Class. Name it WordCount and Finish.



Add the following methods to the newly created class:

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;

public class WordCount {
   public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
      private final static IntWritable one = new IntWritable (1);
      private Text word = new Text ();

      public void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {
         String line = value.toString ();
         StringTokenizer tokenizer = new StringTokenizer (line);
         while (tokenizer.hasMoreTokens ()) {
            word.set (tokenizer.nextToken ());
            context.write (word, one);
         }
      }
   }

   public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
      public void reduce (Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
         int sum = 0;
         for (IntWritable val : values) {
            sum += val.get ();
         }
         context.write (key, new IntWritable (sum));
      }
   }

   public static void main (String[] args) throws Exception {
      Configuration conf = new Configuration ();
      Job job = new Job (conf, "wordcount");
      job.setOutputKeyClass (Text.class);
      job.setOutputValueClass (IntWritable.class);
      job.setMapperClass (Map.class);
      job.setReducerClass (Reduce.class);
      job.setInputFormatClass (TextInputFormat.class);
      job.setOutputFormatClass (TextOutputFormat.class);
      FileInputFormat.addInputPath (job, new Path (args[0]));
      FileOutputFormat.setOutputPath (job, new Path (args[1]));
      job.waitForCompletion (true);
   }
}

This is the simplest for of a MapReduce program. We will have an in-depth look at the code later; first, we need to run this.

- Go to Run > Run to execute the program
- The program should, at first end on an ArrayIndexOutOfBounds Exception
- Go to Run > Run Configurations > Arguments and add the following argument:
/home/hadoop/Documents/books/ /home/hadoop/Documents/books/output (assuming that you followed part 1 and the text files are still in this path)



- Before you press Run, are all the Hadoop services running? You have to start them. Remember! Here is the command:
$ start-all.sh
- Now press Run

Watch the same progress log on the output window that you previously saw on Terminal. Your output should be in the /home/hadoop/Documents/books/output directory.



Next, we will try to understand the code and maybe change it to try something else.

Please feel free to comment for corrections, cricitcs, help, etc.

Comments

Popular posts from this blog

Playing in Amazon's Clouds - Introduction to Elastic Computing Cloud - Part 1

A really brief Intro.. Researcher, Trying to execute an extremely computationally resource hungry experiment? App developer, unsure of how much data you'll be collecting from the users? Student, tasked to build your FYP (final year project) on distributed computing environment? Just an ordinary techie trying to catch up with the world? If you're any of these, you cannot escape the fact that Cloud computing is storming in and you have to engage yourself actively in it. Adopt it, or perish. I'm a newbie (better say wannabe) in this massive web of computing, and here just to share some experiences I'm having - successes and failures. First of all, Cloud computing is nothing new, it has been there for over 3 decades and was referred with names like Grid computing  and Distributed computing . It was business people that came up with a catchy name to attract business. The idea behind distributed computing is simple. We create a network of computers t...

How to detach from Facebook... properly

Yesterday, I deactivated my Facebook account after using it for 10 years. Of course there had to be a very solid reason; there was, indeed... their privacy policy . If you go through this page, you might consider pulling off as well. Anyways, that's not what this blog post is about. What I learned from yesterday is that the so-called "deactivate" option on Facebook is nothing more than logging out. You can log in again without any additional step and resume from where you last left. Since I really wanted to remove myself from Facebook as much as I can, I investigated ways to actually delete a Facebook account. There's a plethora of blogs on the internet, which will tell you how you can simply remove Facebook account. But almost all of them will either tell you to use "deactivate" and "request delete" options. The problem with that is that Facebook still has a last reusable copy of your data. If you really want to be as safe from its s...

Yet another Blog on Query Optimization for MySQL Server

If you have been into MIS development for some time, then you may have realized that buying latest, multi-thousand-dollar Machine, stuffed with a top notch processor and an army of memory chips is not sufficient to your needs when it comes to processing large data, especially when your DBMS is MySQL Server. In this article, I have tried to input  the tips and techniques to-be-followed - some in general and some specific to MySQL Server; but I would, as every blogger, repeat the same common phrase that " in the end   it all depends on your scenario ". The results you are going to see will mostly be in milliseconds so before thinking "is it worth the effort if the result is in a few milliseconds?", do know that these results are derived using a very very simple database with not more than 100000 records in a table.  With complex databases and records in millions, the effort will pay you back. Coming straight to topic, here are some points you should not ign...