Skip to main content

Executing MapReduce Applications on Hadoop (Single-node Cluster) - Part 1

Okay. You just set up Hadoop on a single node on a VM and now wondering what comes next. Of course, you’ll run something on it, and what could be better than your own piece of code? But before we move to that, let’s first try to run an existing program to make sure things are well set on our Hadoop cluster.

Power up your Ubuntu with Hadoop on it and on Terminal (Ctrl+Alt+T) run the following command:

Provide the password whenever asked and when all the jobs have started, execute the following command to make sure all the jobs are running:
$ jps

Note: The “jps” utility is available only in Oracle JDK, not Open JDK. See, there are reasons it was recommended in the first place.

You should be able to see the following services:

We'll take a minute to very briefly define these services first.

NameNode: a component of HDFS (Hadoop File System) that manages all the file system metadata, links, trees, directory structure, etc. You can track the status of NameNode on http://localhost:50070 in the browser of your machine (localhost can be some other address if you are not using standalone deployment).

SecondaryNameNode: no. This is not a backup, or replica of NameNode. The primary responsibility of SecondaryNameNode is maintaining the logs created by NameNode, since the size of the logs can become huge.

DataNode: this one handles the actual data. In a multi-node cluster, you may have more DataNodes. You make changes to the DataNode via NameNode.

JobTracker: this service relates to MapReduce jobs. It keeps the jobs given to Hadoop from client for processing. It talks to NameNode to find relevant data, then looks for most appropriate nodes in the cluster to assign tasks to. Additionally, it reassigns tasks when a node fails to do it. The status of JobTracker can be viewed on http://localhost:50030

TaskTracker: a node in the cluster that does the Map, Reduce and Shuffle operations assigned to it by JobTracker. It keeps sending Heartbeat messages to JobTracker to inform about its status.

Enough theory, let’s get back to real stuff.

We will begin with example of Word Count, provided with Hadoop and see how it goes. This utility does nothing fancy; it counts the number of occurrence of each word in a bunch of text files. Here are the steps to do so:
  1. Fetch some plain text files (novels recommended), create a directory “books” in your Documents and copy these files in it. I’ll be using some novels of Sherlock Holmes I downloaded from, but you can use any text files.

  2. Copy these files into HDFS using dfs utility:$ hadoop dfs –copyFromLocal $HOME/Documents/books /HDFS/books

  3. Confirm that these files have been copied using ls command
    $ hadoop dfs –ls /HDFS/books
  4. Finally, execute the example jar file given in Hadoop examples:
    $ hadoop jar $HADOOP_HOME/hadoop-examples-1.2.1.jar wordcount /HDFS/books /HDFS/books/output

  5. The MapReduce Job “wordcount” in the hadoop-examples jar will execute, pick the text files from /HDFS/books, count the occurrence of each unique word and write the output to /HDFS/books/output. You should also check the following on your web browser to trace the Job’s statuses:
  6. In order to collect the output file, run the following command:
    $ hadoop dfs –getmerge /HDFS/books/output $HOME/Documents/books/
The output file should now be in your Documents/books directory in readable form.


  1. this works !!! thanks

  2. hadoop dfs –copyFromLocal $HOME/Documents/books /HDFS/books here you have done HDFS/books how is it possible without making the sirectory in HDFS i am getting an error copyFromLocal: `/HDFS/gutenberg': No such file or directory

    1. Then create the directory on hdfs: hadoop dfs -mkdir /HDFS/dir_name and try copyFromLocal again :)



Post a Comment

Popular posts from this blog

A faster, Non-recursive Algorithm to compute all Combinations of a String

Imagine you're me, and you studied Permutations and Combinations in your high school maths and after so many years, you happen to know that to solve a certain problem, you need to apply Combinations.

You do your revision and confidently open your favourite IDE to code; after typing some usual lines, you pause and think, then you do the next best thing - search on Internet. You find out a nice recursive solution, which does the job well. Like the following:

import java.util.ArrayList;
import java.util.Date;

public class Combination {
   public ArrayList<ArrayList<String>> compute (ArrayList<String> restOfVals) {
      if (restOfVals.size () < 2) {
         ArrayList<ArrayList<String>> c = new ArrayList<ArrayList<String>> ();
         c.add (restOfVals);
         return c;
      else {
         ArrayList<ArrayList<String>> newList = new ArrayList<ArrayList<String>> ();
         for (String o : restOfVals) {

Titanic: A case study for predictive analysis on R (Part 1) is a popular community of data scientists, which holds various competitions of data science. The article performs predictive analysis on a benchmark case study -- Titanic, picked from -- in-depth.

The case study is a classification problem, where the objective is to determine which class does an instance of data belong to. This can also be called prediction problem, because we are predicting class of a record based on its attributes.

Note: This tutorial requires some basic R programming background. If you haven't yet gotten yourself acquainted with R, maybe this is the right time. Codeacademy's tutorial is my personal recommendation. We will be using RStudio here, the most used IDE for 'R' language. It is free and open-source, you can download it here.

RMS Titanic was a British cruise that sank on its course in the North Atlantic Ocean on its maiden voyage. 1502 people, out of 2224 on board lost their lives in this disaster. Due to lack of li…