Skip to main content

Executing MapReduce Applications on Hadoop (Single-node Cluster) - Part 1

Okay. You just set up Hadoop on a single node on a VM and now wondering what comes next. Of course, you’ll run something on it, and what could be better than your own piece of code? But before we move to that, let’s first try to run an existing program to make sure things are well set on our Hadoop cluster.

Power up your Ubuntu with Hadoop on it and on Terminal (Ctrl+Alt+T) run the following command:
$ start-all.sh

Provide the password whenever asked and when all the jobs have started, execute the following command to make sure all the jobs are running:
$ jps

Note: The “jps” utility is available only in Oracle JDK, not Open JDK. See, there are reasons it was recommended in the first place.

You should be able to see the following services:
NameNode
SecondaryNameNode
DataNode
JobTracker
TaskTracker
Jps



We'll take a minute to very briefly define these services first.

NameNode: a component of HDFS (Hadoop File System) that manages all the file system metadata, links, trees, directory structure, etc. You can track the status of NameNode on http://localhost:50070 in the browser of your machine (localhost can be some other address if you are not using standalone deployment).

SecondaryNameNode: no. This is not a backup, or replica of NameNode. The primary responsibility of SecondaryNameNode is maintaining the logs created by NameNode, since the size of the logs can become huge.

DataNode: this one handles the actual data. In a multi-node cluster, you may have more DataNodes. You make changes to the DataNode via NameNode.

JobTracker: this service relates to MapReduce jobs. It keeps the jobs given to Hadoop from client for processing. It talks to NameNode to find relevant data, then looks for most appropriate nodes in the cluster to assign tasks to. Additionally, it reassigns tasks when a node fails to do it. The status of JobTracker can be viewed on http://localhost:50030

TaskTracker: a node in the cluster that does the Map, Reduce and Shuffle operations assigned to it by JobTracker. It keeps sending Heartbeat messages to JobTracker to inform about its status.

Enough theory, let’s get back to real stuff.

We will begin with example of Word Count, provided with Hadoop and see how it goes. This utility does nothing fancy; it counts the number of occurrence of each word in a bunch of text files. Here are the steps to do so:
  1. Fetch some plain text files (novels recommended), create a directory “books” in your Documents and copy these files in it. I’ll be using some novels of Sherlock Holmes I downloaded from http://www.readsherlock.com, but you can use any text files.


  2. Copy these files into HDFS using dfs utility:$ hadoop dfs –copyFromLocal $HOME/Documents/books /HDFS/books

  3. Confirm that these files have been copied using ls command
    $ hadoop dfs –ls /HDFS/books
  4. Finally, execute the example jar file given in Hadoop examples:
    $ hadoop jar $HADOOP_HOME/hadoop-examples-1.2.1.jar wordcount /HDFS/books /HDFS/books/output


  5. The MapReduce Job “wordcount” in the hadoop-examples jar will execute, pick the text files from /HDFS/books, count the occurrence of each unique word and write the output to /HDFS/books/output. You should also check the following on your web browser to trace the Job’s statuses:
    http://localhost:50070
    http://localhost:50060
    http://localhost:50030
  6. In order to collect the output file, run the following command:
    $ hadoop dfs –getmerge /HDFS/books/output $HOME/Documents/books/
The output file should now be in your Documents/books directory in readable form.


Comments

  1. this works !!! thanks

    ReplyDelete
  2. hadoop dfs –copyFromLocal $HOME/Documents/books /HDFS/books here you have done HDFS/books how is it possible without making the sirectory in HDFS i am getting an error copyFromLocal: `/HDFS/gutenberg': No such file or directory

    ReplyDelete
    Replies
    1. Then create the directory on hdfs: hadoop dfs -mkdir /HDFS/dir_name and try copyFromLocal again :)

      ~SSL

      Delete

Post a Comment

Popular posts from this blog

Playing in Amazon's Clouds - Introduction to Elastic Computing Cloud - Part 1

A really brief Intro.. Researcher, Trying to execute an extremely computationally resource hungry experiment? App developer, unsure of how much data you'll be collecting from the users? Student, tasked to build your FYP (final year project) on distributed computing environment? Just an ordinary techie trying to catch up with the world? If you're any of these, you cannot escape the fact that Cloud computing is storming in and you have to engage yourself actively in it. Adopt it, or perish. I'm a newbie (better say wannabe) in this massive web of computing, and here just to share some experiences I'm having - successes and failures. First of all, Cloud computing is nothing new, it has been there for over 3 decades and was referred with names like Grid computing  and Distributed computing . It was business people that came up with a catchy name to attract business. The idea behind distributed computing is simple. We create a network of computers t...

How to detach from Facebook... properly

Yesterday, I deactivated my Facebook account after using it for 10 years. Of course there had to be a very solid reason; there was, indeed... their privacy policy . If you go through this page, you might consider pulling off as well. Anyways, that's not what this blog post is about. What I learned from yesterday is that the so-called "deactivate" option on Facebook is nothing more than logging out. You can log in again without any additional step and resume from where you last left. Since I really wanted to remove myself from Facebook as much as I can, I investigated ways to actually delete a Facebook account. There's a plethora of blogs on the internet, which will tell you how you can simply remove Facebook account. But almost all of them will either tell you to use "deactivate" and "request delete" options. The problem with that is that Facebook still has a last reusable copy of your data. If you really want to be as safe from its s...

Yet another Blog on Query Optimization for MySQL Server

If you have been into MIS development for some time, then you may have realized that buying latest, multi-thousand-dollar Machine, stuffed with a top notch processor and an army of memory chips is not sufficient to your needs when it comes to processing large data, especially when your DBMS is MySQL Server. In this article, I have tried to input  the tips and techniques to-be-followed - some in general and some specific to MySQL Server; but I would, as every blogger, repeat the same common phrase that " in the end   it all depends on your scenario ". The results you are going to see will mostly be in milliseconds so before thinking "is it worth the effort if the result is in a few milliseconds?", do know that these results are derived using a very very simple database with not more than 100000 records in a table.  With complex databases and records in millions, the effort will pay you back. Coming straight to topic, here are some points you should not ign...