Skip to main content

Executing MapReduce Applications on Hadoop (Single-node Cluster) - Part 3

In our previous experiment, we ran source code of Word count MapReduce application eclipse. This time, we are going to write our own piece of code.

Remember Permutations and Combinations you studied in College? We will write a fresh approach to compute combinations of all strings in a file. You'll have to make a very few changes to the existing code.

First, you need to create a text file with some words separated by spaces:
- Create a new text file named words.txt in /home/hadoop/Documents/combinations/
- Enter some text like:
Astronomystar sun earth moon milkyway asteroid pulsar nebula mars venus jupiter neptune saturn blackhole galaxy cygnus cosmic comet solar eclipse globular panorama apollo discovery seti aurora dwarf halebopp plasmasphere supernova cluster europa juno keplar helios indego genamede neutrinos callisto messier nashville sagittarius corona circinus hydra whirlpool rosette tucanaeAndroidcupcake donut eclair froyo gingerbread honeycomb icecreamsandwich jellybean kitkat lemonadeUbuntuwarty warthog haory hedgehog breezy badger dapper drake edgy eft feisty fawn gutsy gibbon herdy heron intrepid ibex jaunty jackalope karmic koala lucid lynx meverick meerkat natty narwhal oneiric ocelot raring ringtail

- Save and exit

Now open your Eclipse. In your existing Combinatorics Project, add a new Class:
- Right click on src > New > Class
- Name it Combinations
- Replace the code with the following:

import java.io.IOException;
import java.util.Arrays;
import java.util.Date;
import java.util.SortedSet;
import java.util.StringTokenizer;
import java.util.TreeSet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

/**

 * MapReduce Application to discover combinations of all words in a text file
 * 
 * @author hadoop
 * 
 */
public class Combinations {

// Inherit Mapper class, our input and output will be all text

public static class MyMap extends Mapper<LongWritable, Text, Text, Text> {
// Empty string to write as values against keys by default
private Text term = new Text();

// Mapping function to map all words to key-value pairs

protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// Convert all text to a string
String str = value.toString();
// Tokenizer to break text into words, separated by space
StringTokenizer tokenizer = new StringTokenizer(str, " ");
// Write a key-value pair against each term to context (job)
while (tokenizer.hasMoreTokens()) {
term.set(tokenizer.nextToken());
// If length of the term exceeds 20 characters, skip it, because
// processing strings of greater lengths may not be possible
if (term.getLength() > 20)
continue;
// Initially, pass term as both key and value
context.write(term, term);
}
};
}

// Inherit Reducer class, this will be executed on multiple nodes and write

// output to text file
public static class MyReduce extends Reducer<Text, Text, Text, Text> {
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// Sorted collection to store set of combinations
SortedSet<String> list = new TreeSet<String>();
// Iterate for each key
for (Text text : values) {
// Find out all combinations of a string
String str = text.toString();
int length = str.length();
// The number of combinations is 2^(n-1)
int total = ((Double) Math.pow(2, length)).intValue() - 1;
for (int i = 0; i < total; i++) {
String tmp = "";
char[] charArray = new StringBuilder(
Integer.toBinaryString(i)).reverse().toString()
.toCharArray();
for (int j = 0; j < charArray.length; j++) {
if (charArray[j] == '1') {
tmp += str.charAt(j);
}
}
list.add(tmp);
}
list.add(str);
}
// Write term as key and its combinations to output
context.write(key, new Text(Arrays.toString(list.toArray())));
};
}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();
// Initiate MapReduce job, named combinations
Job job = new Job(conf, "combinations");
// Our keys in output will be text
job.setOutputKeyClass(Text.class);
// Our values in output will be text
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMap.class);
job.setReducerClass(MyReduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// Fetch input/output paths from arguments
FileInputFormat.addInputPath(job, new Path(args[0]));
// Create a time stamped directory inside input directory for output
FileOutputFormat.setOutputPath(job,
new Path(args[0] + String.valueOf(new Date().getTime())));
job.waitForCompletion(true);
}
}

Execute the code:
- Right click > Run As > Run Configurations
- Right click on Java Application (from left pane) > New
- On top, name it Combinations
- In Arguments tab, write /home/hadoop/Documents/combinations/
- Apply and Run

The MapReduce application should run and you should be able to collect the output in a time stamped folder in /home/hadoop/Documents/combinations/ directory

This algorithm is explained in detail here. You should also read the comments, the rest of the code is self-explanatory.

With this experiment, we'll wrap up our experiments on Single-node Hadoop clusters. I will encourage you to do some more experiments and make yourself comfortable before you try out something Big.

Comments

Popular posts from this blog

A faster, Non-recursive Algorithm to compute all Combinations of a String

Imagine you're me, and you studied Permutations and Combinations in your high school maths and after so many years, you happen to know that to solve a certain problem, you need to apply Combinations. You do your revision and confidently open your favourite IDE to code; after typing some usual lines, you pause and think, then you do the next best thing - search on Internet. You find out a nice recursive solution, which does the job well. Like the following: import java.util.ArrayList; import java.util.Date; public class Combination {    public ArrayList<ArrayList<String>> compute (ArrayList<String> restOfVals) {       if (restOfVals.size () < 2) {          ArrayList<ArrayList<String>> c = new ArrayList<ArrayList<String>> ();          c.add (restOfVals);          return c;       }       else {          ArrayList<ArrayList<String>> newList = new ArrayList<ArrayList<String>> ();          for (String

How to detach from Facebook... properly

Yesterday, I deactivated my Facebook account after using it for 10 years. Of course there had to be a very solid reason; there was, indeed... their privacy policy . If you go through this page, you might consider pulling off as well. Anyways, that's not what this blog post is about. What I learned from yesterday is that the so-called "deactivate" option on Facebook is nothing more than logging out. You can log in again without any additional step and resume from where you last left. Since I really wanted to remove myself from Facebook as much as I can, I investigated ways to actually delete a Facebook account. There's a plethora of blogs on the internet, which will tell you how you can simply remove Facebook account. But almost all of them will either tell you to use "deactivate" and "request delete" options. The problem with that is that Facebook still has a last reusable copy of your data. If you really want to be as safe from its s

A step-by-step guide to query data on Hadoop using Hive

Hadoop empowers us to solve problems that require intense processing and storage on commodity hardware harnessing the power of distributed computing, while ensuring reliability. When it comes to applicability beyond experimental purposes, the industry welcomes Hadoop with warm heart, as it can query their databases in realistic time regardless of the volume of data. In this post, we will try to run some experiments to see how this can be done. Before you start, make sure you have set up a Hadoop cluster . We will use Hive , a data warehouse to query large data sets and a adequate-sized sample data set, along with an imaginary database of a travelling agency on MySQL; the DB  consisting of details about their clients, including Flight bookings, details of bookings and hotel reservations. Their data model is as below: The number of records in the database tables are as: - booking: 2.1M - booking_detail: 2.1M - booking_hotel: 1.48M - city: 2.2K We will write a query that