Tuesday, May 13, 2014

A step-by-step guide to query data on Hadoop using Hive

Hadoop empowers us to solve problems that require intense processing and storage on commodity hardware harnessing the power of distributed computing, while ensuring reliability. When it comes to applicability beyond experimental purposes, the industry welcomes Hadoop with warm heart, as it can query their databases in realistic time regardless of the volume of data. In this post, we will try to run some experiments to see how this can be done.

Before you start, make sure you have set up a Hadoop cluster. We will use Hive, a data warehouse to query large data sets and a adequate-sized sample data set, along with an imaginary database of a travelling agency on MySQL; the DB consisting of details about their clients, including Flight bookings, details of bookings and hotel reservations. Their data model is as below:



The number of records in the database tables are as:
- booking: 2.1M
- booking_detail: 2.1M
- booking_hotel: 1.48M
- city: 2.2K

We will write a query that retrieves total country-wise bookings with a hotel reservation, paid in USD or GBP, distributed by booking type for each country.


select c.city_title, d.booking_type, count(*) total from booking b 
inner join booking_detail d on d.booking_id = b.booking_id 
inner join booking_hotel h on h.booking_id = b.booking_id 
inner join city c on c.city_code = d.city_code where b.currency in ('GBP','USD') 
group by c.city_code, d.booking_type

This query takes 150 seconds on MySQL to execute. We will try to reduce the query execution time by importing the dataset on Hive and executing the same query on our Hadoop cluster. Here are the steps:


Importing Data

First step is to export the tables in the database CSV files. Open MySQL Workbench in your master (assuming you have MySQL Server and Workbench installed and have exported the data set in it) and execute the following queries:

select * from city into outfile '/tmp/city.csv' fields terminated by ',' lines terminated by '\n';

select * from booking into outfile '/tmp/booking.csv' fields terminated by ',' lines terminated by '\n';

select * from booking_detail into outfile '/tmp/booking_detail.csv' fields terminated by ',' lines terminated by '\n';

select * from booking_hotel into outfile '/tmp/booking_hotel.csv' fields terminated by ',' lines terminated by '\n';


This will export the tables in /tmp directory as CSV files.




Installing Hive

We now install Hive on our Hadoop-ready machine. Download hive-0.11.0 from Apache Hive’s downloads
Extract into /home/hadoop/Downloads directory

Start Hadoop services and disable safemode:

$ start-all.sh
$ hadoop dfsadmin -safemode leave
$ jps



Copy Hive binaries and libraries into /usr/local director (or wherever you wish to install):

$ sudo cp -R /home/hadoop/Downloads/hive-0.11.0/ /usr/local/

Create a temporary directory in HDFS and a warehouse directory for Hive:

$ hadoop fs -mkdir /tmp
$ hadoop fs -mkdir /hadoop/hive/warehouse

Assign rights:

$ hadoop fs -chmod g+w /tmp
$ hadoop fs -chmod g+w /hadoop/hive/warehouse

Export environment variable for Hive:

$ export HIVE=/usr/local/hive-0.11.0

Run hive:


$ $HIVE/bin/hive

You should now be in Hive prompt.


Preparing Hive Database

We will now create tables in Hive DB. Note that there are certain Hive data types that are different from traditional DBMS data types. In addition, we have specified that the fields are terminated by coma, as well as the format, in which the data will be saved. This is because Hive has separate data structures to process different formats of data. Now create the tables identical to the schema by executing following queries:

hive>create table booking (booking_id int, currency string, departure_date string, requested_date string) row format delimited fields terminated by ',' stored as textfile;

hive>create table booking_detail (booking_id int, deetails_id int, booking_type string, item_code string, item_name string, passenger_count int, city_code string, duration int, price float, breakfast string, description string) row format delimited fields terminated by ',' stored as textfile;

hive>create table booking_hotel (booking_id int, details_id int, room_type string, room_cat string) row format delimited fields terminated by ',' stored as textfile;

hive>create table city (city_code string, city_title string, country string) row format delimited fields terminated by ',' stored as textfile;


Next step is to load the data from the CSV files into Hive tables. Execute the following commands to do so:

hive>load data local inpath '/tmp/city.csv' overwrite into table city;

hive>load data local inpath '/tmp/booking.csv' overwrite into table booking;

hive>load data local inpath '/tmp/booking_detail.csv' overwrite into table booking_detail;

hive>load data local inpath '/tmp/booking_hotel.csv' overwrite into table booking_hotel;

Make sure that all tables are in place.

hive>show tables;

Now execute a query to count the number of records in booking table:

hive>select count(*) from booking;

This will execute a MapReduce job against the given query; the query took 24.5 seconds, which may seem to be surprising to you. This is because the first time, Hive initiates a lot of things, creates some files in temp directory, etc. We will execute the same query again; this time, it took 14.3 seconds. Still, this is very high time for a simple count query. This needs some explanation.

First of all, MySQL and other RDBMSs are optimized to run queries faster using techniques like Indexing, etc. On the other hand, Hive is to run queries on Hadoop framework, which is a distributed computing environment. Therefore, the time it takes to distribute the data, interpret the query, create MapReduce tasks respectively is where the time is consumed most.


Second, a simple count query is not the type of query we should run on distributed environment, as it can easily be handled by RDBMS itself. The real performance impact will be shown when we run a complex query that processes data in large volume.

Now run the same query, which MySQL executed in 150 seconds:

hive>select c.city_title, d.booking_type, count(*) total from booking b inner join booking_detail d on d.booking_id = b.booking_id inner join booking_hotel h on h.booking_id = b.booking_id inner join city c on c.city_code = d.city_code where b.currency in ('GBP','USD') group by c.city_code, d.booking_type;



The query takes 132.84 seconds on single node on Hive. This is because the MapReduce job breaks the original query into pieces, thus leveraging all cores of the CPU to process data in parallel.

We can try the same by increasing number of mappers and reducers. To do so, first quit Hive:

hive>quit;

Shut down Hadoop's services

$ stop-all.sh

Edit the mapred-site.xml file in 

$ sudo nano /usr/local/hadoop/conf/mapred-site.xml

Add two properties to the file:

<property>
<name>mapred.map.tasks</name>
<value>20</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>5</value>
</property>


Save the file and exit; start all Hadoop services and Hive again; run the same query.



This time, the query was processed in 121.3 seconds. This is because as the number of mappers increase, the data will be divided into more blocks, thus requiring less processing per block. But be cautious here, increasing mappers too much can overkill, resulting in more disk head reads and writes, while CPU sitting idle. It requires some experimentation to determine the correct number of mappers and reducers required for a job.

The experiment showed how Hive can outperform MySQL on the same machine on slow-queries. Repeating the same experiment on multi-node will definitely reduce the time. 

So.. what are you waiting for? Go for it...

PS: You may find the screenshots of this post here.

PPS: I would like to thank Nabeel Imam (imam.nabeel@gmail.com) for his helping hand in setting up and carrying out the experiments...

Monday, May 12, 2014

Finally, a way to speed up Android emulator

The Android emulator's speed is killer. Literally. Android developers know this; irrespective of platform you are using, Windows, Linux or Macintosh, and your hardware specification, the emulator provided with the Android's SDK crawls like a snail.

Thankfully, the developers at ‎Intel® have come up with a hardware-assisted virtualization engine that uses the power of hardware virtualization to boost the performance of Android emulators. Here is how you can configure it:

1. First, you need to enable Hardware Virtualization (VTx) technology from the BIOS settings on your computer. Since different vendors have different settings, you may have to search on the web on how to do so.

2. Next, you need to download and install Android HAX Manager from Intel. During installation, you will be asked to reserve the amount of memory for HAX, keep it default (1024MB).

3. Now go to your Android SDK directory and launch "SDK Manager.exe".

4. Under Extras, check "Intel x86 Emulator Accelerator (HAXM installer)" and also select "Intel x86 Atom System Image" for your SDK version.



5. Launch "AVD Manager.exe" from your Android SDK directory and create a new Virtual Device (AVD). Select Intel Atom (x86) for the CPU. You may just copy the settings from the image below.



That's all. Launch your AVD and feel the speed. :-)

Note: Unfortunately, this may not work if your computer does not support hardware virtualization, or you are on AMD platform. Here is a list from Intel of the CPUs that support hardware virtualization (I hope your CPU is listed here).

Feel free to comment for suggestions and corrections...

Sunday, April 27, 2014

Hadoop Bullet: a simple script to deploy Hadoop on fresh machine in automated fashion

Installing Hadoop is a hassle; it involves a variety of steps, some proficiency on Linux commands and writing to various files. If you have tried manual installation, you know what I'm talking about.

So, here is a simple Linux shell script. Save the following script as bulletinstall.sh and on your Ubuntu-ready machine, run it using:

$ sudo sh bulletinstall.sh

This script has been tested on Ubuntu 14.04 LTS; if you experience any issues, feel free to drop a comment. Here is the script:


#!/bin/bash

# This document is free to share and/or modify, and comes with ABSOLUTELY NO WARRANTIES. I will not be responsible for any damage or corruption caused to your Computer. Do know your stuff before you run this and backup your important files before trying out.

# Author: owaishussain@outlook.com

# LINUX SCRIPT TO INSTALL HADOOP 1.2.1 ON A MACHINE

# If you already have this file, then put it in /tmp directory and comment out "wget"
HADOOP_URL=http://www.us.apache.org/dist/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz

echo "************************************************************"
cd /tmp
echo "Downloading Apache Hadoop from $HADOOP_URL (you may change the version to any other, but this one has been tested)"
wget "$HADOOP_URL"
echo "STEP 1/10 COMPLETE..."
echo "************************************************************"

# Hadoop will be deployed on "/usr/local" directory, if you want to change it, then modify the path under this section
echo "************************************************************"
echo "Extracting files..."
tar -xzf hadoop-1.2.1.tar.gz
echo "Copying to \"/usr/local/\""
cp -R hadoop-1.2.1 /usr/local/
echo "STEP 2/10 COMPLETE..."
echo "************************************************************"

# Oracle JDK7 will be downloaded from a 3rd party repository (ppa:webupd8team/java. Credit to them). In case, it is unavailable, you may use an alternative in the first command under this section
echo "************************************************************"
echo "Installing Oracle JDK7. Please accept the Oracle license when asked"
add-apt-repository ppa:webupd8team/java
apt-get update
# If you are skeptik about JDK7 (and I won't blame you for that), you may switch comments respectively, in the two lines below to install JDK6
apt-get install oracle-java7-installer
#apt-get install sun-java6-jdk
echo "STEP 3/10 COMPLETE..."
echo "************************************************************"

# This script assumes that you will run Hadoop on new user \"hadoop\". If you wish to choose different user, you're most welcome to; just comment whole section below, but be sure to replace the username everywhere in the script with what you want
echo "************************************************************"
echo "Creating user and group named\" hadoop\""
adduser hadoop
adduser hadoop sudo
echo "Creating home directory for user"
mkdir -p /home/hadoop/tmp
echo "Assigning rights on home directory"
chown -R hadoop:hadoop /home/hadoop/tmp
chmod 755 /home/hadoop/tmp
echo "Changing ownership of Hadoop's installation directory"
chown -R hadoop:hadoop /usr/local/hadoop-1.2.1
echo "STEP 4/10 COMPLETE..."
echo "************************************************************"

# This is to test and initialize the newly created user
echo "************************************************************"
echo "Logging in. ***** PLEASE PROVIDE PASSWORD AND RUN 'exit' (without quotes) TO LOG OUT *****"
su - hadoop
echo "STEP 5/10 COMPLETE..."
echo "************************************************************"

echo "************************************************************"
echo "Setting environment variables"
cd /usr/local/hadoop-1.2.1/conf
echo "export JAVA_HOME=/usr/lib/jvm/java-7-oracle" >> hadoop-env.sh
echo "STEP 6/10 COMPLETE..."
echo "************************************************************"

echo "************************************************************"
echo "Configuring Hadoop properties in *site.xml"
mv core-site.xml core-site.xml.bck
touch core-site.xml
echo "<?xml version=\"1.0\"?>
<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>
<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoop/tmp</value>
</property>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
</property>
</configuration>" > core-site.xml
mv mapred-site.xml mapred-site.xml.bck
touch mapred-site.xml
echo "<?xml version=\"1.0\"?>
<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>
<configuration>
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
</property>
</configuration>" > mapred-site.xml
mv hdfs-site.xml hdfs-site.xml.bck
touch hdfs-site.xml
echo "<?xml version=\"1.0\"?>
<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>
<configuration>
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
</configuration>" > hdfs-site.xml
chown hadoop:hadoop /usr/local/hadoop-1.2.1/conf/*-site.xml
echo "STEP 7/10 COMPLETE..."
echo "************************************************************"

echo "************************************************************"
echo "export JAVA_HOME=/usr/lib/jvm/java-7-oracle" >> /home/hadoop/.bashrc
echo "export HADOOP_HOME=/usr/local/hadoop-1.2.1" >> /home/hadoop/.bashrc
echo "export PATH=$PATH:/usr/local/hadoop-1.2.1/bin" >> /home/hadoop/.bashrc
echo "STEP 8/10 COMPLETE..."
echo "************************************************************"

echo "************************************************************"
echo "Installing OpenSSH Server"
apt-get install openssh-server
ssh-keygen -t rsa -P ""
cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys
echo "STEP 9/10 COMPLETE..."
echo "************************************************************"

echo "************************************************************"
echo "Removing temporary files"
# If you want to preserve the downloaded Hadoop application, you may comment out the first command under this section
rm hadoop-1.2.1.tar.gz
rm -R hadoop-1.2.1
echo "STEP 10/10 COMPLETE..."
echo "************************************************************"
echo "Congratulations! Your Hadoop setup is complete. Please log into hadoop user and start hadoop services."

echo "************************************************************"

Wednesday, March 19, 2014

Step-by-step guide to set up Multi-node Hadoop Cluster on Ubuntu Virtual Machines

If you've landed on this page, I know your feelings. Wanna know how it feels when it's done? Ride on, it's like Roller Coaster...

You have successfully configured a single-node cluster in 7 easy steps. Good! But you are yet to taste the real essence of Hadoop. Recall that the primary purpose of Hadoop is to distribute a very lengthy task to more than one machines. This is exactly what we are going to do, but the only difference is that we will be doing so in Virtual machines.

Step 1: Networking

We have several things to do first with the existing VM, beginning with disabling IPv6. This is a recommendation because Hadoop currently does not support IPv6 according to their official Wiki. In order to do so, you will have to modify a fine named /etc/sysctl.conf:
- Launch your Virtual Machine from Virtualbox
- Open your terminal and run:
$ sudo nano /etc/sysctl.conf

- Add the following lines at the end of the file:
# Disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1



- Now you'll have to reboot the machine. Sorry :(
- Run the following command to check if IPv6 is really disabled:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

You should get 1 as output, meaning that IPv6 is disabled.



Next, we change the Network Settings of Virtual Machine.
- Go to Virtual Machine's Menu > Devices > Network Settings..
- Change the Network Attached to from NAT to Bridged Adapter and choose your Network Adapter from below. Mine is as shown in the image.



Now we change our Ubuntu's IP addressing from Dynamic to Static. This is because we will be using fixed IPs:
- On your Ubuntu's desktop, open Networks window from Top-Right menu by Clicking Edit Connections
- From the same menu, also open Connection Information. This is handy to enter correct settings
- Edit the only connection (if you have multiple, you may want to remove the ones not in use)
- Go to IPv4 Settings and Changed the Method to Manual
- Add a new Address by looking at the Connection Information. You should change the last number to something like 10, 50, or 100
- Gateway and Netmask should remain similar to that of Connection Information
- Save the Settings and Close; the Network should refresh itself

We are not done yet. We now need to define what our Master and Slave machines will be called (defining aliases) in the Network:
- On terminal run:
$ sudo nano /etc/hosts

- Replace the local address from 127.0.0.1 localhost to your machine's IP address and call it master
192.168.0.50 master
- Also, comment out all the lines below, which are related to IPv6

- Change localhost to master. This will make your first line look like:
192.168.0.50 master
- Add another like for slave:
192.168.0.51 slave



Recall that in our single-node set up, we configured some xml files and defined the machine's name as localhost. Well! We'll have to change the localhost to master now:

- Edit the core-site.xml
$ sudo nano /usr/local/hadoop/conf/core-sites.xml

- Change localhost to master



- Save and exit
- Now edit the mapred-site.xml in similar way
$ sudo nano /usr/local/hadoop/conf/mapred-sites.xml

- Change localhost to master
- Save and exit

Finally, shut down the machine. We have to clone it

Step 2: Cloning Virtual Machine

Thanks to VirtualBox's built-in features that we won't have to create another machine from the scratch:
- Right click on Hadoop VM > Clone
- Name the new machine Hadoop Clone
- Next, choose Full clone
- Next, choose Current machine state and Clone



You may want to check this if you encounter problems.

After completion, launch both Virtual Machines. But please watch your memory resources, if you have a total of 4GB, then reduce the size of both VMs' memories to 1GB each in their settings.

Log into both machines and following the steps of change the IP address of newly created clone from 192.168.0.50 to 192.168.0.51 (or whatever you have to set). Wait until the Network refreshes.

This clone is your original machine's Slave. Now we need to define hostname for both machines. Edit the file /etc/hostname. Run:

$ sudo nano /etc/hostname
- Change localhost to master on your master machine and to slave on your slave machine


- Save and Exit
- Restart both machines to apply changes

Step 3: Connectivity

We need to ensure that SSH access is available from master to slave. On master, we already have SSH configured, but we will need to regenerate SSH key on slave machine because it is a clone of master and we should always have a unique key for machines.

- In both master and slave machine, clear existing files from /home/hadoop/.ssh directory
$ sudo rm /home/hadoop/.ssh/*

- Now generate a new public key for SSH. For details, check step 3 here
$ ssh-keygen -t rsa -P ""


- Copy the key to authorized_keys list
$ cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys

- Try the SSH to master from master and slave from slave itself:
On master
$ ssh master
On slave
$ ssh slave

If you experience having to write password because of "Agent admitted failure to sign using the key" error on any of the machines, then you'll have to run the following command to fix it:
$ ssh-add

We will have to copy the master's public key to slave machine as well. On master, run:
$ ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@slave

You will be asked for slave's password to continue (this should be same as master's if you haven't changed explicitly).



We should now be able to connect to slave from master. Why not try out? Run:
$ ssh slave

You should see a successful connection if nothing went wrong.



- Exit the SSH connection to slave:
$ exit


Step 4: Configuring Hadoop Resources

We will run all services on Master node and only TaskTracker and DataNode on slave machines; on large-scale deployments, you may be running Secondary Namenodes and also, may not be running TaskTracker and DataNode on master (when master is dedicated for task assignments only, not executing). We will define which machine will be actual master for Hadoop and which ones will be Slaves (or workers).

- In your master machine, edit the /usr/local/hadoop/conf/masters file:
$ sudo nano /usr/local/hadoop/conf/masters

- Write master



- Save and exit
- In same machine, edit the /usr/local/hadoop/conf/slaves file:
$ sudo nano /usr/local/hadoop/conf/slaves

- Write master in first line and slave in second



- Save and exit

Since we have 2 machines now, we can enable replication in hdfs-site.xml
$ sudo nano /usr/local/hadoop/conf/hdfs-site.xml

- Change the value to 2



- Save and exit


Step 5: Invoking Demons

Format the HDFS file system first:
$ hadoop namenode -format



- Start Namenode, Secondary Namenode and JobTracker on master and Datanode and TaskTracker on both master and slave
$ start-all.sh



- Run Jps utility on master to check running services:
$ jps



- Similarly, run Jps utility on check to check running services:
$ jps



Important! If you see Datanode service not running on any of the machines, then retry this whole step after rebooting the machine that is missing it. If the problem still persists, you'll have to look deeper. Open the Datanode log of the machine:
$ sudo nano /usr/local/hadoop/logs/hadoop-hadoop-datanode-slave.log (or master.log)

If you find a java.io.IOException, saying Incompatible namespaceIDs. Then you'll have to manually correct the datanode namespace ID; this should be identical to the ID. Look for the namenode namespace ID and replace it in your dfs directory (this is defined in hdfs-sites.xml as a property, but in our case, this should be in /home/hadoop/tmp/dfs/data/current/VERSION




- Now stop and start all service on master
$ stop-all.sh
$ start-all.sh

Step 6: Executing MapReduce Application

We will try the same word count example. On master
- Copy some more text files in the /home/hadoop/Documents/books (assuming you have already run this example previously)
- Copy the files in HDFS. Run:
$ hadoop dfs -copyFromLocal /home/hadoop/Documents/books /HDFS/books

Important! If you see a SafeModeException that are keeping the files from copying, turn off the Hadoop safe mode by running:
$ hadoop dfsadmin -safemode leave

You may require this because initially, Hadoop runs in fail safe mode and does not allow certain services. It then transitions into normal mode; this transitioning may not happen sometimes.


- Run the wordcount example
$ hadoop jar $HADOOP_HOME/hadoop-examples-1.2.1.jar wordcount /HDFS/books /HDFS/books/output

- Collect the output:
$ hadoop dfs –getmerge /HDFS/books/output $HOME/Documents/books/

Step 7: Tracking the Progress

You can track the status of NameNode on your browser master:50070
The MapReduce tasks on different nodes can be seen on master:50030

At this point, if you're not feeling like praying, or give a hug to the person sitting next to you, or dancing on the table, then you have no emotion.. :-|

End Note:
If you have arrived to here seamlessly, or with some glitches that you were able to fix by yourself, then you can certainly deploy a multi-node cluster with more than 2 nodes in real World with little effort. I would still recommend to try to extend the existing set up to 3 nodes and maybe on multiple Virtual Machines on different Physical Machines. You may also try to run your own applications via Eclipse on larger data sets to enjoy real Hadoop experience. 
Happy Hadooping...

* All the experiments were performed on my Lenovo Ideapad U510 laptop with moderate specifications.
* If you find some images not being displayed (I have absolutely no idea why this happens), I have put them here for download.

Wednesday, March 12, 2014

Executing MapReduce Applications on Hadoop (Single-node Cluster) - Part 3

In our previous experiment, we ran source code of Word count MapReduce application eclipse. This time, we are going to write our own piece of code.

Remember Permutations and Combinations you studied in College? We will write a fresh approach to compute combinations of all strings in a file. You'll have to make a very few changes to the existing code.

First, you need to create a text file with some words separated by spaces:
- Create a new text file named words.txt in /home/hadoop/Documents/combinations/
- Enter some text like:
Astronomystar sun earth moon milkyway asteroid pulsar nebula mars venus jupiter neptune saturn blackhole galaxy cygnus cosmic comet solar eclipse globular panorama apollo discovery seti aurora dwarf halebopp plasmasphere supernova cluster europa juno keplar helios indego genamede neutrinos callisto messier nashville sagittarius corona circinus hydra whirlpool rosette tucanaeAndroidcupcake donut eclair froyo gingerbread honeycomb icecreamsandwich jellybean kitkat lemonadeUbuntuwarty warthog haory hedgehog breezy badger dapper drake edgy eft feisty fawn gutsy gibbon herdy heron intrepid ibex jaunty jackalope karmic koala lucid lynx meverick meerkat natty narwhal oneiric ocelot raring ringtail

- Save and exit

Now open your Eclipse. In your existing Combinatorics Project, add a new Class:
- Right click on src > New > Class
- Name it Combinations
- Replace the code with the following:

import java.io.IOException;
import java.util.Arrays;
import java.util.Date;
import java.util.SortedSet;
import java.util.StringTokenizer;
import java.util.TreeSet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

/**

 * MapReduce Application to discover combinations of all words in a text file
 * 
 * @author hadoop
 * 
 */
public class Combinations {

// Inherit Mapper class, our input and output will be all text

public static class MyMap extends Mapper<LongWritable, Text, Text, Text> {
// Empty string to write as values against keys by default
private Text term = new Text();

// Mapping function to map all words to key-value pairs

protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// Convert all text to a string
String str = value.toString();
// Tokenizer to break text into words, separated by space
StringTokenizer tokenizer = new StringTokenizer(str, " ");
// Write a key-value pair against each term to context (job)
while (tokenizer.hasMoreTokens()) {
term.set(tokenizer.nextToken());
// If length of the term exceeds 20 characters, skip it, because
// processing strings of greater lengths may not be possible
if (term.getLength() > 20)
continue;
// Initially, pass term as both key and value
context.write(term, term);
}
};
}

// Inherit Reducer class, this will be executed on multiple nodes and write

// output to text file
public static class MyReduce extends Reducer<Text, Text, Text, Text> {
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// Sorted collection to store set of combinations
SortedSet<String> list = new TreeSet<String>();
// Iterate for each key
for (Text text : values) {
// Find out all combinations of a string
String str = text.toString();
int length = str.length();
// The number of combinations is 2^(n-1)
int total = ((Double) Math.pow(2, length)).intValue() - 1;
for (int i = 0; i < total; i++) {
String tmp = "";
char[] charArray = new StringBuilder(
Integer.toBinaryString(i)).reverse().toString()
.toCharArray();
for (int j = 0; j < charArray.length; j++) {
if (charArray[j] == '1') {
tmp += str.charAt(j);
}
}
list.add(tmp);
}
list.add(str);
}
// Write term as key and its combinations to output
context.write(key, new Text(Arrays.toString(list.toArray())));
};
}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();
// Initiate MapReduce job, named combinations
Job job = new Job(conf, "combinations");
// Our keys in output will be text
job.setOutputKeyClass(Text.class);
// Our values in output will be text
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMap.class);
job.setReducerClass(MyReduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// Fetch input/output paths from arguments
FileInputFormat.addInputPath(job, new Path(args[0]));
// Create a time stamped directory inside input directory for output
FileOutputFormat.setOutputPath(job,
new Path(args[0] + String.valueOf(new Date().getTime())));
job.waitForCompletion(true);
}
}

Execute the code:
- Right click > Run As > Run Configurations
- Right click on Java Application (from left pane) > New
- On top, name it Combinations
- In Arguments tab, write /home/hadoop/Documents/combinations/
- Apply and Run

The MapReduce application should run and you should be able to collect the output in a time stamped folder in /home/hadoop/Documents/combinations/ directory

This algorithm is explained in detail here. You should also read the comments, the rest of the code is self-explanatory.

With this experiment, we'll wrap up our experiments on Single-node Hadoop clusters. I will encourage you to do some more experiments and make yourself comfortable before you try out something Big.