Skip to main content

A step-by-step guide to query data on Hadoop using Hive

Hadoop empowers us to solve problems that require intense processing and storage on commodity hardware harnessing the power of distributed computing, while ensuring reliability. When it comes to applicability beyond experimental purposes, the industry welcomes Hadoop with warm heart, as it can query their databases in realistic time regardless of the volume of data. In this post, we will try to run some experiments to see how this can be done.

Before you start, make sure you have set up a Hadoop cluster. We will use Hive, a data warehouse to query large data sets and a adequate-sized sample data set, along with an imaginary database of a travelling agency on MySQL; the DB consisting of details about their clients, including Flight bookings, details of bookings and hotel reservations. Their data model is as below:



The number of records in the database tables are as:
- booking: 2.1M
- booking_detail: 2.1M
- booking_hotel: 1.48M
- city: 2.2K

We will write a query that retrieves total country-wise bookings with a hotel reservation, paid in USD or GBP, distributed by booking type for each country.


select c.city_title, d.booking_type, count(*) total from booking b 
inner join booking_detail d on d.booking_id = b.booking_id 
inner join booking_hotel h on h.booking_id = b.booking_id 
inner join city c on c.city_code = d.city_code where b.currency in ('GBP','USD') 
group by c.city_code, d.booking_type

This query takes 150 seconds on MySQL to execute. We will try to reduce the query execution time by importing the dataset on Hive and executing the same query on our Hadoop cluster. Here are the steps:


Importing Data

First step is to export the tables in the database CSV files. Open MySQL Workbench in your master (assuming you have MySQL Server and Workbench installed and have exported the data set in it) and execute the following queries:

select * from city into outfile '/tmp/city.csv' fields terminated by ',' lines terminated by '\n';

select * from booking into outfile '/tmp/booking.csv' fields terminated by ',' lines terminated by '\n';

select * from booking_detail into outfile '/tmp/booking_detail.csv' fields terminated by ',' lines terminated by '\n';

select * from booking_hotel into outfile '/tmp/booking_hotel.csv' fields terminated by ',' lines terminated by '\n';


This will export the tables in /tmp directory as CSV files.




Installing Hive

We now install Hive on our Hadoop-ready machine. Download hive-0.11.0 from Apache Hive’s downloads
Extract into /home/hadoop/Downloads directory

Start Hadoop services and disable safemode:

$ start-all.sh
$ hadoop dfsadmin -safemode leave
$ jps



Copy Hive binaries and libraries into /usr/local director (or wherever you wish to install):

$ sudo cp -R /home/hadoop/Downloads/hive-0.11.0/ /usr/local/

Create a temporary directory in HDFS and a warehouse directory for Hive:

$ hadoop fs -mkdir /tmp
$ hadoop fs -mkdir /hadoop/hive/warehouse

Assign rights:

$ hadoop fs -chmod g+w /tmp
$ hadoop fs -chmod g+w /hadoop/hive/warehouse

Export environment variable for Hive:

$ export HIVE=/usr/local/hive-0.11.0

Run hive:


$ $HIVE/bin/hive

You should now be in Hive prompt.


Preparing Hive Database

We will now create tables in Hive DB. Note that there are certain Hive data types that are different from traditional DBMS data types. In addition, we have specified that the fields are terminated by coma, as well as the format, in which the data will be saved. This is because Hive has separate data structures to process different formats of data. Now create the tables identical to the schema by executing following queries:

hive>create table booking (booking_id int, currency string, departure_date string, requested_date string) row format delimited fields terminated by ',' stored as textfile;

hive>create table booking_detail (booking_id int, deetails_id int, booking_type string, item_code string, item_name string, passenger_count int, city_code string, duration int, price float, breakfast string, description string) row format delimited fields terminated by ',' stored as textfile;

hive>create table booking_hotel (booking_id int, details_id int, room_type string, room_cat string) row format delimited fields terminated by ',' stored as textfile;

hive>create table city (city_code string, city_title string, country string) row format delimited fields terminated by ',' stored as textfile;


Next step is to load the data from the CSV files into Hive tables. Execute the following commands to do so:

hive>load data local inpath '/tmp/city.csv' overwrite into table city;

hive>load data local inpath '/tmp/booking.csv' overwrite into table booking;

hive>load data local inpath '/tmp/booking_detail.csv' overwrite into table booking_detail;

hive>load data local inpath '/tmp/booking_hotel.csv' overwrite into table booking_hotel;

Make sure that all tables are in place.

hive>show tables;

Now execute a query to count the number of records in booking table:

hive>select count(*) from booking;

This will execute a MapReduce job against the given query; the query took 24.5 seconds, which may seem to be surprising to you. This is because the first time, Hive initiates a lot of things, creates some files in temp directory, etc. We will execute the same query again; this time, it took 14.3 seconds. Still, this is very high time for a simple count query. This needs some explanation.

First of all, MySQL and other RDBMSs are optimized to run queries faster using techniques like Indexing, etc. On the other hand, Hive is to run queries on Hadoop framework, which is a distributed computing environment. Therefore, the time it takes to distribute the data, interpret the query, create MapReduce tasks respectively is where the time is consumed most.


Second, a simple count query is not the type of query we should run on distributed environment, as it can easily be handled by RDBMS itself. The real performance impact will be shown when we run a complex query that processes data in large volume.

Now run the same query, which MySQL executed in 150 seconds:

hive>select c.city_title, d.booking_type, count(*) total from booking b inner join booking_detail d on d.booking_id = b.booking_id inner join booking_hotel h on h.booking_id = b.booking_id inner join city c on c.city_code = d.city_code where b.currency in ('GBP','USD') group by c.city_code, d.booking_type;



The query takes 132.84 seconds on single node on Hive. This is because the MapReduce job breaks the original query into pieces, thus leveraging all cores of the CPU to process data in parallel.

We can try the same by increasing number of mappers and reducers. To do so, first quit Hive:

hive>quit;

Shut down Hadoop's services

$ stop-all.sh

Edit the mapred-site.xml file in 

$ sudo nano /usr/local/hadoop/conf/mapred-site.xml

Add two properties to the file:

<property>
<name>mapred.map.tasks</name>
<value>20</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>5</value>
</property>


Save the file and exit; start all Hadoop services and Hive again; run the same query.



This time, the query was processed in 121.3 seconds. This is because as the number of mappers increase, the data will be divided into more blocks, thus requiring less processing per block. But be cautious here, increasing mappers too much can overkill, resulting in more disk head reads and writes, while CPU sitting idle. It requires some experimentation to determine the correct number of mappers and reducers required for a job.

The experiment showed how Hive can outperform MySQL on the same machine on slow-queries. Repeating the same experiment on multi-node will definitely reduce the time. 

So.. what are you waiting for? Go for it...

PS: You may find the screenshots of this post here.

PPS: I would like to thank Nabeel Imam (imam.nabeel@gmail.com) for his helping hand in setting up and carrying out the experiments...

Comments

  1. There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.

    Big Data Training Chennai
    Big Data Training
    Big Data Course in Chennai

    ReplyDelete

Post a Comment

Popular posts from this blog

Playing in Amazon's Clouds - Introduction to Elastic Computing Cloud - Part 1

A really brief Intro.. Researcher, Trying to execute an extremely computationally resource hungry experiment? App developer, unsure of how much data you'll be collecting from the users? Student, tasked to build your FYP (final year project) on distributed computing environment? Just an ordinary techie trying to catch up with the world? If you're any of these, you cannot escape the fact that Cloud computing is storming in and you have to engage yourself actively in it. Adopt it, or perish. I'm a newbie (better say wannabe) in this massive web of computing, and here just to share some experiences I'm having - successes and failures. First of all, Cloud computing is nothing new, it has been there for over 3 decades and was referred with names like Grid computing  and Distributed computing . It was business people that came up with a catchy name to attract business. The idea behind distributed computing is simple. We create a network of computers t...

How to detach from Facebook... properly

Yesterday, I deactivated my Facebook account after using it for 10 years. Of course there had to be a very solid reason; there was, indeed... their privacy policy . If you go through this page, you might consider pulling off as well. Anyways, that's not what this blog post is about. What I learned from yesterday is that the so-called "deactivate" option on Facebook is nothing more than logging out. You can log in again without any additional step and resume from where you last left. Since I really wanted to remove myself from Facebook as much as I can, I investigated ways to actually delete a Facebook account. There's a plethora of blogs on the internet, which will tell you how you can simply remove Facebook account. But almost all of them will either tell you to use "deactivate" and "request delete" options. The problem with that is that Facebook still has a last reusable copy of your data. If you really want to be as safe from its s...

Yet another Blog on Query Optimization for MySQL Server

If you have been into MIS development for some time, then you may have realized that buying latest, multi-thousand-dollar Machine, stuffed with a top notch processor and an army of memory chips is not sufficient to your needs when it comes to processing large data, especially when your DBMS is MySQL Server. In this article, I have tried to input  the tips and techniques to-be-followed - some in general and some specific to MySQL Server; but I would, as every blogger, repeat the same common phrase that " in the end   it all depends on your scenario ". The results you are going to see will mostly be in milliseconds so before thinking "is it worth the effort if the result is in a few milliseconds?", do know that these results are derived using a very very simple database with not more than 100000 records in a table.  With complex databases and records in millions, the effort will pay you back. Coming straight to topic, here are some points you should not ign...