Monday, June 11, 2012

Yet another Blog on Query Optimization for MySQL Server


If you have been into MIS development for some time, then you may have realized that buying latest, multi-thousand-dollar Machine, stuffed with a top notch processor and an army of memory chips is not sufficient to your needs when it comes to processing large data, especially when your DBMS is MySQL Server.

In this article, I have tried to input the tips and techniques to-be-followed - some in general and some specific to MySQL Server; but I would, as every blogger, repeat the same common phrase that "in the end it all depends on your scenario".


The results you are going to see will mostly be in milliseconds so before thinking "is it worth the effort if the result is in a few milliseconds?", do know that these results are derived using a very very simple database with not more than 100000 records in a table. With complex databases and records in millions, the effort will pay you back.


Coming straight to topic, here are some points you should not ignore when designing tables and writing queries for retrieving data.

Choosing storage Engine


MySQL is unique in a way that it offers different Engines for storage. The two mostly used are InnoDB and MyISAM. When designing tables, choosing appropriate storage engine serves you as lifesaver afterwards.

The most significant differences between InnoDB and MyISAM are that InnoDB supports transactions, foreign keys, row locks, data caches and clustered indexes; it serves best for Insert, Update and Delete operations, more controlled locking, and maintaining referential integrity.

On the other hand, MyISAM, in comparison supports full-text indexing and 4x storage capacity i.e. 256TB (to date), while it: lacks foreign key support; is limited to table locks; misses data caching.

Shortly, InnoDB should be selected for a table when DML operations are more frequent than Select, not because of speed but reliability and durability. While MyISAM is best for data warehouse. More precisely, for tables used for search and reporting purpose.

Indexing and Primary Key


Indexing is a method to sort a number of records into one or more columns. The Index itself works as representative of records in a table; it can be created from one or more columns, and functions as a pointer to the records it's related to. When you define primary key for a table, a primary key index is created automatically by DBMS.

Although creating indexes boosts up data read performance, remember that each index you create requires some additional disk space and the write performance decreases as the number of indexes grow.

Below is an experiment on read/write performance in indexed and non-indexed tables.

create table person
(
    id int(11),
    name char(50),
    nic varchar(15)
)   engine = MYISAM;

create table contact
(
    id int(11),
    contact varchar(20)
)   engine = MYISAM;

Important: Execute the command below to turn off query caching to evaluate actual query execution time.
set session query_cache_type = off;
insert into person values (... Random 10000 records ...);
Result: 2821ms

insert into contact 
select id, [Random number]) as contact from person;
Result: 12ms

select p.name, c.contact from person as p inner join contact as c using (id) limit 10000;
Result: 5.241sec

Now creating primary keys on both tables.

alter table person change column id id int(11) not null, add primary key (id);
alter table contact change column id id int(11) not null, add primary key (id);

Trying the same queries after truncating both tables first.

insert into person values (... Random 10000 records ...);
Result: 2862ms

insert into contact 
select id, [Random number]) as contact from person;
Result: 34ms

select p.name, c.contact from person as p inner join contact as c using (id) limit 10000;
Result: 0.078sec

You may notice little in insert, but dramatic change in select query after indexing by creating primary key.

Talking about Primary keys, there is a misconception that INT data type works faster than char or equivalent. Here is an experiment proving it false.

alter table person change column id id char(11) not null;
alter table contact change column id id char(11) not null;

select p.name, c.contact from person as p inner join contact as c using (id) limit 10000;
Result: 0.078sec (same result)

It doesn't really matter what data type you choose, but the length of data itself. The length should be chosen very carefully, tying best to keep it low. You don't want to keep your Primary Key column(s) BIGINT unless you are willing to register every person, animal, fish and tree on the planet to your system.

Writing Select Statement


Not something to mention, the primary purpose of a Database isn't to pushing data in, in fact the data never read shouldn't have been there in the first place; you retrieve the data in chunks and bulks, of various combinations, in various dimensions. You do so most of the time using Select query. So, in abstract, the efficiency of your data retrieval  - secondarily, after design - depends upon how beautifully you write your Select queries.
  1. Avoid Select * ...
    I know I sound annoying but this is true. The more the number of columns, the more the DB Engine processes the bytes. This serves most in cases where you have full-text columns like Address, Password_Hint, Essay_on_National_Hero, etc. Restricting your columns to only needed relieves your processor, especially when "Fetching" data. Have a look at the results:

    Table schema (different from above example):
    create table contact
    (
      PID varchar(12) not null,
      AddressHouse varchar(50),
      AddressStreet varchar(50),
      AddressState varchar(50),
      City varchar(50),
      Country varchar(50),
      Region varchar(50),
      Phone varchar(20),
      Mobile varchar(20),
      MobileOwner varchar(50),
      Email varchar(100),
      LocationLat float,
      LocationLon float,
      primary key (PID)
    ) engine=MYISAM;

    create table person
    (
      PID varchar(12) not null,
      Salutation varchar(12),
      FirstName varchar(50),
      MiddleName varchar(50),
      LastName varchar(50),
      SurName varchar(50),
      GuardianName varchar(100),
      Gender char(1),
      DOB dateL,
      MaritalStatus varchar(12),
      NIC varchar(20),
      Religion varchar(12),
      Alive bit(1),
      Picture tinyblob,
      primary key  (PID)
    ) engine=MYISAM;

    Selecting...

    select person.*, contact.* from person inner join contact using (id) limit 10000;
    Result: Duration 0.094 sec. Fetched in 0.062 sec

    select person.id, contact.contact from person inner join contact using (id) limit 10000;
    Result: Duration 0.094 sec. Fetched in 0.016 sec

  2. Choosing Base tables
    Base tables are actually the tables from which Views are derived but here, what I mean is the first tables you define right after the term "from" in the Select statement, like select blah_blah from base_table. Other tables will be called Join tables.
    For example, if you want to retrieve a list of those names from person table for which the contact number in contact table ends on 9. What you would typically do would be something like:

    select p.name, c.contact from person as p inner join contact as c using (id) where contact like '%9';
    Result: 0.031sec

    Because Where clause is executed first, if you choose contact as your base table and person as join table, then you will have unnecessary records filtered out before the two tables are joined. Interchanging the positions of both the tables:

    select p.name, c.contact from contact as c inner join person as p using (id) where c.contact like '%9';
    Result: 0.015sec

  3. Hierarchy of Joins
    Following the above principle, correct placement of each join limits the data for next join. In addition, use the column(s) of table right before join table when joining instead of using Base table's columns throughout the list of joins. Here is an example:

    Adding some more tables:
    create table mobile_sp
    (
      sp_id int(11),
      sp_name varchar(20) not null,
      prefix char(6),
      primary key (sp_id)
    ) engine=MYISAM;


    create table person_mobile_sp
    (
      id int(11) not null,
      sp_id int(11) not null,
      primary key (id,sp_id)
    ) engine=MYISAM;


    Bad practice:

    select a.id, a.name, b.contact, d.sp_name from person as a
    inner join contact as b on b.id = a.id
    inner join person_mobile_sp as c on c.id = a.id
    inner join mobile_sp as d on d.sp_id = c.sp_id;
    Result: 0.140sec
    Good practice:

    select a.id, a.name, b.contact, d.sp_name from person as a
    inner join contact as b on b.id = a.id
    inner join person_mobile_sp as c on c.id = b.id
    inner join mobile_sp as d on d.sp_id = c.sp_id;
    Result: 0.125sec
    Avoid misusing outer joins, use straight joins and preferably inner joins wherever possible.

    To demonstrate this, first wipe off some records from
    person_job table:

    delete from person_job where ...;


    Bad Practice:
    select p.name, j.designation from person as p left outer join person_job as j using (id) where j.designation is not null limit 100000;
    Result: 0.141sec

    Good Practice:
    select p.name, j.designation from person as p inner join person_job as j using (id)  limit 100000;
    Result: 0.062sec
  4. Writing Where clause
    Last but not least, Where clause is where you should be focusing most when optimizing queries. There are various tips on how to organize where clause like:
    • Order:
      ... where gender = 'M and age < 25 should be ordered as
      ... where age < 25 and gender = 'M'
      Because there are more chances of limiting number of records in age condition than gender.
      Note: Some clever DBMS's like Oracle do such optimization automatically. However, I am unsure that MySQL is in the list.
    • Between:
      Using
      age between 13 and 19 is slightly better than age >= 13 and age <= 19
    • IN vs EXISTSWhere exists (... sub-query ...) is more efficient than Where column_name in (... sub-query ...) when you must use sub-query.
    • Wild characters:
      Learn to use ? and _ wild cards instead of functions like
      substring(column, 1, 5) = 'BLAAH'.
    • Maths:
      Use and avoid calculations wisely.
      where weight in (60, 70, 80, 90, 100) should be transformed into where weight mod 10 = 1. Similarly, where (duration * 24 * 60) > 7200 should be simplified as where duration > 5
  5. Grouping
    Group by clause is used typically with an aggregate function like max(...), sum(...), etc. In a situation where you are applying one or more aggregate functions to your main query to create a summary of records, Group by clause saves your server from extra burden and you from headache. The example below shows how to avoid a sub-query by grouping:
    select j.designation, (select min(annual_income) from person_job where designation = j.designation) as average from person_job as j;
    Disastrous Result: 45.692sec

    select j.designation, min(j.annual_income) as min_income from person_job as j group by j.designation;
    Result: 0.015sec

    Now can imagine the same with max, count and sum functions added???

  6. Use existing Function
    The assumption that "the DBMS function is doing the same thing in the back-end that I am in my function" is fallacious; you don't know the back-end so don't predict. It is very essential for a query writer to learn using built-in aggregate functions for calculations than inventing.
    Have an example:
    select j.designation, sum(j.annual_income) / count(id) as average from person_job as j group by j.designation;
    Result: 0.031sec

    select j.designation, avg(j.annual_income) as average from person_job as j group by j.designation
    Result: 0.015sec

    And these get worse with dates, like: writing
    select substring(dob, 1, 4) as year instead of simply saying select year(dob) as year.

  7. Difference between WHERE and HAVING
    The only difference (which is a big one) between Where and Having is that Having clause is executed after data has been fetched, i.e. after Select query is executed. It works like filters and instead of column names, it uses aliases (if no alias is defined, column name is used).
    Example: We are interested in finding out for each vendor, the designations with average annual income greater than 500000:

    select m.sp_id, j.designation, avg(j.annual_income) as average from person_mobile_sp as m
    where avg(j.annual_income) > 500000
    inner join person_job as j using (id)
    group by m.sp_id, j.designation
    having average > 500000;

    Using Having instead of Where should give you false results but not vice-versa. So do not use Having for any other purpose.
  8. Choosing between Join and Sub-queryRule of thumb, when you want more than 1 column from a table, don't use a sub-query. However, sometimes you need to create a join with sub-query itself when a table is unavailable, but that again is a join. Look at the structure below:
    select a.col1, b.col1, c.col1 from table_a as a 
    inner join table_b as b on b.id = a.id
    inner join (select id, col1, sum(col2) as total from table_c group by id, col1) as c on c.id = b.id

    Sub-queries can be used in situations where grouping is not desired and you want to apply an aggregate function on a separate table. For example, you want to write a query for each person with his id, name, designation, annual income, highest annual income for his designation, mobile service provider and total subscribers of his service provider:
    select p.id, p.name, j.designation, j.annual_income,
    (select max(annual_income) from person_job where designation = j.designation) as max_income,
    sp.sp_name, (select count(id) from person_mobile_sp where sp_id = m.sp_id) as total_subscribers from person as p 
    inner join person_job as j on j.id = p.id
    inner join person_mobile_sp as m on m.id = p.id
    inner join mobile_sp as sp on sp.sp_id = m.sp_id;

    Result: Duration 14.758sec. Fetched in: 31.356sec

    Obviously, this is undesirable time for you so you optimize the query. Instead of writing sub-queries for aggregate functions, you can create a join with separate queries. Doing so, the tables will be created only once:

    select p.id, p.name, j.designation, j.annual_income, a.max_income, sp.sp_name, b.total_subscribers from person as p 
    inner join person_job as j on j.id = p.id 
    inner join person_mobile_sp as m on m.id = p.id 
    inner join mobile_sp as sp on sp.sp_id = m.sp_id 
    inner join (select designation, max(annual_income) as max_income from person_job group by designation) as a on a.designation = j.designation 
    inner join (select sp_id, count(id) as total_subscribers from person_mobile_sp group by sp_id) as b on b.sp_id = m.sp_id;
    Result: 1.794sec

Views and Temporary Tables


Views are literally nothing other than Select statements stored for reuse. In MySQL, Views are handled using two Algorithms - Merge and Temptable, the third Undefined is a default option which leaves the job of choosing algorithm to DBMS. Merge algorithm cannot be used if the statement contains an aggregate function, sub-query, grouping or union.
I prefer using Temporary tables instead of views because first, I, almost all the time need a separate data container when I have hard time with sub-queries, groups and aggregate functions (which Mege algorithm doesn't support) and secondly, I often need to define additional indices for better performance, something that cannot be done with views.

For more on MySQL views, check this old-yet-useful blog on performance.
Creating temporary tables is one of the best techniques in query optimization, not specific to MySQL but in general. And if you choose to create these in Memory, then the performance will be magical. The example below illustrates how useful Temporary tables can be:

Our objective is to create a report for each person containing his id, contact no, name, mobile service provider, total subscriber of provider, job designation, annual income, minimum and maximum income for his designation and difference of his annual income from average annual income for his designation.

Without using temporary tables or joins with queries, your server will weep like a starving baby:

select p.name, c.contact, sp.sp_name, (select count(id) from person_mobile_sp where sp_id = m.sp_id) as total_subscribers, j.designation, j.annual_income, (select min(annual_income) from person_job where designation = j.designation) as min_income, (select max(annual_income) from person_job where designation = j.designation) as max_income from person as p

inner join contact as c on p.id = c.id 
inner join person_job as j on p.id = j.id 
inner join person_mobile_sp as m on p.id = m.id 
inner join mobile_sp as sp on m.sp_id = sp.sp_id;
Result: Duration 17.410sec. Fetched in 71.012sec


* Note that I have excluded calculation of average because - as I said - my laptop almost started to weep

Now using Temporary table
drop table if exists tmp_mobile_sp_count;
create table tmp_mobile_sp_count
select sp_id, count(id) as total_subscribers from person_mobile_sp
group by sp_id;
Result: 0.203sec

drop table if exists tmp_designation_income;
create table tmp_designation_income
select designation, min(annual_income) as min_income, max(annual_income) as max_income, avg(annual_income) as average_income from person_job
group by designation;
Result: 0.468sec

select p.name, c.contact, sp.sp_name, tm.total_subscribers, j.designation, j.annual_income, td.min_income, td.max_income, (j.annual_income - td.average_income) as difference_from_average from person as p
inner join contact as c on p.id = c.id 
inner join person_job as j on p.id = j.id 
inner join person_mobile_sp as m on p.id = m.id 
inner join mobile_sp as sp on m.sp_id = sp.sp_id
inner join tmp_mobile_sp_count as tm on sp.sp_id = tm.sp_id
inner join tmp_designation_income as td on j.designation = td.designation limit 100000; 
Result: Duration 0.016sec. Fetched in 0.296sec (for 9090 rows)
Total Duration: 0.203 + 0.468 + 0.016 + 0.296 = 0.983sec

But this is not end of the story, you can avoid using temporary tables if you are okay to trade simplicity with performance, by using the same method of creating joins from queries, which will be even faster. Look at the results below:



select p.name, c.contact, sp.sp_name, a.total_subscribers, j.designation, j.annual_income, b.min_income, b.max_income, (j.annual_income - b.average_income) as difference_from_average from person as p
inner join contact as c on p.id = c.id 
inner join person_job as j on p.id = j.id 
inner join person_mobile_sp as m on p.id = m.id 
inner join mobile_sp as sp on m.sp_id = sp.sp_id 
inner join (select sp_id, count(id) as total_subscribers from person_mobile_sp group by sp_id) as a on m.sp_id = a.sp_id 
inner join (select designation, min(annual_income) as min_income, max(annual_income) as max_income, avg(annual_income) as average_income from person_job group by designation) as b on j.designation = b.designation;
Result: Duration 0.31sec. Fetched in 0.31sec

So again when to use Temporary tables? The answer is simple: Not when the data inside the tables is not meant to be reused and not for jobs that can be done by Union. In our example, the tables we created may be used in other queries as well.


An example of how Temporary tables can be avoided by using Union:

Bad Practice:
create table tmp
select a.col1, a.col2 from table_a as a;
insert into tmp
select b.col1, b.col2 from table_b as b 
where not exists (select col1 from table_a where col1 = b.col1);
insert into tmp
select c.col1, c.col2 from table_c as c 
where not exists (select col1 from table_a where col1 = c.col1)
and not exists (select col1 from table_b where col1 = c.col1);

Good Practice:
select a.col1, a.col2 from table_a as a 
union

select b.col1, b.col2 from table_b as b 
union

select c.col1, c.col2 from table_c as c;

Keep in mind that Union gives you unique values only, for repeated values use Union all.

Because use of temporary tables requires more than one statements to execute, a good way to organize these statements is creating a Stored Procedure and calling it from your code.

Problem with Update


In MySQL Server, performance of Update statement much slower than insert, so the idea of creating a temporary table with empty column and then updating this column using a bulk update query. Thinking that a sub-query in Select will be slower and divide the operation in parts is good in general, but disappointing in MySQL when one of the parts is Update statement.


Lets take an example. You are willing to display for each person his name, his mobile service provider and total number of subscribers for his provider. Using the divide and conquer fall approach like above, you will have an unpleasant surprise:


drop table if exists tmp_person_subscriber;

create table tmp_person_subscriber
select p.name, sp.sp_id, sp.sp_name, 0 as total_subscribers from person as p
inner join contact as c on p.id = c.id 
inner join person_mobile_sp as m on p.id = m.id 
inner join mobile_sp as sp on m.sp_id = sp.sp_id;
Result: 0.811sec


update tmp_person_subscriber as t
set total_subscribers = (select count(id) from person_mobile_sp where sp_id = t.sp_id);
Result: 22.215sec




Using direct approach:



select p.name, sp.sp_id, sp.sp_name, (select count(id) from person_mobile_sp where sp_id = m.sp_id) as total_subscribers from person as p
inner join contact as c on p.id = c.id 
inner join person_mobile_sp as m on p.id = m.id 
inner join mobile_sp as sp on m.sp_id = sp.sp_id;
Result: Duration 1.358. Fetched in 0.655

Splitting the Time between Server and Client


In almost 9 out of 10 reports (my perception), you will be using a client-side application to retrieve queries instead of MySQL console or other Querying tools. You can save query time by shifting some load to client-end. Like if you are designing reports in a reporting tool like iReports, a Crystal Reports alternative, you can completely avoid:
  • in-query calculations like date_diff(curdate(), date_of_birth), amount + (amount * 0.12), etc. by creating separate fields for these calculated values.
  • grand aggregates like "sum of counts", "sum of sums", or "average of sums".
  • Order by clause by sorting in the reporting tool

There is more to the list, depending upon the tool you are using. But if your tool's name is like "[some super] Java IDE", "C++ Monster", "PHP Guru" (these are  fictitious so don't google them) or other programming interface, then I strictly advise to stick to query because besides the smartness of your Programming IDE, the way the data is handled through data structures of programming language is different than T-SQL, which are designed built to handle data.

When nothing works.. Warehouse does


Data Warehousing is (not exaggerating) the final solution to most of your complex reporting problems when you have a separate repository, containing all your historical data (NOT Live) and free hand to create as many tables - formally called Dimension tables and Fact tables in Data Warehousing language - in any way you like.


Data Warehousing itself is a complete branch of Data reporting, which helps in massive analysis for decision making, including data mining.


You can experiment this by creating a replica of your Transaction Database and deriving different tables for reporting purpose from it. This new warehouse can be fed new data after a specific interval of time (daily, weekly, or bi-weekly, etc.). However, the biggest setback for this is that you cannot use it for Live reporting like in Dashboards.


Hope the blog helped you. The tips and hints are not cent percent guaranteed to work for you, but as I explained, it all depends on your application's architecture and preferences. Secondly, these are all from my own limited experience of Programming and query design.



Note: All queries were tested on MySQL Workbench on my laptop equipped with Intel Core i3 M380@2.53GHz CPU and 4GB DDR3@1066MHz RAM, running MySQL Server 5.5 on Windows 7 Home Premium.


Please leave a comment if you like. Corrections are very much appreciated...

-
Owais