Saturday, July 23, 2011

SQL SERVER – Difference Between Union vs. Union All – Optimal Performance Comparison


UNION

The UNION command is used to select related information from two tables, much like the JOIN command. However, when using the UNION command all selected columns need to be of the same data type. With UNION, only distinct values are selected.

UNION ALL
The UNION ALL command is equal to the UNION command, except that UNION ALL selects all values.

The difference between Union and Union all is that Union all will not eliminate duplicate rows, instead it just pulls all rows from all tables fitting your query specifics and combines them into a table.

A UNION statement effectively does a SELECT DISTINCT on the results set. If you know that all the records returned are unique from your union, use UNION ALL instead, it gives faster results.

Run following script in SQL Server Management Studio to see the result between UNION ALL and UNION. Download complete script from here.

/* Declare First Table */
DECLARE @Table1 TABLE (ColDetail VARCHAR(10))
INSERT INTO @Table1
SELECT 'First'
UNION ALL
SELECT 'Second'
UNION ALL
SELECT 'Third'
UNION ALL
SELECT 'Fourth'
UNION ALL
SELECT 'Fifth'
/* Declare Second Table */
DECLARE @Table2 TABLE (ColDetail VARCHAR(10))
INSERT INTO @Table2
SELECT 'First'
UNION ALL
SELECT 'Third'
UNION ALL
SELECT 'Fifth'
/* Check the data using SELECT */
SELECT *
FROM @Table1
SELECT *
FROM @Table2
/* UNION ALL */
SELECT *
FROM @Table1
UNION ALL
SELECT *
FROM @Table2
/* UNION */
SELECT *
FROM @Table1
UNION
SELECT
*
FROM @Table2
GO

In our example we have two tables: @Table1 and @Table2.

Now let us run UNION ALL and UNION together and see the resultset as well as Execution Plan compared to complete set of query. You can always turn on actual execution plan using CTRL+M.

We can see from the resultset of UNION ALL that it returns everything from both the table but from UNION it is very clear that only DISTINCT rows from both the table is only retrieved.

Additionally, when comparing the execution plan of UNION ALL and UNION it is also quite clear that UNION ALL is way less expensive than UNION as it does not have DISTINCT SORT operation.

Let me know what do you think about this article. If you have any suggestion for improvement please let me know and I will update articles according to that.


------------------------------------------------


I should be using UNION ALL instead of simple UNION in my benchmarks, and he was right. Numbers would be different but it should not change general point of having optimization of moving LIMIT inside of union clause being cool thing.

But So is UNION ALL indeed faster than UNION DISTINCT (the UNION is shortcut for UNION DISTINCT) ?

Indeed it is. I did not have the same data as I used for the other test but I created similar test case – table with separate indexes on "a" and "b" columns with cardinality of 100, having about 40.000.000 of rows

select * from test.abc where i=5 union  select * from test.abc where j=5 

This original query was taking about 22 seconds.

As I modified it:

select * from test.abc where i=5 union all select * from test.abc where j=5 and i!=5 

The query time dropped to about 6 seconds which is 3.5 times faster – quite considerable improvement.

As you can notice I added "i!=5″ clause – this is what allows us to ensure we do not have duplicate rows in result set matching both conditions and so result will be same as query with "i=5 or j=5″ where clause.

I also tried this original query (which uses index merge method in MySQL 5.0):

select * from test.abc where i=5 or j=5 

Such query takes 4 seconds so if you do not need to trick with order by and limit using index merge is faster than UNION as it indeed should be.

So why UNION ALL is faster than UNION DISTINCT ?

The first informed guess would be – because UNION ALL does not need to use temporary table to store result set, however this is not correct – both UNION ALL and UNION distinct use temporary table for result generation. Perhaps one more thing for Optimizer Team to look into.

Interesting enough the fact UNION and UNION ALL require temporary table can only be seen in SHOW STATUS – EXPLAIN does not want to tell you this shameful fact:

mysql> explain (select * from test.abc where i=5) union all (select * from test.abc where j=5 and i!=5) \G *************************** 1. row ***************************            id: 1   select_type: PRIMARY         table: abc          type: ref possible_keys: i           key: i       key_len: 5           ref: const          rows: 348570         Extra: Using where *************************** 2. row ***************************            id: 2   select_type: UNION         table: abc          type: ref possible_keys: i,j           key: j       key_len: 5           ref: const          rows: 349169         Extra: Using where *************************** 3. row ***************************            id: NULL   select_type: UNION RESULT         table:           type: ALL possible_keys: NULL           key: NULL       key_len: NULL           ref: NULL          rows: NULL         Extra: 3 rows in set (0.00 sec) 

In fact EXPLAIN output is the same for UNION and UNION ALL (which is too bad as execution for them is obviously different).

The difference in execution speed comes from the fact UNION requires internal temporary table with index (to skip duplicate rows) while UNION ALL will create table without such index.

This also explains why difference becomes larger when on disk table is required (as in this case) – Hash indexes used by MEMORY table are very efficient and do not give so much overhead.



No comments:

Post a Comment