Hash Join Operator
Published: Nov 14, 2014
This is the last of the join operators and possibly the most powerful. It is possibly the most common join operator and, for some reason, one of the ones I hear people most saying they’d like to avoid. Personally I wouldn’t agree with that at all… although it does have a few down sides (which I’ll cover later in this post), it’s by far and away the best choice when joining large datasets together and should therefore be welcomed.

As I’ve said in all previous posts on join operators, this won’t be highly technical but more of a guide and overview as to basic understanding, usages and pitfalls. If you wish to get an incredibly detailed description of the operator and it’s internals, then there’s a superb blog post (as with all other operators) from Craig Freedman which can be found here.

So what is a HASH join? Well the clue is in the name… it takes a set of input rows, hashes the keys, and uses those hash values as a basis for the join.

This is easiest explained in a (very) fake example…

Let’s presume we have 2 tables as follows:

Now, what we want to do is join these on myJoinCol = myOtherJoinCol. So how would a HASH join deal with this?

For the sake of this demo, we’ll assume that the first step (build phase) will use the larger table. In real life this wouldn’t happen as ideally we would want the smaller table to be the build table. Anyway… internally SQL Server will hash the join columns (equijoins only) of the build table to create a hash table. This is an in-memory structure.

In this example (I’ve used a checksum for my hash key) we would have the following:

Next we have the second (Probe) phase. SQL Server will run through each row in the Probe table, hash the join column and then scan the hash table for a match. If one is found then there is one extra check performed against the actual column values to make sure they match as hashing could produce duplicate results for different values.

It’s that simple.

So what are the downsides? Well there’s a couple of major ones really… memory and data flow.

Data Flow

In a Nested Loop join SQL Server picks a row and then loops through the join table to find a match. Once it has one it writes this row out and moves on to the next, knowing that it will not need that record again.

In a Merge join SQL Server requires ordered inputs and therefore as each join happens and each record is passed it knows that record will no longer be needed and can therefore write it out to the client before moving on to the next.

As you can see, in both of these examples there is a constant flow of data through the join and therefore each subsequent step in the execution plan will receive data in a stream and can process it accordingly. This helps with the overall speed of the query.

This is the main reason that, when using the hint “option (fast 1)” SQL Server will invariably switch the join method to a Nested Loop or Merge instead of a Hash.

However, with a Hash join this is not the case. SQL needs to consume the entire dataset, create the hash table, and then hold on to it from start to finish. Therefore the data does not start to flow out of the operator until the entire join process has completed. Therefore this is a “blocking” operator.

Memory

As mentioned about, a Hash join is an in-memory operation as it creates the Hash table in memory. Therefore it requires a Memory Grant in order to run. Neither of the other join types require this and therefore in systems with very limited RAM a Hash can be seen to be bad, even though it may be the most effective join method for the query.

How it calculates this grant also requires your statistics to be up to date as cardinality plays a key role. For example, if you have 1000 rows in a table SQL may know that you require 3MB of memory in order to create your hash table… however if your statistics are wrong and you have only 10 rows in your table then SQL will allocate too much memory and this could (if all queries were out by such a factor) reduce the number of concurrent queries your machine would otherwise be capable of running.

Likewise if you were to actually have 10000 rows in your table then this becomes even more of a problem for SQL Server. This is because a memory grant is allocated at the time of execution and once allocated it cannot be changed. At all. Ever. This means that you now have (using the example above) a 3MB allocation but you actually require maybe 30MB.

Therefore the only way that SQL Server can cope with this is to write some of the Hash table out to disk. This manifests as a Worktable in tempDB and, as you can guess, will severely hinder the performance of your join. Also, not only does your hash table get written to disk but any rows from your probe table which join a record in the Worktable will also be written out to disk.

Once SQL has finished the join it will then have to read the data back from tempDB into memory to continue with the query execution. All this writing to and from disk is incredibly costly and therefore it’s vital that your statistics are always kept up to date.

Here is an example (do not run this in production) which I’ve created to show the memory spill and the impact it can have on a query…

Here’s the setup I’m using…

use master
go

set nocount on

if
object_id('temp') is not null drop table temp
if
object_id('temp') is not null drop table temp1

declare @counter int = 1

create table temp
(
  
id int identity(1, 1),
  
miscValues char(2)
)
begin transaction
   while
@counter <= 1000000
  
begin
       insert into temp
       select
'a'
  
      
set @counter += 1
  
end
commit transaction

set
@counter = 1

create table temp1
(
  
id int identity(1, 7),
  
miscOtherValues varchar(2)
)
begin transaction
   while
@counter <= 100000
  
begin
       insert into
temp1
      
select 'b'
  
      
set @counter += 1
  
end
commit transaction


Once those tables have been created, let’s run a simple join...

select t.id, t.miscValues, t1.miscOtherValues
from temp t
join temp1 t1
on t.id = t1.id
option (maxdop 1)


This will use a Hash join (I’ve also hovered over the SELECT operator so that the Memory Grant value is visible)…

This is the resulting data from SQL Profiler…

Now, let’s upset the Query Optimizer by rigging the statistics on the tables a little so that it believes it will actually be receiving a lot less data and therefore will allocate a smaller Memory Grant to force a spill…

update statistics temp with rowcount = 10000


(NEVER use this in production… ever… ever… just no)

Running is again we now get this…

It’s obvious that even in this tiny example (as the rowsize of my table is so small) this has made a big difference to the reads, cpu, and duration. Imagine what chaos would be caused on a decent sized dataset?

You can also see the effects on the execution plan with the memory grant being MUCH smaller and (if you’re running SQL 2012, otherwise this is NOT present) the warning sign on the Hash operator:

Here you can see the warning provided by the Hash operator:

As you can see, the Hash join does have a downside or two, but for large datasets with accurate statistics you simply can’t beat it. To prove the point I attempted to run the same query using the same two tables (with correct statistics) but with a Nested Loop join… this ran for 1 hour 12 mins before I got bored and stopped it executing. It had only retrieved 27246 records out of the 100000 it needed to complete, and the stats for this partial run were through the roof…

select t.id, t.miscValues, t1.miscOtherValues
from temp t
join temp1 t1
on t.id = t1.id
option (maxdop 1, loop join)


Therefore please don’t follow the lead of the many devs I’ve spoken to who all say “we don’t like to see Hash joins in execution plans” as they really aren’t evil as the above example with the loop join shows. They just need to be used appropriately (large tables) and with care (accurate statistics) and they can definitely be your friend.
Leave a Comment
Your email address will not be published. All fields are mandatory.
NB: Comments will only appear once they have been moderated.

Your SQL Trainer
Kevin Urquhart

I am a SQL Server DBA, Architect, Developer, and Trainer for SQL Training. This is my blog in which I’m simply trying to share my SQL knowledge and experiences with the world.

Kevin Urquhart
SQL Server Consultant, London

Categories


Archives


Copyright © 2018 London SQL Server Training Courses @ SQL Training | Website Design From Scott Heron