People are always mentioning Statistics… “Keep your statistics up to date” etc. but, although in my last post I gave a good example as to why you really should keep them up to date, there aren’t that many places explaining what Statistics actually are. Therefore I figured I would give a quick overview.
This won’t be too in depth but simply to glance at what stats SQL Server holds and therefore let you see why they’re so important to SQL Server and also how it achieves all those Estimated figures we know and love from our Estimated Execution Plans.
So, first things first, let’s have a look at a couple of queries…
For these I’m going to use the AdventureWorks2012 database so these are tests you can happily verify yourself on your own server.
So let’s locate some statistics… well the easiest way is to look in Management Studio. Simply find your table in Object Explorer, expand it, then look in the Statistics folder:
In our case we’re going to focus on the on LastName, FirstName, Middlename stats. These are the statistics that SQL Server created for us when the corresponding index was created.
So, how do we see what’s inside? We use a DBCC command…
dbcc show_statistics ('Person.Person', 'IX_Person_LastName_FirstName_MiddleName')
This will give you 3 separate tables of data to look through. We’ll start with the first:
As you can see, this is basically the summary view and tells us a few useful things. For example, the “Unfiltered Rows” value on the far right equates to the number of rows in the table as a whole whereas the “Rows” column is actually how many rows are in the index… therefore if this was a filtered index then these would, of course, be different values.
Secondly you can see that these were last updated back in 2012. Not ideal, but as I’ve never altered the data in this table, that’s not a problem. You can also see they were updated by a FULL SCAN operation and not a SAMPLE. This is shown by the fact that the “Rows Sampled” column matches the “Rows”.
The second table is most useful for giving the average data length of the columns in our index. These are used by SQL Server when estimating the row sizes in execution plans:
Also note that the last entry includes the BusinessEntityID… this is not actually in the index definition, but because it’s the Primary Key of the table it will ALWAYS be included in any non-clustered index as a lookup value back to the clustered index.
The third table is the most interesting as it lets us see how SQL Server produces row estimates within execution plans and lets us see how badly things could go wrong if statistics weren’t up to date.
Basically this is the histogram in which SQL Server breaks down our data in anything up to 200 sections and profiles it in a meaningful way.
Let’s show this with a quick query:
where lastName > 'Murphy'
and lastName <= 'Nara'
This is asking us for the number of people with a lastName between Murphy and Nara (Nara being inclusive).
Here’s the outcome…
If we were to look at the statistics for this we would see the following:
So what does this tell us?
Well it tells us that there are exactly 95 people with the lastName of Nara. It tells us that there are 6 other records in the range between Murphy and Nara (correctly totalling 101), and that there are 3 distinct records in that range.
If we look at the data we see exactly that:
There are 3 lastnames… Murray, Myer, and Maik which make up 6 entries between the 3 distinct names. This, therefore, gives us an average of 2 records per name.
So, now let’s look at an estimated execution plan for the following query:
where lastName = 'Murray'
And as you can see… the estimated number of rows is indeed 2.
Obviously we know that the actual number is 1, but that’s where the estimated and actual values will likely always differ… but with accurate statistics they should never deviate by enough to affect the ideal structure of an execution plan.
This is also where we can see the effects of not keeping your stats up to date though… the inbuilt rule is that statistics are updated when 20% of the data has been changed. In this particular example the Person table has (as we already know) 19972 records. Therefore we could update 3994 records without triggering an auto update of statistics.
Imagine if we updated 3994 records to have a lastName of Murray? What we then have is the potential to get a very poor execution plan if SQL believes it’s returning just 2 records and in fact it’s returning nearly 4000.
Now imagine that same impact if you are using tables with hundreds of millions of records? Your statistics could become grossly off target unless you maintain them regularly.