Okay, we all know that Sampled Statistics are not perfect otherwise there would be no need for a Full Scan stats update. However, there’s an issue I encountered recently that has really bothered me in regard to these.
When we perform a sampled stats refresh we know that SQL Server will sample a smaller and smaller percentage of the rows in our table (which is based on data size, not pure row count) so we’re all accustomed to living with that, but there’s another flaw that really drives me mad…
During some of the work I’ve been doing recently we are constantly updating statistics in order to get good plans (which is pretty standard) because we’re inserting a lot of ascending key data and, in older versions than SQL 2014, this can cause a lot of problems with SQL Server thinking we’re only obtaining 1 record and therefore giving Nested Loops when we need Hash Joins.
This can be alleviated using Trace Flags (I’ll cover those separately) but it’s still not perfect if you ever do anything which causes SQL Server to think the keys may not actually be ascending.
So anyway… “You’re updating stats so what’s the problem?”, I hear you ask.
Well, these tables are upwards of 150GB in size in some cases and therefore we’re relying on sampled stats because a full scan simply isn’t feasible.
BUT, as I found out, SQL Server doesn’t seem to think that it’s necessary to book end the sample. Ie. It doesn’t always sample the min and max values in the table.
To me this seems pretty fundamental as you would think it would simply take the first, the last, and then whatever sample it wanted.
When dealing with ascending keys it means that nearly all queries being run are going to want the most recent key values and SQL simply doesn’t pick them up. Therefore I was finding that even running an update stats wouldn’t help with my execution plans and in the end I had to resort to filtered stats (which can be tricky if you don’t have something nice like Date to work with) alongside query hints.
These are all viable but fiddly and wouldn’t be needed if Sampled Stats didn’t have, what seems to me, a massive flaw.
If you want to have a look yourself, then run the following code and have a look at the output:
if object_id('myTestStatsTable') is not null drop table myTestStatsTable
create table myTestStatsTable
id int identity,
myValue char(1000) default('a'),
constraint pk_myTestStatsTable primary key clustered(id)
declare @statHeader table
declare @statsHist table
declare @output table
id int identity,
declare @rowLimit int = 1000000, @sql varchar(max)
while isnull((select max(id) from myTestStatsTable), 0) <= @rowLimit
insert into myTestStatsTable(idINT)
select top (10000) 1
from sys.objects o
cross apply sys.columns c
cross apply sys.allocation_units a
if (select max(id) from myTestStatsTable) % 50000 = 0
update statistics myTestStatsTable
select @sql = 'dbcc show_statistics (''myTestStatsTable'', ''pk_myTestStatsTable'') with stat_header'
insert into @statHeader
select @sql = 'dbcc show_statistics (''myTestStatsTable'', ''pk_myTestStatsTable'') with histogram'
insert into @statsHist
insert into @output
select s.noRows, s.sampledRows, h.highValue
from @statHeader s
select max(rangeHiKey) highValue
delete from @statHeader
delete from @statsHist
select noRows, sampledRows,
(sampledRows / (noRows * 1.)) * 100 as percentSampled,
order by id