A company I was visiting were having problems with the "Agent History Clean up" job that Microsoft installs as standard on a Distributor server. They had disabled the job because, as the job history showed, it was no longer completing as expected but was running forever and in doing so was causing the distribution database log file to rapidly increase in size until it Windows reported the disk as full. Once full they were having to kill the job, restart SQL Server (very risky as the job was now in rollback but couldn't comlpete due to disk space), truncate the transaction log, and then fix the backup chain for the sake of their disaster recovery requirements. This was clearly a huge problem for them as, without this job running, the distribution database would never stop increasing in size.
I wanted to see this happening, so I requested they start the job.
The distribution database was 40GB in size which, based on their publications, looked too large already, but this was made much worse as immediately the transaction log began to increase at a startling rate and within just a few seconds it was already 1GB+ in size and still growing. Therefore we stopped the job and waited for it to complete a rollback.
Now I have to admit here that I'm not 100% sure what caused this but I do have my suspicions and therefore what follows will be my thought processes and the fix which I finally put in place to deal with this.
Firstly I looked at the table sizes within distribution to see if I could find the cause of the seemingly large database...
You can clearly see a problem here, the MSmerge_history table is huge.
I had a look inside the table and could see that there were records which were months and months old. Further investigation tracked these down to their publications and, crucially, they were all "continous" replications rather than having subscribers on a schedule.
Now, I know that MSmerge_history will keep log of every transaction that has happened since the last snapshot was generated so that replications can be rebuilt without generating new snapshots (within reason). But it also appeared that if a subscription was set to continuous then it was also not deleting any of the transactions in MSmerge_history and therefore these transactions were going back months and months.
I tried to do a couple of sensible lookups on the MSmerge_history table using timestamp and the results were atrocious (sometimes not returning for hours).
So we were now in a position where the Agent History Clean Up job was not deleting records from continuous subscriptions, neither was it seemingly able to hunt down other rows to delete due to the size of table.
To address this I first added an index to the time column of the MSmerge_history table (this was NOT fast, but was DEFINITELY worthwhile).
create nonclustered index ix_time on dbo.MSmerge_history([time])
Using this new index I could create a job which ran every 5 mins and slowly cleaned out the MSmerge_history table starting with the oldest date first and deleting just 1 day of data at a time.
Once the job had cleaned out the table so that only 3 days worth of data was left, the job frequency was reduced to once a day just to keep the table in check.
We chose 3 days retention as no replication should be down for longer than that and if it was the company was prepared to accept a re-initialisation. Therefore change this code to suit whatever you require as part of your company strategy.
Here's the code I placed inside my scheduled job:
set nocount on
declare @date datetime, @eDate datetime, @volume int
select @date = convert(datetime, left(min([time]), 11))
set @eDate = dateadd(dd, 1, @date)
if @eDate >= DATEADD(dd, -3, current_timestamp)
set @eDate = '2011-01-01'
while @date <= @eDate
delete from msMerge_history
where [time] < @date
set @date = dateadd(dd, 1, @date)
set nocount off
Once this had run and drastically reduced the size of the table (down to just a few thousand rows), I enabled the Agent History Clean Up job and it immediately began to run correctly and in a reasonable time. It has caused no trouble since.