Yesterday I had an issue in which someone had accidentally run a tidy up script on a transactionally replicated table without considering the consequences this would have on replication.
Effectively what they had done was delete 2.5 million records in one transaction which amounted to around 5GB of data changes heading into the replication network. Not good.
Luckily they were fast enough to tell me what they had done and although it didn’t save the log reader from working overtime for a good while, it wasn’t too hard to put in a fix which saved a good few hours of replication latency and issues with the distributor whilst the system processed it all across the network.
The fix was as simple as the following:
First things first, find the subscriber’s replication agent job and disable it. If you are running PUSH replication then this will be on the Publisher, but if you are running PULL replication then this job will be on the Subscriber.
As mentioned above… once found, disable this job so that it doesn’t start running and attempting to pull all these deletes from the distribution database.
In this case I was interested in a delete to a table called (invented for purposes of this blog) “myLargeTable” held inside “myDatabase”. Therefore I ran the following command:
where name = 'myLargeTable'
This tells me that article ID of the table I’m interested in. In this case we’ll say it’s 99.
Knowing that, in the distribution database, we’ll look for commands against that table:
sp_browsereplcmds @article_id = 99
This will give you an output in the following format:
Using this I hunted through the commands in search of the delete (it wasn’t hard… there were 2.5 million of them) and located the associated xact_seqno (in my case this was “0x00000016000000A00004”).
All you need to do then is to remove these records from the distribution database so that they are no longer passed to the subscribers:
delete from distribution..msrepl_commands
where xact_seqno = 0x00000016000000A00004
Once you have run this code you can re-enable your replication agent job and replication will carry on, completely oblivious to the rogue command.
The only thing you have to then do is run the delete manually on the subscription database, but this will be MUCH faster done on the server itself using t-SQL than waiting for replication to process it all.
Therefore you now have your databases in sync and without having slowed your replication to a crawl in the meantime.