Scenario
My client was experiencing some severe issues with one of their servers. The issue had occurred after two core networking switches were rebooted at the same time. After this, one of their Exchange 2013 servers would reboot daily. Yes. Daily. Here is their configuration:
3 x Exchange Server 2013 CU19
DAG with all three nodes
FSW was configured and available
Windows 2012 R2 OS
The client was able to determine that the Microsoft Exchange Replication service was constantly crashing. How often, I was not told, but enough so that the service did not appear stable. The server was also not mounting databases because it was always out of communication with the two other servers.
Before connecting to the environment, I did some pre-research to see if I could find anything that would be helpful in troubleshooting or quickly resolve the issue.
Research
So how do you research a problem you have not seen, nor experience, nor even examined at all? Not easily. What you have to do is think logically about the system that is having the issue, the issues it is having and come up with some keywords you can type into your favorite search engine. Well, that is exactly what I did. What I came up with were these options:
(1) DAG Configuration? – https://social.technet.microsoft.com/Forums/en-US/3e21c963-42e2-42ac-9481-82a9e498691a/microsoft-exchange-2013-replication-service-restarting-constantly-event-ids-7031-4999-amp-4401?forum=exchangesvradmin
Their DAG had its DAC setting at ‘None’ which was the default. Not a good answer, but we changed it anyways since they had a three node DAG.
(2) Possible corrupt items in a Crimson Event Log – clear out with ‘Wevtutil.exe cl “Microsoft-Exchange-MailboxDatabaseFailureItems/Operational”‘
https://support.microsoft.com/en-in/help/3003580/event-id-4999-and-4401-when-the-microsoft-exchange-replication-service
However, this was supposedly resolved ….
(3) Other issue? Found articles like this – https://social.technet.microsoft.com/Forums/en-US/7c0d30d1-27fe-4696-b743-e05e2fb70eb7/microsoft-exchange-replication-service-restarting-constantly-amp-microsoft-dag-management-service?forum=exchangesvravailabilityandisasterrecovery
Completely rebuilding the server – doable, but not ideal because we had 1.5+ TB to resync. Would use it as a last resort since adding / removing DAG nods is easy.
Over all it seemed that the Managed Availability feature in Exchange Server 2013 was probably causing an issue. By causing an issue, I should say it was trying to proactively resolve an issue it was having. This means that it was trying to get that Microsoft Exchange Replication service to remain stable and restarting it while doing so.
Reviewing Event Logs
The next step, as I had not found any good resolutions pre-examination, was to review the event logs. Specifically I reviewed these logs:
- System
- Application
- Microsoft-Exchange-MailboxDatabaseFailureItems-Operational
- Microsoft-Exchange-ManagedAvailability-Monitoring
From these I could see there were quite a few attempts to restart the Microsoft Exchange Replication Service:
I also discovered the source of the reboots – it was Managed Availability:
So then I ran ‘Test-ReplicationHealth’ in PowerShell to see what errors I could glean from that:
Not entirely helpful. We already know that the service won’t start. We need to dig into what is actually causing that service to fail and not completely start.
Resolution?
After A LOT of digging in the event logs, we discovered events like this:
Wow. Could it be that simple? We have a database copy that simply won’t start. No other databases are mentioned. The replication service always tries this one database.
With PowerShell we suspended the one copy. How do we do this?
List the database copies
Get-MailboxDatabase | Get-MailboxDatabaseCopyStatus
Get-MailboxDatabase DB09 | Get-MailboxDatabaseCopyStatus
Suspend only the one copy?
To suspend just the one copy on one server, we needed to run this PowerShell:
Suspend-MailboxDatabaseCopy -Identity \
Suspend-MailboxDatabaseCopy -Identity “DB09\Exchange2” -SuspendComment ‘Fix replication issue’
Once this is suspended, we can confirm this here:
Get-MailboxDatabase DB09 | Get-MailboxDatabaseCopyStatus
We can see that one copy is now ‘FailedAndSuspended’
Once we performed these actions, we monitored the Microsoft Exchange Replication service. We noticed that the service never stopped. Never restarted. It was stable for > 5 min which was a very long time for this service to stay up. We then logged in to the Exchange Admin Console and saw that all of the database copies on Exchange2 were now showing good and the only database with a bad copy was DB09.
Solution
The solution was to remove the old copy from Exchange2, physically delete all database/log files, and then re-add the Exchange2 as a copy for DB09. Once that was performed, the copy was successful and the service was stable. The server no longer reboots nightly. Resolved.
So in our case it was a corrupt database file that cause us all of these headaches.