As we all know accurate time synchronization in Active Directory domains is a must. I recently ran into a problem where one domain controller’s times was starting to get off. Once it got off by about 8 minutes, it started to wreak havoc in the environment as some servers were synced to its time and others were synced to different domain controllers that had the correct time. As it turns out this particular DC was a VM and had the Time Synchronization in Active Directory service turned on. While turning this off should prevent it from happening again, I decided to go ahead and make an alert using Azure Log Analytics.
When I first started troubleshooting issues of people having authentication issues, I quickly checked Log Analytics to see if I had a domain controller offline. When I ran a query to get the last heartbeat on all servers, I found a large portion of them had the last Time Generated as 8 minutes ago. So, it appears that the Time Generated field is set to the systems time, not the time in Log Analytics. I was able to confirm this by changing the time a couple of test systems. So, I created the query below to show me all machines that have a heartbeat that is greater than or less than, 5 minutes from the current time.
Heartbeat | where OSType == 'Windows' and TimeGenerated > ago(1d) | summarize arg_max(TimeGenerated, *) by SourceComputerId | where TimeGenerated < now(-5m) or TimeGenerated > now(5m) | extend TimeAgoMinutes = toint((now() - TimeGenerated)/1m) | project Computer, LastHeartbeat=TimeGenerated, TimeAgoMinutes
When I created the alert rule, I set the alert logic as follows:
I set the threshold to greater than 4 because there are around 50 servers in this environment. So, this will send me an alert if around 10% of the servers have their time off by more than 5 minutes. I set the Period to 15 minutes because if I left it at the default 5 minutes, it won’t return any greater than 5 minutes. Left the frequency to 5 minutes because I want to know as soon as possible.
For those interested here is a breakdown of the query line by line.
First, I start by getting all Windows devices that have sent a heartbeat in the last 24 hours.
| where OSType == ‘Windows’ and TimeGenerated > ago(1d)
Next, I get the latest heartbeat time for each machine by summarizing on the max TimeGenerated by SourceComputerId, which is unique to each device.
| summarize arg_max(TimeGenerated, *) by SourceComputerId
Then I filter out all results that are not less than 5 minutes or greater than 5 minutes.
| where TimeGenerated < now(-5m) or TimeGenerated > now(5m)
Now I only have the machines that I want to report on, so I’ll make my data a little more user friendly by calculating the actual number of minutes the time if off, and present it as a integer in the results.
| extend TimeAgoMinutes = toint((now() – TimeGenerated)/1m)
Finally, I only project out the fields that I care about and want to see I the alert email. I also change the title of TimeGenerated to the more relevant, LastHeartbeat.
| project Computer, LastHeartbeat=TimeGenerated, TimeAgoMinutes