none
DHCP HA Deployment - DHCP Services Hangs on Active Server RRS feed

  • Question

  • A month or so ago we deployed DHCP HA in hot standby mode.  It has been working well until the past week or so and we're finding that the DHCP service on the active server hangs and stops handing out new leases (renewals seem to be OK).  The DHCP does not failover & we're not really seeing anything in the logs suggesting a problem other than we see that new leases (event 10) stop being handed out at some point.  Trying to restart DHCP results in a hang & we end up killing the svchost process.  After getting the service restarted everything works as expected.  We thought that maybe we just needed one last restart of DHCP on both ends after setting up the relationships but we've restarted all DHCP services several times but it still happens.  It's not occurring continuously but just about every day another server is affected.  So far none of the servers have had repeat occurrences.  We've resorted to deleting all F/O relationships.  We do have a couple DHCP servers that do not have F/O relationships that do not have an issue.

    All servers are 2016, 50+ relationships on 2 failover servers  - 31 on the first server, the rest on the 2nd server.  Most of the servers were in place upgrades from 2008 R2 and/or 2012 R2 several weeks to several months prior to deploying DHCP but a few were deployed with 2016 images.  All have ADDS/DNS - RODC and RWDC mixed.  The only thing that seems to be in common are 2016, that failover relationships exist (but not with the same partner) & tht DHCP/ADDS/DNS roles exist.  Our DHCP servers that do not have F/O relationships and have not been affected are not 2016 an do have ADDS/DNS.

    We have some SCOM DHCP alerts configured but they don't seem to trigger.  We usually don't know that the condition exists until a user calls and cannot get an IP address.

    Here's the properties that I've deployed when building the relationships via Powershell:

    #DHCP Failover config:

    $ServerName = 'server1'

    $Scopes = Get-DhcpServerv4Scope -ComputerName $ServerName

    Add-DhcpServerv4Failover -ComputerName $ServerName -Name '[server1.domain.com]---[dhcp-fo1.domain.com]' -PartnerServer dhcp-fo1.domain.com -ScopeId $Scopes[0].ScopeID -ServerRole Active
     -ReservePercent 20 -MaxClientLeadTime 0:30:00 -AutoStateTransition $true
     -StateSwitchInterval 0:15:00 -SharedSecret 'secret' -Force

    $ScopeObjects = $Scopes | Select-Object -Skip 1

    $ScopeObjects | ForEach-Object {Add-DhcpServerv4FailoverScope -Name '[server1.domain.com]---[dhcp-fo1.domain.com]' -ComputerName $ServerName -ScopeId $_.ScopeID}

    We don't think the issue is solely related to the existence of a failover relationship since we have divided our deployments into less with a F/O relationship and most without a F/O relationship.  We continue to have problems regardless of the existence of a F/O relationship.

    The issue seems to surface around the end of the lease period - every 8 or so days we have a rash of servers with hung DHCP.

    We do not believe that having DC, DNC & DHCP stacked on the same server is the issue because our largest DHCP server at out corporate office is set up in this manner and it has dozens of scopes on it & NEVER gives us a problem.

    We DO believe it may be an issue solely with 2016.  All of the servers where we've seen this issue are 2016.  We have 3 sites where we were unable to upgrade to 2016 due to iLo problems and the lack of remote control ability.  These have NEVER given us a problem since deploying DHCP on those.  Our corporate server that I mention above is also not 2016.

    Thursday, April 5, 2018 3:57 PM

All replies

  • Hi,

    This behavior happens when DHCP server takes backup of his own DHCP Leases. By default DHCP server takes backup every 60 min. I recommend you to change the backup schedule for every 3 hrs eg 240 on the below registry key

    stop the DHCP Service

    Change the registry key under. BACKUPINTERVAL

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DHCPServer\Parameters

    Start the DHCP Service

    using netsh you can do that

    netsh dhcp server set databasebackupinterval 240

    Kindly post your observations.

    Thanks

    Syed.


    Dont forget to mark as Answered if you found this post helpful.

    Thursday, April 5, 2018 4:12 PM
  • @Syed

    Thanks for the suggestion.  I will deploy on some of our servers & keep my fingers crossed.

    So are you able to explain why changing the backup interval stops the crashing?  Does just increasing the interval prevent whatever condition causes the crash?

    -Dave

    Friday, April 6, 2018 12:09 PM
  • Hi,

    By default Backup interval and cleanup interval happen every 60 mins in DHCP. During this interval both executes the command in the same time and if there is any descripancies it will affect the DHCP service. Hence I will recommend you to change the backup interval  such a way that both never meet at any point. eg. cleanupinterval : 60 mins. backup interval: 72Mins

    Thanks

    Syed


    Dont forget to mark as Answered if you found this post helpful.


    • Edited by Syed Abdul Friday, April 6, 2018 1:46 PM
    Friday, April 6, 2018 1:40 PM
  • So we deployed Syed's suggestion by changing the backup interval to 123 min & leaving the cleanup interval at 60 min.  We chose 123 min. vs 72 min. because 72 min. was a pint at which the 2 times converged again & could introduce the issue again.  123 min. is supposedly a better time because the 2 times never converge (at least my geez-whiz senior admin, who is VERY intelligent, suggested that time).

    Performance has been better but we have had some DHCP hangs over the past several day after about 3 weeks of no issues.  The issues don't appear to be as bad as before so I don't know if we've made total resolution or not.  Maybe 60 min. & 123 min. aren't the ideal intervals to use???

    Is there any way to determine definitely whether the intervals are causing the hangs?  ...or is there a optimal time interval we should be using?

    Thursday, May 10, 2018 12:52 PM
  • Also, here's the PS script I used to push the backup interval setting.  Don't be impressed - it was mostly written by my senior admin, who is WAY better with PS than me.  Not entirely perfect but it does the job. 

    Clear-Host

    $LogPath = 'C:\ServerFiles\BackupInterval\DhcpBackupInterval.log'

    Start-Transcript -Path $LogPath -Force

    $DHCPServers = Get-DhcpServerInDC

    $Global:CorruptServers = @()

    Write-Output $DhcpServers

    foreach ($Server in $DhcpServers) {

        if ((Test-Connection -ComputerName $Server.DnsName)) {

            $Results = Invoke-Command -ComputerName $Server.DnsName -ScriptBlock {

                param($Server)

                Write-Output `n`n"The server [$($Server.DnsName)] is up and running.  Stopping DHCP Server service..."

                $Service = Get-Service -Name dhcpserver -Verbose

                Write-Output "The DHCP Server service on [$($Server.DnsName)] is currently [$($Service.Status)]."

                Write-Output "Now stopping the DHCP Server service process..."

                try {

                    Stop-Service -Name dhcpserver -Force -Verbose -ErrorAction Stop

                }

                catch {

                    Write-Output "Unable to stop the DHCP Server service.  Forcing termination of process."

                    $Global:CorruptServers += $($Server.DnsName)

                    $ProcessId = (Get-WmiObject -Class win32_service -Verbose | Where-Object {$_.Name -eq 'dhcpserver'}).ProcessId

                    Write-Output "The DHCP Server service ProcessId is [$ProcessId]."

                    Stop-Process -Id $ProcessId -Verbose -Force

                    Start-Sleep 15 -Verbose

                }

                $Service = Get-Service -Name dhcpserver -Verbose

                Write-Output "The DHCP Server service on [$($Server.DnsName)] is now $($Service.Status)."

                Write-Output "Setting DHCP BackupInterval to 123 minutes on [$($Server.DnsName)]."

                Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Services\DHCPServer\Parameters" -Name "BackupInterval" -Value "0x0000007b" -ErrorAction 'Stop'

                $BackupInterval = Get-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Services\DHCPServer\Parameters" -Name "BackupInterval"

                Write-Output "The DHCP BackupInterval is now set to $($BackupInterval.BackupInterval) minutes on [$($Server.DnsName)]"

                Start-Service -Name dhcpserver -Verbose

            } -ArgumentList $Server

            Write-Output $Results

            $Service = Get-Service -ComputerName $Server.DnsName -Name dhcpserver -Verbose

            Write-Output "The DHCP Server service on [$($Server.DnsName)] is [$($Service.Status)]"

            Write-Output "Moving on to the next DHCP server."`n`n

        }

        else {

            Write-Output "The server [$($Server.DnsName)] is not responsive.  Not performig DHCP check... "`n`n

        }

    }

    Write-Output "The following servers had issues stopping the DHCP Server service..."

    Write-Output $Global:CorruptServers

    Stop-Transcript -Verbose

    Thursday, May 10, 2018 12:58 PM
  • Hi,

    Try changing the interval like cleanup every 60 mins and backup @ 250 mins so that you are giving some huge difference between cleanup and backup.

    Monitor the time taken of the files in the DHCP folder of the above activity and check during what time the service hang are happening to get into the root of the issue.

    Thanks

    Syed


    Dont forget to mark as Answered if you found this post helpful.

    Thursday, May 10, 2018 2:10 PM
  • We have tried various combinations with the intervals but inevitably the 2 intervals collide & DHCP hangs. I'm assuming the DHCP clean up is something new with Server 2016 because I don't see that happening on our 2012 R2 DHCP servers. If so, there is really a bug that needs to be dealt with by MS, don't you think?

    Can I/should I disable the clean up?

    -Dave

    Wednesday, May 30, 2018 1:54 PM