none
DPM scheduled jobs not running, no error RRS feed

  • Question

  • Hello

    DPM Version: DPM 2016 Version 5.0.409.0, running on Windows Server 2016 version 1607

    SQL Version: SQL 2014 Version 12.0.6024.0 running on Windows Server 2016 version 1607

    I use DPM to backup Exchange and SQL databases, file servers and Hyper-V VMs. Somehow they all stopped working a few days ago. Although I had configured email notifications on the DPM server, I didn't receive any, and into the DPM Administration console, everything shows up as OK, although I have no recent recovery point available.

    This support article was of no use unfortunately, despite depicting the exact same problem as I have: https://support.microsoft.com/en-us/help/4456295/how-to-troubleshoot-scheduled-backup-job-failures-in-dpm-2012-and-late

    •  Scheduled jobs do not run
    • No alert, warning or error on the DPM server
    • If I run a backup job manually from the console, it runs successfully
    • Running job manually from SQL Server Management Studio fails
    • SQL Server instance has some errors in the event log, the same ones as described in the above article

    Event ID 208

    SQL Server Scheduled Job 'b6812c9a-9b32-42e1-a9cb-2a4de7f943a4' (0x83CD06594C066D48A2429D57981FF4F0) - Status: Failed - Invoked on: 2019-06-15 14:30:00 - Message: The job failed.  The Job was invoked by Schedule 8096 (Schedule 1).  The last step to run was step 1 (Default JobStep).

    Event ID 976

    The DPM job failed because it could not contact the DPM engine.

    Problem Details:
    <JobTriggerFailed><__System><ID>9</ID><Seq>0</Seq><TimeCreated>6/15/2019 6:30:58 PM</TimeCreated><Source>TriggerJob.cs</Source><Line>76</Line><HasError>True</HasError></__System><Tags><JobSchedule /></Tags></JobTriggerFailed>

    The Job History (taken from SQL Management Studio) reads like this:

    Message
    Executed as user: <domain>\<SQL Agent runas account>. The step did not generate any output.  Process Exit Code 1064.  The step failed.

    Running the task from the command line with my account does not seem to do anything. No output, %errorlevel% value 1064.

    Both the SQL Server and the DPM server were rebooted multiple times. I always try to keep my servers as up to date as possible.

    I exhausted my list of solutions there, so if someone has any suggestion, I'm in desperate need of them. Running manual backups in the meantime but it's not an ideal solution and is time consuming.

    Thanks.

    Saturday, June 15, 2019 6:54 PM

Answers

  • Ok so, this is dumb. I managed to solve the issue and it was way too simple.

    What I did is I reinstalled the SQL Prep files on the SQL Server.

    I had already re-installed them, in fact is was one of the first things I did when I figured out the issue was SQL-related. The thing is, I didn't reboot the SQL server when I did it. I rebooted it afterwards, but not prior to re-installing. There's no prompt in the installer asking you to reboot so I didn't bother doing it, the thing is I had to do twice to make it work: after uninstalling and after re-installing.

    So here's what I did exactly, I hope it'll help if anyone would experience the same issue:

    1. Uninstall "Microsoft System Center 2016 DPM Support Files" from the SQL server using the control panel

    2. Reboot the SQL server (no prompt to reboot, do it anyway)

    3. Run the DPM setup

    4. Select "DPM Remote SQL Prep" in the DPM setup main menu and run the installation as usual

    5. Reboot the SQL server AGAIN (just to be safe)

    6. Magic, it works!!!

    7. Repeat all steps above on your other SQL cluster nodes if you have any

    Still thank you for your time Leon, I really appreciate. Hope you have a great day.
    • Edited by Quaracal Wednesday, July 3, 2019 8:17 PM
    • Marked as answer by Quaracal Wednesday, July 3, 2019 8:17 PM
    Wednesday, July 3, 2019 8:14 PM

All replies

  • Hello,

    Did you come by the following troubleshooting link?
    How to troubleshoot scheduled backup job failures in DPM 2012 and later versions

    Please also check the DPM logs for any more details/clues about the errors, you'll find the logs here:

    • %ProgramFiles%\Microsoft System Center 2016\DPM\DPM\Temp\MSDPMCurr.errlog

    Best regards,
    Leon


    Blog: https://thesystemcenterblog.com LinkedIn:

    Saturday, June 15, 2019 9:15 PM
  • Sorry about my previous post. I was a bit annoyed I had to work on this issue during my vacation. Here's an actual checklist of everything that I did to troubleshoot.

    Thank you for the link Leon, but it looks like it's the same article as I posted in my original post.

    I still went trough all the troubleshooting steps to make sure I did everything.

    1. Check the Application Event Log

    I have exactly the same events in my application logs as described in the article. So far so good.

    2. Run the job manually from SQL Server Management Studio

    The job fails to run with the exact same error as in the article when run from SSMS. The job history shows the following error message:

    Date        6/19/2019 10:19:30 AM
    Log        Job History (6b5f7854-1d35-4f00-adeb-f84e825808c2)

    Step ID        1
    Server        SQLCLUSTER
    Job Name        6b5f7854-1d35-4f00-adeb-f84e825808c2
    Step Name        Default JobStep
    Duration        00:00:01
    Sql Severity    0
    Sql Message ID    0
    Operator Emailed    
    Operator Net sent    
    Operator Paged    
    Retries Attempted    0

    Message
    Executed as user: <DOMAIN>\<SQL Agent user>. The step did not generate any output.  Process Exit Code 1064.  The step failed.

    Run the job manually from Command Prompt

    Nothing happens, with %errorlevel% returning 1064.

    C:\Windows\system32>"c:\Program Files\Microsoft Data Protection Manager\DPM2016\SQLPrep\TriggerJob.exe" 4af26344-bbd2-47f3-b5f5-99a9a88d00b9 6b5f7854-1d35-4f00-adeb-f84e825808c2 <DPM server FQDN>

    C:\Windows\system32>echo %errorlevel%
    1064

    I ran the SQL query attached in the article to find a few jobs to use for testing purposes.

    Check whether SQL jobs are disabled

    The jobs are marked as enabled.


    Check the logon account credentials

    The SQL instance is located on a different server than the DPM service ("Remote SQL Server" scenario), so I use a domain account to run the SQL Server Agent.

    1) Check logon account configured for following services of SQL instance for DPM:

    - SQL Server (InstanceName)
    - SQL Server Agent (InstanceName)

    On the SQL Server instance, I use, let's say, the account "sqladmin" for both SQL Server and SQL Agent.

    In both services.msc on the SQL Server AND in SQL Server Configuration Manager, both services show up as running as that user "sqladmin". (Both services are running)

    2) Check the values in the following registry key and verify whether the values are different. Update the values to reflect the user account being used for the SQL Agent Service.

    HKLM\Software\Microsoft\Microsoft Data Protection Manager\Setup

    SqlAgentAccountName

    SchedulerJobOwnerName

    On the DPM Server, both registry keys have the correct value, that is, my account "<DOMAIN>\sqladmin".

    3) Restart the SQL Agent and the SQL Server services after changing the account information in the registry.

    Nothing changed.


    4) On the DPM server, select the protection group, click Modify on the ribbon at the top, then complete and update the protection group without making any changes. This step is necessary in order to re-generate the jobs in SQL with the updated account information.

    I hoped it would have changed something but unfortunately the scheduled jobs still won't run even after re-configuring it. I also tried to delete a protection group and re-create it from scratch, but that didn't help.


    5) If you are using an account other than the Microsoft$DPM$Acct service account, update DCOM launch and access permissions to match what was granted to Microsoft$DPM$Acct .

    I did the DCOM trick with "Microsoft System Center Data Protection Manager 2016 Service", + I granted that sqladmin account full permissions on the specified folders in both the DPM server and the remote SQL Server.


    Check the triggerjob.exe path on the remote SQL server

    The path is: c:\Program Files\Microsoft Data Protection Manager\DPM2016\SQLPrep\TriggerJob.exe

    Check network and firewall settings

    The SQL Server can resolve the DPM server name successfully (using Nslookup, that is)


    I configured the DPM server firewall to allow all incoming TCP & UDP connections from the SQL server. Heck I even tried disabling the Windows firewall and nothing changed. They are both in the same subnet so no external, hardware firewall is involved in the process.


    I checked the log file you mentioned which I didn't know existed, thank you for that.

    I have a lot of this:

    0700    120C    06/19    14:45:25.808    16    JobManager.cs(709)            NORMAL    Jobmanager is idle.


    If I go back a couple days earlier, I have this:

    0D28    157C    06/15    18:11:08.079    11    VolumeInstrumentation.cs(92)            WARNING    VolumeInstrumentation.StorageEventWorker is exiting on unexpected error: System.Data.SqlClient.SqlException (0x80131904): Login failed for user '<DOMAIN>\NT-DPM-1$'. Reason: Server is in script upgrade mode. Only administrator can connect at this time.

    (Nt-DPM-1 is the name of my DPM server.)

    I also checked some other log files in the same directory and I found one of "MSDPMCurr.errlog" that goes from 5/30 to 6/12.

    5/30 is the day my last scheduled backup ran automatically. When the last job succeeds, the log is full of this:

    1888    19D0    05/30    13:30:52.866    16    JobManager.cs(709)            NORMAL    Jobmanager is idle.

    All the way to 6/3, when I noticed the backups weren't working, and I started them manually. Then, there are log messages about the jobs I ran manually, and then, from 6/4 all the way to 6/12 (when I ran other backups jobs manually), all I have is that error message about Jobmanager being idle.

    Google won't help me on that one... I will gladly accept any input from a DPM expert.

    Wednesday, June 19, 2019 3:21 PM
  • Has the SQL Server been patched, or has any patches or any other changes been done lately? Especially before 6/3?

    The "NORMAL  Jobmanager is idle."  is normal, nothing to be concerned about.

    For the following error:

    System.Data.SqlClient.SqlException (0x80131904): Login failed for user '<DOMAIN>\NT-DPM-1$'. Reason: Server is in script upgrade mode. Only administrator can connect at this time

    This may happen for example when a cumulative patch has been applied or when the instance has been restarted.

    Since everything is happening in the DPM database, I suspect the issue is there, have you checked the SQL server / SQL agent logs?

    Is there enough free disk space for the DPM database?

    Also check that none of the accounts have had their passwords expired.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Wednesday, June 19, 2019 3:51 PM
  • Thank you for your assistance, I really appreciate.

    I installed the Service Pack 3 on SQL 2014 on June 15, a few hours before I decided to seek help on this forum. I thought it would help to fix the issue, but as it turns out, nothing changed on that matter.

    Now I do remember I installed updates on the SQL server a few weeks ago, maybe around May 30 or 31, so that would coincide with the first missed schedules. I don't remember exactly, but I guess they were just Windows Server security updates and maybe there was a SQL 2014 CU (which was running Service Pack 2 at that time). I didn't have any issue during the update process. Update to SP3 went smoothly too. This SQL instance also hosts databases for VMM and a few other critical applications and all of them are running as good as they were before the update. So... idk maybe the update broke something, but how do I repair it?

    The SQL account has the "password never expires" option enabled. All the databases are stored on a Cluster shared volume which has around 160GB of free space. As for the logs, I found some warnings and errors in the Application event logs that I copy-pasted in my original post. I also have a lot of these:

    Windows Error Reporting (Event ID 1001)

    Fault bucket 127705026330, type 5
    Event Name: DPMException
    Response: Not available
    Cab Id: 0

    Problem signature:
    P1: TriggerJob
    P2: 5.0.158.0
    P3: TriggerJob.exe
    P4: 5.0.158.0
    P5: System.IO.FileNotFoundException
    P6: System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal
    P7: C372538F
    P8:
    P9:
    P10:

    Attached files:

    These files may be available here:
    C:\ProgramData\Microsoft\Windows\WER\ReportArchive\Critical_TriggerJob_416db61cab769ab053b3d5b2e7df5723a74bbf3_00000000_247f0f63

    Analysis symbol:
    Rechecking for solution: 0
    Report Id: dcdfdb64-92b7-11e9-a2ca-00155d97d804
    Report Status: 0
    Hashed bucket: cb7d71c3ced80478b08f8b012f4265ea

    The attached file goes as follows:

    Version=1
    EventType=DPMException
    EventTime=132054390005299415
    ReportType=1
    Consent=1
    UploadTime=132054390024580864
    ReportIdentifier=dcdfdb64-92b7-11e9-a2ca-00155d97d804
    AppSessionGuid=00000498-0000-001d-dbe8-159fc426d501
    TargetAppId=W:000056568927c140b7305f9d9b5d5e222d9000000904!0000c6b44dd98648baf707edfd75131db64ed56a358f!TriggerJob.exe
    TargetAppVer=2016//07//28:14:44:55!4bf0!TriggerJob.exe
    BootId=4294967295
    Response.BucketId=cb7d71c3ced80478b08f8b012f4265ea
    Response.BucketTable=5
    Response.LegacyBucketId=127705026330
    Response.type=4
    Sig[0].Name=Application Name
    Sig[0].Value=TriggerJob
    Sig[1].Name=Application Version
    Sig[1].Value=5.0.158.0
    Sig[2].Name=Module Name
    Sig[2].Value=TriggerJob.exe
    Sig[3].Name=Module Version
    Sig[3].Value=5.0.158.0
    Sig[4].Name=Exception Name
    Sig[4].Value=System.IO.FileNotFoundException
    Sig[5].Name=Exception Point
    Sig[5].Value=System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal
    Sig[6].Name=Other
    Sig[6].Value=C372538F
    DynamicSig[1].Name=OS Version
    DynamicSig[1].Value=10.0.14393.2.0.0.272.7
    DynamicSig[2].Name=Locale ID
    DynamicSig[2].Value=1033
    State[0].Key=Transport.DoneStage1
    State[0].Value=1
    FriendlyEventName=DPMException
    ConsentKey=DPMException
    AppName=TriggerJob
    AppPath=C:\Program Files\Microsoft Data Protection Manager\DPM2016\SQLPrep\TriggerJob.exe
    ReportDescription=AppName: TriggerJob        AppVer: 5.0.158.0    ModName: TriggerJob.exe/nModVer: 5.0.158.0        StackHash: C372538F
    ApplicationIdentity=00000000000000000000000000000000
    MetadataHash=774799989

    Hope that helps.

    Wednesday, June 19, 2019 5:45 PM
  • When you patched the SQL server did you perform a planned failover for the DPM database?

    If yes, is the DPM database currently running on another cluster node than previously?


    Blog: https://thesystemcenterblog.com LinkedIn:

    Tuesday, June 25, 2019 9:52 PM
  • That's an interesting question, I'm not aware of a way to run one database on a specific cluster node. Right now the SQL Server role is running on node n.1 and node n.2 is standby. I'll try to switch them over to see if something happens.

    I did not perform a "planned failover", I'm not even sure what that is, I ran the update during daytime when there is no backup job, starting with the standby node. The failover typically takes a second or two before reconnecting.

    Wednesday, June 26, 2019 12:54 PM
  • The reason to why I’m asking is because I’ve seen similar behavior when the DPM database is on a remote SQL Server cluster, and one of the nodes might be not configured properly (this is easy to miss during installation).


    Blog: https://thesystemcenterblog.com LinkedIn:

    Wednesday, June 26, 2019 12:58 PM
  • Switching cluster nodes did not help.

    I'll probably re-install DPM from scratch with a local SQL Server instance if I don't find a solution. That seems like it would be faster in my case.

    Thursday, June 27, 2019 12:34 PM
  • Which account is the SQL Agent running on?


    Blog: https://thesystemcenterblog.com LinkedIn:

    Thursday, June 27, 2019 1:22 PM
  • It's running as the same user as the SQL Server service, which is a SQL administrator account. I read somewhere that these services should run as two different users, but my guess is that it was configured this way by the previous tech.
    Thursday, June 27, 2019 2:45 PM
  • As long as it has sa it's enough.

    This issue appears to require some deeper troubleshooting, what I would suggest is to either create a case directly to Microsoft or then perform a clean installation of DPM.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, July 1, 2019 8:32 AM
  • Ok so, this is dumb. I managed to solve the issue and it was way too simple.

    What I did is I reinstalled the SQL Prep files on the SQL Server.

    I had already re-installed them, in fact is was one of the first things I did when I figured out the issue was SQL-related. The thing is, I didn't reboot the SQL server when I did it. I rebooted it afterwards, but not prior to re-installing. There's no prompt in the installer asking you to reboot so I didn't bother doing it, the thing is I had to do twice to make it work: after uninstalling and after re-installing.

    So here's what I did exactly, I hope it'll help if anyone would experience the same issue:

    1. Uninstall "Microsoft System Center 2016 DPM Support Files" from the SQL server using the control panel

    2. Reboot the SQL server (no prompt to reboot, do it anyway)

    3. Run the DPM setup

    4. Select "DPM Remote SQL Prep" in the DPM setup main menu and run the installation as usual

    5. Reboot the SQL server AGAIN (just to be safe)

    6. Magic, it works!!!

    7. Repeat all steps above on your other SQL cluster nodes if you have any

    Still thank you for your time Leon, I really appreciate. Hope you have a great day.
    • Edited by Quaracal Wednesday, July 3, 2019 8:17 PM
    • Marked as answer by Quaracal Wednesday, July 3, 2019 8:17 PM
    Wednesday, July 3, 2019 8:14 PM
  • Glad to hear you solved the problem yourself, and thank you for sharing your solution with the community!

    Blog: https://thesystemcenterblog.com LinkedIn:

    Wednesday, July 3, 2019 8:22 PM