ATA Center Service failing to start RRS feed

  • Question

  • Hi All,

    My ATA center has stopped working after being functional for 3+ months.  The service tries to start and then stops repeatedly  The following error is occurring repeatedly in the log:

    2017-09-14 19:12:08.7642 1884 674 21a234d4-0cfb-49d3-9ca9-f7f5fd3c693e Error [SourceAccountToSourceComputerRule+<GetItemsetsAsync>d__8] System.NullReferenceException: Object reference not set to an instance of an object.
       at async Microsoft.Tri.Center.Detection.Detectors.SourceAccountToSourceComputerRule.GetItemsetsAsync(?)
       at async Microsoft.Tri.Center.Detection.Detectors.BehavioralRule.GetCachedItemsetsAsync(?)
       at async Microsoft.Tri.Center.Detection.Detectors.BehavioralRule.RunAsync(?)
       at async Microsoft.Tri.Center.Detection.Detectors.AbnormalBehaviorDetector.BuildDataAsync(?)
       at async Microsoft.Tri.Center.Detection.Detectors.AbnormalBehaviorDetector.DetectAsync(?)
       at async Microsoft.Tri.Center.Detection.Detectors.Detector`4.<OnInitializeAsync>b__44_1[](?)
       at async Microsoft.Tri.Infrastructure.Blocks.BatchBlockWrapper`1.<>c__DisplayClass13_1.<-ctor>b__1[](?)

    I followed the steps at https://docs.microsoft.com/en-us/advanced-threat-analytics/troubleshooting-service-startup, but they didn't help.  Any suggestions you could provide would be great.  Thanks so much!

    Thursday, September 14, 2017 7:52 PM

All replies

  • which exact version of ATA are you running?
    Thursday, September 14, 2017 7:56 PM
  • This would be 1.8 Update 1.
    Thursday, September 14, 2017 8:09 PM
  • Additional info from Microsoft.Tri.Center-Errors.log:

    2017-09-14 20:27:58.2066 3572 5   00000000-0000-0000-0000-000000000000 Error [CenterConfigurationManager+<GetConfigurationAsync>d__7] System.NullReferenceException: Object reference not set to an instance of an object.
       at async Microsoft.Tri.Center.Service.CenterConfigurationManager.GetConfigurationAsync(?)
       at async Microsoft.Tri.Infrastructure.Framework.ConfigurationManager`2.UpdateConfigurationAsync[](?)
       at async Microsoft.Tri.Infrastructure.Framework.ConfigurationManager`2.OnInitializeAsync[](?)
       at async Microsoft.Tri.Center.Service.CenterConfigurationManager.OnInitializeAsync(?)
       at async Microsoft.Tri.Infrastructure.Framework.Module.InitializeAsync(?)
       at async Microsoft.Tri.Infrastructure.Framework.ModuleManager.OnInitializeAsync(?)
       at async Microsoft.Tri.Infrastructure.Framework.Module.InitializeAsync(?)
       at async Microsoft.Tri.Infrastructure.Framework.Service.OnStartAsync(?)
       at Microsoft.Tri.Infrastructure.Framework.Service.OnStart(String[] args)

    Thursday, September 14, 2017 8:29 PM
  • This new callstack is a whole new story...


    How long is this deployment running? 

    Was it a new one, directly installed as 1.8.1 ?

    If it was upgraded from earlier versions - what were the dates of the upgrade to 1.8.0 & 1.8.1?

    Do you have the side backup of the json file from a working configuration of hte center as described here
    https://docs.microsoft.com/en-us/advanced-threat-analytics/ata-configuration-file ?

    Thursday, September 14, 2017 8:48 PM
  • 1.  This deployment has been runing since about mid June.

    2. No, this was installed as 1.7 update 2 and then upgraded to 1.8.0 and 1.8.1.  The upgrade to 1.8.1 occurred in early August and has been working great.  It should be noted some Windows updates came down today for htis server.  I'm in the process of rolling them back.  

    3.  I can attempt to restore the earliest one without having to pull from backup (which will be a little harder).  I noted the ATA was not working properly and only pulling part of the data in when I logged in today.  Following a reboot (which finished applying the above mentioned Windows updates) was when the service began failing.  

    Thursday, September 14, 2017 8:55 PM
  • I restored back to the earliest backup from today (when I know ATA was at least partially working because it was pulling some of the data).  This did not help, the service is still failing.  Any other ideas? 
    Thursday, September 14, 2017 9:03 PM
  • don't roll the updates, they are unrelated, what triggered the failure is the reboot itself,

    which showed a problem that happened before, only gone unnoticed.

    Can you find the exact dates of the upgrade to 1.8.0 & 1.8.1?

    Don't try to restore from existing files on the disk. they are only 10 hours back, and most likely 

    all of them are bad now. (You can verify by searching the string 

    "LicenseType" : "Evaluation"

    if it appears in the file, it means the file is no good (assuming that you did activate the product at some point, and not still running in eval mode, as Mid September is roughly where we are now)

    If indeed you find evaluation, please check the EvaluationExpirationTime value near it, and paste here the timestamp, it will help me figure how long you would need to go back in your backups to get a good file.

    Sadly, restoring just this file won't be enough, you will need to go through the full procedure.


    (Please make sure you have all the needed backups before you start).


    Thursday, September 14, 2017 9:05 PM
  • see my other reply, you might have missed it before posting again.
    Thursday, September 14, 2017 9:12 PM
  • So yes, I did activate the product using our key.  The line in the file is there as shown below:


    So do I have to do disaster recovery at this point?

    Thursday, September 14, 2017 9:17 PM
  • So the problem initiated at 14/09/2017 6:46:05 UTC +- 1 Hour I think.

    Anything special that you know about happened to the server at this point?

    Do you have an earlier backup of this json file where the license is not evaluation?

    Thursday, September 14, 2017 9:21 PM
  • I do have backups prior to this.  I'm not aware of anything that happened to the server at this point.  
    Thursday, September 14, 2017 9:24 PM
  • Then yes, you should go for full Center recovery as described in the link I provided above.

    Note, that if you did any changes like IP changes, certificate changes etc, since the latest working backup you have, it will not work.

    Also, if you added gateways since than, those will need to be reinstalled.

    How many gateways you have deployed?

    Also, if you can, before wiping the current Center machine, if you can stop & disable ATA Services + mongo services, then copy the entire ATA folder aside it would be great, I might be able to provide post mortem steps later on, not sure yet.

    Thursday, September 14, 2017 9:30 PM
  • I have about 50 ATA lightweight gateways deployed.  No changes have been made to the machine so I should be OK.

    One thing I can also do, is I have a full snapshot of the VM a few hours prior to when the issue occurred.  Do you think it is feasible to revert to a snapshot of the VM and then just have a gap in the data? I don't have any issue with that since the ATA center has been down and not collecting data anyways.

    Thursday, September 14, 2017 9:48 PM
  • A snapshot will work if mongo was stopped during the snapshot.

    if it didn't, it's a big risk, I can't tell you anything about the state of the DB during the snapshot.

    Can you email me at atashare at microsoft com ?

    I think this issue is a bit complicated for troubleshooting via the forum. 

    I might be able to help more.

    Thursday, September 14, 2017 9:59 PM