none
Maintenance Mode completes and Unix Log alerts still generate

    Question

  • Hi guys,

    When we place a Unix server into maintenance mode (so work can be done on it) alerts are suppressed as expected while the server is in MM.

    The issue is when the maintenance mode finishes the custom log monitoring rules scrape for events during the maintenance window and then generate alerts minutes after the server has exited maintenance mode.

    Is this how log monitoring works for unix? How can I stop these alerts from coming through after MM has finished? or is this a bug?

    Were using SCOM 2012 SP1 rollup 5 (we were running roll up 4 and this issue was present also)

    Unix server details:

    HPUX 11.23

    SCOM agent version 1.4.1-292

    Note: I believe there's a new version of the agent but if I update it will this resolve the issue?

    Thanks Martin.

    Monday, May 12, 2014 3:12 AM

Answers

  • Martin,

    This is a known limitation, and it is not fixed in the latest agent version.  The UNIX/Linux agent log file parsing keeps track of the last read line in the file, and on subsequent invocations parses between the last read line and the current end-of-file. When an agent enters maintenance mode, the workflows stop running.  When the agent exits maintenance mode, the log file parsing picks up from the last read line (which was read before Maintenance Mode was entered) and the current end-of-file.  Thusly, log file matches written during the maintenance mode window would be matched upon exiting maintenance mode.

    At the present time, I can offer one workaround: 
    The "last read line" that I referred to is tracked in a set of "manifest" files in /var/opt/microsoft/scx/lib/state/.  The files are named with the pattern: LogFileProvider__<Username>__<filename>ManagementGroup.  Note: log files parsed by an account other than root have a manifest file in a subdirectory named for the user, such as /var/opt/microsoft/scx/lib/<username>. If you delete these files, the LogFile parsing will begin again at the END of the file and match for new lines from that point forward.  So, if you did a recursive delete of LogFileProvider_* in /var/opt/microsoft/scx/lib/ before ending maintenance mode, the parsing would start from the current end of each file. If you are using Orchestrator in your environment, you could create a Runbook to control maintenance mode and deletion of these files in an automated fashion.

    -Kris


    www.operatingquadrant.com

    Tuesday, May 13, 2014 6:34 PM
  • Please review Kris's answer.

    What we have done to work around this issue, is create a powershell script that monitors for event id 1215 or(and) 1216.  If an matched event, the powershell then connects to Management server through operationsshell and identify the server in question.  If it matches a server part of a specific unix resource pool, it would then establish a ssh connection to the server and remove the set of manifest file.  

    good. luck.

    Monday, August 04, 2014 6:00 AM

All replies

  • Hi,

    we have the same issue on SLES11 servers on SCOM 2012 SP1 UR4 (Agent 1.4.1-292). We're about to let our Premier Field Engineer check on this too. We need a statement on that. In our opinion thats what the maintenance mode is for, so there shouldn't be alarms for the time while the server is in maintenance, even for logfile monitoring.

    Tuesday, May 13, 2014 5:30 AM
  • Martin,

    This is a known limitation, and it is not fixed in the latest agent version.  The UNIX/Linux agent log file parsing keeps track of the last read line in the file, and on subsequent invocations parses between the last read line and the current end-of-file. When an agent enters maintenance mode, the workflows stop running.  When the agent exits maintenance mode, the log file parsing picks up from the last read line (which was read before Maintenance Mode was entered) and the current end-of-file.  Thusly, log file matches written during the maintenance mode window would be matched upon exiting maintenance mode.

    At the present time, I can offer one workaround: 
    The "last read line" that I referred to is tracked in a set of "manifest" files in /var/opt/microsoft/scx/lib/state/.  The files are named with the pattern: LogFileProvider__<Username>__<filename>ManagementGroup.  Note: log files parsed by an account other than root have a manifest file in a subdirectory named for the user, such as /var/opt/microsoft/scx/lib/<username>. If you delete these files, the LogFile parsing will begin again at the END of the file and match for new lines from that point forward.  So, if you did a recursive delete of LogFileProvider_* in /var/opt/microsoft/scx/lib/ before ending maintenance mode, the parsing would start from the current end of each file. If you are using Orchestrator in your environment, you could create a Runbook to control maintenance mode and deletion of these files in an automated fashion.

    -Kris


    www.operatingquadrant.com

    Tuesday, May 13, 2014 6:34 PM
  • Hi Kris,

    is a fix for that currently in progress? Can we do something to push that? For us thats quite an impact, there will be some monitoring requests within a short time regarding security log monitoring. We then would have to develop a more complex monitoring because of this defect. The workaround with deleting the files via orchestrator is not an option for us in the short term since we have an extra mechanism for maintenance mode implemented at the moment. At the moment our environment covers ~2000 Linux servers.

    Regards

    Wednesday, May 14, 2014 9:20 AM
  • Holger and Martin,

    We are considering options about how we could improve this in a future update.  If I may ask, how are you managing your maintenance mode now?  Are you using PowerShell to manage MM?  Are you using timed MM windows or starting and stopping MM with the console or PowerShell?

    Thanks,

    Kris


    www.operatingquadrant.com

    Monday, May 19, 2014 9:26 PM
  • Hi Kris,

    Thanks for your detailed response. I suspected that this was by design.

    We currently instruct out engineers to use the console to start/stop MM.

    We have an 24/7 Operations team that action alerts and have the ability to place servers into MM.

    We use powershell script + task scheduler to place groups into maintenance mode. 

    We have also configured a task in scom to automatically place a citrix server into MM for citrix server initated restarts (when they do their own restart at night) ect..

    I have had a request from out windows and unix team engineers for the ability to schedule MM for servers and groups so I was looking at developing an inhouse tool to provide this. We also use PRTG for network monitoring and out of the box it offers this. Would love it if SCOM had this built in.

    Also on the unix agent note the ability to perform a task once exiting MM (your workaround would easily be integrated in). 

    Unfortunately we don't use Orchestrator, do you have any other suggestions for how I would implement this workaround?

    P.S Feel free to contact me if you would like more feedback happy to discuss.

    Thanks Martin. 

    Wednesday, May 21, 2014 3:38 AM
  • Please review Kris's answer.

    What we have done to work around this issue, is create a powershell script that monitors for event id 1215 or(and) 1216.  If an matched event, the powershell then connects to Management server through operationsshell and identify the server in question.  If it matches a server part of a specific unix resource pool, it would then establish a ssh connection to the server and remove the set of manifest file.  

    good. luck.

    Monday, August 04, 2014 6:00 AM
  • Please check my latest MP addition to TechNet Gallery:

    https://gallery.technet.microsoft.com/UNIXLinux-LogFile-Library-4133064b

    It allows you to create rules to monitor UNIX/Linux LogFiles with all known limitations lifted.

    Friday, March 03, 2017 6:24 PM
  • Please check my latest MP addition to TechNet Gallery:

    https://gallery.technet.microsoft.com/UNIXLinux-LogFile-Library-4133064b

    It allows you to create rules to monitor UNIX/Linux LogFiles with all known limitations lifted.

    Friday, March 03, 2017 6:24 PM
  • Three years later and the issue was not actually resolved?  

    Is SCOM getting any actual development\remediation work done these days or is it just an ignored backwater?  

    Wednesday, March 08, 2017 6:03 PM