none
linux memory monitoring RRS feed

  • Question

  • I am trying to figure out why the scom memory monitor is way off base to the actual metric on the server.  I am running SCOM 1807 and have installed the latest agent on a Oracle Linux 6.1 server.  The created dashboard shows that the server is only using about 10% memory

    however when I log on to the server and run the following command it shows that the server is using almost 99%.  Why is there such a huge discrepancy?

    Thanks,

    Rene

     
    Thursday, February 14, 2019 8:00 PM

All replies

  • Hi Rene,

    can you check what command SCOM actually runs against the Linux systems? I have currently no access to verify it on my own. It should be fairly easy to fin this out and run the same...

    Also, I would check in the MP if the monitor does not do some kind of average result, based on many samples...

    Regards,


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov

    Thursday, February 21, 2019 2:26 PM
    Moderator
  • Hi Stoyan,

    Thanks for the response.  Not sure how to check for that command.  The MP guide has little information on the process of monitoring memory.  

    Rene

    Thursday, February 21, 2019 6:23 PM
  • Hi Rene,

    this seems to be the rule:

    \% Used Memory (Universal Linux)

    Let me find some tome tomorrow to study the rule and try to find what exactly is the workflow firing. I think that currently you cannot compare this with how you are querying the memory. 

    I will try to find some time tomorrow to check it out. 

    Cheers,


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov

    Thursday, February 21, 2019 7:03 PM
    Moderator
  • Thanks.  Much appreciated.

    Rene

    Thursday, February 21, 2019 7:18 PM
  • Hi Rene,

    I searched and read a bit (I am not that familiar with Linux) and it seems that SCOM is just querying (over the rule) a counter on the Linux system (this is my assumption), which is called "% Used Memory" or "Memory\\% Used Memory". 

    You are querying using a grep command which seems to get the data in a different way. I found a couple if discussions online where it has been discussed that the those are two different methods of getting (in the case of grep also formatting) the data. 

    i am also not sure (and could not find info) if the "% Used Memory" counter is really an actual Linux counter or just the MS interpretation of data, obtained using some command (like "grep" for example).

    Seems that the rule is obtaining this over WSMAN:

    $Data/WsManData/*[local-name(.)='SCX_MemoryStatisticalInformation']/*[local-name(.)='PercentUsedMemory']$

    Does "PercentUsedMemory" ring a bell? Is this easy to find out on the Linux system?

    Regards,


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov

    Friday, February 22, 2019 11:01 AM
    Moderator
  • The thing is that this not a "linux" counter per se, it only exists at the SCX_MemoryStatisticalInformation provider level, and that's a part of scom agent.

    So what we would need to know is how  the provider collects that info... And despite having found some source code for the provider on github, I can't say it's really clear to me.

    Friday, February 22, 2019 11:33 AM
  • The thing is that this not a "linux" counter per se, it only exists at the SCX_MemoryStatisticalInformation provider level, and that's a part of scom agent.

    So what we would need to know is how  the provider collects that info... And despite having found some source code for the provider on github, I can't say it's really clear to me.

    Hi CyrAz,

    good thing we have you here, helping out :) I also searched for such thing as "performance counter" on Linux, but found nothing, that is why I assumed this is more of a Windows way of interpreting the data from the data source. 

    I saw that the Azure monitor uses the same counters to collect that performance info from Linux system, but also there it is not clear how the data is collected. 

    Thanks for the input. Where did you find the source code for the provider? Can you share the link?

    Thanks in advance!


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov

    Friday, February 22, 2019 12:39 PM
    Moderator
  • Regardless of how the data is collected and interpreted, the true values from the Linux server (and I have seen the same issue on a number of Linux servers), and the published values on the dashboards and performance charts are way off.  Is this something that the developers can look at?

    Rene

    Friday, February 22, 2019 2:29 PM
  • Hi Rene,

    the only thing I can think of is to log this in the SCOM User Voice here, because it is monitored by the SCOM Product Group:

    General Operations Manager Feedback

    Regards,


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov

    Friday, February 22, 2019 2:45 PM
    Moderator
  • @Stoyan : here you go :

    header :  https://github.com/Microsoft/SCXcore/blob/master/source/code/providers/SCX_MemoryStatisticalInformation.h 

    actual code (maybe) : https://github.com/Microsoft/pal/blob/master/source/code/scxsystemlib/memory/memoryinstance.cpp

    @renee : Well the published values can't be way off "regardless of how they are collected" : the way they are collected is probably the reason why they are way off, in my opinion.






    • Edited by CyrAz Friday, February 22, 2019 4:34 PM
    Friday, February 22, 2019 3:55 PM
  • I spent a bit more time reading the source code, and from what I understand, this could be how free memory is calculated : 

    It reads /proc/meminfo then it uses the MemTotal and MemAvailable fields to calculate usedMemory :

    m_usedMemory = m_totalPhysicalMemory - m_availableMemory

    and then the PercentFreeMemory is probably calculated somewhere else...

    Friday, February 22, 2019 5:16 PM
  • These are the values from proc/meminfo from the server in question.

    MemTotal:       197940232 kB
    MemFree:         3094384 kB
    Buffers:          537428 kB
    Cached:         175360020 kB
    SwapCached:         1264 kB
    Active:         108163948 kB
    Inactive:       73964256 kB
    Active(anon):   38808024 kB
    Inactive(anon):  4589976 kB
    Active(file):   69355924 kB
    Inactive(file): 69374280 kB
    Unevictable:        6548 kB
    Mlocked:            6548 kB
    SwapTotal:      20971516 kB
    SwapFree:       20737720 kB
    Dirty:               240 kB
    Writeback:             0 kB
    AnonPages:       6237824 kB
    Mapped:         20884012 kB
    Shmem:          37163060 kB
    Slab:            3251676 kB
    SReclaimable:    2901232 kB
    SUnreclaim:       350444 kB
    KernelStack:       12224 kB
    PageTables:      7735796 kB
    NFS_Unstable:          0 kB
    Bounce:                0 kB
    WritebackTmp:          0 kB
    CommitLimit:    119941632 kB
    Committed_AS:   52236904 kB
    VmallocTotal:   34359738367 kB
    VmallocUsed:      691828 kB
    VmallocChunk:   34359017756 kB
    HardwareCorrupted:     0 kB
    HugePages_Total:       0
    HugePages_Free:        0
    HugePages_Rsvd:        0
    HugePages_Surp:        0
    Hugepagesize:       2048 kB
    DirectMap4k:       26260 kB
    DirectMap2M:     2043904 kB
    DirectMap1G:    201326592 kB

    So that would mean that 194845848 kb being used

    therefore the %memory used  194845848/197940232 *100 =  98.45%

    Rene

    Friday, February 22, 2019 5:57 PM