none
Mediation Server Scalability Issue

    Question

  • Hi, we have an issue with our Lync PSTN conference.

    Our mediation server is hosted on a dedicated VM (vSphere 4, 4 vCPU, 8GB RAM), Lync 2010 Enterprise on Windows 2008 R2 to the latest patch, SIP trunk into CUCM 6.1.  With ~55 PSTN users (calling into Lync, no Lync dialout), the CPU utilization reaches 90%.  The CPU spike seems to trigger AV MCU alerts, which reject any new incoming PSTN or IM session.

    Virtual CPU and memory are not oversubscribed.

    The capacity is well below published guidelines below:

    According to this blog (http://jasonshave.blogspot.com/2011/05/calculating-number-of-mediation-servers.html), a standalone mediation server should host 950-1200 concurrent calls.

    PSTN resource utilization should be minimum according to Microsoft guidelines (http://technet.microsoft.com/en-us/library/gg615029):

    Number of concurrent PSTN conferencing users

    CPU requirements in megacycles

    CPU requirements as a percentage of a Front End Server*

    Memory requirements

    Network bandwidth

    50

    373

    2.0%

    0.47 GB

    1.0 Mbps

    100

    560

    3.0%

    0.59 GB

    2.1 Mbps

    150

    560

    3.0%

    0.71 GB

    3.2 Mbps

    200

    933

    5.00%

    0.83 GB

    4.4 Mbps

    250

    1,680

    9.00%

    1.01 GB

    5.6 Mbps


    Anyone encountered a similar problem or has a solution?









    • Edited by Far Side Thursday, June 07, 2012 7:41 PM
    Thursday, June 07, 2012 7:24 PM

All replies

  • Hi, which network driver did you use in vSphere, normal you should use the synthetic network driver.

    regards Holger Technical Specialist UC

    Thursday, June 07, 2012 9:56 PM
  • Hi Holger,

    I believe synthetic network driver is a hyper-v term.  The equivalent in VMWare is paravirtualized VMXNET-adapter.  We use VMXNET 3 adapter.

    Please see VMXNET 3 description here http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1001805

    We run BPA multiple times too.  The only thing BPA flagged is Lync on VM, which we knew...


    • Edited by Far Side Friday, June 08, 2012 1:38 AM
    Friday, June 08, 2012 1:26 AM
  • Hi Far Side,

    yes that is correct. I see in some configuration the E1000 driver would be used for Lync and the VMXNET 3 has a better performance for real time data.

    Have you check your virus scanner, because we use the same configuration as you and every thing is ok, no performance issues.


    regards Holger Technical Specialist UC

    Friday, June 08, 2012 7:01 AM
  • Yes, we disabled all anti-virus scan as a diagnostic measure.  Microsoft published antivirus scan exclusions for Lync (http://technet.microsoft.com/en-us/library/gg195736.aspx).  Just to be on the safe side, we disabled all AV on FE, Mediation, Monitor, Database servers.  There was not substantial difference whether AV is on or off.

     I should also mention we implemented QOS, and engaged Microsoft to no avail (still ongoing). 

    We also looked into Media Bypass in an attempt to offload transcoding effort on the mediation server, but was told that media bypass would only work when a Lync endpoint is involved, or Enterprise Voice is enabled.

    Friday, June 08, 2012 12:17 PM
  • We also looked into Media Bypass in an attempt to offload transcoding effort on the mediation server, but was told that media bypass would only work when a Lync endpoint is involved, or Enterprise Voice is enabled.

    Are you collocating Mediaiton service with FE or not?

    EV Media bypass would work only if Lync client and PSTN gateway (CUCM) have matching media, ie RTP or Secure RTP. By default, Lync client requires SRTP, but Lync cannot talk SRTP to CUCM. You can make encryption optional for Lync client and enably media bypass system wide and on for the PSTN gateway. To do so, you'll have to disable RTCP timers and refer support on CUCM SIP trunk.

    Also keep in mind that technet capacity planning for mediation server are for 8x CPU cores physical machine and 70% of the calls are from internal users. Reading between the lines, it could mean the numbers are acurate if 70% of the calls are using media bypass.

    When you are experiencing high CPU usage, did you check what process uses all the CPU cycles?


    • Edited by Adminiuga Saturday, June 09, 2012 1:24 AM
    Saturday, June 09, 2012 1:23 AM
  • We have the Meidation service collocate with FE, we also tested with the Medication service on its own dedicated server.  We run into similar issues whether the Mediation service is collocated or not.

    The published PSTN callers on a standalone mediation server is 800-950.  Taking into consideration of VM overhead (10% per Microsoft virtualization guideline), 4 virtual CPU - the theoretical limit should be around 400 PSTN callers.  We run into issue with 60 PSTN callers.

    Microsoft guideline seems to be very confident on performance of PSTN callers, not sure the hidden assumption of the PSTN user, i.e. media bypass or not.

    Mediation service is the one taking 70% of the total CPU consumption.

    Sunday, June 10, 2012 9:50 PM
  • The Lync Server 2010 Stress and Performance Tool can be used to prepare, define and validate performance targets of user scenarios offered by on-premise Lync Server 2010 deployment. It includes multiple modules and can simulate simultaneous users on one or more Lync Servers. You can have a try.


    Noya Lau

    TechNet Community Support

    Monday, June 11, 2012 7:59 AM
    Moderator
  • We looked at LSS.  It seems to require Domain Admin privilege to run and it manufactures all users in the production AD domain.  Even though the test users can be placed under a separate OU, our operation teams have reservation to grant so much liberty to a test tool in a production environment.

    Do you know if LSS can run under a domain account and pick up user credentials from a XML file?

    Monday, June 11, 2012 2:49 PM
  • Do NOT run the Stress and Performance Tool in production. It is not supported by MS and it requires domain admin privileges because it runs amok creating test users etc. Getting back to your oringinal question you state that you are using only 4vCPU have you considered and tested what happens if you go to 8vCPU? Of course this is not officially supported by MS because they have not tested 8vCPU because Hyper-V does not currently support more than 4vCPU but it might provide some temporary relief for your problems and it would be interesting to see if this scaled linearly, i.e. you could get to 110 users.

    Monday, June 11, 2012 3:13 PM
  • I couldn't agree with you more on NOT running LSS in production.

    We haven't increased vCPU from 4 to 8 in our test.  We were hoping to find the root cause of the issue.  There seems to be too much of a discrepancy between the official Microsoft number and our number.  We are hoping it's a configuration issue somewhere.

    Monday, June 11, 2012 5:57 PM