locked
SFB 2015 EE pool and LS InterCluster Routing error 55011 (pool unreachable) RRS feed

  • Question

  • This is a strange issue.  I have an EE pool that we're using for some product stress testing, and it's a standard set of 3 FE servers (2015fepool.loadtest.local).  Everything will be fine for a while, but every so often, performance will drop pretty sharply.  We'll see this manifest with timeouts on conference operations, but the other thing that's strange is that this message starts appearing in the event log around the same time:


    Log Name:      Lync Server
    Source:        LS InterCluster Routing
    Date:          1/11/2017 2:23:56 PM
    Event ID:      55011
    Task Category: (1063)
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      2015fe2.loadtest.local
    Description:
    A Communication Server pool is unreachable and has been marked as down.

    Multiple attempts to route to pool 2015fepool.loadtest.local have failed and this pool has been marked as down for audio calls.

    However it's immediately followed by this:

    Log Name:      Lync Server
    Source:        LS InterCluster Routing
    Date:          1/11/2017 2:23:57 PM
    Event ID:      55012
    Task Category: (1063)
    Level:         Information
    Keywords:      Classic
    User:          N/A
    Computer:      2015fe2.loadtest.local
    Description:
    The Pool 2015fepool.loadtest.local has started responding to requests. Any previous errors have been resolved.

    This is happening maybe once every 60-90 seconds, and seems to correspond to the timeout periods.  The server where I'm seeing this is a member of that pool, and all of the FEs were up at the time.  So far, I haven't found any pattern to these problems-it's a random time of day, usually at least 10 hours into running load.  This time things ran for almost 6 days before we started seeing this.  Everything is up to the latest CU, there were no CPU/memory spikes at this point, and the environment is otherwise stable.  What I don't really understand though, is why we're seeing that message in the first place.  What conditions would cause the server/pool to identify itself as "down", and is there a way we can tweak this? 

    Wednesday, January 11, 2017 7:56 PM

All replies

  • Hi Chris,

    Welcome to our forum.

    Did  this error occur on other FE servers?
    Are there any other issues in organization about SFB server 2015?

    To this issue, we suggest you update SFB to the latest to check if the issue persist:
    https://blogs.technet.microsoft.com/uclobby/2015/06/22/skype-for-business-2015-cumulative-update-list/

    If there are any questions or issues, please be free to let me know.            

    Best Regards,
    Jim Xu
    TechNet Community Support


    Please remember to mark the replies as answers if they helped.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Thursday, January 12, 2017 10:32 AM
  • There are 2 reasons.One is network drop to servers,or because of the overload or performance the server is not responding. Sproc latency kind of issues.

    Jayakumar K

    Thursday, January 12, 2017 1:46 PM
  • I do see the same errors on other FEs in the same pool.  Core components on all servers are up to date at 6.0.9319.272. No other SFB issues on this pool that I know of.

    I did have a similar theory about Sproc latency being a problem, so I left a SQL profiler running on RTCLOCAL giving me every operation that took over 500 ms.  No smoking guns there that I can see, but there are things like this:

    exec @result = sys.xp_userlock 0, @dbid, @DbPrincipal, @Resource, @mode, @owner, @LockTimeout (1,197 ms)

    exec @Status = rtcdyn.dbo.sp_getapplock @PublisherLockName, N'Exclusive', N'Transaction', 50000; (1,197 ms)

    Pretty constantly after it gets into this state.  Before this point (the previous 4 days' worth of traces), I'd get maybe one operation every hour or two that crossed this threshold, but after, it was thousands.  Nothing that I can see directly before this error started happening either. 

    Thursday, January 12, 2017 1:59 PM
  • Hi Chris,

    For this issue, we suggest you re-enable “Inter cluster routing” component to check if the issue persist, if not, enable “Inter cluster routing component” log to us for troubleshooting.

    In addition, make sure port 3268 is not used for other application, because it is used for FE to communicate with GC.

    If there are any questions or issues, please be free to let me know.            


    Best Regards,
    Jim Xu
    TechNet Community Support


    Please remember to mark the replies as answers if they helped.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    • Proposed as answer by Liinus Thursday, January 19, 2017 10:02 AM
    Wednesday, January 18, 2017 9:08 AM