locked
DistributedCacheService Crashes RRS feed

  • Question

  • Quick background. This is for a SharePoint 2013 deployment where the Distributed Cache uses AppFabric Caching Service. There are three Windows Server 2008R2 Virtual Servers running on VMWare. Two are dedicated Web Front End Servers the third is a Search/App/Central Admin server. When Distributed Cache is enabled on any of the servers the web site takes at least 6 seconds to load. If I disable Distributed Cache on all servers it takes milliseconds to load. This is a new environment with no data out there yet and only a couple top level sites.

    When Distributed Cache is enabled on any of the servers it will start briefly and crash. In the Application log there are EVENT IDs of 1000 and 1026 being logged as noted here. 

    Log Name:      Application
    Source:        Application Error
    Date:          10/1/2013 1:06:08 PM
    Event ID:      1000
    Task Category: (100)
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      [server name]
    Description:
    Faulting application name: DistributedCacheService.exe, version: 1.0.4632.0, time stamp: 0x4eafeccf
    Faulting module name: KERNELBASE.dll, version: 6.1.7601.18015, time stamp: 0x50b8479b
    Exception code: 0xe0434352
    Fault offset: 0x0000000000009e5d
    Faulting process id: 0xbd4
    Faulting application start time: 0x01cebed0e2fef851
    Faulting application path: c:\Program Files\AppFabric 1.1 for Windows Server\DistributedCacheService.exe
    Faulting module path: C:\Windows\system32\KERNELBASE.dll
    Report Id: 2518a099-2ac4-11e3-ae94-0050568c62e6
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Application Error" />
        <EventID Qualifiers="0">1000</EventID>
        <Level>2</Level>
        <Task>100</Task>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime="2013-10-01T18:06:08.000000000Z" />
        <EventRecordID>103058</EventRecordID>
        <Channel>Application</Channel>
        <Computer>[server name]</Computer>
        <Security />
      </System>
      <EventData>
        <Data>DistributedCacheService.exe</Data>
        <Data>1.0.4632.0</Data>
        <Data>4eafeccf</Data>
        <Data>KERNELBASE.dll</Data>
        <Data>6.1.7601.18015</Data>
        <Data>50b8479b</Data>
        <Data>e0434352</Data>
        <Data>0000000000009e5d</Data>
        <Data>bd4</Data>
        <Data>01cebed0e2fef851</Data>
        <Data>c:\Program Files\AppFabric 1.1 for Windows Server\DistributedCacheService.exe</Data>
        <Data>C:\Windows\system32\KERNELBASE.dll</Data>
        <Data>2518a099-2ac4-11e3-ae94-0050568c62e6</Data>
      </EventData>
    </Event>



    Log Name:      Application
    Source:        .NET Runtime
    Date:          10/1/2013 1:06:06 PM
    Event ID:      1026
    Task Category: None
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:     [server name]
    Description:
    Application: DistributedCacheService.exe
    Framework Version: v4.0.30319
    Description: The process was terminated due to an unhandled exception.
    Exception Info: System.ArgumentException
    Stack:
       at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.Invoke(System.Threading.WaitCallback, System.Object)
       at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.WorkerThreadStart()
       at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
       at System.Threading.ThreadHelper.ThreadStart()

    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name=".NET Runtime" />
        <EventID Qualifiers="0">1026</EventID>
        <Level>2</Level>
        <Task>0</Task>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime="2013-10-01T18:06:06.000000000Z" />
        <EventRecordID>103057</EventRecordID>
        <Channel>Application</Channel>
        <Computer>[server name]</Computer>
        <Security />
      </System>
      <EventData>
        <Data>Application: DistributedCacheService.exe
    Framework Version: v4.0.30319
    Description: The process was terminated due to an unhandled exception.
    Exception Info: System.ArgumentException
    Stack:
       at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.Invoke(System.Threading.WaitCallback, System.Object)
       at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.WorkerThreadStart()
       at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
       at System.Threading.ThreadHelper.ThreadStart()
    </Data>
      </EventData>
    </Event>





    If I drill down to the Microsoft-Windows-Application Server-System Services/Admin Log I see EVENTID 6 and 111

    Log Name:      Microsoft-Windows-Application Server-System Services/Admin
    Source:        Microsoft-Windows Server AppFabric Caching
    Date:          10/1/2013 1:10:54 PM
    Event ID:      111
    Task Category: (1)
    Level:         Error
    Keywords:      
    User:          [Domain\User Account]
    Computer:      [Server Name]
    Description:
    AppFabric Caching service crashed with exception {System.ArgumentException: An entry with the same key already exists.
       at System.Collections.Generic.TreeSet`1.AddIfNotPresent(T item)
       at System.Collections.Generic.SortedDictionary`2.Add(TKey key, TValue value)
       at Microsoft.Fabric.Data.PartitionTable.UpdateEntry(LookupTableEntry newEntry)
       at Microsoft.Fabric.Data.PM.PMPartitionTable..ctor(PartitionManager pm, IList`1 partitions, Int64 savedVersion, Object lockObject)
       at Microsoft.Fabric.Data.PM.PartitionCache..ctor(PartitionManager pm, IPartitionManagerStore pmStore, LoadTable loadTable, IList`1 partitions, Int64 savedLookupVersion)
       at Microsoft.Fabric.Data.PM.PartitionManager.ProcessLoadPM(Object state)
       at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.Invoke(WaitCallback callback, Object state)
       at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.WorkerThreadStart()
       at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()}. Check debug log for more information
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Microsoft-Windows Server AppFabric Caching" Guid="{A77DCF21-545F-4191-B3D0-C396CF2683F2}" />
        <EventID>111</EventID>
        <Version>0</Version>
        <Level>2</Level>
        <Task>1</Task>
        <Opcode>111</Opcode>
        <Keywords>0x8000000000000000</Keywords>
        <TimeCreated SystemTime="2013-10-01T18:10:54.239292100Z" />
        <EventRecordID>13600</EventRecordID>
        <Correlation />
        <Execution ProcessID="4000" ThreadID="4144" />
        <Channel>Microsoft-Windows-Application Server-System Services/Admin</Channel>
        <Computer>[Server Name]</Computer>
        <Security UserID="S-1-5-21-2193036465-2839809817-2807360763-7132" />
      </System>
      <EventData>
        <Data Name="Source">AppFabricCachingService.Crash</Data>
        <Data Name="Param">System.ArgumentException: An entry with the same key already exists.
       at System.Collections.Generic.TreeSet`1.AddIfNotPresent(T item)
       at System.Collections.Generic.SortedDictionary`2.Add(TKey key, TValue value)
       at Microsoft.Fabric.Data.PartitionTable.UpdateEntry(LookupTableEntry newEntry)
       at Microsoft.Fabric.Data.PM.PMPartitionTable..ctor(PartitionManager pm, IList`1 partitions, Int64 savedVersion, Object lockObject)
       at Microsoft.Fabric.Data.PM.PartitionCache..ctor(PartitionManager pm, IPartitionManagerStore pmStore, LoadTable loadTable, IList`1 partitions, Int64 savedLookupVersion)
       at Microsoft.Fabric.Data.PM.PartitionManager.ProcessLoadPM(Object state)
       at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.Invoke(WaitCallback callback, Object state)
       at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.WorkerThreadStart()
       at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()</Data>
      </EventData>
    </Event>

    Log Name:      Microsoft-Windows-Application Server-System Services/Admin
    Source:        Microsoft-Windows-Fabric
    Date:          10/1/2013 1:06:06 PM
    Event ID:      6
    Task Category: None
    Level:         Warning
    Keywords:      
    User:          [Domain\User Account]
    Computer:      [Server Name]
    Description:
    {e9d6998000000000000000000000000} failed to refresh lookup table, exception: {Microsoft.Fabric.Common.OperationCompletedException: Operation completed with an exception ---> Microsoft.Fabric.Federation.RoutingException: The target node explicitly aborted the operation
       --- End of inner exception stack trace ---
       at Microsoft.Fabric.Common.OperationContext.End()
       at Microsoft.Fabric.Federation.FederationSite.EndRoutedSendReceive(IAsyncResult ar)
       at Microsoft.Fabric.Data.ReliableServiceManager.EndRefreshLookupTable(IAsyncResult ar)}
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Microsoft-Windows-Fabric" Guid="{751C9DC0-4F51-44F6-920A-A620C7C2D13E}" />
        <EventID>6</EventID>
        <Version>0</Version>
        <Level>3</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8000000000000000</Keywords>
        <TimeCreated SystemTime="2013-10-01T18:06:06.981761700Z" />
        <EventRecordID>13597</EventRecordID>
        <Correlation />
        <Execution ProcessID="3028" ThreadID="3492" />
        <Channel>Microsoft-Windows-Application Server-System Services/Admin</Channel>
        <Computer>[Server Name]</Computer>
        <Security UserID="S-1-5-21-2193036465-2839809817-2807360763-7132" />
      </System>
      <EventData>
        <Data Name="param1">e9d6998000000000000000000000000</Data>
        <Data Name="param2">Microsoft.Fabric.Common.OperationCompletedException: Operation completed with an exception ---&gt; Microsoft.Fabric.Federation.RoutingException: The target node explicitly aborted the operation
       --- End of inner exception stack trace ---
       at Microsoft.Fabric.Common.OperationContext.End()
       at Microsoft.Fabric.Federation.FederationSite.EndRoutedSendReceive(IAsyncResult ar)
       at Microsoft.Fabric.Data.ReliableServiceManager.EndRefreshLookupTable(IAsyncResult ar)</Data>
      </EventData>
    </Event>



    Any Ideas. I think the biggest clue is the EventID 111 "An entry with the same key already exists" but searches don't seem to provide any clues about this.

    I have tried all the steps outlined in these sites...

    http://blogs.technet.com/b/uktechnet/archive/2013/05/07/guest-post-distributed-cache-service-in-sharepoint-2013.aspx

    http://technet.microsoft.com/en-us/library/jj219613.aspx#changesvcacct

    Tuesday, October 1, 2013 7:44 PM

Answers

All replies

  • did you repair the DC or not? Did you check the Status of the DC and what is the output of that?

    Use-CacheCluster

    Get-CacheHost

    what about the services account under which DC running? Try to enable DC on WFEs( as you mentioned WFE are they LoadBalanced?) only.

    http://wscheema.com/blog/_layouts/15/start.aspx#/Lists/Posts/Post.aspx?ID=9

    thanks


    Thanks -WS SharePoint administrator, MCITP(SharePoint 2010, 2013) Blog: http://wscheema.com/blog

    Tuesday, October 1, 2013 9:24 PM
  • did you repair the DC or not? Did you check the Status of the DC and what is the output of that?

    Use-CacheCluster

    Get-CacheHost

    what about the services account under which DC running? Try to enable DC on WFEs( as you mentioned WFE are they LoadBalanced?) only.

    http://wscheema.com/blog/_layouts/15/start.aspx#/Lists/Posts/Post.aspx?ID=9

    thanks


    Thanks -WS SharePoint administrator, MCITP(SharePoint 2010, 2013) Blog: http://wscheema.com/blog

    I guess I didn't directly mention some of these things. 

    I did repair the DC following the steps in the links I provided in the first post. 

    PS C:\Windows\system32> get-cachehost

    HostName : CachePort              Service Name            Service Status Version Info
    --------------------              ------------            -------------- ------

    [FQ Computername]:22233 AppFabricCachingService DOWN           3 [3,3][1,3]

    As for the service account for AF. I want to say that by default it is set to run under the Farm Admin. I should have looked at it before changing anything but it was broken before following one of my first links which walked me through configuring it to run under a service account which is set up as a managed account in SP. I did make sure that this managed account is a local admin on the box but no change.

    The two WFEs are load balanced via a Cisco ACE4710 Network Load Balancer. I did see a comment somewhere (might have been one of the two links I first posted but I have looked at a lot of stuff) that mentioned if you have a HOST file that you want the FQDN of the machine defined in there. That does apply and I added it but no change.

    Windows Firewall is disabled on all servers and there is no locally installed 3rd party FW. 

    Initially DC was enabled in SP on all 3 servers. It has been shut down on all of them. Since Search along with pretty much all the apps is running on the first server I don't plan on running DC there. For now I was trying to get it working on just one of the WFEs but have had no luck. Both show the same errors and behavior. 

    I have also manually allocated 1GB of RAM to the DC to see if that would help.

    Update-SPDistributedCacheSize -CacheSizeInMB 1024

    In SharePoint it does show that the DC is started if I attempt to start it on any given server. But without the AF backend, I don't see how it can do anything an my long load times are likely the DC waiting to until it hits a timeout.

    There really isn't anything on this environment as of now. The farm is nothing more than a couple site collections and maybe 4 sites. The only content at this point are links between the sites and a couple custom web part.

    Thanks


    Wednesday, October 2, 2013 12:31 PM
  • check this one http://msdn.microsoft.com/en-us/library/ee790821(v=azure.10).aspx

    i think CacheHost section will dress your problem.

    thanks


    Thanks -WS SharePoint administrator, MCITP(SharePoint 2010, 2013) Blog: http://wscheema.com/blog

    Wednesday, October 2, 2013 1:57 PM
  • Looks like we might have found a solution to this issue. 

    I performed the steps in this article and now the AppFabric starts and stays up.

    http://codebender.denniland.com/sharepoint-server-2013-issue-appfabric-distributed-cache-service-crashes/

    Need to test a bit more to make sure it is resolved but so far it seems to have addressed the issue we were having with AppFabric.

    • Marked as answer by sennister12 Thursday, October 3, 2013 1:31 PM
    Wednesday, October 2, 2013 8:06 PM
  • Hi,

    In my case it was a problem of DNS. Initially the server was installed with a name "S-SPAPP1" and the sharepoint 2013 farm was working fine.

    Later I changed the name of the server to "S-SPAPP-P01" but the DistributedCache configurations still held the old name. and due to a name change the service was unable to resolve the name.

    The quik fix was to get the proper server names from the config

    using :| Get-CacheHost

    HostName :         CachePort        Service Name                            Service Status Version Info
    --------------------      ------------            --------------                       ------------
    S-SPAPP1.XXX.com:22233      AppFabricCachingService        UNKNOWN        0 [0,0][0,0]
    S-SPWEB1.XXX.com:22233     AppFabricCachingService       UNKNOWN        0 [0,0][0,0]

    And created DNS /host file entry for both "S-SPAPP1" and "S-SPWEB1"

    I was then able to start the "AppFabric Caching Service"

    Regards,

    Ramz

    Tuesday, May 6, 2014 7:10 AM