Answered by:
DistributedCacheService Crashes

Question
-
Quick background. This is for a SharePoint 2013 deployment where the Distributed Cache uses AppFabric Caching Service. There are three Windows Server 2008R2 Virtual Servers running on VMWare. Two are dedicated Web Front End Servers the third is a Search/App/Central Admin server. When Distributed Cache is enabled on any of the servers the web site takes at least 6 seconds to load. If I disable Distributed Cache on all servers it takes milliseconds to load. This is a new environment with no data out there yet and only a couple top level sites.
When Distributed Cache is enabled on any of the servers it will start briefly and crash. In the Application log there are EVENT IDs of 1000 and 1026 being logged as noted here.
Log Name: Application
Source: Application Error
Date: 10/1/2013 1:06:08 PM
Event ID: 1000
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: [server name]
Description:
Faulting application name: DistributedCacheService.exe, version: 1.0.4632.0, time stamp: 0x4eafeccf
Faulting module name: KERNELBASE.dll, version: 6.1.7601.18015, time stamp: 0x50b8479b
Exception code: 0xe0434352
Fault offset: 0x0000000000009e5d
Faulting process id: 0xbd4
Faulting application start time: 0x01cebed0e2fef851
Faulting application path: c:\Program Files\AppFabric 1.1 for Windows Server\DistributedCacheService.exe
Faulting module path: C:\Windows\system32\KERNELBASE.dll
Report Id: 2518a099-2ac4-11e3-ae94-0050568c62e6
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Application Error" />
<EventID Qualifiers="0">1000</EventID>
<Level>2</Level>
<Task>100</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2013-10-01T18:06:08.000000000Z" />
<EventRecordID>103058</EventRecordID>
<Channel>Application</Channel>
<Computer>[server name]</Computer>
<Security />
</System>
<EventData>
<Data>DistributedCacheService.exe</Data>
<Data>1.0.4632.0</Data>
<Data>4eafeccf</Data>
<Data>KERNELBASE.dll</Data>
<Data>6.1.7601.18015</Data>
<Data>50b8479b</Data>
<Data>e0434352</Data>
<Data>0000000000009e5d</Data>
<Data>bd4</Data>
<Data>01cebed0e2fef851</Data>
<Data>c:\Program Files\AppFabric 1.1 for Windows Server\DistributedCacheService.exe</Data>
<Data>C:\Windows\system32\KERNELBASE.dll</Data>
<Data>2518a099-2ac4-11e3-ae94-0050568c62e6</Data>
</EventData>
</Event>
Log Name: Application
Source: .NET Runtime
Date: 10/1/2013 1:06:06 PM
Event ID: 1026
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: [server name]
Description:
Application: DistributedCacheService.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.ArgumentException
Stack:
at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.Invoke(System.Threading.WaitCallback, System.Object)
at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.WorkerThreadStart()
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Threading.ThreadHelper.ThreadStart()
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name=".NET Runtime" />
<EventID Qualifiers="0">1026</EventID>
<Level>2</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2013-10-01T18:06:06.000000000Z" />
<EventRecordID>103057</EventRecordID>
<Channel>Application</Channel>
<Computer>[server name]</Computer>
<Security />
</System>
<EventData>
<Data>Application: DistributedCacheService.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.ArgumentException
Stack:
at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.Invoke(System.Threading.WaitCallback, System.Object)
at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.WorkerThreadStart()
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Threading.ThreadHelper.ThreadStart()
</Data>
</EventData>
</Event>
If I drill down to the Microsoft-Windows-Application Server-System Services/Admin Log I see EVENTID 6 and 111
Log Name: Microsoft-Windows-Application Server-System Services/Admin
Source: Microsoft-Windows Server AppFabric Caching
Date: 10/1/2013 1:10:54 PM
Event ID: 111
Task Category: (1)
Level: Error
Keywords:
User: [Domain\User Account]
Computer: [Server Name]
Description:
AppFabric Caching service crashed with exception {System.ArgumentException: An entry with the same key already exists.
at System.Collections.Generic.TreeSet`1.AddIfNotPresent(T item)
at System.Collections.Generic.SortedDictionary`2.Add(TKey key, TValue value)
at Microsoft.Fabric.Data.PartitionTable.UpdateEntry(LookupTableEntry newEntry)
at Microsoft.Fabric.Data.PM.PMPartitionTable..ctor(PartitionManager pm, IList`1 partitions, Int64 savedVersion, Object lockObject)
at Microsoft.Fabric.Data.PM.PartitionCache..ctor(PartitionManager pm, IPartitionManagerStore pmStore, LoadTable loadTable, IList`1 partitions, Int64 savedLookupVersion)
at Microsoft.Fabric.Data.PM.PartitionManager.ProcessLoadPM(Object state)
at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.Invoke(WaitCallback callback, Object state)
at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.WorkerThreadStart()
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()}. Check debug log for more information
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows Server AppFabric Caching" Guid="{A77DCF21-545F-4191-B3D0-C396CF2683F2}" />
<EventID>111</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>1</Task>
<Opcode>111</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2013-10-01T18:10:54.239292100Z" />
<EventRecordID>13600</EventRecordID>
<Correlation />
<Execution ProcessID="4000" ThreadID="4144" />
<Channel>Microsoft-Windows-Application Server-System Services/Admin</Channel>
<Computer>[Server Name]</Computer>
<Security UserID="S-1-5-21-2193036465-2839809817-2807360763-7132" />
</System>
<EventData>
<Data Name="Source">AppFabricCachingService.Crash</Data>
<Data Name="Param">System.ArgumentException: An entry with the same key already exists.
at System.Collections.Generic.TreeSet`1.AddIfNotPresent(T item)
at System.Collections.Generic.SortedDictionary`2.Add(TKey key, TValue value)
at Microsoft.Fabric.Data.PartitionTable.UpdateEntry(LookupTableEntry newEntry)
at Microsoft.Fabric.Data.PM.PMPartitionTable..ctor(PartitionManager pm, IList`1 partitions, Int64 savedVersion, Object lockObject)
at Microsoft.Fabric.Data.PM.PartitionCache..ctor(PartitionManager pm, IPartitionManagerStore pmStore, LoadTable loadTable, IList`1 partitions, Int64 savedLookupVersion)
at Microsoft.Fabric.Data.PM.PartitionManager.ProcessLoadPM(Object state)
at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.Invoke(WaitCallback callback, Object state)
at Microsoft.Fabric.Common.IOCompletionPortWorkQueue.WorkerThreadStart()
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()</Data>
</EventData>
</Event>
Log Name: Microsoft-Windows-Application Server-System Services/Admin
Source: Microsoft-Windows-Fabric
Date: 10/1/2013 1:06:06 PM
Event ID: 6
Task Category: None
Level: Warning
Keywords:
User: [Domain\User Account]
Computer: [Server Name]
Description:
{e9d6998000000000000000000000000} failed to refresh lookup table, exception: {Microsoft.Fabric.Common.OperationCompletedException: Operation completed with an exception ---> Microsoft.Fabric.Federation.RoutingException: The target node explicitly aborted the operation
--- End of inner exception stack trace ---
at Microsoft.Fabric.Common.OperationContext.End()
at Microsoft.Fabric.Federation.FederationSite.EndRoutedSendReceive(IAsyncResult ar)
at Microsoft.Fabric.Data.ReliableServiceManager.EndRefreshLookupTable(IAsyncResult ar)}
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-Fabric" Guid="{751C9DC0-4F51-44F6-920A-A620C7C2D13E}" />
<EventID>6</EventID>
<Version>0</Version>
<Level>3</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2013-10-01T18:06:06.981761700Z" />
<EventRecordID>13597</EventRecordID>
<Correlation />
<Execution ProcessID="3028" ThreadID="3492" />
<Channel>Microsoft-Windows-Application Server-System Services/Admin</Channel>
<Computer>[Server Name]</Computer>
<Security UserID="S-1-5-21-2193036465-2839809817-2807360763-7132" />
</System>
<EventData>
<Data Name="param1">e9d6998000000000000000000000000</Data>
<Data Name="param2">Microsoft.Fabric.Common.OperationCompletedException: Operation completed with an exception ---> Microsoft.Fabric.Federation.RoutingException: The target node explicitly aborted the operation
--- End of inner exception stack trace ---
at Microsoft.Fabric.Common.OperationContext.End()
at Microsoft.Fabric.Federation.FederationSite.EndRoutedSendReceive(IAsyncResult ar)
at Microsoft.Fabric.Data.ReliableServiceManager.EndRefreshLookupTable(IAsyncResult ar)</Data>
</EventData>
</Event>
Any Ideas. I think the biggest clue is the EventID 111 "An entry with the same key already exists" but searches don't seem to provide any clues about this.I have tried all the steps outlined in these sites...
http://technet.microsoft.com/en-us/library/jj219613.aspx#changesvcacct
Tuesday, October 1, 2013 7:44 PM
Answers
-
- Marked as answer by sennister12 Thursday, October 3, 2013 1:31 PM
Wednesday, October 2, 2013 8:06 PM
All replies
-
did you repair the DC or not? Did you check the Status of the DC and what is the output of that?
Use-CacheCluster
Get-CacheHost
what about the services account under which DC running? Try to enable DC on WFEs( as you mentioned WFE are they LoadBalanced?) only.
http://wscheema.com/blog/_layouts/15/start.aspx#/Lists/Posts/Post.aspx?ID=9
Thanks -WS SharePoint administrator, MCITP(SharePoint 2010, 2013) Blog: http://wscheema.com/blog
Tuesday, October 1, 2013 9:24 PM -
did you repair the DC or not? Did you check the Status of the DC and what is the output of that?
Use-CacheCluster
Get-CacheHost
what about the services account under which DC running? Try to enable DC on WFEs( as you mentioned WFE are they LoadBalanced?) only.
http://wscheema.com/blog/_layouts/15/start.aspx#/Lists/Posts/Post.aspx?ID=9
Thanks -WS SharePoint administrator, MCITP(SharePoint 2010, 2013) Blog: http://wscheema.com/blog
I guess I didn't directly mention some of these things.
I did repair the DC following the steps in the links I provided in the first post.
PS C:\Windows\system32> get-cachehost
HostName : CachePort Service Name Service Status Version Info
-------------------- ------------ -------------- ------
[FQ Computername]:22233 AppFabricCachingService DOWN 3 [3,3][1,3]
As for the service account for AF. I want to say that by default it is set to run under the Farm Admin. I should have looked at it before changing anything but it was broken before following one of my first links which walked me through configuring it to run under a service account which is set up as a managed account in SP. I did make sure that this managed account is a local admin on the box but no change.
The two WFEs are load balanced via a Cisco ACE4710 Network Load Balancer. I did see a comment somewhere (might have been one of the two links I first posted but I have looked at a lot of stuff) that mentioned if you have a HOST file that you want the FQDN of the machine defined in there. That does apply and I added it but no change.
Windows Firewall is disabled on all servers and there is no locally installed 3rd party FW.
Initially DC was enabled in SP on all 3 servers. It has been shut down on all of them. Since Search along with pretty much all the apps is running on the first server I don't plan on running DC there. For now I was trying to get it working on just one of the WFEs but have had no luck. Both show the same errors and behavior.
I have also manually allocated 1GB of RAM to the DC to see if that would help.
Update-SPDistributedCacheSize -CacheSizeInMB 1024
In SharePoint it does show that the DC is started if I attempt to start it on any given server. But without the AF backend, I don't see how it can do anything an my long load times are likely the DC waiting to until it hits a timeout.
There really isn't anything on this environment as of now. The farm is nothing more than a couple site collections and maybe 4 sites. The only content at this point are links between the sites and a couple custom web part.
Thanks
Wednesday, October 2, 2013 12:31 PM -
check this one http://msdn.microsoft.com/en-us/library/ee790821(v=azure.10).aspx
i think CacheHost section will dress your problem.
thanks
Thanks -WS SharePoint administrator, MCITP(SharePoint 2010, 2013) Blog: http://wscheema.com/blog
Wednesday, October 2, 2013 1:57 PM -
- Marked as answer by sennister12 Thursday, October 3, 2013 1:31 PM
Wednesday, October 2, 2013 8:06 PM -
Hi,
In my case it was a problem of DNS. Initially the server was installed with a name "S-SPAPP1" and the sharepoint 2013 farm was working fine.
Later I changed the name of the server to "S-SPAPP-P01" but the DistributedCache configurations still held the old name. and due to a name change the service was unable to resolve the name.
The quik fix was to get the proper server names from the config
using :| Get-CacheHost
HostName : CachePort Service Name Service Status Version Info
-------------------- ------------ -------------- ------------
S-SPAPP1.XXX.com:22233 AppFabricCachingService UNKNOWN 0 [0,0][0,0]
S-SPWEB1.XXX.com:22233 AppFabricCachingService UNKNOWN 0 [0,0][0,0]And created DNS /host file entry for both "S-SPAPP1" and "S-SPWEB1"
I was then able to start the "AppFabric Caching Service"
Regards,
Ramz
Tuesday, May 6, 2014 7:10 AM