locked
ADFS Service Congestion (EventId 230 and 222) RRS feed

  • Question

  • Hi,

    This question has been asked before in that forum but I was not able to find a clear answer and I'd like to see if someone can shed some light on it or point us to the right direction.

    Currently we have 2 ADFS environments:

    • The Production farm (ADFS 3): 3 ADFS servers, 3 proxies all running on Windows Server 2012 R2. About 50 RPs.
    • A new ADFS farm we are building in Azure: 2 ADFs servers and 2 proxies running on Server 2016. The idea is to replace the PROD farm by this new one in Azure. We already have a Domain Controller in Azure too.

    The thing is that, every now and then (more frequently than we would like this to happen) the service is suffering from service congestion and 2 make it worse, during our 2 last incidents, both farms (prod and the one in Azure) suffered from congestion at the same time and it stopped at the same time (it lasted for about 2 hours). That's causing intermittent authentications errors.

    It's a bit surprising because we built the one in Azure just to overcome latency on our on-prem farm because despite our Networking team was not able to detect latency or congestion on our LAN. We are not convinced of that,  due to the lack of robust metrics on their side.

    We have not modified the proxy config file to increase the congestion algorithm. First we would like to see why this is happening as looking at the perfmon data collector sets we have (colecting OS and ADFS related metrics), we  don't see any issue on the VMs performance and no spikes in SSO requests either. There's just an spike in latency but we are not able to see why it happens.

    Moreover, the platform in Azure should be running completely in the azure cloud. So the interaction wiht our on-prem services should be really low. I mean, we even set up a Domain Controller in the cloud to avoid auth requests to flow into our on-prem network and, looking at the network traces, all auths seems to be forwarded to the proper DC. Actually, on our PROD farm we have about 50 federated apps but on the one in Azure we have just one experimental RP. It’s true that we are putting some pressure on the RP generating auth request via coding every 1 minute, but that shouldn’t be that hard for that ADFS farm.

    Any clue on what could be going on? I'm even starting to think that we have a hidden performance issue hitting our DCs but looking our perf metrics (basic ones: MEM, CPU, OIPs) on our DCs I don't see anything. Maybe we need to collect deeper AD metrics.

    Any suggestion is welcome.

    Thanks.

    Monday, January 16, 2017 11:21 AM

All replies

  • Hi David,

    Can you mentioned something about the 50 RPs? Are there 50 active RPs or is it just a couple of key applications?Are they web apps or are there also Web Services RPs involved? The fact that this occurs on both on-prem and cloud farms tends to indicate something awry from a user or RP standpoint, rather than the environments.. also, you're running on different OS/versions which further reinforces that notion. How many users are you servicing btw?


    http://blog.auth360.net

    Tuesday, January 17, 2017 6:53 PM
  • Hi Milo,<o:p></o:p>

    Thanks for your reply. Yes we have about 50 RPs, all active and some of them, like the one for O365, ServiceNow and a few others, potentially providing service to all the users in the company (about 8000 users). I mean, among these RPs we have few ones which are  not intensively used: some sandbox and test RPs, but for the most part all of them are active. Most of all they are Web Apps.<o:p></o:p>

    On the other hand, in the Azure farm we have just one RP we are using for testing issues on that farm.<o:p></o:p>

    Initially we thought the problem could be in the ADFS farm. We collected metrics and we didn't see any performance issue, even though we moved on and added an extra Web App proxy and another extra ADFS server. The situation didn't improve.<o:p></o:p>

    We are constantly monitoring both farms (the one on-prem and the one in Azure): we have some scripts performing ws-trust SSO requests every minute and one of the dev teams is executing passive SSO requests from a cloud-based monitoring solution every 5 mins. Sometimes SSO fails for our scripts, sometimes SSO just fails from the cloud based monitoring, sometimes it fails for both. Normally it fails due to a time out, the client requesting SSO waits for one minute and after that minute if no token received it returns a time out exception. <o:p></o:p>

    Don't know...just yesterday the  dev team detected another 8 minutes outage. It just affected the cloud based monitor, I don't see any alert on our on-prem monitoring system.  I just ran out of ideas: I don't know what to look at. Another common point of failure could be our AD, don't know, like overloaded DCs...but we are collecting quite a good amount of AD related metrics and we don't see anything alarming.

    btw...we have an open support case with Microsoft, but no luck so far: just some diags collection and few links on similar issues...that's all.

    Thanks!

    Wednesday, January 18, 2017 11:26 AM
  • hmm..at one customer an in-house developed (over zealous) WS-Trust script-based monitor turned out to be source of performance problems.. not saying that's the case here but I'd look for common ground between the two environments.. do you see the problem reported on both the cloud-based passive monitor and the active (WS-Trust) one? 

    You also mentioned that you have experienced the performance issues simultaneously between Azure and on-premise? 


    http://blog.auth360.net

    Wednesday, January 18, 2017 8:16 PM
  • Hiya,

    Just stating some of the obvious, I presume you are collecting performance data on the below parameters. Just to make sure that there are no spikes or other abnormality happening, or generally to find a pattern in your outages.

    https://technet.microsoft.com/en-us/library/ff627833(v=ws.11).aspx


    Also what are the errors in the ADFS logs? Users are experiencing a timeout, however the ADFS logs might give some more detailed information on that.
    Thursday, January 19, 2017 9:29 AM
  • Hi Milo

    The WS-trust code just connects to the on-prem farm. I still need to modify the code to trigger a token request to the ADFS farm. Even though, the problem happens on both farms and, the last 2 times it happened, latency increased in both farms at the same time and when latency got back to normal values, between 500 to 1000ms, it got back to normal at the same time in both farms. That looks suspicious to me.

    Apart from that, the WS-trust check was implemented after we detected the issue, the idea was to check if in reality we were having issues or not as just one team was complaining about it, but the issue was detected before.

    Today I'll have a look at the articles that the Microsoft guys sent me. The speak about the congestion algorithm, but to be honest I have the impression that modifying the congestion algorithm is just to kick the can down the road. Actually, sometimes we don't even see the congestion algorithm taking place because even the latency increases up to 10ms, the number of concurrent connections remains below the threshold. Even though, clients keep waiting for tokens to be received and, after 60secs, they timeout the connection.

    Just to give you more context: I've removed most of the HW metrics and left just 3 ADFS counters. The charts below show in blue the requests lantecy (you'll see an important spike), in yellow the rate of request/sec (nothing abnormal) and in green the outstanding ADFS requests (some spikes but guess is just because of the latency spike)

    The spike in latency lasted for about 2 hours and a half but I don't see any specially alarming in the metrics. At 9pm we had more requets per sec but they are not even 1request/sec.

    Maybe you something clarifying there.

    Thursday, January 19, 2017 12:04 PM
  • Yes we are collecting that metrics. I'm not able to detect any spike that gives me a clear indication. Sometimes we have spikes but under normal values nothing outside the normal day to day operations. In any case, I'll check it again.

    I posted some charts in my last message. Maybe you can see anything alarming there.

    In the event log I see most of all the event id 222 and sometime the 230 one indicating congestion. Actually in the ADFS farm in azure, as we just have an experimental RP, that's all the entries we have.

    Thanks.

    Thursday, January 19, 2017 12:09 PM
  • Hi,

    When I rethink this, having the problem on both ADFS farms, which are completely separated, targeting different DC's, indicates ... well yeah.. Something weird :)

    Just trying to pin out, if indeed your farms are completely separated. And you are experiencing latency on both farms, it's something that you are running on both farms, which is unrelated to ADFS.

    1: You running any scanner software on your servers?

    2: You running any scanner/proxy software on your network?

    3: How you experienced this with any other Web service maybe? - Maybe you just didn't notice it?

    4: Which load balancer are you running infront of the servers? and what is your session configuration?

    Also your requests/sec. to me indicates that it's not RP relelated. Unless you have an RP requesting a big amount of data, which saturates in same way.

    Another thing could be to install/run NetMon og Fiddler on the target ADFS, when you are seeing this congestion to spot a possible cause of this..

    Thursday, January 19, 2017 2:23 PM
  • Hi David-JG,

    When the ADFS Proxy experiences a high load of authentication it will start adding delay into the authentications, as your authentication load builds up, this delay can be protracted by the congestion algorithm itself. If you are not experiencing any capacity issues on your ADFS infrastructure it is safe to increase the threshold.

    The purpose of the congestion algorithm is to prevent an outage on your internal ADFS infrastructure due to Denial of Service attack or other high amounts of unnecessary load coming from the Internet. So either you are under a DoS attack (or have accidentally caused one yourself using monitoring) or you just need to service more authentications than the default parameters allow.

    To determine if you are under a DoS attack look at the counters for failed authentications/inbound connections. If the numbers aren't high, look at successful authentications and track the client IP to see if it looks like legitimate traffic.

    If everything checks out, and your CPU/Memory on the ADFS servers isn't being hammered, increase the thresholds.

    Good Luck!

    Shane

    Friday, January 20, 2017 3:38 AM
  • Hi David-JG,

    When the ADFS Proxy experiences a high load of authentication it will start adding delay into the authentications, as your authentication load builds up, this delay can be protracted by the congestion algorithm itself. If you are not experiencing any capacity issues on your ADFS infrastructure it is safe to increase the threshold.

    The purpose of the congestion algorithm is to prevent an outage on your internal ADFS infrastructure due to Denial of Service attack or other high amounts of unnecessary load coming from the Internet. So either you are under a DoS attack (or have accidentally caused one yourself using monitoring) or you just need to service more authentications than the default parameters allow.

    To determine if you are under a DoS attack look at the counters for failed authentications/inbound connections. If the numbers aren't high, look at successful authentications and track the client IP to see if it looks like legitimate traffic.

    If everything checks out, and your CPU/Memory on the ADFS servers isn't being hammered, increase the thresholds.

    Good Luck!

    Shane

    Hi Shane,

    The reuqest/sec does not indicate any sudden increase or anything. Wouldn't that be the case with a Denial of Service attack? - Request/sec is not only successful, its all requests as far as I am aware. In other word, that metric would contradict what you stating, no?

    Friday, January 20, 2017 7:35 AM
  • Hey Shane,

    I'd agree with you if we havd spikes in authentications requests. We've suffered DoS in the past and it's pretty easy to determine whats going on: we just have to identify an abnormal pattern.

    To my experience (We've had to analyze serveral DoS attacks here in my company). The tricky part is not to identify what's going on, the tricky part is to be proactive and avoid that to happen. I mean DoSs are aimed to have impact and that involves visibility. Last time we had hundreds of auth requests in just about 10mins from a single computer in Israel. Actually due to that fact, we decided to enable the extranet lockout feature. Even though looking at our metrics I don't see any indication of something strange....as I said, during the latency spike we don't even get 1 request per sec.

    We will increase the latency (most probably), because our MS support guys ask us to do so but I just have the impression that's not the most accurate way to approach it.

    Another thing the MS guys pointed out is that the extranet lookout feature relies on the PDC availability. We are monitoring all our DCs so maybe that's point that we need to review: check our PDC status when it happens.

    Regards.

    Friday, January 20, 2017 10:26 AM
  • Hi Jesper.

    To your questions:

    1: No, not HIDS/HIPS not AV (yes, we could discuss about if that's ideal or not. At least, it is in terms of performance).

    2: Not affecting our server vLANs. At least that I'm aware of. I'd need to check with our InfoSec guys.

    3: If there are other [web] services impacted. Nobody, not just my team, have detected  that, but as you said maybe they don't notice that. Maybe they web services are not so latency sensitive.

    4: For the ones in Azure: standard Azure LBs (no affinity configured). For our on-prem farm we have NetScalers (affinity configured per SSL session). It's my to-do list to woek with the network team to remove affinity on the netscalers for the ADFS service.

    Yes, I think I'm going to enable again a netmon trace. Actually I'll try to do it with Wireshark which I'm more familiar with. Do you know if fiddler can be configured to usa a circular ring buffer?

    Regards. 

    Friday, January 20, 2017 10:45 AM
  • Hi Shane,

    Here is Patrick, Nice to meet you.

    We have encounter similar problem accordingly.

    I would ask it possible to disable ADFS congestion setting? I am afraid it may seek high load of authentication or connection as DoS attack.

    As we got large amount of authentication connection at the same time maybe.

    Thanks,
    Patrick

    Friday, June 22, 2018 8:24 AM