Friday, September 21, 2012 8:10 PM
I had a call from work this evening advising me all mail had stopped inbound and outbound out of my Exchange 2010 SP2 server. When I got onto the call we could see the queue had just jammed up around 3000 mails. When I looked at the logs I could correlate that there was no disk space issue, memory looked normal etc. But there was an event 15004 loogged which talks about back pressure and server resources.
What I realised was at the time the issue started, I had ran the following powershell script on my server..
D:\Logs\*.log | Export-ActiveSyncLog -Output:"C:\Temp\Logs"
I was trying to get a log for analysis of all mobile device users. So at the time I ran this script, it appeared to run fine, it generated all the correct logs and all was good. But now in hindsight, what appears to have happened is it caused an issue with my mail transport and all inbound/outbound mail stopped. I checked all exchange services and they were all started as I would expect. As soon as I rebooted the server all was good in the world again.
Am I just being really paranoid and this is a massive co-incidence that the server which has run flawlessly for almost 6 months died at the same time I ran this export script, almost to the exact minute? Has anybody seen any issue like this before?
Friday, September 21, 2012 11:37 PMCouuld be related ast the export was on the c drive, so it may have generated some files. That event should say what type of back pressure you had, if it's realted to disk space then I would say it was the export reports.
Tuesday, September 25, 2012 10:38 AM
This was not a disk space issue. The error logged on my server states...
The following resources are under pressure:
Physical memory load = 93% [limit is 94% to start dehydrating messages.]
Submission Queue = 2000 [Medium] [Normal=1000 Medium=2000 High=4000]
So I can see that when the submission queue hits 2,000 mails the server resource hits the "Medium" threshold and it then disabled the various components. Now I cannot find anything that would tell me why the submission queue on this one server stopped processing mails. I can only correspond it to the fact this all happened right after I ran the export command.
Tuesday, September 25, 2012 10:52 AM
It seems like the memory was close but the the submission queues was the actual issue.
What you can check if the logging is enabled is the protocol logs and message tracking to see what those message were which were sent.
The issue is clear, the submission queues were the cause.
Tuesday, September 25, 2012 10:56 AM
I think I may be getting closer on this one. I did a little more digging on this one and can see from my logs that there are a load more errors from earlier on in the day..
The execution time of agent 'McAfeeTxRoutingAgent' exceeded 90000 milliseconds while handling event 'OnCategorizedMessage' for message with InternetMessageId: 'Not Available'. This is an unusual amount of time for an agent to process a single event. However, Transport will continue processing this message.
- Edited by LambyUK Tuesday, September 25, 2012 10:56 AM
Tuesday, September 25, 2012 11:00 AM
I don't believe the script has anything to do with the issue. It's just happened to be when you ran the script. The script just exports to the file system. It doesn't send any messages.
The AV seems to be more relevant, how many of those events do you see:?
Tuesday, September 25, 2012 11:03 AMI can see around 75 instances of it happening around the same time users began complaining of mails not sending. There have been instances of it happening on several other days too but only ever a few at a time.
Tuesday, September 25, 2012 11:09 AM
I would say that is your issue. I've seen this with other AV products as well as the one that you're using.
As the issue can't be reproduced I would log a call as you suggested with the vender to see what they say you can log.
If you can repro the issue, then simply disable/remove AV and the agents from the transport layer.
Tuesday, September 25, 2012 11:12 AM
And this is where I am a little bit cautious. Basically I have just found the McAfee article stating that this error can be logged when the server is under extreme load. So is this actually a red herring? Were these errors only logged because my server was already on its way down? Or were these errors actually the root cause of taking it down in the first place? Its a tricky one!
Many thanks for your continued advice.
Tuesday, September 25, 2012 11:31 AM
If you look at a simple Exch install without any AV, it will function as expected. I've seen this for many years. I've also seen Exch experience symptons you have described when AV is in the equation for many years.
Why would the server be overloaded with messages unless a bulk email was sent out for example via an application or a virus etc..
That's why I said use message tracking to see what message were sent during that time.
Tuesday, September 25, 2012 11:52 AMBut this is my initial train of thought. Something crashed out the transport agent or the submission queue. The server stopped processing messages and the queue just continued to build and build. There was no mass or bulk mail sent looking at the message tracking logs.
Tuesday, September 25, 2012 11:56 AM
Then the issue would be AV more than anything else in this case.
The AV would have transport agents which seems like they have had a knock on effect on the transport system.
If you run Get-TransportAgent you will see it there.