none
BPOS Outages RRS feed

  • Question

  • Our users have been reporting major downtime with Exchange Online over the last couple of days. Support has been unavailable when I have placed calls to determine the issue. Seems to have been intermittent outages on Tuesday, Wednesday and Thursday. Anyone have any idea's of what is going on? I'm being asked to look into alternative email solutions because this is killing our buisness!

    Thursday, May 12, 2011 5:09 PM

All replies

  • We are experiencing the same thing. We are looking at alternatives as well
    Thursday, May 12, 2011 5:17 PM
  • We have had all day outages Tues , partial outage Wed, and outage today also
    Friday, May 13, 2011 1:47 AM
  • http://blogs.technet.com/b/msonline/archive/2011/05/13/update-on-bpos-standard-email-issues.aspx

     

    Here is the copy of the blog article.

    Update On BPOS-Standard Email Issues

    DaveT_MSFT

    12 May 2011 5:47 PM

    ·  

    I lead the engineering organization responsible for BPOS.  My team builds, operates and supports our BPOS service, and over the last few days, we have not satisfied our customer’s needs.  On Tuesday and today we experienced three separate service issues that impacted customers served from our Americas data center.  All of these issues have been resolved and the service is now running smoothly. These incidents were unique to BPOS and not related to Office 365 or any other Microsoft services.

    I’d like to apologize to you, our customers and partners, for the obvious inconveniences these issues caused.  We know that email is a critical part of your business communication, and my team and I fully recognize our responsibility as your partner and service provider. We will provide a full post mortem, and will also provide additional updates on how our service level agreement (SLA) was impacted.   We will be proactively issuing a service credit to our impacted customers.

    I also want to provide more detail about the recent issues.

    On Tuesday at 9:30am PDT, the BPOS-S Exchange service experienced an issue with one of the hub components due to malformed email traffic on the service.   Exchange has the built-in capability to handle such traffic, but encountered an obscure case where that capability did not work correctly.  The result was a growing backlog of email.  By 12:00am PDT, the malformed traffic was isolated and the mail queues cleared.  The delays encountered by customers varied, on the order of 6-9 hours.   Short term mitigation was implemented and a fix was under development.

    At 9:10am PDT today, service monitoring again detected malformed email traffic on the service.   The problem was resolved at 10:03am, but users experienced up to 45 minute email delays during this time.   A second, but related issue was detected via monitoring at 11:35am PDT, resulting in email stuck in some end users’ outboxes. The issue was remediated at 12:04pm PDT. During this time, more than 1.5 million messages had queued on the service awaiting delivery.   The backlog was 90% clear by 4:12 PM, but because of this large backlog of email, customers may have experienced delays of as long as 3 hours.   We are implementing a comprehensive fix to both problems.  

    As a result of Tuesday’s incident, we feel we could have communicated earlier and been more specific.  Effective today, we updated our communications procedures to be more extensive and timely.   We understand that it is critical for our customers to be as fully informed as possible during service impacting events.  We will continue to improve the timeliness and specificity of our communications.  The primary mechanism for communicating to our customers on issues has been and will continue to be the Service Health Dashboard.  For North America, that dashboard is at https://health.noam.microsoftonline.com/.

     In an unrelated incident, starting at 1:04am PDT, service monitoring detected a failure in the Domain Name Service (DNS) hosting the http://mail.microsoftonline.com domain.  This failure, prevented users from accessing Outlook Web Access hosted in the Americas, and partially impacted some functionality of Microsoft Outlook and Microsoft Exchange ActiveSync devices.  The team diagnosed, and fixed, an underlying problem in the servers hosting Domain Name Service (DNS) for the http://mail.microsoftonline.com domain, and restored service at 4:52am PDT.  The team identified a number of improvements in our handling of problems associated with DNS, and will provide a full post mortem of this incident available through Microsoft Support.

     As I’ve said before, all of us in the BPOS team and at Microsoft appreciate the serious responsibility we have as a service provider to you, and we know that any issue with the service is a disruption to your business – that’s not acceptable.  I want to assure you that we are investing the time and resources required to ensure we are living up to your – and our own – expectations for a quality service experience every day.

    As always, if you are experiencing any service issues, we encourage customers to check the Service Health Dashboard for the latest information or contact our customer support team. Our customer support is available 24 hours a day by telephone or via Service Requests submitted from the Microsoft Online Services Administration Center.

     

    Dave Thompson

    Corporate Vice-President, Microsoft Online Services

    Friday, May 13, 2011 2:19 AM
  • We have the same Problem today. I can't ping microsoft.com to. All microsoft Sites are very slow. The OWA is also not reachable.

    What is the problem, what can i do?

    Tuesday, May 17, 2011 7:27 AM
  • Are you sure.. HTTPS://mail.microsoftonline.com is working by the way I think the server does not respond to pings..
    Tuesday, May 17, 2011 9:10 AM
  • We are investigating reports of intermittent mail flow issues affecting Exchange Online users served from the Americas data center. We are publishing information to customers via our normal communication channels. Check the Service Health Dashboard for the latest information. We apologize for any inconvenience this causes our customers.

    JRG_MSFT


    JRG (MSFT)
    Thursday, May 19, 2011 5:58 PM
  • Mail flow issue resolved. Here is the information from Service Health Dashboard.

    This is a short update to provide more detail on issues that impact mail queues today in The Americas. Beginning at 8:48am PDT on Thursday May 19, monitoring indicated mail queues building and exceeding normal thresholds. A high priority crisis bridge was established, and mail queues were identified to be high across 30% of Exchange Online HUB servers. At 9:54am PDT, mail queues had naturally reduced to normal levels, except a single HUB server which still had a mail queue of messages. The BPOS, Exchange and Forefront Engineering teams were engaged to troubleshoot the issue and a software problem was identified at 11:21 that was preventing the Exchange Online HUB from draining quickly due to the large volume of items in the mail queue. This same issue was also the root cause of the initial delays that were detected at 8:48am. To address the issue, Exchange Online Operations immediately added new HUB capacity across the environment which relieved pressure and allowed new mail for users to flow efficiently. The single HUB server drained email, however due to the large volume of email, did not complete draining until 3:33pm PDT. In parallel, Microsoft is working to fix and deploy the underlying software problem to prevent the problem from occurring again. A full post mortem will be available from Microsoft Support within 7 business days. We apologize for the interruption to your service.


    - Srini
    Friday, May 20, 2011 4:10 PM