none
Explanation of Exchange Online outage?

    Question

  • Can I get an explanation of the outage that has happened today.  In addition all the outage posts on RSS feed.
    Thursday, January 28, 2010 9:10 PM

Answers

  • If an official RCA is required, you may request one by submitting a service request.  Please review the disclaimer included before posting any part of it.
    Tuesday, February 09, 2010 10:58 PM

All replies

  • Me Too.  I've just signed up a few clients promising your advertised 99.9999% reliability.

    Thursday, January 28, 2010 9:18 PM
  • Same here. Admin portal says that Exchange Online service is "healthy" but our users cannot login via Sign-In App or OWA. What's the issue and when is it going to be resolved?
    Thursday, January 28, 2010 9:21 PM
  • Me too!  What really gets me is that they are calling it Planned Maintenance, at 8PM GMT = 3PM EST!

    https://rss.microsoftonline.com/feeds.aspx?center=default&chan=notifications&lang=en-us
    Thursday, January 28, 2010 9:25 PM
  • Yes, not sure whats going on.  I assume this is some emergency maintenance.  But in the middle of day.  Also there have been a lot of outages in Jan so far, so not sure what going on.

    I am sure in a few days, MS will post a link to the RSS feed.  We need more info then just there was an outage.  This helps partners and customers evaluate if hosting email service with MS fits within there SLA's.

    Thanks
    Matt

    Thursday, January 28, 2010 9:38 PM
  • We developed the Exchange Online Monitoring Application (Exmon) to help you keep a better eye on the Exchange Online Environment and alert you if there are problems.  It's still in Beta, but is available as a free download from our site:

    http://www.messageops.com/ExmonDownload.html

    We hope to release a new version next couple weeks, which has better alerting and reporting features.

    Chad

    Chad Mosman, MessageOps | www.MessageOps.com
    Thursday, January 28, 2010 9:49 PM
  • I like how they keep posting messages saying its planned outage while the posting gets posted 2hrs after the outage. I must be looking at the wrong definition of "Planned"

    Thursday, January 28, 2010 10:19 PM
  • It's great that there are 3rd party monitoring tools out there but MSFT should provide at least the basics as part of the package.

    Why doesn't the admin portal dashboard reflect this type of thing in real time? It shows the following for Exchange Online:

    All features of this service are available.

    Mailflow
    MAPI Connectivity
    Outlook Web Access (OWA)

    when clearly this is not the case.

    Also, annoucements about planned maintenance and service availability should be posted directly to the admin portal rather than admins having to search the forums or rss feeds.

    Thursday, January 28, 2010 10:22 PM
  • Hi all,

    As a user since day 1 I must say I am glad we have such a great team working on this service offering. I have watched this service grow and I am proud to say this was really the first extended issue in more then a year that has been visable to the users is a good track record.

    I will say that if I had this in house it would have taken longer I am sure to get everything back up, I have been on may other services and they all when they go down do not hve it up and running in less then 3 hours. many times it was 3 days...

    The reasons in my view do not matter becasue it is just words, I trust that the team at Microsoft is goign through many meetings at this point and over the next few days t make sure every effort is taken to prevent this from happening again in the future.

    I am with you all that RRS is not really the way it should go, maybe they could use Vine Technology and for the people that subscribe can get a text message to there desk, phone, etc..... My feeling also is that I would rather they focus there efforts on getting fixed rather then updating an RSS feed that most people can not find or even use.

    In my view with all this this is still the best game in town, and I want to say thank to the team for getting us all back on line, Was the outage acceptable not realy, but the resolution was.

    I know this is not an answer but really nothing else out here is either, but I for one am glad to say that I use a system that uses all the best of breed tools that in order for me to bring them to use locally the cost and manpower would make it un atainable without this service 2 hours out of a year is not a bad record, and let me also say the upgrades happen normally without visability.

    I have heard it said tis way " It is not oh god, it is thank god"

    Thursday, January 28, 2010 10:47 PM
  • I agree with the sentiment that this type of incident is rare and that Microsoft's service is superior in this regard to others in the market. I think the majority of critisim here is positive in nature in that it speaks to the fact that there are improvements that can be made to getting information into the hands of end users and admins in this type of scenario. My experience has been that the support team is very open to listening to the community and acting on their feedback and I'm sure this will be no different.

    What's comforting to note is that once service was restored, in our case at least, we had no reports of data loss or NDR's.
    Friday, January 29, 2010 4:30 AM
  • Still no update from MS on this in this forum, but I managed to get this,

    Dear Customer:

    Microsoft Online Services strives to provide exceptional service for all of our customers. On January 28, customers served from a North America data center may have experienced intermittent access to services included in the Business Productivity Online Suite. We apologize for any inconvenience this may have caused you and your employees.

    We are committed to communicating with our customers in an open and honest manner about service issues and the steps we’re taking to prevent recurrences.

    • What happened?  
      • Monitoring alerted us to a possible issue with networking.
      • Troubleshooting procedures ultimately pointed to a problem with network infrastructure, resulting in intermittent access for customers.
    • What actions have been taken to prevent a recurrence?
      • We have identified the root cause, and have taken steps to remediate the network issues.



     

    Friday, January 29, 2010 3:02 PM
  • No one said that this business is easy. There will be outages and I can accept that. However, one has greater expectations from Microsoft to be able to address and resolve issues in a timely fashion. Client notification of problems does not exist at this time and we are left to deal with our business users and senior management without any information or support from BPOS team. The RSS feed is of no value. This is not a new product and MSFT with their technology team should be able to provide a reasonable time estimate on when we can expect the system to recover. Not what I expected when I signed up for BPOS.
    Friday, January 29, 2010 4:00 PM
  • Agreed, MS need to provide paying customers better SLA's.  They should work on providing a dashboard for real time, or near time stats to the admin portal, and a status update page, not a RSS feed, that a lot of users do not know about.

    As the status for services out there yesterday clearly did not reflect real or near time, as throughout the day, Exchange Online showed online with no issues.

    While maintaining and infrastructure that has no downtime is possible, its takes a considerable effort and capital.  With the proper change management, process and resources, you can provide a infrastructure that will not have unscheduled outages.  Speaking as a data center manager for a global bank, this is possible.  I would expect MS to provide this.  But like in a all computer systems, code is written by humans, humans run systems, and they make mistakes, just like all of us.  But there are well documented process to help even in that area.

    Lets hope that a root cause is being done and MS has plan to shore up whatever process failed, so they can build redundancy around that.
    Friday, January 29, 2010 4:17 PM
  • Looks like the service is down again this morning.

    Crashes happen. We're in the beginning years of cloud computing and there are likely to be outages caused by faulty software, human error, hackers, and more.

    I'm not upset about the outages, I'm upset about the lack of communication. Yesterday some human being drawing a paycheck made the decision to post a notice calling this "planned maintenance." That was a lie, and it was the only communication with the outside world during the day.

    Last night's email message wasn't technically a lie but it was a condescending brushoff. There was “a possible issue with networking” and diligent work by top engineers identified that there was a “problem with network infrastructure,” so they promptly “took steps to remediate the network issues.”

    I honestly can’t think of any way to make that any more vague. I picture Jack Nicholson shouting at us, "You can't handle the truth!"

    Pretty disappointing. A little transparency into the problems and the response would go a long way to reassuring us that the service should be trusted in the long run.

    Bruce Berls
    http://www.brucebnews.com/2010/01/wobbles-with-microsoft-online-services/
    http://www.brucebnews.com/2010/01/microsoft-online-services-outage-followup/


    Valisystem
    Friday, January 29, 2010 6:13 PM
  • The MSOL Team blog has posted this article and some of the comments answer additional questions about the time and scope of impact:
    http://blogs.technet.com/msonline/archive/2010/01/29/response-to-north-america-connectivity-issues.aspx
    Jeff Schertz, PointBridge | MVP | MCITP: Enterprise Messaging | MCTS: OCS
    Friday, January 29, 2010 9:20 PM
  • The blog posting does not provide a clue of what really happened. In every services organization, clients participate in the Change Management process that impacts the client's environment. BPOS should post Change Management board approvals on a secure web site that clients can access and review. This would allow us to prepare our business users for "planned outages".

    Saturday, January 30, 2010 4:35 AM
  • Good point, eMailPMO.  I think this outage was not planned.  They should have a real time status portal.  Not attached to the BPOS domain and services.  Or start working on exposing logs and WMI information so client can use SCOM to some degree.
    Sunday, January 31, 2010 1:51 PM
  • I haven't heard complaints from clients yet but from the RSS feed it appears the service is going up and down like a yo-yo this morning . . .
    Valisystem
    Monday, February 01, 2010 7:33 PM
  • If an official RCA is required, you may request one by submitting a service request.  Please review the disclaimer included before posting any part of it.
    Tuesday, February 09, 2010 10:58 PM