locked
Replication Link Failing or Degraded RRS feed

  • Question

  • I have just upgraded my test environment to SCCM 2012 SP1 to SCCM 2012 R2.  I now have replication issues between my CAS and Primary Site server.  On the CAS it looks as though the link is active (although I do get an error when running the Replication Link Analyzer) on the Primary it is definitely showing as Link failed.  I get the following message - Replication group on the Primary site are degraded.  Replication group Global -receiving pattern CFG_Global.  The SCCM client package on the Primary keeps trying to install and although I can see it on the DP it's a version 1000.  I've look in the replmgr.log and the rcmctrl.log but I can't see any errors.On the SMS_Replication_Configuration_Monitor the link status goes between active, degraded and failed.  And as mentioned before the Distribution Manager doesnt show any errors but keeps saying its successfully processed the Configuration Manager Client Package.  Does anyone have any ideas what I should try?

    Thanks in advance

     
    Monday, December 16, 2013 3:59 PM

Answers

  • A few additional comments then:

    - Your DR test is artificial and is not in any way mimicking a real-world disaster thus making it N/A IMO. That's ultimately for you to decide though.

    - You're sacrificing day-to-day efficiency (as well as adding cost, overhead, and latency into *everything* you do with ConfigMgr) to account for a black swan event (which can be handled by a simple restore) and twice a year testing.

    - You're not really talking about DR, you're talking about site resiliency and HA. You've blurred the lines and thus created a no-win scenario for yourself -- or maybe the powers that be have.


    Jason | http://blog.configmgrftw.com

    • Proposed as answer by Garth JonesMVP Wednesday, February 25, 2015 5:46 PM
    • Marked as answer by Garth JonesMVP Sunday, January 31, 2016 1:02 AM
    Wednesday, December 18, 2013 2:35 PM

All replies

  • The link will go down for some time after upgrading to R2 so that's expected.
    There's a file in one of the inboxes that's responsible for upgrading the package over and over again but I cannot remember the exact same. This should get deleted automatically (but fails in some cases).

    Torsten Meringer | http://www.mssccmfaq.de

    Monday, December 16, 2013 4:11 PM
  • Thanks Torsten.  In the Hman.box there is a CliUpG.ACU file that has been sitting there since the upgrade.  Should that be deleted?  I am meant to be installing SCCM 2012 R2 in live today but these issues I've seen have made me a bit nervous.  In your experience would you still go ahead with the SCCM 2012 R2 install?
    Wednesday, December 18, 2013 11:00 AM
  • Why have you decided to install a CAS at all? Are you going to manage more than 100k clients?
    Just move the file out of hman.box and examine hman.log. I haven't seen issues with R2 that would prevent me installing it so far.

    Torsten Meringer | http://www.mssccmfaq.de

    Wednesday, December 18, 2013 11:12 AM
  • I didn't want to install a CAS and tried everything possible to avoid one but our DR testing model means that it's pretty much impossible to not have one.  Otherwise we would have had to restore the database from backup twice a year for our DR tests and this was deemed as too high a risk.  It also meant that we couldn't start building workstations until the restore had completed.

    Error I am getting in the hman.log is

    Failed to update advertisement xxxx with a new mandatory schedule.  Error returned is 5.

    Failed to add a new mandatory schedule to the client upgrade advertisement xxx.  Will continue on next cycle.

    Where should I move the file to?  A particular location or just anywhere? 

    Wednesday, December 18, 2013 11:21 AM
  • I don't see much benefit having a CAS when it comes to restoring ...

    Just move the file to whatever location (so that you could move it back if it would have caused issues)


    Torsten Meringer | http://www.mssccmfaq.de

    Wednesday, December 18, 2013 11:49 AM
  • I don't need the CAS for the restore it's that I need a second site and both sites need to be pretty much a replica of the other.  During the DR test the link to the Datacenter which holds our Primary Site it cut.  This means I would either need to restore this site from a backup at our head office in order to deploy stuff etc or have another Primary Site permanently housed at the head office.  The reason why we decided on the latter option is because it takes time to restore the site, there is a risk the site restore may not work and also once the test has finished if we trash the restored server then there may be a lot of network traffic as the client would all be reporting up to update their status on the site which was at the data center but unreachable for the weekend of the test (this was only a minor concern).  Hope this make sense.  I'll move the file now and see what happens.

    Wednesday, December 18, 2013 12:13 PM
  •   This means I would either need to restore this site from a backup at our head office in order to deploy stuff etc or have another Primary Site permanently housed at the head office.  The reason why we decided on the latter option is because it takes time to restore the site, there is a risk the site restore may not work and also once the test has finished if we trash the restored server then there may be a lot of network traffic as the client would all be reporting up to update their status on the site which was at the data center but unreachable for the weekend of the test

    Why would a second primary site help then? You would have to reassign all clients "manually" from PR1 to PR2 (using a GPO, script, or the new feature in R2); this will cause full inventories etc ...

    Torsten Meringer | http://www.mssccmfaq.de

    Wednesday, December 18, 2013 12:30 PM
  • I concur with Torsten.

    A secondary primary site provides no DR capabilities. During your DR test (as described above) all of your clients would either be orphaned or have you be manually reassigned causing all kinds of resynchronization traffic. 

    It is a false statement that primary sites (under a single CAS) are identical. Configuration wise, this is true, but not client wise. You will cause yourself far more pain in the long and on a day to day basis having a FAS and an additional primary site than any perceived issues you have twice a year. Site install and DB restoration is a painless and straight-forward process that is *the* designed DR scenario/solution.


    Jason | http://blog.configmgrftw.com

    Wednesday, December 18, 2013 12:51 PM
  • No we don't need to reassign, I know our DR model is quite confusing.  We have PR1 at the head office and all clients here will point to PR1.  Then we have another location which is just DR seats.  These machine are built on the day of the DR test ad connect to our datacenter so they would all be PR2.  So basically we need to carry out two tests - our datacenter being unavailable - clients at head office can still connect to PR1 as its housed at head office.  Then the second test is head office being unavailable but the data center is fine so users would go to the new site and would get a new workstations linked to PR2.  We will house the CAS at the datacenter.  So obviously when testing the datacenter being unavailable the Advertisements from the CAS will be locked but I believe new advertisement can be created on the Primary site here.  When we are testing the DR for seat then we should be fine as the Primary and the DR will be at the datacenter, which will be available.

    I have read lots of comments in the past about not having a CAS and that was my preference but I can't see how we could carry out the testing above with only one site.

    Wednesday, December 18, 2013 1:22 PM
  • Hi Jason.  Yes I totally agree with you.  The problem I have it that while I am doing the DR testing for London we also have various offices globally which need to carry on as normal.  So I need to keep their Primary site up and running while the test is going on.  Thus meaning after the test we would need to trash the testing database and connect the London clients back to their original site, which I presume would work but they would be out of sync with what is in the original database, as this information on them is two days old.  Ultimately we decided to have a DR Primary site, so all DR offices globally will link to this Primary site, then a Prod Primary site and all Prod offices will access this site.  So I'm not expecting them to be identical but I do want the same software packages, updates etc available at both sites.
    Wednesday, December 18, 2013 1:39 PM
  • A few additional comments then:

    - Your DR test is artificial and is not in any way mimicking a real-world disaster thus making it N/A IMO. That's ultimately for you to decide though.

    - You're sacrificing day-to-day efficiency (as well as adding cost, overhead, and latency into *everything* you do with ConfigMgr) to account for a black swan event (which can be handled by a simple restore) and twice a year testing.

    - You're not really talking about DR, you're talking about site resiliency and HA. You've blurred the lines and thus created a no-win scenario for yourself -- or maybe the powers that be have.


    Jason | http://blog.configmgrftw.com

    • Proposed as answer by Garth JonesMVP Wednesday, February 25, 2015 5:46 PM
    • Marked as answer by Garth JonesMVP Sunday, January 31, 2016 1:02 AM
    Wednesday, December 18, 2013 2:35 PM
  • I agree with what you're saying but this decision was taken out of our hands but we have to make it work as best as we can.  On a more positive note, moving the CliUpG.ACU sorted out the problem of the package being copied multiple times.  Just have to fix my Global Data replication point now :(

    Wednesday, December 18, 2013 3:16 PM