none
Many-to-many relationship: alternative approach?

    Question

  • I’m writing to ask if anyone knows of an alternative approach to many-to-many relationship in Analysis Services. I have a scenario where the only possible solution to it is to use a many-to-many relationship. My predicament is that the intermediate fact table is too big (>28million rows), and as a result query performance is extremely affected. I really would like to use an alternative approach if possible.

    Many thanks,


    Mpumelelo

    Friday, June 28, 2013 8:26 AM

Answers

  • I think I have found the solution even though it’s not smart (as I imagine). For that reason, I would be thankful if anyone with a better approach can help out please. In the meantime – here is what I have done:

    My ETL framework is broken down into batches which in fact are projects within a solution in BIDS, i.e. simply put its

    1. Extracts
    2. Dimensions
    3. Facts
    4. Cube

    Project Extracts runs and populates base tables as they are supposed to be, and project Dimensions also runs and populates all the relevant dimension tables accordingly. However, what I have done then is to create a view on-top of the Extract base tables which incorporates the “offending” many-to-many relationship objects. I work with healthcare data and below is a practical example:

    1. Referrals is a dataset (business process) in its own right which deals with all patient referrals to different clinical care departments. This dataset eventually has its own star schema with various participating dimensions.
    2. DiagnosisProcedures is another dataset in its own right which deals with different diagnoses or procedures which patients get. This dataset also eventually has its own star schema with varied participating dimensions.

    Base tables of both of these datasets are populated accordingly at project level Extracts in my ETL framework. Their various dimensions are also populated accordingly at project level Dimensions in my ETL framework. However, at cube level DiagnosisProcedures has a dimension (DimDiagnosisDetails) which is not directly related with Referrals and cannot even be implemented as a referenced dimension due to the many-to-many cardinality which happens should this course of action be taken. For that reason, I have implemented a many-to-many cube solution to address the need. As my original post mentions, there is an extremely high query performance cost owing to the size of DiagnosisProcedures  measure group.

    About the view – the view is created on-top of the Extract base tables and it incorporates the “offending” many-to-many relationship objects. My view looks like

    CREATE VIEW [dbo].[vw_REFERRALSDiagnosisProceduresDetails]

    AS

    SELECT re.*, --[used * for brevity]

             dp.DP_id --[column on which lookup will happen for DimDiagnosisDetails when populating FactReferrals

    FROM dbo.Referrals re

    LEFT JOIN

      (SELECT SorceID,

              DP_id

       FROM dbo.DiagnosisProcedures)dp ON dp.SorceID = re.ReferralsID

    GO  

    Its not possible to apply this condition straightaway on the base table as such an action can blow up the size of the table while it does not have any immediate relevance to the business requirements at that level. Besides, applying the condition in the above query at Extracts project level can also disrupt some of the business logic for other objects which do not need that level of detail. What then happens about the view is that – it is the one that I run my query on to populate FactReferrals. By so doing, DimDiagnosisDetails gets directly linked with FactReferrals.

    I hope my long explanation makes sense. Like I said, if anyone has a better solution than this, please let me know and I will gratefully accept your help.

    Many thanks.

    • Edited by Mpumelelo S Friday, June 28, 2013 11:21 AM clarity
    • Marked as answer by Mpumelelo S Thursday, July 18, 2013 9:33 AM
    Friday, June 28, 2013 11:03 AM

All replies

  • I think I have found the solution even though it’s not smart (as I imagine). For that reason, I would be thankful if anyone with a better approach can help out please. In the meantime – here is what I have done:

    My ETL framework is broken down into batches which in fact are projects within a solution in BIDS, i.e. simply put its

    1. Extracts
    2. Dimensions
    3. Facts
    4. Cube

    Project Extracts runs and populates base tables as they are supposed to be, and project Dimensions also runs and populates all the relevant dimension tables accordingly. However, what I have done then is to create a view on-top of the Extract base tables which incorporates the “offending” many-to-many relationship objects. I work with healthcare data and below is a practical example:

    1. Referrals is a dataset (business process) in its own right which deals with all patient referrals to different clinical care departments. This dataset eventually has its own star schema with various participating dimensions.
    2. DiagnosisProcedures is another dataset in its own right which deals with different diagnoses or procedures which patients get. This dataset also eventually has its own star schema with varied participating dimensions.

    Base tables of both of these datasets are populated accordingly at project level Extracts in my ETL framework. Their various dimensions are also populated accordingly at project level Dimensions in my ETL framework. However, at cube level DiagnosisProcedures has a dimension (DimDiagnosisDetails) which is not directly related with Referrals and cannot even be implemented as a referenced dimension due to the many-to-many cardinality which happens should this course of action be taken. For that reason, I have implemented a many-to-many cube solution to address the need. As my original post mentions, there is an extremely high query performance cost owing to the size of DiagnosisProcedures  measure group.

    About the view – the view is created on-top of the Extract base tables and it incorporates the “offending” many-to-many relationship objects. My view looks like

    CREATE VIEW [dbo].[vw_REFERRALSDiagnosisProceduresDetails]

    AS

    SELECT re.*, --[used * for brevity]

             dp.DP_id --[column on which lookup will happen for DimDiagnosisDetails when populating FactReferrals

    FROM dbo.Referrals re

    LEFT JOIN

      (SELECT SorceID,

              DP_id

       FROM dbo.DiagnosisProcedures)dp ON dp.SorceID = re.ReferralsID

    GO  

    Its not possible to apply this condition straightaway on the base table as such an action can blow up the size of the table while it does not have any immediate relevance to the business requirements at that level. Besides, applying the condition in the above query at Extracts project level can also disrupt some of the business logic for other objects which do not need that level of detail. What then happens about the view is that – it is the one that I run my query on to populate FactReferrals. By so doing, DimDiagnosisDetails gets directly linked with FactReferrals.

    I hope my long explanation makes sense. Like I said, if anyone has a better solution than this, please let me know and I will gratefully accept your help.

    Many thanks.

    • Edited by Mpumelelo S Friday, June 28, 2013 11:21 AM clarity
    • Marked as answer by Mpumelelo S Thursday, July 18, 2013 9:33 AM
    Friday, June 28, 2013 11:03 AM
  • Microsoft published a whitepaper on optimizing M2M relationships that would document an alternate approach better than I could. http://www.microsoft.com/en-us/download/details.aspx?id=137

    28M rows in an intermediate measure group is really bad. Consider resolving the M2M relationship between fact and diagnosis codes by creating another "Diagnosis Bridge" dimension that has a one-to-many relationship with your fact table and contains a single record for each unique combination of diagnosis codes on a given referral/claim. Rather than referal being the glue between your diagnosis dimension and fact, the "Dimension Bridge" dimension does which should/hopefully have much fewer rows than the total number of referals and your intermediate measure group would shrink accordingly. (Where did the original matrix optimization approach that Erik Veerman go? http://www.ssas-info.com/analysis-services-articles/62-design/380-ssas-many-to-many-optimization-technique-1-by-erik-veerman)

    HTH, Martin


    http://martinmason.wordpress.com

    Friday, June 28, 2013 12:41 PM
  • Thank you Martin for your contribution. Actually I tried the suggestion which you made (amongst many other things) before even posting this question as I have one area with that arrangement in my architecture (re: Consider resolving the M2M relationship between fact and diagnosis codes by creating another "Diagnosis Bridge" dimension that has a one-to-many relationship with your fact table and contains a single record for each unique combination of diagnosis codes on a given referral/claim.). Unfortunately the unique combinations between Referrals and Diagnosis Procedures have a very high count (millions) which at the end of the end of the day boil down to the same predicament.

    As for the whitepaper, thank you again for the link. I have that paper in my collections and what is suggested in it is so involving to be honest and I was hoping for a lightweight solution if possible.

    Unfortunately I am not able to access the article by Erik Veerman whose link you have included. I keep on getting “Server error…unreachable at this moment” error message.

    Kind regards,

    Mpumelelo


    Mpumelelo

    Friday, June 28, 2013 1:11 PM
  • Are you sure that the matrix approach approach wouldn't reduce the number of rows in the intermediate measure group more significantly? There's only a couple thousand unique ICD9 codes and I seriously doubt that every referral uses a combination unique to that particular case. That would be extremely, extremely strange as there's usually only a couple of diagnostic codes that are commonly used. Are you positive you've applied the matrix optimization techniques correctly? If you use a query similar to Appendix B in the referenced whitepaper, what do you get back for the Reduction%?

    HTH, Martin


    http://martinmason.wordpress.com

    Friday, June 28, 2013 2:13 PM
  • Thank you Martin. The problem is that while they may be only a couple of thousands of unique ICD9 codes, the dataset that I am dealing with has each Referral further broken down to something which the business calls “DiagnosisProceduresChronologyOfCodedEvents”. That breakdown unfortunately is implemented at the DiagnosisProcedures dataset. What this breakdown does is that under each given ReferralID, a chronology of events is assigned and against this chronology that is where diagnosis procedures are appended. Therefore keeping the column “DiagnosisProceduresChronologyOfCodedEvents” complicates things, and getting rid of it complicates things even further. With “DiagnosisProceduresChronologyOfCodedEvents” in place, unique combinations of ReferralsID and DiagnosisProcedures amounts to millions.


    Mpumelelo

    • Edited by Mpumelelo S Friday, June 28, 2013 2:45 PM clarity
    Friday, June 28, 2013 2:43 PM
  • Then you might need to think again about the approach. What questions are being asked and how are those questions being answered? Are you better off creating multiple datamarts to answer different types of questions? Depending on how much time you your users have, a single measure group with a M2M relationship created from an intermediary measure group of 28M records is probably not going to be the answer.

    HTH, Martin


    http://martinmason.wordpress.com

    Friday, June 28, 2013 3:31 PM