none
How to partitions a fact table with 250+ million rows RRS feed

  • Question

  • Hello,

    In a separate but related thread here, I have been advised to partition our main Fact table which has over 250+ million rows in in it. The SSAS database is shown in the image below. The Daily Balances fact table is the one with 250+ million rows.

    In my attempt to partition the Cube, I changed the measure group's Binding Type and added the following WHERE clause:

    (SELECT [DateKey] FROM [SPBUK_DW].[BI_Cube].[Dates] where FilterLatestDay = 'Y')

    In other words, I thought I would limit the processing to latest rows that have been inserted into the Data Warehouse as part of the nightly batch.

    Unfortunately, even though the Cube now processes in minutes (instead of 4 hours), when I try use the Cube no data comes back.

    I haven't touched SSAS for 6 years so I appreciate that I am very rusty. Moreover, partitioning is completely new to me. So, I'd appreciate any pointers/help.

    Questions:

    • Do I need to create multiple partitions in the Cube? One with all of the data and one the latest data?
    • Do I need to partition the Data Warehouse fact table in the Database Engine too?
    • Do I need to make changed to the Dimensions too?

    Thanks in advance.

    Wednesday, April 12, 2017 11:02 AM

All replies

  • 1. Yes, you have to have 2 different partitions in the cube in this case. One for Y and one for N

    2. No, but it can be helpful for performance on the database side.

    3. No, dimensions do not have to be changed


    Thomas LeBlanc twitter ( @TheSmilingDBA )

    Wednesday, April 12, 2017 12:59 PM
  • Hi Thomas,

    Thanks for your reply.

    To be clear, when you say Yes, you have to have 2 different partitions in the cube in this case. One for Y and one for N, do you mean:

    • Yes, I need to create multiple partitions
    • No, I don't need one for the latest data.

    I am not clear what partitions I need to create: at the moment we have just one partition and we do a PROCESS FULL every night.


    • Edited by amir tohidi Wednesday, April 12, 2017 1:19 PM
    Wednesday, April 12, 2017 1:19 PM
  • once partition bound by query where clause with FilterLatestDay = 'Y'

    (SELECT [DateKey] FROM [SPBUK_DW].[BI_Cube].[Dates] where FilterLatestDay = 'Y')

    The second partitioned bound by query where clause with FilterLatestDay = 'N'

    (SELECT [DateKey] FROM [SPBUK_DW].[BI_Cube].[Dates] where FilterLatestDay = 'N')


    Thomas LeBlanc twitter ( @TheSmilingDBA )

    Wednesday, April 12, 2017 5:30 PM
  • Hi Amir,

    Thanks for your question.

    You may want to refer to "Partition Design Best Practices".

    Following are some simple best practices that can help ensure an effective data storage strategy that will enable the OLAP server to scale efficiently to your data volumes.

    1)Avoid having partitions with more than 20 million rows
    Large partitions should generally be broken up into multiple smaller partitions for better performance. As a rule, no partition should have more than 20 million rows.

    2)Avoid having many small partitions in a measure group
    Small partitions should generally be combined into fewer, larger partitions for better performance. As a guideline, consider combining partitions when a measure group has more than five partitions with less than two million rows each.

    3)Do set the Slice property on partitions that are ROLAP or partitions that use proactive caching
    Use of the Slice property lets the server know which partitions it must look in for data to answer queries.  For example, if a partition contains a slice that states it contains only data for 2002, when a query requests sales in 2003, the server will know it does not have to look in that partition in order to answer that query.

    For MOLAP partitions, you do not have to specify the slice because the server will be able to figure out the relevant partitions after the data has been processed.  However, when you use proactive caching, it’s a good idea to specify the slice manually since the partition may be treated as ROLAP during the construction of the MOLAP cache.  (If a slice is manually specified on a MOLAP partition and that slice is not consistent with the data that is contained in the partition, the server will raise an error during processing of the data.) 

    For ROLAP partitions, a slice should be specified or the partition will always have to be queried. 
    To set the Slice property in Business Intelligence Development Studio, select the partition in the Partitions tab of the Cube Editor. In the Properties window, click the text box for the Slice property.  Click the ellipsis (...) button to open the Partition Slice dialog box. You can type an MDX expression to specify the member on which you want to slice.  To slice on multiple hierarchies, enter a tuple containing the members on which you want to slice.

    4)Consider partitioning a distinct count measure group along the dimension used most often to query the distinct count measure
    When partitioning a measure group that includes a distinct count measure, consider partitioning the measure group along the dimension that is most often used to query distinct count measure. This will provide improved query performance by reducing the frequency that all partitions must be accessed.

    For more information, please refer to SQL Server Best Practices Article .

    Best Regards
    Willson Yuan
    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com

    Thursday, April 13, 2017 6:29 AM
    Moderator
  • Thank you Thomas. That's made it much clearer.

    Using the proposed filters, I would end up with two partitions:

    1. Partition 1 : Latest Day
    2. Partition 2 : Everything up to, but not including, the latest day

    But, come tomorrow the facts in Partition 1 would end up in Partition 2 and the latest facts would end up in Partition 1. This then means that I need to process Partition 2 again (it has 250M+ rows). Doesn't this create processing overload or can I process using Process Add?


    Edit: I tried Process Default in DEV and it seemed to work. Testing in PRE PROD now.
    • Edited by amir tohidi Thursday, April 13, 2017 11:53 AM Added Edit
    Thursday, April 13, 2017 9:58 AM
  • Thank you Willson. I had already seen and read that article. It is also mentioned in one my books.

    Having partitioned my Cube as per Thomas's recommendation, I am not trying to figure out how to process the cubes (we use SSIS) so that we don't end up doing a full process every night.

    Thursday, April 13, 2017 10:00 AM
  • Hi Amir,

    Thanks for your response.

    In your scenario, you can do a processing full once a month for your cube. 

    For daily processing:

    For measure groups, please create a new partition to contain the new data which might get updated, deleted or inserted, and process full for this partition, then process index over all partitions. For the other partitions, there is no need to process them, as we have processed them once a month. 

    If you do not have data which might get updated, deleted for your partition 2, you can do a processing Add for it.


    Best Regards
    Willson Yuan
    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com

    • Proposed as answer by alexander fun Saturday, April 29, 2017 7:03 AM
    Monday, April 17, 2017 7:03 AM
    Moderator
  • Then the idea of partitioning with a Y or N column does not work in your case.

    William's suggestion and the article look at Date columns for partitioning. This would work better than the N/Y idea.

    There would be more work say for partitioning by month or quarter or even year, but that would be doable in your case.

    Sorry to confuse you with my previous responses.


    Thomas LeBlanc twitter ( @TheSmilingDBA )

    Monday, April 17, 2017 1:27 PM
  • Update:

    Using partitions, I have managed to reduce the cube processing from hours to under 30 minutes :-)

    Does anyone see any issues with the below?

    1. Create a partition for each Year
    2. In the SSIS package that kicks off the cubes refresh for 2016 and earlier years use Process Default. For 2017 and beyond use Process Full

    There is the problem that unless I can find a way to change 2017's processing, come next year I will need to release a new version of the package. I am now investigating how I can control this aspect of the process...


    • Edited by amir tohidi Thursday, April 27, 2017 1:00 PM
    Thursday, April 27, 2017 12:54 PM
  • I have just raised this post in the SSIS forum.
    Thursday, April 27, 2017 1:01 PM
  • Hi Amir,

    Thanks for your response.

    If you are using SSIS for processing partitions then you can simply use "Script Task" in your SSIS package and then write AMO code for processing partitions.

    Check following sample code which I have build on adventureworks sample and it can be used for processing partitions with ID ending with current year as Date.Today.Year . you can modify sample AMO code as per your requirement.


    Dim objServer As Server
        Dim objDatabase As Database
        Dim objCube As Cube
        Dim objMeasureGroup As MeasureGroup
        Dim objPartition As Partition

        objServer = New Server

        objServer.Connect("localhost")
        objDatabase = objServer.Databases("Adventure Works DW 2014")

        If objDatabase.Cubes.Count > 0 Then
          objCube = objDatabase.Cubes("Adventure Works")

          For Each objMeasureGroup In objCube.MeasureGroups
            For Each objPartition In objMeasureGroup.Partitions
              If Cint(objPartition.ID.Trim.Substring(objPartition.ID.Trim.Length - 4, 4)) >= Date.Today.Year Then
                objPartition.Process(ProcessType.ProcessFull)
              else
                objPartition.Process(ProcessType.ProcessDefault)
              End If
            Next
          Next
        End If

        objServer.Disconnect()


    Best Regards
    Willson Yuan
    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com

    • Proposed as answer by alexander fun Saturday, April 29, 2017 7:04 AM
    Friday, April 28, 2017 1:17 AM
    Moderator
  • Thanks Willson. That code doesn't work with SSIS 2014, but is definitely the kind of solution I was looking for.

    Does anyone have the equivalent SSIS 2014 version of the above?

    Wednesday, May 3, 2017 2:24 PM