none
Percentile calculation

    Question

  • Hi,

    I need to calculate some value at some specific percentile using OLAP.

    Simplified cube could look like this:

    Customer dimension (Name, Age)
    Location dimension (Name).

    Now, I need to calculate age at 75% percentile of all customers for each location .

    Lets say, that our data looks like this:

    There are 4 costumers:
    Customer A - Age 10
    Customer B - Age 20
    Customer C - Age 30
    Customer D - Age 40

    Only one location:
    Location A

    All customers belong to this location, so calculating age at 75th percentile would return customer C with age = 30
    How could I return such result with MDX?

    Thanks
    Tuesday, July 21, 2009 2:23 PM

Answers

  • Since there is no MDX Percentile(), I'm using the Excel formula documented here: Percentile. To fit the children data, I'm computing the 85th percentile for Florida, where there there are 11 samples. So the 85th percentile should be midway between the 9th and 10th samples, which are 2 and 4, ie. = 3:

     With
    Set [PercentileSamples] as
    Order(NonEmpty([Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members
    * [Customer].[Total Children].[Total Children].Members, [Measures].[Internet Sales Amount]),
    [Customer].[Total Children].MemberValue, BASC)
    Member [Measures].[PercentileValue] as 85
    Member [Measures].[PercentileInt] as
    Int((([PercentileSamples].Count - 1) * [Measures].[PercentileValue]) / 100)
    Member [Measures].[PercentileFrac] as
    (([PercentileSamples].Count - 1) * [Measures].[PercentileValue]) / 100
    - [Measures].[PercentileInt]
    Member [Measures].[PercentileLo] as
    [PercentileSamples].Item([Measures].[PercentileInt]).Item(1).MemberValue
    Member [Measures].[PercentileHi] as
    [PercentileSamples].Item([Measures].[PercentileInt] + 1).Item(1).MemberValue
    Member [Measures].[PercentileChildren] as
    ([Measures].[PercentileLo] * (1 - [Measures].[PercentileFrac]))
    + ([Measures].[PercentileHi] * [Measures].[PercentileFrac])
    select
    {[Measures].[Customer Count],
    [Measures].[Internet Transaction Count],
    [Measures].[PercentileInt],
    [Measures].[PercentileFrac],
    [Measures].[PercentileLo],
    [Measures].[PercentileHi],
    [Measures].[PercentileChildren]} on 0
    from [Adventure Works]
    where [Customer].[Customer Geography].[State-Province].&[FL]&[US]
    -----------------------------------------------------------------
    Customer Count Internet Transaction Count PercentileInt PercentileFrac PercentileLo PercentileHi PercentileChildren
    3 11 8 0.5 2 4 3
    - Deepak
    • Marked as answer by Dell1505 Friday, August 21, 2009 6:34 AM
    Thursday, August 20, 2009 4:14 AM

All replies

  • In your specific example, you could use Tail(BottomCount(CustomerSet, 3, Age)) to get the 3rd customer in ascending order of age. Here's an Adventure Works example, where the sales for the reseller with 75th percentile Annual Sales in a specific city is determined. There are 24 resellers in Toronto, and the 18th reseller has $1.5 million in sales:

    With
    Member [Measures].[ResellerCitySales] as
    Sum(Exists([Reseller].[Reseller].[Reseller].Members,
    {LinkMember([Geography].[Geography Key].CurrentMember,
    [Reseller].[Geography Key])}),
    [Reseller].[Annual Sales].MemberValue),
    FORMAT_STRING = "Currency"
    Member [Measures].[City75PercentileSales] as
    Sum(Tail(BottomCount(Exists([Reseller].[Reseller].[Reseller].Members,
    Generate(Existing [Geography].[Geography Key].[Geography Key].Members,
    Exists([Reseller].[Reseller].[Reseller].Members,
    LinkMember([Geography].[Geography Key].CurrentMember,
    [Reseller].[Geography Key])))) as [CityResellers],
    (3 * [CityResellers].Count) / 4,
    [Reseller].[Annual Sales].MemberValue)),
    [Reseller].[Annual Sales].MemberValue),
    FORMAT_STRING = "Currency"
    select
    {[Measures].[ResellerCitySales],
    [Measures].[City75PercentileSales]} on 0,
    {[Geography].[City].&[Toronto]&[ON]} on 1
    from [Adventure Works]
    ------------------------------------------------------
     ResellerCitySales City75PercentileSales
    Toronto $1,110,900,000.00 $1,500,000.00

    - Deepak
    Tuesday, July 21, 2009 5:46 PM
  • Thanks. However, I modified your query for a simpler:

    With
    Member [Measures].[City75PercentileSales] as
    Sum(Exists([Reseller].[Reseller].[Reseller].Members, {LinkMember([Geography].[Geography Key].CurrentMember,
    [Reseller].[Geography Key])}),
    [Reseller].[Annual Sales].MemberValue)
    select
    {
    [Measures].[City75PercentileSales]} on 0,
    {[Geography].[City].Members} on 1
    from [Adventure Works]
    All cities have the same Annual Sales amount and I think this is not correct?

    Also, is there a simple solution without linking members for calculating sums in this example?
    Wednesday, July 22, 2009 5:58 AM
  • A couple of points to make:

    - ".. However, I modified your query for a simpler .." - unfortunately, it may now be too simple for the sample query, because [Geography Key] is at a lower granularity than [City] and is hence not set by selecting a City. So the calculation doesn't take the selected City into account - try the version below.

    - ".. is there a simple solution without linking members .." - LinkMember() is used because Geography is a reference dimension, with Reseller as the intermediate dimension. This was the closest analogy in Adventure Works to your original scenario with separate Customer and Location dimensions. However, if the Location attribute is included in the Customer dimension (as in Adventure Works), then there would be no need to use LinkMember(). The key construct is Tail(BottomCount({PercentileSet}, (3 * PercentileSet.Count) / 4, OrderingValue)), which returns the member of PercentileSet at the 75th percentile.

     

    With
    Member [Measures].[City75PercentileSales] as
    Sum(Exists([Reseller].[Reseller].[Reseller].Members,
    Generate(Existing [Geography].[Geography Key].[Geography Key].Members,
    Exists([Reseller].[Reseller].[Reseller].Members,
    LinkMember([Geography].[Geography Key].CurrentMember,
    [Reseller].[Geography Key])))),
    [Reseller].[Annual Sales].MemberValue),
    FORMAT_STRING = "Currency"
    select
    {
    [Measures].[City75PercentileSales]} on 0,
    {[Geography].[City].Members} on 1
    from [Adventure Works]
    
    
    - Deepak
     





    Wednesday, July 22, 2009 4:27 PM
  • I think you're oversimplifying this. 

    If you have 4 customers of those ages, the 75th percentile is not 30, it's 32.5.  And the median (50th percentile) is 25.  Try it in excel.

    If you have excel installed on the analysis server, you can use the excel function:

    WITH MEMBER [Measures].[75th percentile] AS
        Excel.Percentile(SETTOARRAY(<<yoursetexpression>>, [Measures].[Age]), .75)

    If you don't have excel handy, then it's a mess.  You need to get a count of the members in your set n(=4 in this case), subtract 1.

    Multiply (n-1)*(percentile), e.g. (3)(.75) = 2.25

    New integer j = integer portion of above, or 2
    New decimal g = decimal portion of above, or 0.25

    And the percentile age is the (j+1th member of your set)*(1-g) + (j+2th member of your set) * (g), or

    (30 * .75) + (40 * .25) = 32.5

    Creating all the calculated measures here is a nuisance but it works, and it's usually pretty fast.
    Wednesday, July 22, 2009 10:05 PM
  • Thanks jwr for explaining calculation of percentiles!

    However, I still do not get it - lets say I want to calculate median age over location.

    I have a fact table (which is also customer dimension):
    ID, Name, Age, LocationID

    And Location dimension
    LocationID, LocationName

    What should be written in Median function? How to create calculated measure for Age, I mean what aggregate function should be used? And why should I pass any set to median function if I know that LocationName will be on ROWS?

    WITH MEMBER [Measures].[Med] as
    MEDIAN (??, ??)
    SELECT
         {[Measures].[Med]} ON COLUMNS,
         { [Location].[LocationName].Members} ON ROWS
    FROM [SAMPLE CUBE]

    Thanks for your patience!


    Monday, July 27, 2009 2:20 PM
  • Still need help for this. Anyone?
    Tuesday, August 18, 2009 1:26 PM
  • Any response to my last post? Also, I'm not sure whether you're looking at this point for an example of Median (there are some in this Forum that I can point you to), or of Percentile.


    - Deepak
    Tuesday, August 18, 2009 1:56 PM
  • -- additional info:

    I have looked at thread http://social.msdn.microsoft.com/Forums/en-US/sqlanalysisservices/thread/b5d4d7cc-1f7c-407b-bdf0-58084b0d1cc9/ and I need the median to be 40, but not 90.

    -- end of additional info

    Thanks Deepak for your response, but at this moment it is too difficult.

    I have managed to understand how median is being calculated. However, I could not pass correct "calculation" for MEDIAN function.

    If I understand correctly foreach location MEDIAN function gets set of ages, groups them and applies some function to those ages (like sum, count, etc). And after that, MEDIAN for those grouped and calculated values is being chosen.

    So it appears like this:
    WITH MEMBER [Measures].[Med] as
    MEDIAN ([Customer].[Age].Members, ???)
    SELECT
         {[Measures].[Med]} ON COLUMNS,
         { [Location].[LocationName].Members} ON ROWS
    FROM [SAMPLE CUBE]

    Which function to set for MEDIAN?

    If for example we have customers with ages (5, 10, 10, 15, 20) for one location, what calculation should I pass?
    If I pass count, I get things like this:
    5 - 1
    10 - 2
    15 - 1
    20 - 1

    Ordered list: 1, 1, 1, 2

    Median is (1+1)/2  = 1.

    I hope you undestood my problem.
    • Edited by Dell1505 Tuesday, August 18, 2009 3:24 PM
    Tuesday, August 18, 2009 3:03 PM
  • Your scenario calls for computing the median of customer age, which is a dimension attribute. It's easier to discuss the approach with a similar example from Adventure Works - suppose we want to compute the median number of customer children by US state. The median for Utah is 1.5. To confirm the calculation, the 2nd query shows that there are 6 fact rows for Utah - with children counts of (0, 0, 0, 3, 3, 3). So the median is the average of the 3rd and 4th values = (0 + 3) / 2 = 1.5

    With
    Member [Measures].[MedianChildren] as
    Median(NonEmpty([Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members
    * [Customer].[Total Children].[Total Children].Members, [Measures].[Internet Sales Amount]),
    [Customer].[Total Children].MemberValue)
    select
    {[Measures].[Customer Count],
    [Measures].[Internet Transaction Count],
    [Measures].[MedianChildren]} on 0,
    Non Empty [Customer].[State-Province].[State-Province].Members
    having [Measures].[Customer Count] > 1 on 1
    from [Adventure Works]
    where [Customer].[Country].&[United States]
    --------------------------------------------------------------
     Customer Count Internet Transaction Count MedianChildren
    ...
    Utah	2	6	1.5
    
    select
    {[Measures].[Internet Sales Amount]} on 0,
    Non Empty [Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members
    * [Customer].[Total Children].[Total Children].Members on 1
    from [Adventure Works]
    where [Customer].[State-Province].&[UT]&[US]
    -----------------------------------------------------------
    		Internet Sales Amount
    SO44953   Line 1	3	$3,578.27
    SO54776   Line 1	0	$24.99
    SO54776   Line 2	0	$3.99
    SO54776   Line 3	0	$34.99
    SO58892   Line 1	3	$742.35
    SO58892   Line 2	3	$34.99





    - Deepak
    Tuesday, August 18, 2009 4:49 PM
  • Thanks Deepak! That helped a lot!

    However, could you help me a little more - now, instead of MEDIAN I need to calculate percentile (e.g. 95th).

    Could you provide an example from Adventure Works - percentile should be calculated with dimension attribute?
    Wednesday, August 19, 2009 7:17 AM
  • Since there is no MDX Percentile(), I'm using the Excel formula documented here: Percentile. To fit the children data, I'm computing the 85th percentile for Florida, where there there are 11 samples. So the 85th percentile should be midway between the 9th and 10th samples, which are 2 and 4, ie. = 3:

     With
    Set [PercentileSamples] as
    Order(NonEmpty([Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members
    * [Customer].[Total Children].[Total Children].Members, [Measures].[Internet Sales Amount]),
    [Customer].[Total Children].MemberValue, BASC)
    Member [Measures].[PercentileValue] as 85
    Member [Measures].[PercentileInt] as
    Int((([PercentileSamples].Count - 1) * [Measures].[PercentileValue]) / 100)
    Member [Measures].[PercentileFrac] as
    (([PercentileSamples].Count - 1) * [Measures].[PercentileValue]) / 100
    - [Measures].[PercentileInt]
    Member [Measures].[PercentileLo] as
    [PercentileSamples].Item([Measures].[PercentileInt]).Item(1).MemberValue
    Member [Measures].[PercentileHi] as
    [PercentileSamples].Item([Measures].[PercentileInt] + 1).Item(1).MemberValue
    Member [Measures].[PercentileChildren] as
    ([Measures].[PercentileLo] * (1 - [Measures].[PercentileFrac]))
    + ([Measures].[PercentileHi] * [Measures].[PercentileFrac])
    select
    {[Measures].[Customer Count],
    [Measures].[Internet Transaction Count],
    [Measures].[PercentileInt],
    [Measures].[PercentileFrac],
    [Measures].[PercentileLo],
    [Measures].[PercentileHi],
    [Measures].[PercentileChildren]} on 0
    from [Adventure Works]
    where [Customer].[Customer Geography].[State-Province].&[FL]&[US]
    -----------------------------------------------------------------
    Customer Count Internet Transaction Count PercentileInt PercentileFrac PercentileLo PercentileHi PercentileChildren
    3 11 8 0.5 2 4 3
    - Deepak
    • Marked as answer by Dell1505 Friday, August 21, 2009 6:34 AM
    Thursday, August 20, 2009 4:14 AM
  • Thanks Deepak!

    There is one issue left - set that you defined is for Florida only - if I would like to calculate percentile for each state, how can I change set to be linked with state?

    We are using SSAS 2005.

    -- Edited:

    I do not know if this query is correct for all states. It takes about a minute to list results - how could we improve this?:

    With
    Member [Measures].[cnt] as
    Count(NonEmpty([Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members
    * [Customer].[Total Children].[Total Children].Members, [Measures].[Internet Sales Amount]))
    Member [Measures].[PercentileValue] as 85
    Member [Measures].[PercentileInt] as
    Int((([Measures].[cnt] - 1) * [Measures].[PercentileValue]) / 100)
    Member [Measures].[PercentileFrac] as
    (Int([Measures].[cnt] - 1) * [Measures].[PercentileValue]) / 100
    - [Measures].[PercentileInt]
    Member [Measures].[PercentileLo] as
    (Order(NonEmpty([Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members
    * [Customer].[Total Children].[Total Children].Members, [Measures].[Internet Sales Amount]),
    [Customer].[Total Children].MemberValue, BASC)).Item([Measures].[PercentileInt]).Item(1).MemberValue
    Member [Measures].[PercentileHi] as
    (Order(NonEmpty([Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members
    * [Customer].[Total Children].[Total Children].Members, [Measures].[Internet Sales Amount]),
    [Customer].[Total Children].MemberValue, BASC)).Item([Measures].[PercentileInt] + 1).Item(1).MemberValue
    Member [Measures].[PercentileChildren] as
    ([Measures].[PercentileLo] * (1 - [Measures].[PercentileFrac]))
    + ([Measures].[PercentileHi] * [Measures].[PercentileFrac])
    select
    {
    [Measures].[Customer Count],
    [Measures].[Internet Transaction Count],
    [Measures].[PercentileInt],
    [Measures].[PercentileFrac],
    [Measures].[PercentileLo],
    [Measures].[PercentileHi],
    [Measures].[PercentileChildren],
    [Measures].[cnt]
    } on 0,
    [Customer].[Customer Geography].[State-Province].Members
    having [Measures].[Internet Sales Amount] > 0 on 1
    from [Adventure Works]

    Thursday, August 20, 2009 7:13 AM
  • ".. I do not know if this query is correct for all states. It takes about a minute to list results - how could we improve this? .." - the modified query looks correct. The Named Set I used only applied to a single state, so that had to be incorporated into calculated measures. Performance is another matter - are you asking just for this specific query, or for percentile in general - in the latter case, I would suggest starting a separate thread, as this one is too long already.
    - Deepak
    Thursday, August 20, 2009 2:58 PM
  • Hi,

    Simple question regarding your sample query - if I did not want to get the dimension member value at a particular percentile but instead wanted to get the measure value, how would I go about modifying the above query to get that?

    Thanks
    Thursday, August 20, 2009 4:35 PM
  • Hi contec,

    when you have an expression like this

        <multidimensional_set>.Item(x).Item(y).MemberValue

    you can get the measure value using

      (<multidimensional_set>.Item(x), Measures.YourMeasure)

    Btw, the optimized version of the query is here.

    Regards,


    Tomislav Piasevoli
    Business Intelligence Specialist
    www.softpro.hr
    Saturday, August 29, 2009 8:49 AM
  • With

    Member

     

    [Measures].[cnt] as

    Count

     

    (NonEmpty(

    [DimContract].[Contract Hierarchy].[Contract].

    members*[DimPremise].[Premise].[Premise Id].members*

    [DimDate].[Calendar].[Month Name].

    members,[Measures].[Quantity]))

    Member

     

    [Measures].[Percentile85] as 85

    Member

     

    [Measures].[PercentileInt85] as

    Int((([Measures].[cnt]) * [Measures].[Percentile85]) / 100)

    Member

     

    [Measures].[PercentileLo] as

    (

    Order(NonEmpty([DimContract].[Contract Hierarchy].[Contract].members*[DimPremise].[Premise].[Premise Id].members*

    [DimDate].[Calendar].[Month Name].

    members,[Measures].[Quantity]),

    [Measures].[Quantity].

    Value, BDESC)).Item([Measures].[PercentileInt85]).Item(0).MemberValue

    select

    {

    [Measures].[cnt],

    [Measures].[Percentile85],[Measures].[PercentileInt85],

    [Measures].[PercentileLo]

    }

    on 0

    from

     

    [Consumption Hub]

    cnt 85th percentile 85th percentile position Contract

    -- ---------------- ----------------------- ---------

    103 85 87 45000056

    Thursday, July 22, 2010 12:25 AM
  • If I'm only looking to generate the 25th and 75th percentiles, is there perhaps a more efficient way to utilize the MEDIAN function twice for each percentile to optimize performance?  In other words, since the MEDIAN represents the 50th percentile and the function is already available natively to SSAS, could you not get the 25th percentile by creating a set from 0 to the MEDIAN, and then taking MEDIAN of that resulting set? 

    I'm not sure that that would be faster, but it seems logical to me.  I'm not well-versed enough in MDX to figure this out, but I'm hoping that it may provide an easier way to get at the percentile performance issue from another angle.

     

     

    Wednesday, June 29, 2011 3:06 PM