Answered by:
Percentile calculation

Hi,
I need to calculate some value at some specific percentile using OLAP.
Simplified cube could look like this:
Customer dimension (Name, Age)
Location dimension (Name).
Now, I need to calculate age at 75% percentile of all customers for each location .
Lets say, that our data looks like this:
There are 4 costumers:
Customer A  Age 10
Customer B  Age 20
Customer C  Age 30
Customer D  Age 40
Only one location:
Location A
All customers belong to this location, so calculating age at 75th percentile would return customer C with age = 30
How could I return such result with MDX?
Thanks
Question
Answers

Since there is no MDX Percentile(), I'm using the Excel formula documented here: Percentile. To fit the children data, I'm computing the 85th percentile for Florida, where there there are 11 samples. So the 85th percentile should be midway between the 9th and 10th samples, which are 2 and 4, ie. = 3:
With
Set [PercentileSamples] as
Order(NonEmpty([Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members
* [Customer].[Total Children].[Total Children].Members, [Measures].[Internet Sales Amount]),
[Customer].[Total Children].MemberValue, BASC)
Member [Measures].[PercentileValue] as 85
Member [Measures].[PercentileInt] as
Int((([PercentileSamples].Count  1) * [Measures].[PercentileValue]) / 100)
Member [Measures].[PercentileFrac] as
(([PercentileSamples].Count  1) * [Measures].[PercentileValue]) / 100
 [Measures].[PercentileInt]
Member [Measures].[PercentileLo] as
[PercentileSamples].Item([Measures].[PercentileInt]).Item(1).MemberValue
Member [Measures].[PercentileHi] as
[PercentileSamples].Item([Measures].[PercentileInt] + 1).Item(1).MemberValue
Member [Measures].[PercentileChildren] as
([Measures].[PercentileLo] * (1  [Measures].[PercentileFrac]))
+ ([Measures].[PercentileHi] * [Measures].[PercentileFrac])
select
{[Measures].[Customer Count],
[Measures].[Internet Transaction Count],
[Measures].[PercentileInt],
[Measures].[PercentileFrac],
[Measures].[PercentileLo],
[Measures].[PercentileHi],
[Measures].[PercentileChildren]} on 0
from [Adventure Works]
where [Customer].[Customer Geography].[StateProvince].&[FL]&[US]

Customer Count Internet Transaction Count PercentileInt PercentileFrac PercentileLo PercentileHi PercentileChildren
3 11 8 0.5 2 4 3
 Deepak Marked as answer by Dell1505 Friday, August 21, 2009 6:34 AM
All replies

In your specific example, you could use Tail(BottomCount(CustomerSet, 3, Age)) to get the 3rd customer in ascending order of age. Here's an Adventure Works example, where the sales for the reseller with 75th percentile Annual Sales in a specific city is determined. There are 24 resellers in Toronto, and the 18th reseller has $1.5 million in sales:
With Member [Measures].[ResellerCitySales] as Sum(Exists([Reseller].[Reseller].[Reseller].Members, {LinkMember([Geography].[Geography Key].CurrentMember, [Reseller].[Geography Key])}), [Reseller].[Annual Sales].MemberValue), FORMAT_STRING = "Currency" Member [Measures].[City75PercentileSales] as Sum(Tail(BottomCount(Exists([Reseller].[Reseller].[Reseller].Members, Generate(Existing [Geography].[Geography Key].[Geography Key].Members, Exists([Reseller].[Reseller].[Reseller].Members, LinkMember([Geography].[Geography Key].CurrentMember, [Reseller].[Geography Key])))) as [CityResellers], (3 * [CityResellers].Count) / 4, [Reseller].[Annual Sales].MemberValue)), [Reseller].[Annual Sales].MemberValue), FORMAT_STRING = "Currency" select {[Measures].[ResellerCitySales], [Measures].[City75PercentileSales]} on 0, {[Geography].[City].&[Toronto]&[ON]} on 1 from [Adventure Works]  ResellerCitySales City75PercentileSales Toronto $1,110,900,000.00 $1,500,000.00
 Deepak 
Thanks. However, I modified your query for a simpler:
With Member [Measures].[City75PercentileSales] as Sum(Exists([Reseller].[Reseller].[Reseller].Members, {LinkMember([Geography].[Geography Key].CurrentMember, [Reseller].[Geography Key])}), [Reseller].[Annual Sales].MemberValue) select { [Measures].[City75PercentileSales]} on 0, {[Geography].[City].Members} on 1 from [Adventure Works]
All cities have the same Annual Sales amount and I think this is not correct?
Also, is there a simple solution without linking members for calculating sums in this example? 
A couple of points to make:
 ".. However, I modified your query for a simpler .."  unfortunately, it may now be too simple for the sample query, because [Geography Key] is at a lower granularity than [City] and is hence not set by selecting a City. So the calculation doesn't take the selected City into account  try the version below.
 ".. is there a simple solution without linking members .."  LinkMember() is used because Geography is a reference dimension, with Reseller as the intermediate dimension. This was the closest analogy in Adventure Works to your original scenario with separate Customer and Location dimensions. However, if the Location attribute is included in the Customer dimension (as in Adventure Works), then there would be no need to use LinkMember(). The key construct is Tail(BottomCount({PercentileSet}, (3 * PercentileSet.Count) / 4, OrderingValue)), which returns the member of PercentileSet at the 75th percentile.
With Member [Measures].[City75PercentileSales] as Sum(Exists([Reseller].[Reseller].[Reseller].Members, Generate(Existing [Geography].[Geography Key].[Geography Key].Members, Exists([Reseller].[Reseller].[Reseller].Members, LinkMember([Geography].[Geography Key].CurrentMember, [Reseller].[Geography Key])))), [Reseller].[Annual Sales].MemberValue), FORMAT_STRING = "Currency" select { [Measures].[City75PercentileSales]} on 0, {[Geography].[City].Members} on 1 from [Adventure Works]
 Deepak 
I think you're oversimplifying this.
If you have 4 customers of those ages, the 75th percentile is not 30, it's 32.5. And the median (50th percentile) is 25. Try it in excel.
If you have excel installed on the analysis server, you can use the excel function:
WITH MEMBER [Measures].[75th percentile] AS
Excel.Percentile(SETTOARRAY(<<yoursetexpression>>, [Measures].[Age]), .75)
If you don't have excel handy, then it's a mess. You need to get a count of the members in your set n(=4 in this case), subtract 1.
Multiply (n1)*(percentile), e.g. (3)(.75) = 2.25
New integer j = integer portion of above, or 2
New decimal g = decimal portion of above, or 0.25
And the percentile age is the (j+1th member of your set)*(1g) + (j+2th member of your set) * (g), or
(30 * .75) + (40 * .25) = 32.5
Creating all the calculated measures here is a nuisance but it works, and it's usually pretty fast. 
Thanks jwr for explaining calculation of percentiles!
However, I still do not get it  lets say I want to calculate median age over location.
I have a fact table (which is also customer dimension):
ID, Name, Age, LocationID
And Location dimension
LocationID, LocationName
What should be written in Median function? How to create calculated measure for Age, I mean what aggregate function should be used? And why should I pass any set to median function if I know that LocationName will be on ROWS?
WITH MEMBER [Measures].[Med] as
MEDIAN (??, ??)
SELECT
{[Measures].[Med]} ON COLUMNS,
{ [Location].[LocationName].Members} ON ROWS
FROM [SAMPLE CUBE]
Thanks for your patience!


Any response to my last post? Also, I'm not sure whether you're looking at this point for an example of Median (there are some in this Forum that I can point you to), or of Percentile.
 Deepak Edited by Deepak PuriModerator Tuesday, August 18, 2009 1:58 PM

 additional info:
I have looked at thread http://social.msdn.microsoft.com/Forums/enUS/sqlanalysisservices/thread/b5d4d7cc1f7c407bbdf058084b0d1cc9/ and I need the median to be 40, but not 90.
 end of additional info
Thanks Deepak for your response, but at this moment it is too difficult.
I have managed to understand how median is being calculated. However, I could not pass correct "calculation" for MEDIAN function.
If I understand correctly foreach location MEDIAN function gets set of ages, groups them and applies some function to those ages (like sum, count, etc). And after that, MEDIAN for those grouped and calculated values is being chosen.
So it appears like this:
WITH MEMBER [Measures].[Med] as
MEDIAN ([Customer].[Age].Members, ???)
SELECT
{[Measures].[Med]} ON COLUMNS,
{ [Location].[LocationName].Members} ON ROWS
FROM [SAMPLE CUBE]
Which function to set for MEDIAN?
If for example we have customers with ages (5, 10, 10, 15, 20) for one location, what calculation should I pass?
If I pass count, I get things like this:
5  1
10  2
15  1
20  1
Ordered list: 1, 1, 1, 2
Median is (1+1)/2 = 1.
I hope you undestood my problem. Edited by Dell1505 Tuesday, August 18, 2009 3:24 PM

Your scenario calls for computing the median of customer age, which is a dimension attribute. It's easier to discuss the approach with a similar example from Adventure Works  suppose we want to compute the median number of customer children by US state. The median for Utah is 1.5. To confirm the calculation, the 2nd query shows that there are 6 fact rows for Utah  with children counts of (0, 0, 0, 3, 3, 3). So the median is the average of the 3rd and 4th values = (0 + 3) / 2 = 1.5
With Member [Measures].[MedianChildren] as Median(NonEmpty([Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members * [Customer].[Total Children].[Total Children].Members, [Measures].[Internet Sales Amount]), [Customer].[Total Children].MemberValue) select {[Measures].[Customer Count], [Measures].[Internet Transaction Count], [Measures].[MedianChildren]} on 0, Non Empty [Customer].[StateProvince].[StateProvince].Members having [Measures].[Customer Count] > 1 on 1 from [Adventure Works] where [Customer].[Country].&[United States]  Customer Count Internet Transaction Count MedianChildren ... Utah 2 6 1.5
select {[Measures].[Internet Sales Amount]} on 0, Non Empty [Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members * [Customer].[Total Children].[Total Children].Members on 1 from [Adventure Works] where [Customer].[StateProvince].&[UT]&[US]  Internet Sales Amount SO44953 Line 1 3 $3,578.27 SO54776 Line 1 0 $24.99 SO54776 Line 2 0 $3.99 SO54776 Line 3 0 $34.99 SO58892 Line 1 3 $742.35 SO58892 Line 2 3 $34.99
 Deepak Edited by Deepak PuriModerator Tuesday, August 18, 2009 5:00 PM

Thanks Deepak! That helped a lot!
However, could you help me a little more  now, instead of MEDIAN I need to calculate percentile (e.g. 95th).
Could you provide an example from Adventure Works  percentile should be calculated with dimension attribute? 
Since there is no MDX Percentile(), I'm using the Excel formula documented here: Percentile. To fit the children data, I'm computing the 85th percentile for Florida, where there there are 11 samples. So the 85th percentile should be midway between the 9th and 10th samples, which are 2 and 4, ie. = 3:
With
Set [PercentileSamples] as
Order(NonEmpty([Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members
* [Customer].[Total Children].[Total Children].Members, [Measures].[Internet Sales Amount]),
[Customer].[Total Children].MemberValue, BASC)
Member [Measures].[PercentileValue] as 85
Member [Measures].[PercentileInt] as
Int((([PercentileSamples].Count  1) * [Measures].[PercentileValue]) / 100)
Member [Measures].[PercentileFrac] as
(([PercentileSamples].Count  1) * [Measures].[PercentileValue]) / 100
 [Measures].[PercentileInt]
Member [Measures].[PercentileLo] as
[PercentileSamples].Item([Measures].[PercentileInt]).Item(1).MemberValue
Member [Measures].[PercentileHi] as
[PercentileSamples].Item([Measures].[PercentileInt] + 1).Item(1).MemberValue
Member [Measures].[PercentileChildren] as
([Measures].[PercentileLo] * (1  [Measures].[PercentileFrac]))
+ ([Measures].[PercentileHi] * [Measures].[PercentileFrac])
select
{[Measures].[Customer Count],
[Measures].[Internet Transaction Count],
[Measures].[PercentileInt],
[Measures].[PercentileFrac],
[Measures].[PercentileLo],
[Measures].[PercentileHi],
[Measures].[PercentileChildren]} on 0
from [Adventure Works]
where [Customer].[Customer Geography].[StateProvince].&[FL]&[US]

Customer Count Internet Transaction Count PercentileInt PercentileFrac PercentileLo PercentileHi PercentileChildren
3 11 8 0.5 2 4 3
 Deepak Marked as answer by Dell1505 Friday, August 21, 2009 6:34 AM

Thanks Deepak!
There is one issue left  set that you defined is for Florida only  if I would like to calculate percentile for each state, how can I change set to be linked with state?
We are using SSAS 2005.
 Edited:
I do not know if this query is correct for all states. It takes about a minute to list results  how could we improve this?:
With
Member [Measures].[cnt] as
Count(NonEmpty([Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members
* [Customer].[Total Children].[Total Children].Members, [Measures].[Internet Sales Amount]))
Member [Measures].[PercentileValue] as 85
Member [Measures].[PercentileInt] as
Int((([Measures].[cnt]  1) * [Measures].[PercentileValue]) / 100)
Member [Measures].[PercentileFrac] as
(Int([Measures].[cnt]  1) * [Measures].[PercentileValue]) / 100
 [Measures].[PercentileInt]
Member [Measures].[PercentileLo] as
(Order(NonEmpty([Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members
* [Customer].[Total Children].[Total Children].Members, [Measures].[Internet Sales Amount]),
[Customer].[Total Children].MemberValue, BASC)).Item([Measures].[PercentileInt]).Item(1).MemberValue
Member [Measures].[PercentileHi] as
(Order(NonEmpty([Internet Sales Order Details].[Internet Sales Orders].[Order Line].Members
* [Customer].[Total Children].[Total Children].Members, [Measures].[Internet Sales Amount]),
[Customer].[Total Children].MemberValue, BASC)).Item([Measures].[PercentileInt] + 1).Item(1).MemberValue
Member [Measures].[PercentileChildren] as
([Measures].[PercentileLo] * (1  [Measures].[PercentileFrac]))
+ ([Measures].[PercentileHi] * [Measures].[PercentileFrac])
select
{
[Measures].[Customer Count],
[Measures].[Internet Transaction Count],
[Measures].[PercentileInt],
[Measures].[PercentileFrac],
[Measures].[PercentileLo],
[Measures].[PercentileHi],
[Measures].[PercentileChildren],
[Measures].[cnt]
} on 0,
[Customer].[Customer Geography].[StateProvince].Members
having [Measures].[Internet Sales Amount] > 0 on 1
from [Adventure Works] 
".. I do not know if this query is correct for all states. It takes about a minute to list results  how could we improve this? .."  the modified query looks correct. The Named Set I used only applied to a single state, so that had to be incorporated into calculated measures. Performance is another matter  are you asking just for this specific query, or for percentile in general  in the latter case, I would suggest starting a separate thread, as this one is too long already.
 Deepak 

Hi contec,
when you have an expression like this
<multidimensional_set>.Item(x).Item(y).MemberValue
you can get the measure value using
(<multidimensional_set>.Item(x), Measures.YourMeasure)
Btw, the optimized version of the query is here.
Regards,
Tomislav Piasevoli
Business Intelligence Specialist
www.softpro.hr 
With
Member
[Measures].[cnt] as
Count
(NonEmpty(
[DimContract].[Contract Hierarchy].[Contract].
members*[DimPremise].[Premise].[Premise Id].members*
[DimDate].[Calendar].[Month Name].
members,[Measures].[Quantity]))
Member
[Measures].[Percentile85] as 85
Member
[Measures].[PercentileInt85] as
Int((([Measures].[cnt]) * [Measures].[Percentile85]) / 100)
Member
[Measures].[PercentileLo] as
(
Order(NonEmpty([DimContract].[Contract Hierarchy].[Contract].members*[DimPremise].[Premise].[Premise Id].members*
[DimDate].[Calendar].[Month Name].
members,[Measures].[Quantity]),
[Measures].[Quantity].
Value, BDESC)).Item([Measures].[PercentileInt85]).Item(0).MemberValue
select
{
[Measures].[cnt],
[Measures].[Percentile85],[Measures].[PercentileInt85],
[Measures].[PercentileLo]
}
on 0
from
[Consumption Hub]
cnt 85th percentile 85th percentile position Contract
   
103 85 87 45000056

If I'm only looking to generate the 25th and 75th percentiles, is there perhaps a more efficient way to utilize the MEDIAN function twice for each percentile to optimize performance? In other words, since the MEDIAN represents the 50th percentile and the function is already available natively to SSAS, could you not get the 25th percentile by creating a set from 0 to the MEDIAN, and then taking MEDIAN of that resulting set?
I'm not sure that that would be faster, but it seems logical to me. I'm not wellversed enough in MDX to figure this out, but I'm hoping that it may provide an easier way to get at the percentile performance issue from another angle.