none
Why does Mircosoft Power Query still use calculate a Median Function in a "unique" way RRS feed

  • Question

  • Why is the Median Function in Power Query still calculating incorrectly. 

    "When there are an even number of items, the Median function in Power Query returns the lower of the two middle numbers as the median."

    This is not the correct way to calculate a median and is different from how Excel (and my MATHS lecturers) correctly calculate a median.

     I note that this "bug" was acknowledged by Curt Hagenlocher                on         Wednesday, February 12, 2014 7:38 PM Why no action

    Wednesday, February 15, 2017 10:48 PM

Answers

  • Almost four years and counting. Obviously, fixing this function is not a priority. I've created a custom median function that works correctly (along with a whole bunch of other "missing" stat functions).

    List_Median function:

    (list as list) as number => 
    let
        ListCount = List.Count(list),
        SortedList = List.Sort(list),
        Median = if Number.IsOdd(ListCount) then 
                    SortedList{Number.IntegerDivide(ListCount,2)}
                 else
                   (SortedList{ListCount/2}+
                   SortedList{(ListCount/2)-1})/2
    in
        Median


    Wednesday, February 15, 2017 11:07 PM

All replies

  • Almost four years and counting. Obviously, fixing this function is not a priority. I've created a custom median function that works correctly (along with a whole bunch of other "missing" stat functions).

    List_Median function:

    (list as list) as number => 
    let
        ListCount = List.Count(list),
        SortedList = List.Sort(list),
        Median = if Number.IsOdd(ListCount) then 
                    SortedList{Number.IntegerDivide(ListCount,2)}
                 else
                   (SortedList{ListCount/2}+
                   SortedList{(ListCount/2)-1})/2
    in
        Median


    Wednesday, February 15, 2017 11:07 PM
  • Hi Colin can you explain how I incorporate this code as part of a Query.

    I assume I need to add a custom column called "Median" to my table which  is called Reviewer Scores

    However I get syntax errors when I try to do this

    Monday, April 3, 2017 4:32 AM
  • Paste the code into a blank query and give it a name. Then use the name in a custom column as you would with any built-in function (taking a list as a parameter i.e. tableName[ColumnName]).
    Monday, April 3, 2017 12:13 PM
  • Another option would be to exploit the weird potential of parameter comparisonCriteria, which allows to provide a sort order: with Order.Descending (1), List.Median will return the higher of the middle 2 values, in case of an equal number of elements in the list.

    Syntax: List.Median(list as list, optional comparisonCriteria as any) as any

    The function would then become simply:

    (list as list) as number => List.Average({List.Median(list,0),List.Median(list,1)})

    Wednesday, May 17, 2017 3:20 AM
  • > "Almost four years and counting. Obviously, fixing this function is not a priority."

    OK, so now I know why for serious statistical calculations people use SAS, SPSS, Statistica or free R. I's five minute work for the programmer to fix Median in PQ. Excel has bad reputation in statistical functions and Microsoft does nothing to change it.

    One more example: PQ's MODULO transform returns different results from function MOD in Excel.

    Check MOD(-3;2)

    Are there two different statistics in one Excel suit?


    Lech Jaszowski

    Saturday, August 19, 2017 11:13 AM
  • As you can see, I contributed to this topic earlier and I spent some time on it

    I have a theory why this wouldn't be so simple: the input for List.Median can be a structured value and the result must be 1 of those values.

    Quoted from my notes with an example in which the median is taken from a number field in a list of records:

    = List.Average({List.Median(SampleData[Records1], each [Number Field])[Number Field],
                    List.Median(SampleData[Records1], {each [Number Field],1})[Number Field]})

    Notice that no record can be returned with the correct median value, if the median value is calculated as the average of 2 middle values. (Would this be the background for the known issue?).

    The other remark about differences between Excel and Power Query: there are many more differences like:

    • default rounding for ties (Excel: away from zero, PQ: towards nearest even number),
    • iseven / isodd: try for a negative fractional number,
    • the default base for logaritmics: Excel 10, PQ e (2.71.....),
    • 2/29/1900 exists in Excel, not in reality nor in PQ
    • Etcetera
    • Etcetera

    They are just 2 different products.

    Saturday, August 19, 2017 1:04 PM
  • "Excel has bad reputation in statistical functions and Microsoft does nothing to change it."

    This statement hasn't been true since Excel 2010, when a massive rewrite of the stat functions was undertaken. Excel is now as good as any other stat package based on a function-to-function comparison (and not necessarily by any other standards).

    However, you are completely correct about the MOD function. M calculates MOD(-3,2) (i.e. Number.Mod(-3,2)) incorrectly. The result should be a positive number.


    Saturday, August 19, 2017 9:53 PM
  • "Notice that no record can be returned with the correct median value, if the median value is calculated as the average of 2 middle values."

    Marcel, could you elaborate? Using your formula with a even list of records yielded to correct median value (average of two middle values).

    "The other remark about differences between Excel and Power Query: there are many more differences like:

    • default rounding for ties (Excel: away from zero, PQ: towards nearest even number),
    • iseven / isodd: try for a negative fractional number,
    • the default base for logaritmics: Excel 10, PQ e (2.71.....),
    • 2/29/1900 exists in Excel, not in reality nor in PQ
    • Etcetera
    • Etcetera

    They are just 2 different products"

    I don't think that you make a strong argument here. The defaults and other stuff you list are peculiarities inherent in many products.

    The important issue is whether M's result of a function is compatible with other products that all calculate the function the same by default. All of the stat packages listed by Lech, plus Excel & VertiPaq, calculate MOD and MEDIAN (and probably some other functions) in the exact same manner. What is the advantage of being contrary?

    The danger is that many users of M will perform certain calculations without realizing that the results are wrong. Addressing these issues is not low priority for the development team - there is no priority at all.

    It is left to the community of users to identify the issues and provide solutions (like your median solution).


    Saturday, August 19, 2017 10:32 PM
  • Colin: "value" in Power Query can have a very broad definition, varying from a single value to (nested) structured values (lists, records and table) and functions, together with their associated types and metadata.

    In that perspective: List.Median returns the middle occurrence of a list of values, where values can be (almost) anything (I guess it will become difficult to get the median of a list of functions or a list of types or any other values that doesn't allow ordinal comparisons like less than etcetera).

    Below an example with a list of records as the Source, followed by 3 different List.Median's that each return one of the records.

    let
        Source = {[Id = 1, Name = "Marcel", SurName = "Beugelsdijk"],
                  [Id = 2, Name = "Colin",  SurName = "Banfield"],
                  [Id = 3, Name = "Lech",   SurName = "Jaszowski"]},
        MedianId =      List.Median(Source,each [Id]),
        MedianName =    List.Median(Source,each [Name]),
        MedianSurName = List.Median(Source,each [SurName])
    in
        MedianSurName
    I wonder what would be the "correct" median of "Banfield" and "Beugelsdijk"? ;-)

    Sunday, August 20, 2017 8:50 AM
  • Colin:

    Yes, I agree that the statement about Excel’s bad reputation is now not true. But older Excel users find it hard to forget these experiences (I've been teaching Excel since 1996). The more so because additionally the names of statistical functions were poorly translated to Polish. So only the most persistent folks were able to use them. I just love Excel and I do not want the same happen to Power Query. Only and so much.

    As for the MOD function, I think this time Power Query is right, and Excel is wrong. In Excel, the help for the MOD function is given a formula to which Excel does not apply. That formula is:

    MOD(n,d) = n – d * INT(n/d)

    Excel adds to it: “The result has the same sign as divisor.”

    I don’t know why. I would not complain if both programs would generate the same value.

    The danger is that many users of M will perform certain calculations without realizing that the results are wrong.”

    I'm afraid of that most. Again this bad reputation ...

    Marcel:

    Thank you for writing the correct MEDIAN function and explanation. An interesting example :)
    On Wikipedia I read about other possible median definitions:

    • A version for even n, instead of the arithmetic average one of the middle values is drawn(!). Such a median does not produce a result beyond the set of previous values. It is particularly applicable for binary bitmap processing (black&white). The classical median would then require putting gray in addition to the existing white and black colors.
    • It can also be assumed that if n is even, then the median can equally well be any number from the first to the second middle values. Each of these numbers fulfills the requirement to minimize the mean of the modulus of deviation.

    I wonder in what programming language is written the same Power Query? After all, not in M. Therefore, programmers, creators of PQ have other ways of counting median. I guess so but I'm not sure. By the way: in what language is Excel written?


    Lech Jaszowski

    Sunday, August 20, 2017 2:45 PM
  • Hi Marcel,

    I think that most people (if not all people) would expect the median calculation to return a numeric value. In fact, median doesn't make sense with any other type of value, since like standard deviation, variance, and mean, it is a measure of central tendency in a data set. 

    "I wonder what would be the "correct" median of "Banfield" and "Beugelsdijk"? ;-)"

    "Imke" :D

    Sunday, August 20, 2017 6:41 PM
  • It looks like the developers of Power Query / M were the exception.

    The long description of the function List.Median:

    Returns the median item of the list <code>list</code>. This function throws an exception if the list is empty.     If there is an even number of items, the function chooses the smaller of the two median items.

    So the description clearly mentions items, not values (or numbers).

    Otherwise I agree that one would expect the median of numeric values, with the average of the 2 middle values in case of an even number of values.

    I'm not defending Power Query; I just try to understand it.

    Sunday, August 20, 2017 7:28 PM
  • "As for the MOD function, I think this time Power Query is right, and Excel is wrong. In Excel, the help for the MOD function is given a formula to which Excel does not apply."

    No, Excel is correct. I've checked the explanation with several online sources as well as the calculation in R. In R, -3%%2 returns 1. 3%%-2 returns -1. Excel returns the same values as R. In PQ, the signs are reversed. 

    "I'm afraid of that most. Again this bad reputation ..."

    Like I said previously, Excel has corrected the issues raised by academics. If there is still lingering bitterness by folks, then it's unfortunate. The point is, they addressed the issues, and involved standard bodies and academic researchers in the process. What more can they do?

    "On Wikipedia I read about other possible median definitions"

    But that's purely academic. What actual products produce by default is what matters.

    "I wonder in what programming language is written the same Power Query?"

    I think that it's written in C#

    "By the way: in what language is Excel written?"

    Currently C++.

    Sunday, August 20, 2017 7:38 PM
  • Marcel,

    Well, it never occurred to me that anything other than numbers was implied in the List.Median description :)

    However, your observations point to the unevenness of the documentation in general.
    Sunday, August 20, 2017 7:49 PM
  • So, this is certainly an embarrassing bug -- all the more so because it's been open for almost four years now. As has been surmised, the biggest problem here is that List.Median works with arbitrary lists and not just lists of numbers. When you have a "variant" type (which we call "any") then you have to be able to do things like impose an arbitrary sort order across types in order to satisfy requests to sort by a particular column. Once you have such a thing (and if you're not aware of the standard definition), then the median just "falls out" of the behavior of the sort.

    When I first discovered this, I suggested that we retain the existing List.Median function without changing it and then add a new function for numbers -- say, Number.Median -- which would produce an error if its list argument contained anything other than a number. Unfortunately, we then let ourselves get distracted in two different direction. If we were going to create one or more new functions, shouldn't we consider

    1) Extending the concept of median to other types, especially the datetime type which seems like an obvious fit. Unfortunately, there are now a whole bunch of edge cases. What about dates? Should there be a median for dates, and if so how do you take the average of '1969-01-05' and '1969-01-06' (if those are the central two values) and still have the result be a date?

    2) Extending the concept of median to other percentiles. The median is just the 50th percentile, but there are other percentiles that people find interesting -- quartiles, quintiles, deciles, for instance, not to mention arbitrary percentiles.

    As you can imagine, this discussion turned into a bit of a rathole. Finally, after a very spirited email thread, the bug was assigned to an owner who was given the assignment to make a proposal. I think it's now on its third owner, and there hasn't a proposal yet.

    Wednesday, August 23, 2017 2:59 AM
  • With respect to the Modulo operation there isn't really any right or wrong when it comes to negative numbers. Different programming languages have made different decisions here. The one made for M predates me, so I can't say why it was chosen. Knowing the people that were involved, I'm certain they had specific reasons.
    • Edited by Curt Hagenlocher Wednesday, August 23, 2017 3:02 AM messed up hyperlink
    Wednesday, August 23, 2017 3:02 AM
  • Maybe I should consider applying with Microsoft...

    It can't be too difficult to identify if List.Median is invoked with a "simple" number list argument and - if so - return the correct median. For other values keep the current function behavior.

    To account for compatibility you might consider adding an optional argument like "UseCorrectAlgoritm" with default value false - LOL!

    Wednesday, August 23, 2017 9:27 AM
  • Maybe I should consider applying with Microsoft...

    It can't be too difficult to identify if List.Median is invoked with a "simple" number list argument and - if so - return the correct median. For other values keep the current function behavior.

    I was thinking the exact same thing.

    One has to wonder how every other tool on the planet (well at least the important products) can provide a median function that returns correct numeric results and M can't. Perhaps the function is too broadly defined. In 99.999% of cases, median is based on numeric values, as it is often compared with (or used in combination with) other statistical functions that measure central tendency. Edge cases are values that aren't numeric, but can still have a meaningful median value, like dates. In such cases, Marcel mentions a viable solution.

    As a side issue, although we are debating the median function, the statistical library in M is the most arbitrary that I have ever encountered in any product. It's as if someone started with a list of possibilities, randomly sorted the list, and finally picked the top N. How else can one explain the following:

    1) There is a standard deviation sample function but no standard deviation population function?
    2) There are no common functions like variance, rank, or percentile, but there are less common functions like mode(s) and covariance?
    3) As for covariance, there is population covariance, but no sample covariance (opposite of the standard deviation function)?

    Also, before using M, I've never, ever encountered a math library comparable to M's, but without the very common ceiling and floor functions.


    Wednesday, August 23, 2017 5:00 PM
  • Number.Ceiling = (x) => Number.Round(x, null, RoundingMode.Up);

    Number.Floor = (x) => Number.Round(x, null, RoundingMode.Down);

    Wednesday, August 23, 2017 8:31 PM
  • It can't be too difficult to identify if List.Median is invoked with a "simple" number list argument and - if so - return the correct median. For other values keep the current function behavior.

    Good point. Rather than wait for a more fully-featured function to be spec'd, I went ahead and fixed this this morning.
    Wednesday, August 23, 2017 8:33 PM
  • Well done. Thanks for your leadership on this issue!

    Edit: Could you look into the Number.Mod function, which returns the incorrect signs for either negative numerator or denominator?

    Wednesday, August 23, 2017 8:58 PM
  • Number.Ceiling = (x) => Number.Round(x, null, RoundingMode.Up);

    Number.Floor = (x) => Number.Round(x, null, RoundingMode.Down);

    Hi Curt, thank you. However, I was ranting over functions that should have been built-in because they are so very common in other products. Between 2013 (after Data Explorer was renamed) and 2014, I built a huge pile of custom functions (and still continue to this day, but to a far lesser extent than that original period). Among those original functions were ceiling and floor functions (I chose to implement the Excel CEILING.MATH and FLOOR.MATH functions because they are supersets of the other CEILING and FLOOR functions. Also implemented the MROUND function).

    In addition:

    1) I implemented all of the Excel stat functions, except for the probably distribution functions. 
    2) DAX's path functions, out of disbelief that a product dedicated to transforming data had no built-in mechanisms for flattening hierarchies.
    3) And much more...

    Amazing what you can do when you are annoyed! :) However, despite all this work, I would still prefer if these functions were built into M. For me, the effort is nothing more than a means to an end (and as it turned out, a way to learn M). In the past four years, about a dozen of the custom functions I've created (out of a library of over 100 general-purpose functions) have made it into the product (not based on any request by me) - I'm grateful for that.

    There are some things that are difficult or impossible to implement. Another breathtaking omission (and a first for a product of this type) is the lack of wildcards. All we have to work with are Text.Contains, Text.StartsWith, and Text.EndsWith. Did you know that there about 15 Excel functions that accept the * and ? wildcards? 

    There is some useful validations that can be done - fixed-length text isn't hard, and most codes are possible (e.g. credit card format validation). However, something like text contains ?tar*night eludes me, not to mention other useful regex type conditions like either/or.

    Wednesday, August 23, 2017 10:04 PM
  • Could you look into the Number.Mod function, which returns the incorrect signs for either negative numerator or denominator?

    As per my above response, there is no such thing as an incorrect sign for negative values used with the modulo operation. See the Wikipedia article at https://en.wikipedia.org/wiki/Modulo_operation.

    Wednesday, August 23, 2017 10:32 PM
  • Could you look into the Number.Mod function, which returns the incorrect signs for either negative numerator or denominator?

    As per my above response, there is no such thing as an incorrect sign for negative values used with the modulo operation. See the Wikipedia article at https://en.wikipedia.org/wiki/Modulo_operation.

    Compatibility with related products (Excel, DAX, R) should have higher priority than a Wikipedia article. Please also note that the 40-page Microsoft document titled "Function Improvements in Microsoft Excel 2010" lists the MOD function under "Functions with Higher Accuracy." In that document, Wikipedia is unsurprisingly not listed as a source.

    For these reasons, I'll create a custom mod function that can be used if there's a possibility that the denominator can be negative.

    Thursday, August 24, 2017 1:30 AM
  • In case anyone has not yet noticed, the Median function is now fixed. Thank you Curt!!!
    Sunday, October 29, 2017 1:40 PM