none
Question about "Clustering Model"

    Question

  • Hi,

    I've started learning of the Microsoft Data Analysis services and don't understand - 

    why the "State" on the "Cluster Diagram" tab have a negative values? In the dataset I can see only positive values...

    P.S. maybe you could propose some books or other sources for the better understanding of tool from MS for data mining. Desirable with simple examples. I've already passed tutorials in BOL, but still can't understand some things.

    Thank you in advance and sorry for newbie questions.

    Wednesday, June 12, 2013 2:53 PM

Answers

  • I received the data from Mark.

    This happens because column has mean value close to 0 and very large standard deviation.

    When viewer needs to discretize values in the numeric column into 5 bins, ranges for the bins are calculated using mean and standard deviation assuming that values have normal distribution. That is why it creates bins with negative boundaries.

    Correct viewer behavior is to look at the actual minimum and maximum values in the column in addition to mean and standard deviation when determining bin ranges. If anybody from Microsoft is looking at this thread, please open a bug to fix this.

    User should consider preparing data before modeling to get more accurate model. Possible data preparation steps are

    1. replace extreme values with values closer to usual range
    2. remove rows with extreme values
    3. create calculated columns that use original columns and use them instead of the original columns

    Tatyana Yakushev [PredixionSoftware.com]

    Download Predixion Insight 3.0 - World class predictive platform for big data

    Tuesday, June 18, 2013 4:42 PM

All replies

  • Can you share screenshot?

    Where do you see the negative values?

    Do you see those states in the State: combobox after you select Shading Variable:?

    The "State:" combobox for numeric columns shows ranges of values. E.g. "Low (1 - 5)" means that values are between 1 and 5.


    Tatyana Yakushev [PredixionSoftware.com]

    Download Predixion Insight 3.0 - World class predictive platform for big data

    Wednesday, June 12, 2013 6:41 PM
  • Please find all necessary information on the screenshot below: 

    http://i.minus.com/iG1RriPawhOZR.png

    Thank you for the assistance!


    Thursday, June 13, 2013 7:56 AM
  • I see that cluster names are in russian.

    Can you try changing Date and Number Format to English (US), rebooting the computer and retraining the model? I want to check this because I have seen errors because european countries use , instead of . as a decimal separator.

    In your entire dataset, what is the minimum value in the "Billed for9 Month" column?

    Can you share your data with me?


    Tatyana Yakushev [PredixionSoftware.com]

    Download Predixion Insight 3.0 - World class predictive platform for big data

    Thursday, June 13, 2013 8:42 PM
  • I've done it. But stil can see negative values in combobox.

    And minimum value for the "Billed for9 Month" column is "17"

    Please find more details on the screenshots below:

    http://i.minus.com/iY6OCGfXAtndJ.png - clustering model

    http://i.minus.com/ib0u3KFge4S39o.png - dataset

    Thank you.



    Friday, June 14, 2013 3:57 PM
  • It looks like a bug in SQL Server to me. If you can find or create a data that you can email to me that causes same problem when you create a clustering model, I might be able to investigate and provide a walkaround. My email is tatyana  AT  predixionsoftware  DOT  com.


    Tatyana Yakushev [PredixionSoftware.com]

    Download Predixion Insight 3.0 - World class predictive platform for big data

    Friday, June 14, 2013 4:47 PM
  • I received the data from Mark.

    This happens because column has mean value close to 0 and very large standard deviation.

    When viewer needs to discretize values in the numeric column into 5 bins, ranges for the bins are calculated using mean and standard deviation assuming that values have normal distribution. That is why it creates bins with negative boundaries.

    Correct viewer behavior is to look at the actual minimum and maximum values in the column in addition to mean and standard deviation when determining bin ranges. If anybody from Microsoft is looking at this thread, please open a bug to fix this.

    User should consider preparing data before modeling to get more accurate model. Possible data preparation steps are

    1. replace extreme values with values closer to usual range
    2. remove rows with extreme values
    3. create calculated columns that use original columns and use them instead of the original columns

    Tatyana Yakushev [PredixionSoftware.com]

    Download Predixion Insight 3.0 - World class predictive platform for big data

    Tuesday, June 18, 2013 4:42 PM