locked
clustering reports 90% missing values when data has no missing values RRS feed

  • Question

  • I have a lot of experience with databases and a fair amount of experience with SSAS but I am just doing my first data mining. Probably I am doing some noob thing. SQL Server 2005 (9.0.3233).

    I am adding a data mining structure to an existing SSAS database. In the DSV I created a named query (based on views in the DSV) that feeds me 1300 rows. If I take the SQL for the named query and run it in SSMS I see my data. Within the query I'm using inner joins to look up text column values. No missing values. Not one. FWIW the query produces its rows by grouping source rows, summing the one numeric value.

    I create from scratch my mining structure, based exclusively on this named query. I select the one continuous column as both input and predicted. I select the ID (a GUID cast to a string) as the key. The other three columns (all discrete, text) I set as input only. One model -- Microsoft Clustering. I deploy and process. For two of the three discrete text columns, for the entire population, in the cluster properties viewer it shows 90.5% missing values. The third column has a reasonable and accurate distribution of values. The cluster profiles for the clusters show the missing values in varying proportions.

    I have recreated, reprocessed, and refreshed to ensure I'm not looking at results left over from some previous query or version. I have not been reprocessing the entire database, just the mining structure and model. I have no missing values but the predominant value in my results is "missing". I'm stumped.
    Thursday, March 4, 2010 6:53 PM

Answers

  • This may happen, for example, for a text column which has around 1000+ distinct values. The clustering algorithm by default only handles 100 distinct states for each column. Is this the case?
    If this is correct:
    - If your column is numeric or date time, you should change the modeling from discrete to continuous
    - if your column is really discrete, you may want to reconsider whether it should be included in cluster analysis. You may keep the column in the structure, for drillthrough purposes, but not include it in the cluster detection process
    - if you really need to keep the column as is, then try changing the MAXIMUM_STATES algorithm parameter




    bogdan crivat [sql server data mining] / http://www.bogdancrivat.net/dm
    • Marked as answer by Clark Wilson Friday, March 5, 2010 1:00 AM
    Thursday, March 4, 2010 11:14 PM