locked
Making things easier RRS feed

  • Question

  • Good afternoon,

     

    I've here a shell plugin and it's compiling fine and can be viewed in BI Dev Studio when choosing the DM technique using the proper wizard.

     

    I also have here a K-Means implementation that estimates the number of clusters using a statistical semi-empiric index (the PBM index).

     

    This implementation is done in C# and works fine. But it has to receive all the data of the database (all variables for each row) in order to do the proper vectorial calculations in a CSR (Compact Sparse Rows) way.

     

    Besides, as you know, K-Means needs all the data at once because of the clusters mean (centroid) calculation.

     

    So, I have some questions:

     

    1) Where to place the call to the K-Means implementation in the shell passing as argument an object holding all the data ?

     

    2) After this call, with the data clustered, what other objects must be modified in order to use Microsoft Cluster Viewer ?

     

    3) I will need to create a new column or a new table on the database to specify which data belongs to which cluster. Can I open an ADO connection as I normally do in other programs from inside the plugin or is there another (easier/better) way to do so ?

     

    Thanks a lot once more.

     

    Best regards,

     

    -Renan Souza

    Monday, January 14, 2008 4:16 PM

Answers

  • > Besides, as you know, K-Means needs all the data at once because of the clusters mean (centroid) calculation.

    As far as I remember, in K-means the centroids are evolving as you are adding new data to the clusters. In this case, you do not really need all the data at once, you only need the ability to iterate over all the data, which the processing code in the shell already provides (the case processor object has a method invoked for each row ini the data set).

     

    1) Where to place the call...

    I would suggest the following:

    - your case processor object maintains the centroids

    - whenever ProcessCase is invoked (for each case) you adjust the centroids

    If you cannot or do not want to do this, then the easy (but not scalable) way is to create a data structure (a data table, for example) before calling StartCases, then have ProcessCase add a row to the table for each case then, in the end, call your implementation of k-Means. Again, this is not very efficient, as it keeps all the data in memory at once (instead of keeping just the centroids).

    The StartCases method is blocking (on the main processing thread). ProcessCase is invoked on a separate thread so, when (in the main thread) StartCases returns all the cases were processed. If you cache the cases in a data table, here is the point where you cache is full (after the StartCases call)

     

    2) Hmm, this is tricky. The Microsoft Cluster Viewer is based on the distribution of the clusters. You will need to expose the content of your clustering model exactly in the same way as the Microsoft_Clustering algorithm (one top node, and one node for each cluster, with a NODE_DISTRIBUTION table containing the distribution for that cluster). The sample plug-in included with the C# package does exactly this, you should make sure the distribution code is adjusted to reflect your internal clusters

     

    3) You should not write to an external database. You could open an ADO connection  (assuming that your permissions are set properly). However, there is no way to manage the lifetime of the table (as there is no notification for the plug-in when a model is deleted)

    The right way to do this is to implement the Drillthrough feature for your plug-in.

    This allows queries like:

    SELECT * FROM MyModel.CASES WHERE IsInNode('Cluster001') // to get all training cases belonging to one cluster

    or

    SELECT T.*, Cluster() FROM MyModel NATURAL PREDICTION JOIN

    (SELECT * FROM MyModel.CASES) AS T

    to get all the cases and the cluster they belong to

     

    Tuesday, January 15, 2008 12:54 AM