> Besides, as you know, K-Means needs all the data at once because of the clusters mean (centroid) calculation.
As far as I remember, in K-means the centroids are evolving as you are adding new data to the clusters. In this case, you do not really need all the data at once, you only need the ability to iterate over all the data, which the processing code in the shell already provides (the case processor object has a method invoked for each row ini the data set).
1) Where to place the call...
I would suggest the following:
- your case processor object maintains the centroids
- whenever ProcessCase is invoked (for each case) you adjust the centroids
If you cannot or do not want to do this, then the easy (but not scalable) way is to create a data structure (a data table, for example) before calling StartCases, then have ProcessCase add a row to the table for each case then, in the end, call your implementation of k-Means. Again, this is not very efficient, as it keeps all the data in memory at once (instead of keeping just the centroids).
The StartCases method is blocking (on the main processing thread). ProcessCase is invoked on a separate thread so, when (in the main thread) StartCases returns all the cases were processed. If you cache the cases in a data table, here is the point where you cache is full (after the StartCases call)
2) Hmm, this is tricky. The Microsoft Cluster Viewer is based on the distribution of the clusters. You will need to expose the content of your clustering model exactly in the same way as the Microsoft_Clustering algorithm (one top node, and one node for each cluster, with a NODE_DISTRIBUTION table containing the distribution for that cluster). The sample plug-in included with the C# package does exactly this, you should make sure the distribution code is adjusted to reflect your internal clusters
3) You should not write to an external database. You could open an ADO connection (assuming that your permissions are set properly). However, there is no way to manage the lifetime of the table (as there is no notification for the plug-in when a model is deleted)
The right way to do this is to implement the Drillthrough feature for your plug-in.
This allows queries like:
SELECT * FROM MyModel.CASES WHERE IsInNode('Cluster001') // to get all training cases belonging to one cluster
or
SELECT T.*, Cluster() FROM MyModel NATURAL PREDICTION JOIN
(SELECT * FROM MyModel.CASES) AS T
to get all the cases and the cluster they belong to