# How to classify a document based on Microsoft's Naive Bayes algorithm?

### Question

• I am quite unfamiliar with using SSAS as of the moment. I am aware of the classical implementation of Naive Bayes. I have learned about it from the here. However what I am looking for is a complete walkthrough of how to use this particular algorithm with SSAS.

For simplicity let me assume we are supposed to classify a news write-up as having positive criticism or negative criticism. So for the positive articles we can observe words like good, awesome, super, recommended, love, like, etc occuring frequently. For negative articles we can observe words like bad, poor, unsatisfactory, unsatisfied, pathetic, etc mostly. There are only two possible outcomes (positive or negative), hence, generalizing on patterns is fairly simple.

To start with we have a few write-ups with their corresponding outcomes, which are mostly in accordance with patterns we've generalized above. If we were to do this without the help of a data mining tool, we would do the following:

1. Take the first write-up (assume this one is a positive article)
2. We'd first split the whole write-ups into words.
3. Remove the stopwords in them, like the, this, that, etc. (Words meant to provide a grammatical structure to the write-up but they occur frequently hence get rid of them). We get a corpus of words now.
4. This corpus is assigned to the outcome positive. We simply note the frequency of how many time positive appears, and also the frequency of the individual words tending to give outcome positive.
5. The next write-up is taken. (Assume this one to be negative).
6. Steps 2-5 is repeated and the particular frequencies are updated each time.

So once we have looked into all documents, we can actually prepare test cases.

In accordance with the formula above, nc is the no.of times good actually give the outcome positive. p is a prior estimate (=0.5 since only 2 outcomes), and n is no.of time positive outcome appears in our corpus.

How can I use SSAS to and go about verifying these kind of test cases manually?

I am a bundle of mistakes intertwined together with good intentions

• Edited by Saturday, June 29, 2013 4:48 AM formula explanation
Saturday, June 29, 2013 4:42 AM