Tuesday, February 9, 2010

Targetting discrimination as opposed to similarity

Clustering mechanism with data that is expected to follow continuous distributions use an objective function to measure the optimum clustering of the documents. Such an objective function is usually a ratio of two quantities : inter-cluster distance to intra-cluster distance. Such a ratio measures the power of the clustering algorithm to bring together similar data samples and separate apart the dissimilar data samples. Fortunately for Gaussian models, this works out into a nice clean equation. No wonder people LOVE the Gaussian model.

When we are dealing with discrete data, the most likely distribution is the multinomial distribution. To add more control over the behavior of the model one could use dirichlet priors. The document clustering approaches in the discrete domain merely use the EM or the gibbs sampling approach to fit data optimally to the discrete model (especially latent mixture model). There is no underlying objective function that says, this discrete (latent) model should also be able to optimally sepearte out one topic from the other. Thus one could end up with 2 topics generating the exact same set of words.

This is definitely something that is lacking in the EM algorithms for fitting discrete data....
Maybe there has been work done on this already. But I am yet to discover it

No comments:

Post a Comment