Thursday, July 18, 2013

Matlab Tip: Changing line-width of all the lines via commandline

Get the handle to the line using:
hline = findobj(gcf, 'type', 'line');
Then  you can change some property for all the line objects:
set(hline,'LineWidth',3)
or just for some of them :
set(hline(1),'LineWidth',3)
set(hline(2:3),'LineStyle',':')
idx = [4 5];
set(hline(idx),'Marker','*')
This is not an original post. Here is the original post

Thursday, April 11, 2013

Surviving the PhD program

As I am nearing the end of the longest phase of my academic life, I wanted to put together all resources that are useful to keep one going during the frustrating times of a PhD program. I truly believe that keeping a positive emotional state and using our energies in the right direction is the key to getting through the program. I am grateful to have a good support of friends and family but the internet is a great place to seek support in times of trouble. Here are some of the most useful links that I found.

  • Dissertation Writers Toolkit : A great resource for those who are procrastinating on writing. This webpage also contains tons of other helpful material, like balanced-life chart, tools on staying organized, positive affirmations on writing and so on.
  • Life is easier when you can laugh at yourself. Here are some daily affirmations for doctoral students. But I stayed away from PhD comics as much as I could.
  • The Thesis Whisperer is another useful blog, that helped me realize I was not alone in some of my struggles. In fact my struggles are perfectly normal for a PhD student.
  • Here are some productivity tricks for graduate students that I found useful. In fact, following one of the suggestions from this page, I purchased multiple chargers for my laptop so I could save time on getting started with my day.
  • The 3-month thesis is also a good resource for thesis writing

Wednesday, April 3, 2013

Trailing Slash in rsync command

Just making a quick reference to the rysnc manual in order to synchronize directories

The command used to copy folders is as follows:
 
rsync -avz foo:src/bar /data/tmp
 
This command copies the directory bar in /data/tmp .. 
meaning at the end of this command, you will have /data/tmp/bar folder
 
 
If you just want to sync the folders then use the trailing slash like this
 
rsync -avz foo:src/bar/ /data/tmp 

Now only the contents of bar will be copied into /data/tmp folder;
 you will not find a folder called bar in /data/tmp

Wednesday, March 27, 2013

The caveats of using tar to create archives

So I recently created a bunch of tar files that had a long path name when unzipped. This long path name was the absolute path on the computer where the tar files were created. For example

tar c /home/shivani/testfile.txt
will create a tar that when unzipped will create /home/shivani/testfile.txt in the folder where you untar the file. 

In order to ensure this does not happen use the -C option, that cds into the directory /home/shivani and then creates the tar. 

This caveat was found on this webapge

Tuesday, March 12, 2013

Review of Available Implementations of the Latent Dirichlet Allocation

A part of my dissertation is based on the Latent Dirichlet Allocation Model and I have extensively searched for played with various LDA implementations.

Gibbs Sampling implementation of LDA in C++
The input format is basically a single text file that contains one line for each document. Each line contains the actual terms. I found this input format to be bulky. The code itself is very clean and easy to understand. Gibbs Sampling by itself is a little slow and nothing can be done about it.

Gibbs Sampling Implementation of LDA in Java
Written by the same folks who wrote the c++ implementation. Leaving the programming language aside everything else is kept same.

Variational Inference of LDA in Matlab and C
This is easy to use code that uses the input format as that of  SVMLight software, which is basically a text file with each line containing a sequence of tuples of the format : where the feature_id is the word_id based on a dictionary and count is the number of times it appears in the corpus. Implementation wise, I have one comment though. I was hoping that the algorithm would include estimation of the \eta parameter, the smoothing parameter on \beta.

MALLET toolbox
This is written in Java and uses plain text or SVM light format as input. It implements the Gibbs Sampling for LDA and allows for optimization of the hyper parameters \alpha and \eta after burn-in iterations. This allows us to exercise greater control over the impact of topics on the entire collection. Given that LDA itself is slow and that for estimation of hyper parameters we need to wait for burn-in number of iterations, I feel discouraged to use this tool.

Hybrid approach to LDA inference
This approach apparently uses both Variational and Gibbs Sampling based approach to learn the parameters. The author also presents a vanilla LDA implementation.

I will keep updating this post as and when I find more useful stuff.

Thursday, December 27, 2012

Trying to revive the C3M algorithm

C3M is so much unlike other hierarchical  or k-means category of algorithms for clustering. The C3M (Cover-Coefficient Clustering Method) is a single-pass clustering algorithm built by researchers in IR (Information Retrieval). Hence, it is primarily meant for documents. Because IR researchers built this algorithm, they take advantage of the fact that term-document matrices are usually sparse and can be represented well by an inverted index. The main advantage of the algorithm is that it provides a way to estimate/compute the number of clusters given a document-term matrix.

 Here are some of the claims made by their paper " Concept and Effectiveness of the Cover Coefficient Based Clustering Methodology for Text Databases by Fazli Can and Esen OZkarahan".
a) Clusters are stable
b) Algorithm is independent of order of documents, hence, we will always a unique clusters
c) The memory overhead is really low
d) The algorithm distributes the documents evenly among cluster. In other words, it does not create fat clusters or singletons.

The algorithm starts of by examining the document-term matrix D of a collection of size m by n. From this it computes C ( m by m) and C' (n by n), where C[i,j] indicates the extent to which i covers j. The higher c[i,j] the higher the similarity between the two documents. Similarly, C' records how much one terms covers the other. From C and C' matrices it is possible to compute a document's seed power. The higher the seed power of a document, the higher the chance that it being selected to be a seed of a cluster.Once the cluster seeds are selected. For each document (not assigned yet) we find the closest cluster by simply examining c[i,j], where i corresponds to the unassigned document and j corresponds to each of the cluster seeds.

If m denotes the total number of documents, xd denotes the average number of words in each document and, tg denotes the average of documents in which a term appears, then the complexity of this clustering approch is O(xd X m X tg)


Creating the C matrix

initialize C[m,m] <- 0="0" p="p">for each document i
      for each term t_i in that document
                 Compute P(t_i) = D(i,t)/ sum_t {D(i,t)}
                 Get the inverted index for term t_i .. Lets call it inv(t_i)
                  compute p(d_j|inv(t_i)) for each document in the inverted list
                  for each document j in inv(t_i)
                               c[i,j] = c[i,j] + p(t_i)*p(d_j|inv(t_i))

Similarly one can compute c'

Compute number of clusters n_c = \sum_i c[i,i]

Computing Cluster seed power
for each document i
     P(i) = c[i,i]*(1-c[i,i])* sum_j  { d[i,j] c'[j,j] (1-c'[j,j])}

Pick the top n_c documents that have highest P(i) and set them as cluster seeds  {D_cluster_Seeds}

Assigning documents to cluster seeds
for each of the document i (that is not a cluster seed)
   assigned = argmax(j) {c[i,j] where j < {D_cluster_Seeds}}

Cluster Validity

This paper presents a unique way of gauging the cluster structure as it measures this in terms of the query. Given two cluster structures -one random assignment and second the cluster structure under test, a set of queries with the corresponding set of relevant documents, the cluster validity is computed as follows. Suppose a query has k relevant documents, then we compute the value $n_t$ and $n_tr$ corresponding to the test cluster structure and random clustering. The authors use $n_t$ or $n_tr$ to denote the number of target clusters. The target cluster is a cluster that contains at least 1 relevant document to the query. If the clustering is any good, then $n_t$ should be less than $n_tr$. The average of $n_t$ is computed with $n_tr$ over all the queries and may be additionally subject to significance testing.

Now let me explain how $n_t$ is computed. For a given query, the authors first find the size of the relevant set (from ground truth). Lets denote this by k. Given $n_c$ number of clusters, for each cluster, $P_i$ is computed. $P_i$ computes the probability that a cluster $C_i$ will be selected given that randomly k documents from the set of m documents were picked. There is a direct formula in the paper that computes this. Computing $P_1+P_2+...P_n_c$ yields $n_t$ for the test cluster. The procedure is repeated for the cluster structure obtained by random clustering.

Although it makes mathematical sense, I wonder what would be the actual number of target clusters for the queries based on the actual location of the relevant documents. I personally feel that this is a far better and more accurate representation of the reality. Of course, we can compare the number of target clusters with a random cluster structure or with another clustering algorithm.

Querying

Querying with different term weighting has been extensively covered in this paper and I will not be able to cover all of it here.

I am going to implement this algorithm and update this post with some results.

Monday, December 10, 2012

Quick Reference: Extract metadata information of a document

To Extract ID of a document in a corpus called myCorpus


meta(myCorpus[[1]],tag="ID")


I quickly strip the leading and trailing white spaces using this command below:

as.numeric(gsub("\\s$","",gsub("^\\s","",meta(myCorpus[[1]],tag="ID"))))