My Research Diaries: Personal Thoughts

Showing posts with label Personal Thoughts. Show all posts

Thursday, April 11, 2013

Surviving the PhD program

As I am nearing the end of the longest phase of my academic life, I wanted to put together all resources that are useful to keep one going during the frustrating times of a PhD program. I truly believe that keeping a positive emotional state and using our energies in the right direction is the key to getting through the program. I am grateful to have a good support of friends and family but the internet is a great place to seek support in times of trouble. Here are some of the most useful links that I found.

Dissertation Writers Toolkit : A great resource for those who are procrastinating on writing. This webpage also contains tons of other helpful material, like balanced-life chart, tools on staying organized, positive affirmations on writing and so on.
Life is easier when you can laugh at yourself. Here are some daily affirmations for doctoral students. But I stayed away from PhD comics as much as I could.
The Thesis Whisperer is another useful blog, that helped me realize I was not alone in some of my struggles. In fact my struggles are perfectly normal for a PhD student.
Here are some productivity tricks for graduate students that I found useful. In fact, following one of the suggestions from this page, I purchased multiple chargers for my laptop so I could save time on getting started with my day.
The 3-month thesis is also a good resource for thesis writing

Monday, November 5, 2012

In much need of inspiration

Nora Denzel's talk at Grace Hopper 2012 was very inspiring. Here are the key ideas from her talk.

I remember this everyday during this last trying year of my PhD.

Thanks Norah. You are an inspiration and I owe my PhD to you.

Tuesday, June 28, 2011

Back to Square one

I am lost here every time I come here. Its a maze and I cannot find a way out. Even if I do make a way out, I am unclear as to what happened back there.

I am now making a step by step plan on how I can tackle this gigantic problem I am facing.

What am I trying to do?
My main goal is to try and do distributed topic modeling using R on Amazon EMR.

The steps I need to take now to solve this problem

1) Install hadoop and run a single node hadoop cluster and basic mapper reducer scripts on it
2) Run R on hadoop using hive and try to do the same via R
3) Run distributed tm on R
4) Run Mahout on single node hadoop
5) Using hive try to convert data types between R and mahout

Do all of the above on the amazon emr cluster using its ruby client.

Loads of painful nights ahead, but hopefully rewarding too

Friday, January 29, 2010

TIME corpus... blah!!!

The TIME corpus is the most useless and "wrong" corpus that can ever be.

A little analysis on the corpus reveals that the corpus is built off two main topics, and the rest of the topics are under-represented.
The number of words is really high as compared to the number of documents, thats why the representation itself is very sparse.
Due to the high ratio of (Numof words) to (Num of documents) it is impossible to train a topic model on this corpus and expect it to represent such a corpus well
the qrels (relevance judgements) are INCORRECT. A visual exploration by comparing words reveal no similarity (at least on term-term basis) between the queries and the documents marked relevant vis-a-vis the queries.

So if you want to use TIME corpus, use it at your own RISK

Friday, January 8, 2010

The trouble with testing IRS

Testing for an IRS can get so freaking hard. The vocabulary sizes can run upto a few tens of thousands. That means you are dealing with a very very very high dimensional space. The term weights or term probabilities are extremely small. So small that you are dealing with values in log scale only. The relevance judgements for some queries are not reliable. The queries dig up some documents that might seem relevant to you (as a programmer/user) but the standardized judgements have not tagged the documents. This can get increasingly frustrating because you might be led to believe that there is a bug in your code. Changes made to the query or the document in response to relevance feedback is hard to interpret and understand, when dealing with few tens of thousands of words and tens of thousands of documents.

To circumvent this problem, it is essential to first create your own teeny tiny corpora of few tens of documents and few tens of words. If you wish to build topic models over your corpora make sure that your document-term matrix is tall and thin (not sparse). If you are using the basic Unigram and VSM kind of approaches, a short and stout (sparse) matrix might do the magic.
The relevance judgements should be created by you by manual inspection. To start out build a few topics with a few words that are very very distinct even to a human :). Use these topics to generate documents using the LDA document generation model.

This has proven useful to my experimentation and explorations and I hope the reader of this blog will find it useful too.

Friday, December 4, 2009

The reality about academia

http://sciencecareers.sciencemag.org/career_magazine/previous_issues/articles/2009_11_13/caredit.a0900141

Monday, November 23, 2009

First Acceptance of failure

I feel really let down. I feel like a wastrel. I have delayed my efforts. I feel self doubted.
I spent almost 8-9 months, on two papers (for which I was the second author) and coding in C/C++ for even proof of Idea. The result was, I ended up being a programmer :(

I could have achieved so much more. I can at least confess that the past 6 months have been a super unproductive time. I feel like I was a short-sighted amateur researcher, who could not have known better.

Hopefully the next 6 weeks, before new years will be at least useful and productive in terms of experimentation.

Here goes my nights of sleep and peace of mind.
I won't be able to rest till I see the light at the end of this tunnel.

Will I be able to meet SIGIR deadline? God only knows