## Friday, October 21, 2011

### Experience with Matlab's TMG

I have been playing with Matlab's TMG and here are my 2 cents from experience with the software

Pros: Easy to install, Easy to index a corpus and queries.
Cons: Proprietary code. Impossible to contribute modify or fix bugs in the code. Is not maintained well. Does not have any discussion forum where users and developers can communicate. Last citation was 2009

I was unable to even run some basic retrieval tasks due to bug in the code. norm2 does not work for sparse arrays and one needs to replace norm2 with normset, and I was unable to make the change because the code is protected and proprietary.

## Thursday, July 14, 2011

### Counter-intuitive results

I have been researching how restricting the vocabulary of a dataset affects retrieval results and I am seeing some really counter-intuitive results

This the format of the experiment:

I create a vocabulary using some percentage of the documents not all. The percentages I used was (10,20,30,40,50,60,70,80,90) and then performed retrieval and calculated MAP

There were two key observations that I found  interesting.

1) For software corpora, I found that even with 10% of documents, 45% of the original vocabulary was built. This means that software vocabulary tends to have a more uniform distribution of the terms/identifiers/variable-names across source files
2) With the Vector Space Model with tf-idf weighting, I found that with 10-30% of documents used to create the vocabulary (or dictionary) I got better performance as compared to original vocabulary. The only explanation I have for this result is that, some words are more important for retrieval than others. I do need to mention that the original vocabulary itself is a pruned version of the original raw vocabulary obtained by removing sparse terms in the document-term Matrix.

Shivani

## Tuesday, June 28, 2011

### Back to Square one

I am lost here every time I come here. Its a maze and I cannot find a way out. Even if I do make a way out, I am unclear as to what happened back there.

I am now making a step by step plan on how I can tackle this gigantic problem I am facing.

What am I trying to do?
My main goal is to try and do distributed topic modeling using R on Amazon EMR.

The steps I need to take now to solve this problem

1) Install hadoop and run a single node hadoop cluster and basic mapper reducer scripts on it
2) Run R on hadoop using hive and try to do the same via R
3) Run distributed tm on R
4) Run Mahout on single node hadoop
5) Using hive try to convert data types between R and mahout

Do all of the above on the amazon emr cluster using its ruby client.

## Tuesday, May 10, 2011

### The recovery from crash entry

I am penning down personal notes on how to recover from crash

setup auto rsync with cron tab and update everyday
Back up your scripts, that you used to fix issues with your computer. In my case

Sound issue fixing
R
java
Python
cheese webcam
texmaker
lyx
skype
gtalk video voice chat
svn for code
pdf annontator
mendeleydesktop
eclipse IDE
perl, python, plugins for Eclipse IDE

Here is the link to the page on how to partition

## Tuesday, May 3, 2011

### password-less ssh into multiple machines from a single machine

I have been trying to get password-less ssh login into two or more machines (servers). The tutorials that are available online are great but they do not cover one corner case.

The default private and public key are named id_rsa, so every time you attempt a password-less login, ssh looks for id_rsa private file and matches it with the ~/.ssh/authorized_keys . However, what happens when you want to login to multiple servers

1) put the same public key id_rsa.pub in all servers where you want password-less login
2) create a separate private-public key pair and use ssh-agent to add the private keys
for each public-private key pair (new_rsa and  new_rsa.pub) do the following

localuser@localmahine$scp ~/.ssh/new_rsa.pub username@server:~/.ssh/new_rsa.pub # copy to .ssh folder localuser@localmahine$ssh username@server # login to the server
username@server$cat ~/.ssh/new_rsa.pub >> ~/.ssh/authorized_keys # append to existing aurhotized keys localuser@localmahine$ssh-agent bash
localuser@localmahine$ssh-add ~/.ssh/new_rsa ## Tuesday, April 26, 2011 ### R tips and tricks 1) R.setenv() and R.getenv() can help you set environment variables for packages like rJava from within R. syntax: print(Sys.setenv(R_TEST="testit", "A+C"=123)) # A+C could also be used Sys.getenv("R_TEST") Sys.unsetenv("R_TEST") # may warn and not succeed 2) R has no bound checking for lists or arrays, so one has to take care of it manually. For example: arr[length(arr)+1] simply yeilds a NA This becomes an issue in running for loops for (i in (2:length(arr))) will not work out well if length(arr)<2 ## Friday, April 22, 2011 ### Using gensim... the basics Thanks to the extremely active community of gensim, I have made way through some basic commands in python and gensim I have a directory of text documents that I want indexed and topic model built on Each file in the directory is a document containing plain text. Lets assume that the text is pruned for stopwords and special characters etc. I will need to write custom over-rides of the get_text() function of textcorpus and this is how I achieve it def split_line(text): words = text.split() out = [] for word in words: out.append(word) return out import gensim class MyCorpus(gensim.corpora.TextCorpus): def get_texts(self): for filename in self.input: yield split_line(open(filename).read()) if b is a list of files then myCorpus = MyCorpus(b) will create the corpus and myCorpus.dictionary has all the unique words myCorpus.dictionary.token2id.items() gives the word-id pairs myCorpus.dictionary.token2id.keys() gives the unique words myCorpus.dictionary.token2id.values() gives the corresponding ids One can save it in Matrix Market format using the following command gensim.corpora.MmCorpus.serialize('mycorpus.mm', myCorpus) In order to add new documents, just extend the list b to include the file names and redo all of the above. Internally the implementation takes off from where it left I still need to work on indexing, lsi based and lda based modeling of the corpus using the above framework and I am hoping to add more posts as I learn about them. ## Wednesday, April 13, 2011 ### Hadoop streaming blues I am facing trouble using hadoop streaming in order to solve a simple nearest neighbor problem. Input data is in the following format '\t' key is the imageid for which nearest neighbor will be computed the value is 100 dimensional vector of floating point values separated by space or tab The mapper reads in the query (the query is a 100 dimensional vector) and each line of the input and outputs a where key2 is a floating point value indicating the distance, and value2 is the imageid The number of reducers is set to 1. And the reducer is set to be the identity reducer. I tried to use the following command bin/hadoop jar ./mapred/contrib/streaming/ hadoop-0.21.0-streaming.jar -Dmapreduce.job.output.key.class=org.apache.hadoop.io.DoubleWritable -files /home/shivani/research/toolkit/mathouttuts/nearestneighbor/code/IdentityMapper.R#file1 -input datain/comparedata -output dataout5 -mapper file1 -reducer org.apache.hadoop.mapred.lib.IdentityReducer -verbose This is the output stream is as below. The failure is in the mapper itself, more specifically the TEXTOUTPUTREADER. I am not sure how to fix this. The logs are attached below: 11/04/13 13:22:15 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000 11/04/13 13:22:15 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used STREAM: addTaskEnvironment= STREAM: shippedCanonFiles_=[] STREAM: shipped: false /usr/local/hadoop/file1 STREAM: cmd=file1 STREAM: cmd=null STREAM: shipped: false /usr/local/hadoop/org.apache.hadoop.mapred.lib.IdentityReducer STREAM: cmd=org.apache.hadoop.mapred.lib.IdentityReducer 11/04/13 13:22:15 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id STREAM: Found runtime classes in: /usr/local/hadoop-hadoop/hadoop-unjar7358684340334149267/ packageJobJar: [/usr/local/hadoop-hadoop/hadoop-unjar7358684340334149267/] [] /tmp/streamjob2923554781371902680.jar tmpDir=null JarBuilder.addNamedStream META-INF/MANIFEST.MF JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritable.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordOutput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableOutput$1.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordOutput.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesOutput.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesOutput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesInput$1.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableOutput.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordInput$TypedBytesIndex.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableInput$2.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableInput.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordInput.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/Type.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableInput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordInput$1.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesInput.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil$TaskId.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$1.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamJob.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil.class JarBuilder.addNamedStream org/apache/hadoop/streaming/Environment.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/RawBytesOutputReader.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TypedBytesInputWriter.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TextInputWriter.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/InputWriter.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TextOutputReader.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/IdentifierResolver.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/RawBytesInputWriter.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TypedBytesOutputReader.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/OutputReader.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PathFinder.class JarBuilder.addNamedStream org/apache/hadoop/streaming/LoadTypedBytes.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamXmlRecordReader.class JarBuilder.addNamedStream org/apache/hadoop/streaming/UTF8ByteArrayUtils.class JarBuilder.addNamedStream org/apache/hadoop/streaming/JarBuilder.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil$StreamConsumer.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$MRErrorThread.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamKeyValUtil.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeCombiner.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeReducer.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamInputFormat.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRunner.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$MROutputThread.class
STREAM: ==== JobConf properties:
STREAM: dfs.block.access.key.update.interval=600
STREAM: dfs.block.access.token.enable=false
STREAM: dfs.blockreport.initialDelay=0
STREAM: dfs.blockreport.intervalMsec=21600000
STREAM: dfs.blocksize=67108864
STREAM: dfs.bytes-per-checksum=512
STREAM: dfs.client-write-packet-size=65536
STREAM: dfs.client.block.write.retries=3
STREAM: dfs.client.https.keystore.resource=ssl-client.xml
STREAM: dfs.client.https.need-auth=false
STREAM: dfs.datanode.balance.bandwidthPerSec=1048576
STREAM: dfs.datanode.data.dir=file://${hadoop.tmp.dir}/dfs/data STREAM: dfs.datanode.data.dir.perm=755 STREAM: dfs.datanode.directoryscan.interval=21600 STREAM: dfs.datanode.directoryscan.threads=1 STREAM: dfs.datanode.dns.interface=default STREAM: dfs.datanode.dns.nameserver=default STREAM: dfs.datanode.du.reserved=0 STREAM: dfs.datanode.failed.volumes.tolerated=0 STREAM: dfs.datanode.handler.count=3 STREAM: dfs.datanode.http.address=0.0.0.0:50075 STREAM: dfs.datanode.https.address=0.0.0.0:50475 STREAM: dfs.datanode.ipc.address=0.0.0.0:50020 STREAM: dfs.default.chunk.view.size=32768 STREAM: dfs.heartbeat.interval=3 STREAM: dfs.https.enable=false STREAM: dfs.https.server.keystore.resource=ssl-server.xml STREAM: dfs.namenode.accesstime.precision=3600000 STREAM: dfs.namenode.backup.address=0.0.0.0:50100 STREAM: dfs.namenode.backup.http-address=0.0.0.0:50105 STREAM: dfs.namenode.checkpoint.dir=file://${hadoop.tmp.dir}/dfs/namesecondary
STREAM: dfs.namenode.checkpoint.edits.dir=${dfs.namenode.checkpoint.dir} STREAM: dfs.namenode.checkpoint.period=3600 STREAM: dfs.namenode.checkpoint.size=67108864 STREAM: dfs.namenode.decommission.interval=30 STREAM: dfs.namenode.decommission.nodes.per.interval=5 STREAM: dfs.namenode.delegation.key.update-interval=86400 STREAM: dfs.namenode.delegation.token.max-lifetime=604800 STREAM: dfs.namenode.delegation.token.renew-interval=86400 STREAM: dfs.namenode.edits.dir=${dfs.namenode.name.dir}
STREAM: dfs.namenode.handler.count=10
STREAM: dfs.namenode.logging.level=info
STREAM: dfs.namenode.max.objects=0
STREAM: dfs.namenode.name.dir=file://${hadoop.tmp.dir}/dfs/name STREAM: dfs.namenode.replication.considerLoad=true STREAM: dfs.namenode.replication.interval=3 STREAM: dfs.namenode.replication.min=1 STREAM: dfs.namenode.safemode.extension=30000 STREAM: dfs.namenode.safemode.threshold-pct=0.999f STREAM: dfs.namenode.secondary.http-address=0.0.0.0:50090 STREAM: dfs.permissions.enabled=true STREAM: dfs.permissions.superusergroup=supergroup STREAM: dfs.replication=1 STREAM: dfs.replication.max=512 STREAM: dfs.stream-buffer-size=4096 STREAM: dfs.web.ugi=webuser,webgroup STREAM: file.blocksize=67108864 STREAM: file.bytes-per-checksum=512 STREAM: file.client-write-packet-size=65536 STREAM: file.replication=1 STREAM: file.stream-buffer-size=4096 STREAM: fs.AbstractFileSystem.file.impl=org.apache.hadoop.fs.local.LocalFs STREAM: fs.AbstractFileSystem.hdfs.impl=org.apache.hadoop.fs.Hdfs STREAM: fs.automatic.close=true STREAM: fs.checkpoint.dir=${hadoop.tmp.dir}/dfs/namesecondary
STREAM: fs.checkpoint.edits.dir=${fs.checkpoint.dir} STREAM: fs.checkpoint.period=3600 STREAM: fs.checkpoint.size=67108864 STREAM: fs.defaultFS=hdfs://localhost:54310 STREAM: fs.df.interval=60000 STREAM: fs.file.impl=org.apache.hadoop.fs.LocalFileSystem STREAM: fs.ftp.impl=org.apache.hadoop.fs.ftp.FTPFileSystem STREAM: fs.har.impl=org.apache.hadoop.fs.HarFileSystem STREAM: fs.har.impl.disable.cache=true STREAM: fs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem STREAM: fs.hftp.impl=org.apache.hadoop.hdfs.HftpFileSystem STREAM: fs.hsftp.impl=org.apache.hadoop.hdfs.HsftpFileSystem STREAM: fs.kfs.impl=org.apache.hadoop.fs.kfs.KosmosFileSystem STREAM: fs.ramfs.impl=org.apache.hadoop.fs.InMemoryFileSystem STREAM: fs.s3.block.size=67108864 STREAM: fs.s3.buffer.dir=${hadoop.tmp.dir}/s3
STREAM: fs.s3.maxRetries=4
STREAM: fs.s3.sleepTimeSeconds=10
STREAM: fs.s3n.block.size=67108864
STREAM: fs.trash.interval=0
STREAM: ftp.blocksize=67108864
STREAM: ftp.bytes-per-checksum=512
STREAM: ftp.client-write-packet-size=65536
STREAM: ftp.replication=3
STREAM: ftp.stream-buffer-size=4096
STREAM: hadoop.tmp.dir=/usr/local/hadoop-${user.name} STREAM: hadoop.util.hash.type=murmur STREAM: io.bytes.per.checksum=512 STREAM: io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec STREAM: io.file.buffer.size=4096 STREAM: io.map.index.skip=0 STREAM: io.mapfile.bloom.error.rate=0.005 STREAM: io.mapfile.bloom.size=1048576 STREAM: io.native.lib.available=true STREAM: io.seqfile.compress.blocksize=1000000 STREAM: io.seqfile.lazydecompress=true STREAM: io.seqfile.local.dir=${hadoop.tmp.dir}/io/local
STREAM: io.seqfile.sorter.recordlimit=1000000
STREAM: io.skip.checksum.errors=false
STREAM: ipc.client.connect.max.retries=10
STREAM: ipc.client.connection.maxidletime=10000
STREAM: ipc.client.idlethreshold=4000
STREAM: ipc.client.kill.max=10
STREAM: ipc.client.tcpnodelay=false
STREAM: ipc.server.listen.queue.size=128
STREAM: ipc.server.tcpnodelay=false
STREAM: kfs.blocksize=67108864
STREAM: kfs.bytes-per-checksum=512
STREAM: kfs.client-write-packet-size=65536
STREAM: kfs.replication=3
STREAM: kfs.stream-buffer-size=4096
STREAM: mapred.child.java.opts=-Xmx200m
STREAM: mapreduce.client.completion.pollinterval=5000
STREAM: mapreduce.client.genericoptionsparser.used=true
STREAM: mapreduce.client.output.filter=FAILED
STREAM: mapreduce.client.progressmonitor.pollinterval=1000
STREAM: mapreduce.client.submit.file.replication=10
STREAM: mapreduce.cluster.local.dir=${hadoop.tmp.dir}/mapred/local STREAM: mapreduce.cluster.temp.dir=${hadoop.tmp.dir}/mapred/temp
STREAM: mapreduce.input.fileinputformat.split.minsize=0
STREAM: mapreduce.job.committer.setup.cleanup.needed=true
STREAM: mapreduce.job.complete.cancel.delegation.tokens=true
STREAM: mapreduce.job.jar=/tmp/streamjob2923554781371902680.jar
STREAM: mapreduce.job.maps=2
STREAM: mapreduce.job.queuename=default
STREAM: mapreduce.job.reduce.slowstart.completedmaps=0.05
STREAM: mapreduce.job.reduces=1
STREAM: mapreduce.job.speculative.slownodethreshold=1.0
STREAM: mapreduce.job.speculative.speculativecap=0.1
STREAM: mapreduce.job.split.metainfo.maxsize=10000000
STREAM: mapreduce.job.userlog.retain.hours=24
STREAM: mapreduce.jobtracker.expire.trackers.interval=600000
STREAM: mapreduce.jobtracker.handler.count=10
STREAM: mapreduce.jobtracker.heartbeats.in.second=100
STREAM: mapreduce.jobtracker.jobhistory.block.size=3145728
STREAM: mapreduce.jobtracker.jobhistory.lru.cache.size=5
STREAM: mapreduce.jobtracker.persist.jobstatus.active=true
STREAM: mapreduce.jobtracker.persist.jobstatus.dir=/jobtracker/jobsInfo
STREAM: mapreduce.jobtracker.persist.jobstatus.hours=1
STREAM: mapreduce.jobtracker.restart.recover=false
STREAM: mapreduce.jobtracker.retiredjobs.cache.size=1000
STREAM: mapreduce.jobtracker.staging.root.dir=${hadoop.tmp.dir}/mapred/staging STREAM: mapreduce.jobtracker.system.dir=${hadoop.tmp.dir}/mapred/system
STREAM: mapreduce.map.log.level=INFO
STREAM: mapreduce.map.maxattempts=4
STREAM: mapreduce.map.output.compress=false
STREAM: mapreduce.map.skip.maxrecords=0
STREAM: mapreduce.map.skip.proc.count.autoincr=true
STREAM: mapreduce.map.sort.spill.percent=0.80
STREAM: mapreduce.map.speculative=true
STREAM: mapreduce.output.fileoutputformat.compress=false
STREAM: mapreduce.output.fileoutputformat.compression.type=RECORD
STREAM: mapreduce.reduce.input.buffer.percent=0.0
STREAM: mapreduce.reduce.log.level=INFO
STREAM: mapreduce.reduce.markreset.buffer.percent=0.0
STREAM: mapreduce.reduce.maxattempts=4
STREAM: mapreduce.reduce.merge.inmem.threshold=1000
STREAM: mapreduce.reduce.shuffle.connect.timeout=180000
STREAM: mapreduce.reduce.shuffle.input.buffer.percent=0.70
STREAM: mapreduce.reduce.shuffle.merge.percent=0.66
STREAM: mapreduce.reduce.shuffle.parallelcopies=5
STREAM: mapreduce.reduce.skip.maxgroups=0
STREAM: mapreduce.reduce.skip.proc.count.autoincr=true
STREAM: mapreduce.reduce.speculative=true
STREAM: net.topology.script.number.args=100
STREAM: s3.blocksize=67108864
STREAM: s3.bytes-per-checksum=512
STREAM: s3.client-write-packet-size=65536
STREAM: s3.replication=3
STREAM: s3.stream-buffer-size=4096
STREAM: s3native.blocksize=67108864
STREAM: s3native.bytes-per-checksum=512
STREAM: s3native.client-write-packet-size=65536
STREAM: s3native.replication=3
STREAM: s3native.stream-buffer-size=4096
STREAM: stream.map.streamprocessor=file1
STREAM: stream.numinputspecs=1
STREAM: tmpfiles=file:/home/shivani/research/toolkit/mathouttuts/nearestneighbor/code/IdentityMapper.R#file1
STREAM: webinterface.private.actions=false
STREAM: ====
STREAM: submitting to jobconf: localhost:54311
11/04/13 13:22:17 INFO mapred.FileInputFormat: Total input paths to process : 1
11/04/13 13:22:17 INFO mapreduce.JobSubmitter: number of splits:2
11/04/13 13:22:17 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null
11/04/13 13:22:17 INFO streaming.StreamJob: Running job: job_201104131251_0002
11/04/13 13:22:17 INFO streaming.StreamJob: To kill this job, run:
11/04/13 13:22:17 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201104131251_0002
11/04/13 13:22:18 INFO streaming.StreamJob:  map 0%  reduce 0%
11/04/13 13:23:19 INFO streaming.StreamJob:  map 100%  reduce 100%
11/04/13 13:23:19 INFO streaming.StreamJob: To kill this job, run:
11/04/13 13:23:19 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201104131251_0002
11/04/13 13:23:19 ERROR streaming.StreamJob: Job not Successful!
11/04/13 13:23:19 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

I looked at the output of the mapper and it fails

ava.lang.NullPointerException at
java.lang.String.(String.java:523) at
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) at org.apache.hadoop.mapred.Child.main(Child.java:211) ## Friday, April 8, 2011 ### Hadoop troubleshooting tips: Hadoop hangs before launching a job Whenever a hadoop job hangs right after spitting out the following 11/04/08 13:52:59 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000 After 6 hours of diagnoses, I realized there are two possible problems 1) namenode needs formatting. You do this by going to your hadoop-hadoop/ directory and deleting everything in there and running bin/hdfs namenode format 2) examine the logs and look for exceptions in the datnode, tasktracker ## Monday, February 21, 2011 ### R on amazon EMR Seems like I found an opening. This tutorial seems to be very relevant and useful to me. If I intend to work on the cluster from my R, I need to tell R where the cluster is sitting maybe this is the way http://jeffreybreen.wordpress.com/ ## Thursday, February 17, 2011 ### ssh tunnelling blues I have been meddling with ssh tunneling issues now and for the first time I have been able to set up an ssh session with the remote server The nomenclature is explained in this beautiful tutorial The real trick I got from this tutorial and another tutorial Run the following on myPC$ssh -t userid@gateway ssh remoteserver

Still gotta figure out how to do sftp through ssh tunneling

## Wednesday, February 16, 2011

Hadoop is a gigantic program that defines its mapper and reducer all in one code (or compiled as one) and all conf details are all in that one object file
With hadoop streaming, you can use the streaming option to run a mapper of any kind and a reducer of any kind and specify other details (conf etc) externally

## Tuesday, February 15, 2011

### algorithmic and algorithm

algorithm encapsulates algorithmic, and has options of being boxed ruled or plain. However once algorithm package is loaded, the option settings are global to the entire package

\usepackage{algorithm}[boxed]

and then whenever you create an algorithm code

\begin{algorithm}
\caption{XOXO}
\label{XOXO}
\begin{algorithmic}[1]
.
.
.

\end{algorithmic}
\end{algorithm}

## Monday, February 14, 2011

### Finally figured the svn puzzle

Today is my lucky day. SVN has finally shown me some love. So this is what I wanted to do always

1) Have a data folder "research" that needs backing up , and checking out from either lab computer or home computer
2) Create two users on the server machine (where you run svnadmin commands)
3) From either the lab computer or the home machine use svn co svn+ssh://@/pathtorepos to check out files
4) the passwd file in /pathtorepos/conf/passwd is not of any use... this file really got me twisted. No matter what the contents of this file are, there needs to be a login on the machine for each user that wants to checkout any data
5) last but not the least. once the "research" folder is imported, one needs to check it back out on to the machine where modifications need to be made.

## Sunday, February 13, 2011

### Change the float page fraction

\renewcommand{\dbltopfraction}{0.9} % fit big float above 2-col. text
\renewcommand{\textfraction}{0.07} % allow minimal text w. figs
% Parameters for FLOAT pages (not text pages):
\renewcommand{\floatpagefraction}{0.7} % require fuller float pages
% N.B.: floatpagefraction MUST be less than topfraction !!
\renewcommand{\dblfloatpagefraction}{0.7} % require fuller float pages

## Friday, February 11, 2011

### svn version control tips and pitfalls

Here is a step by step procedure to do the following

The data to be imported is in /media/data

There are two users on this machine lab_user home_user .These users modify or update the repository from two different locations labpc and homepc

All computers run Ubuntu OS

2) Prefer web access

I will update the post later on web access. For now I will post instructions on setting it up.

On labPC
1) run $svnadmin create /svnrepos svn import /media/data /svnrepos 2) Change /svnrepos/conf/svnserve.conf to look like this [general] anon-access = none auth-access = write password-db = passwd 3) Modify the pasword file /svnrepos/conf/passwd to User1 = passw1 User2 = pass2 4) on homepc change /etc/hosts and add the following line lab_ipaddress mysvn.server.purdue.edu the mysvn.server.purdue.edu could be changed to anything 5) on homepc, Add the following lines to .ssh/config Host mysvn.server.purdue.edu User mylogin Port some_number 6) Finally try$svn list svn+ssh://mylogin@mysvn.server.purdue.edu/svnrepos

Common pitfalls

1) the svnserve.conf should have no leading spaces
2) the port number specified in step 5 should not be a port that is commonly used...

More to come later

### Latex table of figures

\begin{figure}
\centering
\begin{tabular}{cc}
\begin{minipage}[c]{0.5\linewidth}
\epsfig{file=myimage.eps,width=\linewidth}
\caption{Image 1}
\end{minipage} &
\begin{minipage}[c]{0.5\linewidth}
\epsfig{file=edgeimage2.eps,width=\linewidth}
\caption{Image 2}
\end{minipage} \\
\begin{minipage}[c]{0.5\linewidth}
\epsfig{file=out3.eps,width=\linewidth}
\caption{Image 3}
\end{minipage} &
\begin{minipage}[c]{0.5\linewidth}
\epsfig{file=out.eps,width=\linewidth}
\caption{Image 4}
\end{minipage} \\
\end{tabular}
\end{figure}