I have been playing with Matlab's TMG and here are my 2 cents from experience with the software
Pros: Easy to install, Easy to index a corpus and queries.
Cons: Proprietary code. Impossible to contribute modify or fix bugs in the code. Is not maintained well. Does not have any discussion forum where users and developers can communicate. Last citation was 2009
I was unable to even run some basic retrieval tasks due to bug in the code. norm2 does not work for sparse arrays and one needs to replace norm2 with normset, and I was unable to make the change because the code is protected and proprietary.
Friday, October 21, 2011
Thursday, July 14, 2011
Counter-intuitive results
I have been researching how restricting the vocabulary of a dataset affects retrieval results and I am seeing some really counter-intuitive results
This the format of the experiment:
I create a vocabulary using some percentage of the documents not all. The percentages I used was (10,20,30,40,50,60,70,80,90) and then performed retrieval and calculated MAP
There were two key observations that I found interesting.
1) For software corpora, I found that even with 10% of documents, 45% of the original vocabulary was built. This means that software vocabulary tends to have a more uniform distribution of the terms/identifiers/variable-names across source files
2) With the Vector Space Model with tf-idf weighting, I found that with 10-30% of documents used to create the vocabulary (or dictionary) I got better performance as compared to original vocabulary. The only explanation I have for this result is that, some words are more important for retrieval than others. I do need to mention that the original vocabulary itself is a pruned version of the original raw vocabulary obtained by removing sparse terms in the document-term Matrix.
Shivani
This the format of the experiment:
I create a vocabulary using some percentage of the documents not all. The percentages I used was (10,20,30,40,50,60,70,80,90) and then performed retrieval and calculated MAP
There were two key observations that I found interesting.
1) For software corpora, I found that even with 10% of documents, 45% of the original vocabulary was built. This means that software vocabulary tends to have a more uniform distribution of the terms/identifiers/variable-names across source files
2) With the Vector Space Model with tf-idf weighting, I found that with 10-30% of documents used to create the vocabulary (or dictionary) I got better performance as compared to original vocabulary. The only explanation I have for this result is that, some words are more important for retrieval than others. I do need to mention that the original vocabulary itself is a pruned version of the original raw vocabulary obtained by removing sparse terms in the document-term Matrix.
Shivani
Tuesday, June 28, 2011
Back to Square one
I am lost here every time I come here. Its a maze and I cannot find a way out. Even if I do make a way out, I am unclear as to what happened back there.
I am now making a step by step plan on how I can tackle this gigantic problem I am facing.
What am I trying to do?
My main goal is to try and do distributed topic modeling using R on Amazon EMR.
The steps I need to take now to solve this problem
1) Install hadoop and run a single node hadoop cluster and basic mapper reducer scripts on it
2) Run R on hadoop using hive and try to do the same via R
3) Run distributed tm on R
4) Run Mahout on single node hadoop
5) Using hive try to convert data types between R and mahout
Do all of the above on the amazon emr cluster using its ruby client.
Loads of painful nights ahead, but hopefully rewarding too
I am now making a step by step plan on how I can tackle this gigantic problem I am facing.
What am I trying to do?
My main goal is to try and do distributed topic modeling using R on Amazon EMR.
The steps I need to take now to solve this problem
1) Install hadoop and run a single node hadoop cluster and basic mapper reducer scripts on it
2) Run R on hadoop using hive and try to do the same via R
3) Run distributed tm on R
4) Run Mahout on single node hadoop
5) Using hive try to convert data types between R and mahout
Do all of the above on the amazon emr cluster using its ruby client.
Loads of painful nights ahead, but hopefully rewarding too
Tuesday, May 10, 2011
The recovery from crash entry
I am penning down personal notes on how to recover from crash
setup auto rsync with cron tab and update everyday
Back up your scripts, that you used to fix issues with your computer. In my case
Sound issue fixing
R
java
Python
cheese webcam
texmaker
lyx
skype
gtalk video voice chat
auto ssh-passwordless login into most commonly used servers
svn for code
pdf annontator
mendeleydesktop
eclipse IDE
perl, python, plugins for Eclipse IDE
Here is the link to the page on how to partition
setup auto rsync with cron tab and update everyday
Back up your scripts, that you used to fix issues with your computer. In my case
Sound issue fixing
R
java
Python
cheese webcam
texmaker
lyx
skype
gtalk video voice chat
auto ssh-passwordless login into most commonly used servers
svn for code
pdf annontator
mendeleydesktop
eclipse IDE
perl, python, plugins for Eclipse IDE
Here is the link to the page on how to partition
Tuesday, May 3, 2011
password-less ssh into multiple machines from a single machine
I have been trying to get password-less ssh login into two or more machines (servers). The tutorials that are available online are great but they do not cover one corner case.
The default private and public key are named id_rsa, so every time you attempt a password-less login, ssh looks for id_rsa private file and matches it with the ~/.ssh/authorized_keys . However, what happens when you want to login to multiple servers
1) put the same public key id_rsa.pub in all servers where you want password-less login
2) create a separate private-public key pair and use ssh-agent to add the private keys
for each public-private key pair (new_rsa and new_rsa.pub) do the following
localuser@localmahine$scp ~/.ssh/new_rsa.pub username@server:~/.ssh/new_rsa.pub # copy to .ssh folder
localuser@localmahine$ssh username@server # login to the server
username@server$cat ~/.ssh/new_rsa.pub >> ~/.ssh/authorized_keys # append to existing aurhotized keys
localuser@localmahine$ssh-agent bash
localuser@localmahine$ssh-add ~/.ssh/new_rsa
The default private and public key are named id_rsa, so every time you attempt a password-less login, ssh looks for id_rsa private file and matches it with the ~/.ssh/authorized_keys . However, what happens when you want to login to multiple servers
1) put the same public key id_rsa.pub in all servers where you want password-less login
2) create a separate private-public key pair and use ssh-agent to add the private keys
for each public-private key pair (new_rsa and new_rsa.pub) do the following
localuser@localmahine$scp ~/.ssh/new_rsa.pub username@server:~/.ssh/new_rsa.pub # copy to .ssh folder
localuser@localmahine$ssh username@server # login to the server
username@server$cat ~/.ssh/new_rsa.pub >> ~/.ssh/authorized_keys # append to existing aurhotized keys
localuser@localmahine$ssh-agent bash
localuser@localmahine$ssh-add ~/.ssh/new_rsa
Tuesday, April 26, 2011
R tips and tricks
1) R.setenv() and R.getenv() can help you set environment variables for packages like rJava from within R.
syntax:
2) R has no bound checking for lists or arrays, so one has to take care of it manually. For example:
arr[length(arr)+1] simply yeilds a NA
This becomes an issue in running for loops
for (i in (2:length(arr))) will not work out well if length(arr)<2
syntax:
print(Sys.setenv(R_TEST="testit", "A+C"=123)) # `A+C` could also be used Sys.getenv("R_TEST") Sys.unsetenv("R_TEST") # may warn and not succeed
2) R has no bound checking for lists or arrays, so one has to take care of it manually. For example:
arr[length(arr)+1] simply yeilds a NA
This becomes an issue in running for loops
for (i in (2:length(arr))) will not work out well if length(arr)<2
Friday, April 22, 2011
Using gensim... the basics
Thanks to the extremely active community of gensim, I have made way through some basic commands in python and gensim
I have a directory of text documents that I want indexed and topic
model built on
Each file in the directory is a document containing plain text.
Lets assume that the text is pruned for stopwords and special
characters etc.
I will need to write custom over-rides of the get_text() function of textcorpus and this is how I achieve it
def split_line(text):
words = text.split()
out = []
for word in words:
out.append(word)
return out
import gensim
class MyCorpus(gensim.corpora.TextCorpus):
def get_texts(self):
for filename in self.input:
yield split_line(open(filename).read())
if b is a list of files then
myCorpus = MyCorpus(b)
will create the corpus and
myCorpus.dictionary has all the unique words
myCorpus.dictionary.token2id.items() gives the word-id pairs
myCorpus.dictionary.token2id.keys() gives the unique words
myCorpus.dictionary.token2id.values() gives the corresponding ids
One can save it in Matrix Market format using the following command
`gensim.corpora.MmCorpus.serialize('mycorpus.mm', myCorpus)`
In order to add new documents, just extend the list b to include the file names and redo all of the above. Internally the implementation takes off from where it left
I still need to work on indexing, lsi based and lda based modeling of the corpus using the above framework and I am hoping to add more posts as I learn about them.
I have a directory of text documents that I want indexed and topic
model built on
Each file in the directory is a document containing plain text.
Lets assume that the text is pruned for stopwords and special
characters etc.
I will need to write custom over-rides of the get_text() function of textcorpus and this is how I achieve it
def split_line(text):
words = text.split()
out = []
for word in words:
out.append(word)
return out
import gensim
class MyCorpus(gensim.corpora.TextCorpus):
def get_texts(self):
for filename in self.input:
yield split_line(open(filename).read())
if b is a list of files then
myCorpus = MyCorpus(b)
will create the corpus and
myCorpus.dictionary has all the unique words
myCorpus.dictionary.token2id.items() gives the word-id pairs
myCorpus.dictionary.token2id.keys() gives the unique words
myCorpus.dictionary.token2id.values() gives the corresponding ids
One can save it in Matrix Market format using the following command
`gensim.corpora.MmCorpus.serialize('mycorpus.mm', myCorpus)`
In order to add new documents, just extend the list b to include the file names and redo all of the above. Internally the implementation takes off from where it left
I still need to work on indexing, lsi based and lda based modeling of the corpus using the above framework and I am hoping to add more posts as I learn about them.
Wednesday, April 13, 2011
Hadoop streaming blues
I am facing trouble using hadoop streaming in order to solve a simple nearest neighbor problem.
Input data is in the following format
'\t'
key is the imageid for which nearest neighbor will be computed
the value is 100 dimensional vector of floating point values separated by space or tab
The mapper reads in the query (the query is a 100 dimensional vector) and each line of the input and outputs a
where key2 is a floating point value indicating the distance, and value2 is the imageid
The number of reducers is set to 1. And the reducer is set to be the identity reducer.
I tried to use the following command
bin/hadoop jar ./mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -Dmapreduce.job.output.key.class=org.apache.hadoop.io.DoubleWritable -files /home/shivani/research/toolkit/mathouttuts/nearestneighbor/code/IdentityMapper.R#file1 -input datain/comparedata -output dataout5 -mapper file1 -reducer org.apache.hadoop.mapred.lib.IdentityReducer -verbose
This is the output stream is as below. The failure is in the mapper itself, more specifically the TEXTOUTPUTREADER. I am not sure how to fix this. The logs are attached below:
11/04/13 13:22:15 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/04/13 13:22:15 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
STREAM: addTaskEnvironment=
STREAM: shippedCanonFiles_=[]
STREAM: shipped: false /usr/local/hadoop/file1
STREAM: cmd=file1
STREAM: cmd=null
STREAM: shipped: false /usr/local/hadoop/org.apache.hadoop.mapred.lib.IdentityReducer
STREAM: cmd=org.apache.hadoop.mapred.lib.IdentityReducer
11/04/13 13:22:15 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
STREAM: Found runtime classes in: /usr/local/hadoop-hadoop/hadoop-unjar7358684340334149267/
packageJobJar: [/usr/local/hadoop-hadoop/hadoop-unjar7358684340334149267/] [] /tmp/streamjob2923554781371902680.jar tmpDir=null
JarBuilder.addNamedStream META-INF/MANIFEST.MF
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritable.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordOutput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableOutput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordOutput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesOutput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesOutput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesInput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableOutput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordInput$TypedBytesIndex.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableInput$2.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableInput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordInput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/Type.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableInput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordInput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesInput.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil$TaskId.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$1.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamJob.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/Environment.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/RawBytesOutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TypedBytesInputWriter.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TextInputWriter.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/InputWriter.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TextOutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/IdentifierResolver.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/RawBytesInputWriter.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TypedBytesOutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/OutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PathFinder.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/LoadTypedBytes.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamXmlRecordReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/UTF8ByteArrayUtils.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/JarBuilder.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil$StreamConsumer.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$MRErrorThread.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamKeyValUtil.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeCombiner.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeReducer.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamInputFormat.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRunner.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$MROutputThread.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/HadoopStreaming.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/DumpTypedBytes.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/AutoInputFormat.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapper.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamBaseRecordReader.class
STREAM: ==== JobConf properties:
STREAM: dfs.block.access.key.update.interval=600
STREAM: dfs.block.access.token.enable=false
STREAM: dfs.block.access.token.lifetime=600
STREAM: dfs.blockreport.initialDelay=0
STREAM: dfs.blockreport.intervalMsec=21600000
STREAM: dfs.blocksize=67108864
STREAM: dfs.bytes-per-checksum=512
STREAM: dfs.client-write-packet-size=65536
STREAM: dfs.client.block.write.retries=3
STREAM: dfs.client.https.keystore.resource=ssl-client.xml
STREAM: dfs.client.https.need-auth=false
STREAM: dfs.datanode.address=0.0.0.0:50010
STREAM: dfs.datanode.balance.bandwidthPerSec=1048576
STREAM: dfs.datanode.data.dir=file://${hadoop.tmp.dir}/dfs/data
STREAM: dfs.datanode.data.dir.perm=755
STREAM: dfs.datanode.directoryscan.interval=21600
STREAM: dfs.datanode.directoryscan.threads=1
STREAM: dfs.datanode.dns.interface=default
STREAM: dfs.datanode.dns.nameserver=default
STREAM: dfs.datanode.du.reserved=0
STREAM: dfs.datanode.failed.volumes.tolerated=0
STREAM: dfs.datanode.handler.count=3
STREAM: dfs.datanode.http.address=0.0.0.0:50075
STREAM: dfs.datanode.https.address=0.0.0.0:50475
STREAM: dfs.datanode.ipc.address=0.0.0.0:50020
STREAM: dfs.default.chunk.view.size=32768
STREAM: dfs.heartbeat.interval=3
STREAM: dfs.https.enable=false
STREAM: dfs.https.server.keystore.resource=ssl-server.xml
STREAM: dfs.namenode.accesstime.precision=3600000
STREAM: dfs.namenode.backup.address=0.0.0.0:50100
STREAM: dfs.namenode.backup.http-address=0.0.0.0:50105
STREAM: dfs.namenode.checkpoint.dir=file://${hadoop.tmp.dir}/dfs/namesecondary
STREAM: dfs.namenode.checkpoint.edits.dir=${dfs.namenode.checkpoint.dir}
STREAM: dfs.namenode.checkpoint.period=3600
STREAM: dfs.namenode.checkpoint.size=67108864
STREAM: dfs.namenode.decommission.interval=30
STREAM: dfs.namenode.decommission.nodes.per.interval=5
STREAM: dfs.namenode.delegation.key.update-interval=86400
STREAM: dfs.namenode.delegation.token.max-lifetime=604800
STREAM: dfs.namenode.delegation.token.renew-interval=86400
STREAM: dfs.namenode.edits.dir=${dfs.namenode.name.dir}
STREAM: dfs.namenode.handler.count=10
STREAM: dfs.namenode.http-address=0.0.0.0:50070
STREAM: dfs.namenode.https-address=0.0.0.0:50470
STREAM: dfs.namenode.logging.level=info
STREAM: dfs.namenode.max.objects=0
STREAM: dfs.namenode.name.dir=file://${hadoop.tmp.dir}/dfs/name
STREAM: dfs.namenode.replication.considerLoad=true
STREAM: dfs.namenode.replication.interval=3
STREAM: dfs.namenode.replication.min=1
STREAM: dfs.namenode.safemode.extension=30000
STREAM: dfs.namenode.safemode.threshold-pct=0.999f
STREAM: dfs.namenode.secondary.http-address=0.0.0.0:50090
STREAM: dfs.permissions.enabled=true
STREAM: dfs.permissions.superusergroup=supergroup
STREAM: dfs.replication=1
STREAM: dfs.replication.max=512
STREAM: dfs.stream-buffer-size=4096
STREAM: dfs.web.ugi=webuser,webgroup
STREAM: file.blocksize=67108864
STREAM: file.bytes-per-checksum=512
STREAM: file.client-write-packet-size=65536
STREAM: file.replication=1
STREAM: file.stream-buffer-size=4096
STREAM: fs.AbstractFileSystem.file.impl=org.apache.hadoop.fs.local.LocalFs
STREAM: fs.AbstractFileSystem.hdfs.impl=org.apache.hadoop.fs.Hdfs
STREAM: fs.automatic.close=true
STREAM: fs.checkpoint.dir=${hadoop.tmp.dir}/dfs/namesecondary
STREAM: fs.checkpoint.edits.dir=${fs.checkpoint.dir}
STREAM: fs.checkpoint.period=3600
STREAM: fs.checkpoint.size=67108864
STREAM: fs.defaultFS=hdfs://localhost:54310
STREAM: fs.df.interval=60000
STREAM: fs.file.impl=org.apache.hadoop.fs.LocalFileSystem
STREAM: fs.ftp.impl=org.apache.hadoop.fs.ftp.FTPFileSystem
STREAM: fs.har.impl=org.apache.hadoop.fs.HarFileSystem
STREAM: fs.har.impl.disable.cache=true
STREAM: fs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem
STREAM: fs.hftp.impl=org.apache.hadoop.hdfs.HftpFileSystem
STREAM: fs.hsftp.impl=org.apache.hadoop.hdfs.HsftpFileSystem
STREAM: fs.kfs.impl=org.apache.hadoop.fs.kfs.KosmosFileSystem
STREAM: fs.ramfs.impl=org.apache.hadoop.fs.InMemoryFileSystem
STREAM: fs.s3.block.size=67108864
STREAM: fs.s3.buffer.dir=${hadoop.tmp.dir}/s3
STREAM: fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem
STREAM: fs.s3.maxRetries=4
STREAM: fs.s3.sleepTimeSeconds=10
STREAM: fs.s3n.block.size=67108864
STREAM: fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
STREAM: fs.trash.interval=0
STREAM: ftp.blocksize=67108864
STREAM: ftp.bytes-per-checksum=512
STREAM: ftp.client-write-packet-size=65536
STREAM: ftp.replication=3
STREAM: ftp.stream-buffer-size=4096
STREAM: hadoop.common.configuration.version=0.21.0
STREAM: hadoop.hdfs.configuration.version=1
STREAM: hadoop.logfile.count=10
STREAM: hadoop.logfile.size=10000000
STREAM: hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.StandardSocketFactory
STREAM: hadoop.security.authentication=simple
STREAM: hadoop.security.authorization=false
STREAM: hadoop.tmp.dir=/usr/local/hadoop-${user.name}
STREAM: hadoop.util.hash.type=murmur
STREAM: io.bytes.per.checksum=512
STREAM: io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec
STREAM: io.file.buffer.size=4096
STREAM: io.map.index.skip=0
STREAM: io.mapfile.bloom.error.rate=0.005
STREAM: io.mapfile.bloom.size=1048576
STREAM: io.native.lib.available=true
STREAM: io.seqfile.compress.blocksize=1000000
STREAM: io.seqfile.lazydecompress=true
STREAM: io.seqfile.local.dir=${hadoop.tmp.dir}/io/local
STREAM: io.seqfile.sorter.recordlimit=1000000
STREAM: io.serializations=org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization,org.apache.hadoop.io.serializer.avro.AvroReflectSerialization
STREAM: io.skip.checksum.errors=false
STREAM: ipc.client.connect.max.retries=10
STREAM: ipc.client.connection.maxidletime=10000
STREAM: ipc.client.idlethreshold=4000
STREAM: ipc.client.kill.max=10
STREAM: ipc.client.tcpnodelay=false
STREAM: ipc.server.listen.queue.size=128
STREAM: ipc.server.tcpnodelay=false
STREAM: kfs.blocksize=67108864
STREAM: kfs.bytes-per-checksum=512
STREAM: kfs.client-write-packet-size=65536
STREAM: kfs.replication=3
STREAM: kfs.stream-buffer-size=4096
STREAM: map.sort.class=org.apache.hadoop.util.QuickSort
STREAM: mapred.child.java.opts=-Xmx200m
STREAM: mapred.input.format.class=org.apache.hadoop.mapred.TextInputFormat
STREAM: mapred.map.runner.class=org.apache.hadoop.streaming.PipeMapRunner
STREAM: mapred.mapper.class=org.apache.hadoop.streaming.PipeMapper
STREAM: mapred.output.format.class=org.apache.hadoop.mapred.TextOutputFormat
STREAM: mapred.reducer.class=org.apache.hadoop.mapred.lib.IdentityReducer
STREAM: mapreduce.client.completion.pollinterval=5000
STREAM: mapreduce.client.genericoptionsparser.used=true
STREAM: mapreduce.client.output.filter=FAILED
STREAM: mapreduce.client.progressmonitor.pollinterval=1000
STREAM: mapreduce.client.submit.file.replication=10
STREAM: mapreduce.cluster.local.dir=${hadoop.tmp.dir}/mapred/local
STREAM: mapreduce.cluster.temp.dir=${hadoop.tmp.dir}/mapred/temp
STREAM: mapreduce.input.fileinputformat.inputdir=hdfs://localhost:54310/user/hadoop/datain/comparedata
STREAM: mapreduce.input.fileinputformat.split.minsize=0
STREAM: mapreduce.job.cache.symlink.create=yes
STREAM: mapreduce.job.committer.setup.cleanup.needed=true
STREAM: mapreduce.job.complete.cancel.delegation.tokens=true
STREAM: mapreduce.job.end-notification.retry.attempts=0
STREAM: mapreduce.job.end-notification.retry.interval=30000
STREAM: mapreduce.job.jar=/tmp/streamjob2923554781371902680.jar
STREAM: mapreduce.job.jvm.numtasks=1
STREAM: mapreduce.job.maps=2
STREAM: mapreduce.job.maxtaskfailures.per.tracker=4
STREAM: mapreduce.job.output.key.class=org.apache.hadoop.io.Text
STREAM: mapreduce.job.output.value.class=org.apache.hadoop.io.Text
STREAM: mapreduce.job.queuename=default
STREAM: mapreduce.job.reduce.slowstart.completedmaps=0.05
STREAM: mapreduce.job.reduces=1
STREAM: mapreduce.job.speculative.slownodethreshold=1.0
STREAM: mapreduce.job.speculative.slowtaskthreshold=1.0
STREAM: mapreduce.job.speculative.speculativecap=0.1
STREAM: mapreduce.job.split.metainfo.maxsize=10000000
STREAM: mapreduce.job.userlog.retain.hours=24
STREAM: mapreduce.job.working.dir=hdfs://localhost:54310/user/hadoop
STREAM: mapreduce.jobtracker.address=localhost:54311
STREAM: mapreduce.jobtracker.expire.trackers.interval=600000
STREAM: mapreduce.jobtracker.handler.count=10
STREAM: mapreduce.jobtracker.heartbeats.in.second=100
STREAM: mapreduce.jobtracker.http.address=0.0.0.0:50030
STREAM: mapreduce.jobtracker.instrumentation=org.apache.hadoop.mapred.JobTrackerMetricsInst
STREAM: mapreduce.jobtracker.jobhistory.block.size=3145728
STREAM: mapreduce.jobtracker.jobhistory.lru.cache.size=5
STREAM: mapreduce.jobtracker.maxtasks.perjob=-1
STREAM: mapreduce.jobtracker.persist.jobstatus.active=true
STREAM: mapreduce.jobtracker.persist.jobstatus.dir=/jobtracker/jobsInfo
STREAM: mapreduce.jobtracker.persist.jobstatus.hours=1
STREAM: mapreduce.jobtracker.restart.recover=false
STREAM: mapreduce.jobtracker.retiredjobs.cache.size=1000
STREAM: mapreduce.jobtracker.staging.root.dir=${hadoop.tmp.dir}/mapred/staging
STREAM: mapreduce.jobtracker.system.dir=${hadoop.tmp.dir}/mapred/system
STREAM: mapreduce.jobtracker.taskcache.levels=2
Input data is in the following format
key is the imageid for which nearest neighbor will be computed
the value is 100 dimensional vector of floating point values separated by space or tab
The mapper reads in the query (the query is a 100 dimensional vector) and each line of the input and outputs a
where key2 is a floating point value indicating the distance, and value2 is the imageid
The number of reducers is set to 1. And the reducer is set to be the identity reducer.
I tried to use the following command
bin/hadoop jar ./mapred/contrib/streaming/
This is the output stream is as below. The failure is in the mapper itself, more specifically the TEXTOUTPUTREADER. I am not sure how to fix this. The logs are attached below:
11/04/13 13:22:15 INFO security.Groups: Group mapping impl=org.apache.hadoop.
11/04/13 13:22:15 WARN conf.Configuration: mapred.used.
STREAM: addTaskEnvironment=
STREAM: shippedCanonFiles_=[]
STREAM: shipped: false /usr/local/hadoop/file1
STREAM: cmd=file1
STREAM: cmd=null
STREAM: shipped: false /usr/local/hadoop/org.apache.
STREAM: cmd=org.apache.hadoop.mapred.
11/04/13 13:22:15 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
STREAM: Found runtime classes in: /usr/local/hadoop-hadoop/
packageJobJar: [/usr/local/hadoop-hadoop/
JarBuilder.addNamedStream META-INF/MANIFEST.MF
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
JarBuilder.addNamedStream org/apache/hadoop/streaming/
STREAM: ==== JobConf properties:
STREAM: dfs.block.access.key.update.
STREAM: dfs.block.access.token.enable=
STREAM: dfs.block.access.token.
STREAM: dfs.blockreport.initialDelay=0
STREAM: dfs.blockreport.intervalMsec=
STREAM: dfs.blocksize=67108864
STREAM: dfs.bytes-per-checksum=512
STREAM: dfs.client-write-packet-size=
STREAM: dfs.client.block.write.
STREAM: dfs.client.https.keystore.
STREAM: dfs.client.https.need-auth=
STREAM: dfs.datanode.address=0.0.0.0:
STREAM: dfs.datanode.balance.
STREAM: dfs.datanode.data.dir=file://$
STREAM: dfs.datanode.data.dir.perm=755
STREAM: dfs.datanode.directoryscan.
STREAM: dfs.datanode.directoryscan.
STREAM: dfs.datanode.dns.interface=
STREAM: dfs.datanode.dns.nameserver=
STREAM: dfs.datanode.du.reserved=0
STREAM: dfs.datanode.failed.volumes.
STREAM: dfs.datanode.handler.count=3
STREAM: dfs.datanode.http.address=0.0.
STREAM: dfs.datanode.https.address=0.
STREAM: dfs.datanode.ipc.address=0.0.
STREAM: dfs.default.chunk.view.size=
STREAM: dfs.heartbeat.interval=3
STREAM: dfs.https.enable=false
STREAM: dfs.https.server.keystore.
STREAM: dfs.namenode.accesstime.
STREAM: dfs.namenode.backup.address=0.
STREAM: dfs.namenode.backup.http-
STREAM: dfs.namenode.checkpoint.dir=
STREAM: dfs.namenode.checkpoint.edits.
STREAM: dfs.namenode.checkpoint.
STREAM: dfs.namenode.checkpoint.size=
STREAM: dfs.namenode.decommission.
STREAM: dfs.namenode.decommission.
STREAM: dfs.namenode.delegation.key.
STREAM: dfs.namenode.delegation.token.
STREAM: dfs.namenode.delegation.token.
STREAM: dfs.namenode.edits.dir=${dfs.
STREAM: dfs.namenode.handler.count=10
STREAM: dfs.namenode.http-address=0.0.
STREAM: dfs.namenode.https-address=0.
STREAM: dfs.namenode.logging.level=
STREAM: dfs.namenode.max.objects=0
STREAM: dfs.namenode.name.dir=file://$
STREAM: dfs.namenode.replication.
STREAM: dfs.namenode.replication.
STREAM: dfs.namenode.replication.min=1
STREAM: dfs.namenode.safemode.
STREAM: dfs.namenode.safemode.
STREAM: dfs.namenode.secondary.http-
STREAM: dfs.permissions.enabled=true
STREAM: dfs.permissions.
STREAM: dfs.replication=1
STREAM: dfs.replication.max=512
STREAM: dfs.stream-buffer-size=4096
STREAM: dfs.web.ugi=webuser,webgroup
STREAM: file.blocksize=67108864
STREAM: file.bytes-per-checksum=512
STREAM: file.client-write-packet-size=
STREAM: file.replication=1
STREAM: file.stream-buffer-size=4096
STREAM: fs.AbstractFileSystem.file.
STREAM: fs.AbstractFileSystem.hdfs.
STREAM: fs.automatic.close=true
STREAM: fs.checkpoint.dir=${hadoop.
STREAM: fs.checkpoint.edits.dir=${fs.
STREAM: fs.checkpoint.period=3600
STREAM: fs.checkpoint.size=67108864
STREAM: fs.defaultFS=hdfs://localhost:
STREAM: fs.df.interval=60000
STREAM: fs.file.impl=org.apache.
STREAM: fs.ftp.impl=org.apache.hadoop.
STREAM: fs.har.impl=org.apache.hadoop.
STREAM: fs.har.impl.disable.cache=true
STREAM: fs.hdfs.impl=org.apache.
STREAM: fs.hftp.impl=org.apache.
STREAM: fs.hsftp.impl=org.apache.
STREAM: fs.kfs.impl=org.apache.hadoop.
STREAM: fs.ramfs.impl=org.apache.
STREAM: fs.s3.block.size=67108864
STREAM: fs.s3.buffer.dir=${hadoop.tmp.
STREAM: fs.s3.impl=org.apache.hadoop.
STREAM: fs.s3.maxRetries=4
STREAM: fs.s3.sleepTimeSeconds=10
STREAM: fs.s3n.block.size=67108864
STREAM: fs.s3n.impl=org.apache.hadoop.
STREAM: fs.trash.interval=0
STREAM: ftp.blocksize=67108864
STREAM: ftp.bytes-per-checksum=512
STREAM: ftp.client-write-packet-size=
STREAM: ftp.replication=3
STREAM: ftp.stream-buffer-size=4096
STREAM: hadoop.common.configuration.
STREAM: hadoop.hdfs.configuration.
STREAM: hadoop.logfile.count=10
STREAM: hadoop.logfile.size=10000000
STREAM: hadoop.rpc.socket.factory.
STREAM: hadoop.security.
STREAM: hadoop.security.authorization=
STREAM: hadoop.tmp.dir=/usr/local/
STREAM: hadoop.util.hash.type=murmur
STREAM: io.bytes.per.checksum=512
STREAM: io.compression.codecs=org.
STREAM: io.file.buffer.size=4096
STREAM: io.map.index.skip=0
STREAM: io.mapfile.bloom.error.rate=0.
STREAM: io.mapfile.bloom.size=1048576
STREAM: io.native.lib.available=true
STREAM: io.seqfile.compress.blocksize=
STREAM: io.seqfile.lazydecompress=true
STREAM: io.seqfile.local.dir=${hadoop.
STREAM: io.seqfile.sorter.recordlimit=
STREAM: io.serializations=org.apache.
STREAM: io.skip.checksum.errors=false
STREAM: ipc.client.connect.max.
STREAM: ipc.client.connection.
STREAM: ipc.client.idlethreshold=4000
STREAM: ipc.client.kill.max=10
STREAM: ipc.client.tcpnodelay=false
STREAM: ipc.server.listen.queue.size=
STREAM: ipc.server.tcpnodelay=false
STREAM: kfs.blocksize=67108864
STREAM: kfs.bytes-per-checksum=512
STREAM: kfs.client-write-packet-size=
STREAM: kfs.replication=3
STREAM: kfs.stream-buffer-size=4096
STREAM: map.sort.class=org.apache.
STREAM: mapred.child.java.opts=-
STREAM: mapred.input.format.class=org.
STREAM: mapred.map.runner.class=org.
STREAM: mapred.mapper.class=org.
STREAM: mapred.output.format.class=
STREAM: mapred.reducer.class=org.
STREAM: mapreduce.client.completion.
STREAM: mapreduce.client.
STREAM: mapreduce.client.output.
STREAM: mapreduce.client.
STREAM: mapreduce.client.submit.file.
STREAM: mapreduce.cluster.local.dir=${
STREAM: mapreduce.cluster.temp.dir=${
STREAM: mapreduce.input.
STREAM: mapreduce.input.
STREAM: mapreduce.job.cache.symlink.
STREAM: mapreduce.job.committer.setup.
STREAM: mapreduce.job.complete.cancel.
STREAM: mapreduce.job.end-
STREAM: mapreduce.job.end-
STREAM: mapreduce.job.jar=/tmp/
STREAM: mapreduce.job.jvm.numtasks=1
STREAM: mapreduce.job.maps=2
STREAM: mapreduce.job.maxtaskfailures.
STREAM: mapreduce.job.output.key.
STREAM: mapreduce.job.output.value.
STREAM: mapreduce.job.queuename=
STREAM: mapreduce.job.reduce.
STREAM: mapreduce.job.reduces=1
STREAM: mapreduce.job.speculative.
STREAM: mapreduce.job.speculative.
STREAM: mapreduce.job.speculative.
STREAM: mapreduce.job.split.metainfo.
STREAM: mapreduce.job.userlog.retain.
STREAM: mapreduce.job.working.dir=
STREAM: mapreduce.jobtracker.address=
STREAM: mapreduce.jobtracker.expire.
STREAM: mapreduce.jobtracker.handler.
STREAM: mapreduce.jobtracker.
STREAM: mapreduce.jobtracker.http.
STREAM: mapreduce.jobtracker.
STREAM: mapreduce.jobtracker.
STREAM: mapreduce.jobtracker.
STREAM: mapreduce.jobtracker.maxtasks.
STREAM: mapreduce.jobtracker.persist.
STREAM: mapreduce.jobtracker.persist.
STREAM: mapreduce.jobtracker.persist.
STREAM: mapreduce.jobtracker.restart.
STREAM: mapreduce.jobtracker.
STREAM: mapreduce.jobtracker.staging.
STREAM: mapreduce.jobtracker.system.
STREAM: mapreduce.jobtracker.