Saturday, November 03, 2012

Pig HBase integration on MapR

This procedure helps integrate HBase transactions using Pig on MapR Hadoop clusters.

On the client node where you have installed Pig, in /opt/mapr/conf/env.sh

Add
export PIG_CLASSPATH=$PIG_CLASSPATH:/location-to-hbase-jar

If you are launching PIG on a node where you have hbase-regionserver
or hbase-master installed, then simply add the location of the
hbase-0.92-1.jar to the PIG_CLASSPATH variable above. Eg

export
PIG_CLASSPATH="$PIG_CLASSPATH:/opt/mapr/hbase/hbase-0.92.1/hbase-0.92.1.jar"

If you don't have hbase installed, the HBase jar can be copied over
directly from any node where hbase is installed to some location on
the pig client node. Include the location where you copied it to in
the above definition.Eg
export PIG_CLASSPATH=$PIG_CLASSPATH:/opt/mapr/lib/hbase-0.92.1.jar

Then identify your zookeeper nodes,
maprcli node listzookeepers

and accordingly add this variable to /opt/mapr/conf/env.sh

export PIG_OPTS="-Dhbase.zookeeper.property.clientPort=5181
-Dhbase.zookeeper.quorum=10.10.80.61,10.10.80.62,10.10.80.63"

Launch the pig job and you should be able to access HBase.
NB: Use only the HBase table name directly for accessing tables. Do not use
hbase:// prefixes. Eg script

Sample env.sh

[root@nmk-centos-60-3 ~]# cat /opt/mapr/conf/env.sh
#!/bin/bash
# Copyright (c) 2009 & onwards. MapR Tech, Inc., All rights reserved
# Please set all environment variable you want to be used during MapR cluster
# runtime here.
# namely MAPR_HOME, JAVA_HOME, MAPR_SUBNETS

export PIG_OPTS="-Dhbase.zookeeper.property.clientPort=5181
-Dhbase.zookeeper.quorum=10.10.80.61,10.10.80.62,10.10.80.63"
export
PIG_CLASSPATH="$PIG_CLASSPATH:/opt/mapr/hbase/hbase-0.92.1/conf:/usr/java/default/lib/tools.jar:/opt/mapr/hbase/hbase-0.92.1:/opt/mapr/hbase/hbase-0.92.1/hbase-0.92.1.jar"
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$PIG_CLASSPATH"
export CLASSPATH="$CLASSPATH:$HADOOP_CLASSPATH"
#export JAVA_HOME=
#export MAPR_SUBNETS=
#export MAPR_HOME=
#export MAPR_ULIMIT_U=
#export MAPR_ULIMIT_N=
#export MAPR_SYSCTL_SOMAXCONN=
#export PIG_CLASSPATH=:$PIG_CLASSPATH
[root@nmk-centos-60-3 ~]#

Sample hbase insertion script

[root@nmk-centos-60-3 nabeel]# cat hbase_pig.pig
raw_data = LOAD '/user/mapr/input2.csv' USING PigStorage(',') AS (
listing_id: chararray,
fname: chararray,
lname: chararray );

STORE raw_data INTO 'sample_names' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage (
'info:fname info:lname');


Thursday, October 04, 2012

Awk display / print file from column N onwards


This can be piped to any output to display the lines from the second column onwards.
awk -F: -v nr=N '{ for (x=nr; x<=NF; x++) {printf $x " "};print " ";}'
An example usage, find the first disk re-inited by MapRFS this month.
find . -name mfs.log -exec grep -H ^2012\-10\-* {} \; | grep spinit.cc:1002 | awk -F: -v nr=2 '{ for (x=nr; x<=NF; x++) \
{printf $x " "};print " ";}' | awk -v nr=2 '{ for (x=nr; x<=NF; x++) {printf $x " "};print " ";}' | sort -n | head -n 5

UPDATE:

Just realized cut could do this in a much simpler wa
cut -d : -f N-

where N is the column from where you need to display the lines.

Thursday, September 13, 2012

Finding the culprit in df disk utilization different from du

Sometimes df and du show different outputs due to the open file descriptor issue. That's explained in detail all over google, so I'm not touching that here. Here's a quick command to find the culprit that makes the difference. This shows the top 5 files that consume around 1GB of space. You can adjust the number after the digit regex to change that. Keep in mind, occasionally temp files have names that can run into digits, so that will return in the grep results, but the sort should be accurately on the filesize columns.


lsof -n -P | grep -E [[:digit:]]{10} | sort -k 7 | head -n 5

Monday, July 02, 2012

"java.io.IOException: Could not create FileClient"

"java.io.IOException: Could not create FileClient"

If you see this error in the jobtracker logs for a failed job, you are most likely looking at an irrelevant error. To be clearer, your job failed before this point and was subsequently killed by the jobtracker. Some processes that linger in memory after the job was killed keeps attempting to access some of the resources assigned to them and this continues till they time out and eventually get killed. The resources, for eg. the file or parent directory etc. would have been removed as part of killing the job by the job tracker, hence the stray process would not be able to create any new files.


Saturday, June 23, 2012

HBase fixing region missing Region //regionname// on HDFS but not listed in .META or deployed on any server

If you see this error on HBase

Region //regionname// on HDFS but not listed in .META or deployed on any server

then use the add_regions.rb script to have this fixed. 

Normally hbase errors can be fixed by hbase hbck -fix
But this approach works mostly in cases where the region is listed in .META but not assigned etc.

In such cases, the script add_regions.rb comes to our rescue. It can be used as follows:

hbase org.jruby.Main add_regions.rb //full-path-region-on-HDFS/.regioninfo

This will add the region to the .META. table. Next assign the region to a region-server. To do this , launch hbase shell and issue the command

assign 'full-region-name'

on the hbase shell.

The full region-name includes the standard format tablename.startrowkey.timestamp.md5sum.

The dot at the end is important and should be included while passing to the assign command.

Now run hbase hbck again and look at status 'simple' on the shell. Your region numbers should have increased according to the regions you have added now. The errors related to not listed in .META should be gone now.

Tuesday, June 12, 2012

Python copy absolute value of array elements into another array


In the array below, some values are positive, while others are negative.

>>> x = [-538,-181,-145,552,-847,6,141,-58,-122,314,-816,245,594,-613,-287,1232,-1479,-326,-197,715,4,-677,95,308,-1224,953,-81,-189,341,-654,242,-948,1088,-533,-328,123,552,-855,49,-443,-37,-57,199,56,-459,-47,-167,13,-521,476,-161,440,-540,180,43,-57,-236,-29,-830,265,-2,-379,-9,198,12,79,-257,113]
>>>


I want a sum total of the array after conversion to their absolute values.

>>> values = [abs(x[n]) for n in range(1,len(x))]
>>> values
[181, 145, 552, 847, 6, 141, 58, 122, 314, 816, 245, 594, 613, 287, 1232, 1479, 326, 197, 715, 4, 677, 95, 308, 1224, 953, 81, 189, 341, 654, 242, 948, 1088, 533, 328, 123, 552, 855, 49, 443, 37, 57, 199, 56, 459, 47, 167, 13, 521, 476, 161, 440, 540, 180, 43, 57, 236, 29, 830, 265, 2, 379, 9, 198, 12, 79, 257, 113]
>>> print sum (values)
24419


Thursday, June 07, 2012

Extract each archive into its own directory


Assuming the archives are all bzipped archives with .tbz2

for i in `ls`; do a=`basename $i .tbz2`; mkdir $a; cd $a;tar -xvf ../$i;cd ..; done

Friday, February 17, 2012

Identify user from python mapper in Hadoop

If you are running a mapreduce job using hadoop streaming with python, and want to know the user the job runs as or other parameters in the OS environment on the task tracker node, use

if(envdict.has_key('user_name')):
           user = os.environ['user_name']
            sys.stdout.write('Job runs as username: '+user)

To make sure what variables you have, use this

       envdict = os.environ
       keylist = envdict.keys()
       sys.stdout.write('Variables available'+str(keylist)+'\n')


Friday, February 10, 2012

Moving Hbase heap dump locations

Occasionally HBase can generate heap dumps on OOM and put it in the bin folder (which is default). This can cause quite a lot of issues, if the system partitioning is not prepared for huge gigabyte files in the bin folder for hbase. To move the dump cores to another folder ,

Do this in the file  hbase-ver-conf/hbase-env.sh, add  -XX:HeapDumpPath=/path/to/dump to the line

>> export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError $HBASE_GC"

Restart HBase.

Now you can sleep well without fears of heap dumps filling up the hbase binary partitions :)

Thursday, January 19, 2012

Compare list of files to identify files and directories

Say xyz is a file with a list of files and directories. (The files and directories are all on the working system itself)
You need to print only the directories amongst them.

for i in `cat xyz`; do if [ -d $i ] ; then echo $i ;fi; done

will print out only directories among them.