Friday, February 14, 2014

Populating Hbase tables with sample data


Quite often we may require to populate hbase tables for issue recreations. Here's a simple procedure for this ( from cloudavenue.com
) :

1) For creating a table 'testtable' with a column family 'colfam1'

create 'testtable', 'colfam1'

list 'testtable'

3) To test insert data into the 'testtable' table.

put 'testtable', 'myrow-1', 'colfam1:q1', 'value-1'
put 'testtable', 'myrow-2', 'colfam1:q2', 'value-2'
put 'testtable', 'myrow-2', 'colfam1:q3', 'value-3'

The HBase Shell is (J)Ruby’s IRB with some HBase-related commands added. Anything that can be done in IRB, can also be done in the HBase Shell. The below command will insert 1K rows into the 'testtable' table.

for i in '0'..'9' do for j in '0'..'9' do \
for k in '0'..'9' do put 'testtable', "row-#{i}#{j}#{k}", \
"colfam1:#{j}#{k}", "#{j}#{k}" end end end

4) For getting data from the 'testtable' table

get 'testtable', 'myrow-1'
scan 'testtable'

5) For deleting data from the 'testtable' table.

delete 'testtable', 'myrow-2', 'colfam1:q2'

6) For deleting the table.

disable 'testtable'
drop 'testtable'


If you want to test a sample csv import,

Use this bash one liner to generate a csv as you want :

for i in `seq 1 19`; do for j in `seq 1 9`; do for k in `seq 1 9`; do echo "row"$i",col"$j",value"$i"-"$j"-"$k; done; done; done

According to the amount of data you want to load, increase the max value of the loop variables.

This should produce output of the form

row1,col1,value1-1-1
row1,col1,value1-1-2
row1,col1,value1-1-3
row1,col1,value1-1-4
row1,col1,value1-1-5
row1,col1,value1-1-6
row1,col1,value1-1-7
row1,col1,value1-1-8
row1,col1,value1-1-9
row1,col2,value1-2-1
row1,col2,value1-2-2
row1,col2,value1-2-3
row1,col2,value1-2-4
row1,col2,value1-2-5
row1,col2,value1-2-6
row1,col2,value1-2-7

Create a sample table

create 'testtable', 'colfam1'

hadoop jar /opt/mapr/hbase/hbase-0.94.5/hbase-0.94.5-mapr.jar importtsv -Dimporttsv.columns=colfam1:row,colfam1:col,colfam1:val

Update :

In case you want to generate and print random data on the screen from a ruby shell,

irb(main):014:0> require 'securerandom'
=> true
irb(main):015:0> for i in '1'..'10' do puts SecureRandom.hex
irb(main):016:1> end
8917ccbb7f0bea0d54d0e98e12b416cf
9cd1865fd43482174b3088c6749075de
1d009056e9fcc0b2ddf4352eb824a97d
1abeb9bb4b0993ad732335818fdc8835
d41cf0ca16be930d0aa3925651a10ec4
732dc0d79e7b7d82e4b5ac21d8b00f5c
519fc21d6d0a76a467dd2f2d14741090
27fb689fd3d9b8f4b17b17535681214b
6454ff61e5ef116688ca172ba13aa80c
83ecb50f1e9ab42d1e320119e24a9a9c
=> "1".."10"
irb(main):017:0>

This can be used on the HBase shell to insert into the table

hbase(main):001:0> require 'securerandom'; for i in '0'..'9' do for j in '0'..'9' do \
for k in '0'..'9' do put 'testtable', SecureRandom.hex , \
"colfam1:#{j}#{k}", "#{j}#{k}" end end end


Can't open /tmp/mapr-hadoop/mapred/local/taskTracker/root/jobcache//jobToken for output - File exists

If you hit this error on any task attempt :

Can't open /tmp/mapr-hadoop/mapred/local/taskTracker/root/jobcache/job_201402140511_0001/jobToken for output - File exists

Eg.

14/02/14 06:07:09 INFO mapred.JobClient: Task Id : attempt_201402140511_0001_r_000001_0, Status : FAILED on node nmk-centos-60-3
Error initializing attempt_201402140511_0001_r_000001_0 java.io.IOException: Job initialization failed (255). with output: Reading task controller config from /opt/mapr/hadoop/hadoop-0.20.2/conf/taskcontroller.cfg
number of groups = 8
main : command provided 0
main : user is root
number of groups = 7
Can't open /tmp/mapr-hadoop/mapred/local/taskTracker/root/jobcache/job_201402140511_0001/jobToken for output - File exists
failed to copy credential file

at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:195)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1564)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1540)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1425)
at org.apache.hadoop.mapred.TaskTracker$6.run(TaskTracker.java:3802)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:322)
at org.apache.hadoop.util.Shell.run(Shell.java:249)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:442)
at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:188)

Chances are that you have either of the two issues :
1) Different UID GID mapping for the same user on the nodes in the cluster 
2) Incorrect credentials for /tmp/mapr-hadoop folder on the tasktracker where the job failed.

The simple fix for that is :

Stop tasktracker
Remove /tmp/mapr-hadoop/
Start TT again
Run job