Wednesday, April 03, 2013

Running pig job inside java wrapper on mapr hadoop


Here's how to get pig code running with a java wrapper on mapr hadoop.

[root@nmk-centos-60-1 ~]# cat idmapreduce.java
import java.io.IOException;
import org.apache.pig.PigServer;
public class idmapreduce{
   public static void main(String[] args) {
   try {
     PigServer pigServer = new PigServer("mapreduce");
     runIdQuery(pigServer, "/test/Mapr_rpm_Files");
   }
   catch(Exception e) {
   }
}
public static void runIdQuery(PigServer pigServer, String inputFile)
throws IOException {
   pigServer.registerQuery("A = load '" + inputFile + "' using
PigStorage('/');");
   pigServer.registerQuery("B = foreach A generate $0 as id;");
   pigServer.store("B", "/test/idout");
   }
}
[root@nmk-centos-60-1 ~]#


Then compile it

[root@nmk-centos-60-1 ~]# javac -cp
/opt/mapr/pig/pig-0.10/pig-0.10.0.jar idmapreduce.java

The binary is now in /root (my current working directory)

[root@nmk-centos-60-1 ~]# ls idmapreduce.*
idmapreduce.class  idmapreduce.java

Then run the java wrapper program with the pig jar location,

[root@nmk-centos-60-1 pig-0.10]# ls
autocomplete*  CHANGES.txt*  contrib/       ivy.xml*  lib-src/
mapr-build.properties*  pig-0.10.0.jar*                pigperf.jar*
        README.txt*         src/
bin/           conf/         doap_Pig.rdf*  KEYS*     license/
NOTICE.txt*             pig-0.10.0-withouthadoop.jar*
pig-withouthadoop.jar*  RELEASE_NOTES.txt*  test/
build.xml*     conf.new/     ivy/           lib/      LICENSE.txt*
pig-0.10.0-core.jar*    pig.jar*                       readme.md*
        shims/              tutorial/


and the location where the wrapper binary is located,

[root@nmk-centos-60-1 pig-0.10]# cd -
/root

and the whole classpath `hadoop classpath` expands to :

and also the location to the native io library as a -D option.

So the  final command would be :

[root@nmk-centos-60-1 ~]#  java -cp
/opt/mapr/pig/pig-0.10/pig.jar:.:`hadoop classpath`
-Djava.library.path=/opt/mapr/hadoop/hadoop-0.20.2/lib/native/Linux-amd64-64 idmapreduce

13/04/03 09:33:48 INFO executionengine.HExecutionEngine: Connecting to
hadoop file system at: maprfs:///
13/04/03 09:33:48 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/04/03 09:33:48 INFO security.JniBasedUnixGroupsMapping: Using
JniBasedUnixGroupsMapping for Group resolution
13/04/03 09:33:48 INFO executionengine.HExecutionEngine: Connecting to
map-reduce job tracker at: maprfs:///
13/04/03 09:33:49 INFO pigstats.ScriptState: Pig features used in the
script: UNKNOWN
13/04/03 09:33:49 INFO mapReduceLayer.MRCompiler: File concatenation
threshold: 100 optimistic? false
13/04/03 09:33:49 INFO mapReduceLayer.MultiQueryOptimizer: MR plan
size before optimization: 1
13/04/03 09:33:49 INFO mapReduceLayer.MultiQueryOptimizer: MR plan
size after optimization: 1
13/04/03 09:33:49 INFO pigstats.ScriptState: Pig script settings are
added to the job
13/04/03 09:33:49 INFO mapReduceLayer.JobControlCompiler:
mapred.job.reduce.markreset.buffer.percent is not set, set to default
0.3
13/04/03 09:33:49 INFO mapReduceLayer.JobControlCompiler: creating jar
file Job6414903787816249153.jar
13/04/03 09:33:56 INFO mapReduceLayer.JobControlCompiler: jar file
Job6414903787816249153.jar created
13/04/03 09:33:56 INFO mapReduceLayer.JobControlCompiler: Setting up
single store job
13/04/03 09:33:56 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce
job(s) waiting for submission.
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:zookeeper.version=3.3.6-1366786, built on 07/29/2012 06:22
GMT
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:host.name=nmk-centos-60-1
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.version=1.6.0_25
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.vendor=Sun Microsystems Inc.
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.home=/usr/java/jdk1.6.0_25/jre
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.class.path=/opt/mapr/pig/pig-0.10/pig.jar:.:/opt/mapr/hadoop/hadoop-0.20.2/bin/../conf:/usr/java/default/lib/tools.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/..:/opt/mapr/hadoop/hadoop-0.20.2/bin/../hadoop*core*.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/amazon-s3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/asm-3.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjrt-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjtools-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aws-java-sdk-1.3.26.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-cli-1.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-codec-1.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-configuration-1.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-daemon-1.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-el-1.0.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-httpclient-3.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-httpclient-3.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-lang-2.6.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-1.1.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-api-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-math-2.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-net-1.4.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-net-3.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/core-3.1.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/emr-metrics-1.0.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/eval-0.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/gson-1.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/guava-13.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-capacity-scheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-fairscheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hsqldb-1.8.0.10.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/httpclient-4.1.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/httpcore-4.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-core-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-mapper-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-compiler-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-runtime-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jersey-core-1.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jersey-json-1.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jersey-server-1.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jets3t-0.6.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-servlet-tester-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-util-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/junit-4.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/kfs-0.2.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/log4j-1.2.15.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/logging-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-0.20.2-2.1.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-jni-0.20.2-2.1.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mockito-all-1.8.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mockito-all-1.8.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mysql-connector-java-5.0.8-bin.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/oro-2.0.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/protobuf-java-2.4.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/servlet-api-2.5-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-api-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-log4j12-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/xmlenc-0.52.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/zookeeper-3.3.6.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-2.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-api-2.1.jar
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.library.path=/opt/mapr/hadoop/hadoop-0.20.2/lib/native/Linux-amd64-64
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.io.tmpdir=/tmp
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:java.compiler=
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client
environment:os.version=2.6.32-71.el6.x86_64
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client environment:user.name=root
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client environment:user.home=/root
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Client environment:user.dir=/root
13/04/03 09:33:56 INFO zookeeper.ZooKeeper: Initiating client
connection, connectString=nmkc1:5181,nmkc2:5181,nmkc3:5181
sessionTimeout=30000 watcher=com.mapr.fs.JobTrackerWatcher@40bb2bc3
13/04/03 09:33:56 INFO zookeeper.ClientCnxn: Opening socket connection
to server nmkc3/10.10.80.93:5181
13/04/03 09:33:56 INFO zookeeper.ClientCnxn: Socket connection
established to nmkc3/10.10.80.93:5181, initiating session
13/04/03 09:33:56 INFO zookeeper.ClientCnxn: Session establishment
complete on server nmkc3/10.10.80.93:5181, sessionid =
0x23db71ab8a301d9, negotiated timeout = 30000
13/04/03 09:33:56 INFO fs.JobTrackerWatcher: Current running
JobTracker is: nmk-centos-60-1/10.10.80.91:9001
13/04/03 09:33:56 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the
same.
13/04/03 09:33:57 INFO mapReduceLayer.MapReduceLauncher: 0% complete
13/04/03 09:33:57 INFO input.FileInputFormat: Total input paths to process : 1
13/04/03 09:33:57 INFO util.MapRedUtil: Total input paths to process : 1
13/04/03 09:33:57 WARN snappy.LoadSnappy: Snappy native library not loaded
13/04/03 09:33:57 INFO util.MapRedUtil: Total input paths (combined)
to process : 1
13/04/03 09:33:58 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId:
job_201304010834_0003
13/04/03 09:33:58 INFO mapReduceLayer.MapReduceLauncher: More
information at:
http://maprfs:50030/jobdetails.jsp?jobid=job_201304010834_0003
13/04/03 09:33:58 INFO fs.JobTrackerWatcher: Current running
JobTracker is: nmk-centos-60-1/10.10.80.91:9001
13/04/03 09:34:17 INFO mapReduceLayer.MapReduceLauncher: 50% complete
13/04/03 09:34:18 INFO mapReduceLayer.MapReduceLauncher: 100% complete
13/04/03 09:34:18 INFO pigstats.SimplePigStats: Script Statistics:

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
1.0.3   0.10.0  root    2013-04-03 09:33:49     2013-04-03 09:34:18     UNKNOWN

Success!

Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime
 MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature
Outputs
job_201304010834_0003   1       0       3       3       3       0
 0       0       A,B     MAP_ONLY        /test/idout,

Input(s):
Successfully read 4826 records (5229 bytes) from: "/test/Mapr_rpm_Files"

Output(s):
Successfully stored 4826 records in: "/test/idout"

Counters:
Total records written : 4826
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201304010834_0003


13/04/03 09:34:18 INFO mapReduceLayer.MapReduceLauncher: Success!

[root@nmk-centos-60-1 ~]#

To see output,

[root@nmk-centos-60-1 ~]# hadoop fs -ls /test*
Found 2 items
-rwxr-xr-x   3 root root     342071 2013-04-03 08:54 /test/Mapr_rpm_Files
drwxr-xr-x   - root root          2 2013-04-03 09:34 /test/idout
[root@nmk-centos-60-1 ~]#