Friday, 3 June 2016

How to write Spark jobs in Java for Spark Job Server

In the previous post, we learnt about setting up Spark job server, and running the spark jobs. So far, we have used Scala programs to run on job server. Now we’ll see, how to write the Spark jobs in java to run on job server.

As in Scala, job must implement the SparkJob trait.  So the job looks like this:

object SampleJob  extends SparkJob {
    override def runJob(sc:SparkContext, jobConfig: Config): Any = ???
    override def validate(sc:SparkContext, config: Config): SparkJobValidation = ???
}

What these methods are:
  • runJob method contains the implementation of the Job. The SparkContext is managed by the JobServer and will be provided to the job through this method. This relieves the developer from the boiler-plate configuration management that comes with the creation of a Spark job and allows the Job Server to manage and re-use contexts.
  • validate method allows for an initial validation of the context and any provided configuration. If the context and configuration are OK to run the job, returning spark.jobserver.SparkJobValid will let the job execute, otherwise returning spark.jobserver.SparkJobInvalid(reason) prevents the job from running and provides means to convey the reason of failure. In this case, the call immediately returns an HTTP/1.1 400 Bad Request status code.
    validate helps preventing running jobs that will eventually fail due to missing or wrong configuration and save both time and resources.

In Java, we need to extend JavaSparkJob class. It has following methods which will be overridden in the program:
  • runJob(jsc: JavaSparkContext, jobConfig: Config)
  • validate(sc: SparkContext, config: Config)
  • invalidate(jsc: JavaSparkContext, config: Config)

JavaSparkJob class is available in job-server-api package. Build the job-server-api source code and add this jar to your project.  Add spark and other required dependencies in your pom.xml. 

Let’s start with the basic WordCount example:

WordCount.java:

package spark.jobserver;

import java.io.Serializable;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;

import org.apache.commons.lang.StringUtils;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
import spark.jobserver.JavaSparkJob;
import spark.jobserver.SparkJobInvalid;
import spark.jobserver.SparkJobValid$;
import spark.jobserver.SparkJobValidation;
import com.typesafe.config.Config;
public class Wordcount extends JavaSparkJob implements Serializable { 
       private static final long serialVersionUID = 1L;
private static final Pattern SPACE = Pattern.compile(" ");
static String fileName = StringUtils.EMPTY;

       public Object runJob(JavaSparkContext jsc, Config config) {
              try {
                     JavaRDD<String> lines = jsc.textFile(
                                  config.getString("input.filename"), 1);
                     JavaRDD<String> words = lines
                                  .flatMap(new FlatMapFunction<String, String>() {
                                         public Iterable<String> call(String s) {
                                                return Arrays.asList(SPACE.split(s));
                                         }
                                  });
                     JavaPairRDD<String, Integer> counts = words.mapToPair(
                                  new PairFunction<String, String, Integer>() {
                                         public Tuple2<String, Integer> call(String s) {
                                                return new Tuple2<String, Integer>(s, 1);
                                         }
                                  }).reduceByKey(new Function2<Integer, Integer, Integer>() {
                           public Integer call(Integer i1, Integer i2) {
                                  return i1 + i2;
                           }
                     });
                     List<Tuple2<String, Integer>> output = counts.collect();
                     System.out.println(output);
                     return output;
              } catch (Exception e) {
                     e.printStackTrace();
                     return null;
              }
       }

       public SparkJobValidation validate(SparkContext sc, Config config) {
              String filename = config.getString("input.filename");
              if (!filename.isEmpty()) {
                     return SparkJobValid$.MODULE$;
              } else {
                     return new SparkJobInvalid(
                                  "Input paramerter is missing. Please mention the filename");
              }
       }

       public String invalidate(JavaSparkContext jsc, Config config) {
              return null;
       }
}

Next step is : compile the code and build the jar. Then upload it to the Job server.

So your Spark job is ready to run on JobServer....!!!  

35 comments:

  1. I am getting error when I copied this example to my project:
    The return types are incompatible for the inherited methods SparkJobBase.validate(Object, JobEnvironment, Config), JavaSparkJob.validate(Object, JobEnvironment, Config)


    ReplyDelete
    Replies
    1. Hey Krish, I'm having the same issue, were able to solve it?

      Delete
    2. Hi Krish, I'm also facing the same problem. Where you able to find a solution for this problem yet ?

      Delete
    3. Hi Krish,

      Thanks for the tip, appreciate it. Your article definitely helped me to understand the core concepts.
      I’m most excited about the details your article touch based! I assume it doesn’t come out of the box, it sounds like you are saying we’d need to write in the handlers ourselves.
      Are there any other articles you would recommend to understand this better?
      I want to download and use UiPath Community Edition to study.
      I am planning to download it to the company’s computer where I work and use it.
      The company I work for has more than 250 computers and has sales of over $1 million throughout the company. but I want to download it for personal study.
      In the case, is it possible to use UiPath Community Edition?
      By the way do you have any YouTube videos, would love to watch it. I would like to connect you on LinkedIn, great to have experts like you in my connection (In case, if you don’t have any issues).

      Please keep providing such valuable information.

      Regards,
      Sanvi

      Delete
  2. @Nishu Tayal, thank you for this example! It's helped me wrap my head around getting this thing going. Would be nice to have a blog post on running StreamingContext as well!

    ReplyDelete
    Replies
    1. Hello Nishu,


      Thank you for making your blogs an embodiment of perfection and simplicity. You make everything so easy to follow.

      Tested mainly with Microsoft Cloud OCR.
      I’ve used the same exact image from UiPath and from normal C# code.
      The latter works normally (although the results are not that great Uipath Training USA .
      The former throws an InvalidImageInput for all configurations I could think of - passing UiPath.Core.Image, System.Drawing.Image, loading from file, using the activity GetOcrText with nested engine, using the engine activity directly… always same result.
      Thanks a lot. This was a perfect step-by-step guide. Don’t think it could have been done better.

      Shukran,
      Kevin

      Delete
  3. @Nishu Hi i'm doing one development in spark need of some passionated bigdata/spark developer. Let me know if you are interested. Contact : arumugamcovai@gmail.com or whatsapp : +91 9619663272

    ReplyDelete
  4. When i did my technical work for projects.This helpful my project like java jobs in hyderabad


    ReplyDelete
    Replies
    1. Hey,
      Thank You so much for this blog. It helped me lot. I am a Technical Recruiter by profession and first time working on this technology was bit tough for me, this article really helped me a lot to understand the details to get started with.
      As per Forrester Wave 2017 Q1[1] report best RPA vendors are
      • Automation Anywhere
      • UiPath
      • NICE
      • BluePrism
      • EdgeVerve
      • Workfusion
      • Pega / OpenSpan

      forrester.png602x521 114 KB

      UiPath scores best in the technology category, AA has the biggest market presence and breadth of use-cases while BP scores best when it comes to bot governance and deployment features though, I think, they are a bit underrated.

      Secondly it depends on the use case and who you are. For example WF is great when it comes to digitization(OCR) processes while UiPath offers a free community edition.
      Appreciate your effort for making such useful blogs and the community
      Obrigado,

      Delete
  5. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Apache Spark TECHNOLOGY , kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor-led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ pieces of training in India, USA, UK, Australia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Pratik Shekhar
    MaxMunus
    E-mail: pratik@maxmunus.com
    Ph:(0) +91 9066268701
    http://www.maxmunus.com/

    ReplyDelete
  6. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Java developer learn from Java Training in Chennai. or learn thru Java EE Online Training from India . Nowadays Java has tons of job opportunities on various vertical industry.



    Java Online Training

    ReplyDelete
  7. Learned a lot of new things from your post , Thanks for sharing


    Java Online Training Hyderabad

    ReplyDelete
  8. Hello Nishu,


    Allow me to show my gratitude ASHA24 bloggers. You guys are like unicorns. Never seen but always spreading magic. Your content is yummy. So satisfied. Uipath certification

    The best way to implement what you want to do (target individual vertices of plane shapes many times over) is by using an IDE form. The reason is: while the running form is waiting for input (point selection) from you, you can change anything you want in the active view (including Display Control settings). When you finally select a point, you can process it in the form code (as Kendall noted, this is what Mentor does), and then move on to the next point.

    I have provided a form that defines its own command (Select Plane Vertex) to which a mouse click event is attached. The event is only sensitive to left mouse clicks that do not include any combination of the Shift, Ctrl, or Alt keys. All other mouse events are passed to the application for processing, so you are free to use strokes, menus, pan, zoom, etc., while the form is running. But a solitary left mouse click will register the cursor location, then search all conductor graphics items that exist on the displayed layers. It will only examine plane shapes that are not contained in cells, and will find the closest coordinate to the mouse click. It will then select that plane shape, zoom to its extents, and report the layer, net name, index into the points array for the shape, and the X/Y coordinate at that index. Now that you can locate that point in its array, you can apply whatever processing you desire (per input from Patrick). When you finally close the form, it will unselect and unhighlight everything, reset the view to the board extents, and terminate the Select Plane Vertex command.

    When you save the form, don't forget to set the read-only property for production use.

    It was cool to see your article pop up in my google search for the process yesterday. Great Guide.
    Keep up the good work!


    Thanks and Regards

    Radhey

    ReplyDelete