HowTo:spark

From CAC Wiki
Revision as of 21:09, 13 January 2017 by Jstaff (Talk | contribs) (create page)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. This page is a how to guide on using Apache Spark on the SW cluster.

Running a Spark job requires several steps:

  • To load the Spark libraries and scripts to your path, run the command: use spark.
  • To setup scratch disks appropriately, run source /opt/global/spark/setup.sh
  • Run your application in Spark using spark-submit. Make sure to set proper values for executor and driver memory or you will likely experience memory-related errors! --master local[#] sets up and uses a local standalone Spark cluster with # workers for your job.

Note that HDFS is not installed on the SW cluster. To use the cluster filesystem instead, use file:///path/to/file instead of hdfs://host:8020/path/to/file.

Template job

The following template script (found at /opt/global/spark/spark-job.sh) will work as a good job template for a typical Spark job. You must edit this script to reflect the proper number of cores and memory you will use, or your job may be scheduled with sub-optimal resources. Note that shm.pe can be substituted for glinux.pe for access to larger machines (in terms of memory and CPUs), but this comes at a cost of not having local scratch disks, which may dramatically slow down performance for jobs requiring lots of I/O (Spark also performs caching on disk).

#!/bin/bash
#$ -S /bin/bash
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -cwd
#$ -V
#$ -q abaqus.q
#$ -l qname=abaqus.q

# This line should be equal to the total executor and driver memory used.
#$ -l mf=55G

# Number of worker cores you want to use
#$ -pe glinux.pe 4

# Setup your environment to use Spark
use spark
source /opt/global/spark/setup.sh

# If using pyspark, you will likely want to use one of the Anaconda python distributions installed:
use anaconda3
use anaconda2

# Edit this line to change how your job is run:
spark-submit --master local[$NSLOTS] --executor-memory 50G --driver-memory 5G example-spark-application.py


Using Python 3 instead of Python 2

You can use Python 3 instead of Python 2 by setting the environment variable PYSPARK_PYTHON.

hpc####@swlogin1$ use spark
hpc####@swlogin1$ use anaconda3
hpc####@swlogin1$ export PYSPARK_PYTHON=python3
hpc####@swlogin1$ pyspark
Python 3.4.5 |Anaconda 2.3.0 (64-bit)| (default, Jul  2 2016, 17:47:47) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
17/01/13 15:56:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.1
      /_/

Using Python version 3.4.5 (default, Jul  2 2016 17:47:47)
SparkSession available as 'spark'.
>>> 

Troubleshooting

If you see out-of-memory errors, increase --executor-memory and --driver-memory accordingly.