HowTo:spark
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. This page is a how to guide on using Apache Spark on the SW cluster.
Running a Spark job requires several steps:
- To load the Spark libraries and scripts to your path, run the command:
use spark
.
- To setup scratch disks appropriately, run
source /opt/global/spark/setup.sh
- Run your application in Spark using
spark-submit
. Make sure to set proper values for executor and driver memory or you will likely experience memory-related errors!--master local[#]
sets up and uses a local standalone Spark cluster with # workers for your job.
Note that HDFS is not installed on the SW cluster. To use the cluster filesystem instead, use file:///path/to/file
instead of hdfs://host:8020/path/to/file
.
Template job
The following template script (found at /opt/global/spark/spark-job.sh) will work as a good job template for a typical Spark job. You must edit this script to reflect the proper number of cores and memory you will use, or your job may be scheduled with sub-optimal resources. Note that shm.pe
can be substituted for glinux.pe
for access to larger machines (in terms of memory and CPUs), but this comes at a cost of not having local scratch disks, which may dramatically slow down performance for jobs requiring lots of I/O (Spark also performs caching on disk).
#!/bin/bash #$ -S /bin/bash #$ -o $JOB_NAME.o$JOB_ID #$ -e $JOB_NAME.e$JOB_ID #$ -cwd #$ -V #$ -q abaqus.q #$ -l qname=abaqus.q # This line should be equal to the total executor and driver memory used. #$ -l mf=55G # Number of worker cores you want to use #$ -pe glinux.pe 4 # Setup your environment to use Spark use spark source /opt/global/spark/setup.sh # If using pyspark, you will likely want to use one of the Anaconda python distributions installed: use anaconda3 use anaconda2 # Edit this line to change how your job is run: spark-submit --master local[$NSLOTS] --executor-memory 50G --driver-memory 5G example-spark-application.py
Using Python 3 instead of Python 2
You can use Python 3 instead of Python 2 by setting the environment variable PYSPARK_PYTHON.
hpc####@swlogin1$ use spark hpc####@swlogin1$ use anaconda3 hpc####@swlogin1$ export PYSPARK_PYTHON=python3 hpc####@swlogin1$ pyspark Python 3.4.5 |Anaconda 2.3.0 (64-bit)| (default, Jul 2 2016, 17:47:47) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 17/01/13 15:56:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.1 /_/ Using Python version 3.4.5 (default, Jul 2 2016 17:47:47) SparkSession available as 'spark'. >>>
Troubleshooting
If you see out-of-memory errors, increase --executor-memory
and --driver-memory
accordingly.