Difference between revisions of "HowTo:spark"
(→Template job) |
|||
Line 13: | Line 13: | ||
= Template job = | = Template job = | ||
− | The following template script (found at /opt/global/spark/spark-job.sh) will work as a good job template for a typical Spark job. You must edit this script to reflect the proper number of cores and memory you will use, or your job may be scheduled with sub-optimal resources | + | The following template script (found at /opt/global/spark/spark-job.sh) will work as a good job template for a typical Spark job. You must edit this script to reflect the proper number of cores and memory you will use (the "-pe glinux.pe #" and "-l mf=#G" lines), or your job may be scheduled with sub-optimal resources. |
This particular job will start a Spark application on a node with 55GB of memory, 4 cpus, and local scratch disks for improved I/O performance. STDOUT and STDERR will be written to <code>nameOfJob.oJob#</code> and <code>nameOfJob.eJob#</code>, respectively. The "anaconda2" python distribution will be used for "pyspark". | This particular job will start a Spark application on a node with 55GB of memory, 4 cpus, and local scratch disks for improved I/O performance. STDOUT and STDERR will be written to <code>nameOfJob.oJob#</code> and <code>nameOfJob.eJob#</code>, respectively. The "anaconda2" python distribution will be used for "pyspark". | ||
Line 34: | Line 34: | ||
# Setup your environment to use Spark | # Setup your environment to use Spark | ||
− | |||
source /opt/gaussian/setup-spark.sh | source /opt/gaussian/setup-spark.sh | ||
− | |||
− | |||
− | |||
− | |||
− | |||
# Edit this line to change how your job is run: | # Edit this line to change how your job is run: | ||
spark-submit --master local[$NSLOTS] --executor-memory 50G --driver-memory 5G example-spark-application.py | spark-submit --master local[$NSLOTS] --executor-memory 50G --driver-memory 5G example-spark-application.py | ||
</pre> | </pre> | ||
− | |||
= Using Python 3 instead of Python 2 = | = Using Python 3 instead of Python 2 = |
Revision as of 15:33, 16 January 2017
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. This page is a how to guide on using Apache Spark on the SW cluster.
Running a Spark job requires several steps:
- To load the Spark libraries and scripts to your path, run the command:
use spark
.
- To setup scratch disks appropriately, run
source /opt/gaussian/setup-spark.sh
- Run your application in Spark using
spark-submit
. Make sure to set proper values for executor and driver memory or you will likely experience memory-related errors!--master local[#]
sets up and uses a local standalone Spark cluster with # workers for your job (using "$NSLOTS" for "#" automatically sets the number of workers to the number of cores requested by your job).
Note that HDFS is not installed on the SW cluster. To use the cluster filesystem instead, use file:///path/to/file
instead of hdfs://host:8020/path/to/file
in your scripts.
Template job
The following template script (found at /opt/global/spark/spark-job.sh) will work as a good job template for a typical Spark job. You must edit this script to reflect the proper number of cores and memory you will use (the "-pe glinux.pe #" and "-l mf=#G" lines), or your job may be scheduled with sub-optimal resources.
This particular job will start a Spark application on a node with 55GB of memory, 4 cpus, and local scratch disks for improved I/O performance. STDOUT and STDERR will be written to nameOfJob.oJob#
and nameOfJob.eJob#
, respectively. The "anaconda2" python distribution will be used for "pyspark".
#!/bin/bash #$ -S /bin/bash #$ -o $JOB_NAME.o$JOB_ID #$ -e $JOB_NAME.e$JOB_ID #$ -cwd #$ -V #$ -q abaqus.q #$ -l qname=abaqus.q # This line should be equal to the total executor and driver memory used. #$ -l mf=55G # Number of worker cores you want to use #$ -pe glinux.pe 4 # Setup your environment to use Spark source /opt/gaussian/setup-spark.sh # Edit this line to change how your job is run: spark-submit --master local[$NSLOTS] --executor-memory 50G --driver-memory 5G example-spark-application.py
Using Python 3 instead of Python 2
You can use Python 3 instead of Python 2 by setting the environment variable PYSPARK_PYTHON.
hpc####@swlogin1$ use spark hpc####@swlogin1$ use anaconda3 hpc####@swlogin1$ export PYSPARK_PYTHON=python3 hpc####@swlogin1$ pyspark Python 3.4.5 |Anaconda 2.3.0 (64-bit)| (default, Jul 2 2016, 17:47:47) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 17/01/13 15:56:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Python version 3.4.5 (default, Jul 2 2016 17:47:47) SparkSession available as 'spark'. >>>
Troubleshooting
If you see out-of-memory errors, increase --executor-memory
and --driver-memory
accordingly.