Setting up Apache Spark for Python on OSX
The good news first, I managed to install Spark for Python in less than 30 minutes with basic googling.[1][2] If you do not want Google it yourself you may follow steps below:
- Verify your Java version
java -version
Mine version was, and worked fine for Spark 1.4.0 and OS X 10.10.4 (Yosemite)
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
- Install Homebrew if you do not have it
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
- Install Scala
brew install scala
- Set environmental variables for Scala.
You should put these settings to
.bashrc
for permanent usage
export JAVA_HOME=$(/usr/libexec/java_home)
export SCALA_HOME=/usr/local/bin/scala
export PATH=$PATH:$SCALA_HOME/bin
- Download Spark from https://spark.apache.org/downloads.html
cd YOUR_FOLDER_WHERE_YOU_WANT_TO_INSTALL_SPARK # I install it into ~/Library
tar -xvzf spark*.tgz
cd spark-*
- Build Apache Spark. May take some time.
sbt/sbt clean assembly
- Pyspark (Spark for Python) depends on
py4j
package so install it before launching Spark scripts withSparkContext
from standard Python interpreter
pip install py4j # sudo may be required
- Set up your environment.
You should put these settings to
.bashrc
for permanent usage
export SPARK_HOME=$HOME/Library/spark-1.4.0
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PATH=$PATH:$SPARK_HOME/bin
export IPYTHON=1
- Note that `IPYTHON=1` is optional and its used only when launching interactive shell with `pyspark`.
- Do not forget to `source ~/.bashrc` to activate the settings
Test time
So lets download the official word count example, and run the example using standard Python interpreter. Note we are redirecting tons of logging from Spark to /dev/null
so one can see the results.
wget https://raw.githubusercontent.com/apache/spark/master/examples/src/main/python/wordcount.py
echo "TESTING TIME" > test.txt
python wordcount.py 2> /dev/null
The expected output is:
TESTING: 1
TIME: 1
Alternatives
You may want to play with pyspark
interactively or as an interpreter.
See the docs
References
- http://genomegeek.blogspot.cz/2014/11/how-to-install-apache-spark-on-mac-os-x.html
- http://stackoverflow.com/questions/25205264/how-do-i-install-pyspark-for-use-in-standalone-scripts