Apache Spark is an open-source cluster computing framework developed at the AMPLab at UC Berkeley. Spark has an in-memory mechanism which provides serious performance gains for large datasets. And, it makes it easy go on cloud!
I started taking the Edx course on Apache Spark last week. You can go over to https://courses.edx.org/courses/course-v1:BerkeleyX+CS105x+1T2016/ to see what’s the course about. While the course provider ( UCBerkley) is offering a free Databricks CE ( This is the startup from the guys who built-up Spark), I wanted to set it up on my local machine to get some hours under it!. If you really just want to try out the commands, and do not want to bother about all the technical issues, Databricks CE is great for learning!
Installing Apache Spark on Mac
Installation of Spark on Mac is super easy if you have the Homebrew installed ( this alone has saved hours of my time)
Step 1 : Get the homebrew installed on your Mac. Mac already comes with its own python ( headache! ) and other tools. One of them is the ruby binary which can process our homebrew’s installation script.
Run this on a terminal to get your brew up:
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Step 2: Install your Apache Spark from the brew repo
brew install apache-<wbr></wbr>spark
Note: Only snag which you’re going to get during the brew installation of spark is that of a missing Oracle Java SDK . But you’ll be given a fairly detailed warning on how to set it up by the installer, make sure you follow that through! ( Ref : https://java.com/en/download/ )
Installing Apache Spark on Ubuntu
Step1 : Get your repos up and up to date!
sudo apt-add-repository ppa:webupd8team/java sudo apt-get update
Step 2: Verify your java installation using
If this throws up any version below 7 and update 32 , get the latest installation done
sudo apt-get install oracle-java7-installer
Step 3: Get the Scala installation. run the following steps on a temp folder if you wish to do so. We won’t need the compressed file after the extraction is done.
wget http://www.scala-lang.org/files/archive/scala-2.11.8.tgz sudo mkdir /usr/local/src/scala sudo tar xvf scala-2.11.8.tgz -C /usr/local/src/scala/
Step 4: Initialize the environment variables for Scala
Open up your bash rc file at your user root
Add the following lines at the end of bashrc file. Mind the Scala version
export SCALA_HOME=/usr/local/src/scala/scala-2.11.8 export PATH=$SCALA_HOME/bin:$PATH
Fire up another bash terminal and check your Scala’s version
It should be showing 2.11.8, which is the latest as of now.
Step 5: Get your git set up as we need it during the build of spark
sudo apt-get install git
Step 6: Get the spark code which we can build
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz tar xvf spark-1.1.0.tgz
move into that directory after extraction
And, let’s build the spark binaries
Running the Spark
You can run spark ( Mac / Ubuntu …) by running spark-shell command which will trigger a scala session
Or, by running pyspark command which will use the python installation on the system to have a python session.
On the mac, the binaries gets intalled at the following path, so you might want to add this path onto the $PATH variable or use the full path for now to run the PySpark!
More on how to deal with PySpark later!. Enjoy your Sparks!