Getting Apache Spark on Installed on OSX and Ubuntu

Apache Spark is an open-source cluster computing framework  developed at the AMPLab at UC Berkeley. Spark has an in-memory mechanism which provides serious performance gains for large datasets. And, it makes it easy go on cloud!

I started taking the Edx course on Apache Spark last week.  You can go over to https://courses.edx.org/courses/course-v1:BerkeleyX+CS105x+1T2016/ to see what’s the course about.  While the course provider ( UCBerkley) is offering a free Databricks CE  ( This is the startup from the guys who built-up Spark), I wanted to set it up on my local machine to get some hours under it!.  If you really just want to try out the commands, and do not  want to bother about all the technical issues, Databricks CE is great for learning!

Installing Apache Spark on Mac

Installation of Spark on Mac is super easy if you have the Homebrew installed ( this alone has saved hours of my time)

Step 1 : Get the homebrew installed on your Mac.  Mac already comes with its own python ( headache! ) and other tools.  One of them is the ruby binary which can process our homebrew’s installation script.

Run this on a terminal to get your brew up:

 /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Step 2:  Install your Apache Spark from the brew repo

brew install apache-<wbr></wbr>spark

 

Note: Only snag which you’re going to get during the brew installation of spark is that of a missing Oracle Java SDK .  But you’ll be given a fairly detailed warning on how to set it up by the installer, make sure you follow that through! ( Ref : https://java.com/en/download/ )

 

Installing Apache Spark on Ubuntu

Step1 : Get your repos up and up to date!

sudo apt-add-repository ppa:webupd8team/java

sudo apt-get update 

Step 2: Verify your java installation using

java -version

If this throws up any version below 7   and update 32 , get the latest installation done

sudo apt-get install oracle-java7-installer

Step 3: Get the Scala installation. run the following steps on a temp folder if you wish to do so. We won’t need the compressed file after the extraction is done.

wget http://www.scala-lang.org/files/archive/scala-2.11.8.tgz 

sudo mkdir /usr/local/src/scala

sudo tar xvf scala-2.11.8.tgz -C /usr/local/src/scala/

Step 4: Initialize the environment variables for Scala

Open up your bash rc file at your user root

vi  ~/.bashrc

Add the following lines at the end of bashrc file. Mind the Scala version

export SCALA_HOME=/usr/local/src/scala/scala-2.11.8 

export PATH=$SCALA_HOME/bin:$PATH

Fire up another bash terminal and check your Scala’s version

scala -version

It should be showing 2.11.8, which is the latest as of now.

Step 5: Get your git set up as we need it during the build of spark

sudo apt-get install git

Step 6:  Get the spark code which we can build

 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz

tar xvf spark-1.1.0.tgz

move into that directory after extraction

cd spark-1.1.0

And, let’s build the spark binaries

sbt/sbt assembly

 

Running the Spark

You can run spark ( Mac / Ubuntu …)  by running   spark-shell   command  which will trigger a scala session

Or, by running  pyspark command which will use the python installation on the system to have a python session.

On the mac,  the binaries gets intalled at  the  following path, so you might want to add this path onto the $PATH variable or use the full path for now to run the PySpark!

/usr/local/opt/apache-spark/libexec

More on how to deal with PySpark later!. Enjoy your Sparks!

Leave a Reply