Apache Spark is an open source parallel processing framework for running
large-scale data analytics applications across clustered computers. It can
handle both batch and real-time analytics and data processing workloads. Apache
Spark achieves high performance for both batch and streaming data, using a
state-of-the-art DAG scheduler, a query optimizer, and a physical execution
engine.
Prerequisites:
Java 8Hadoop 2.6
Scala (Spark comes prebuilt with Hadoop and scala)
Downloading Apache Spark
Download spark-1.6.1 using below command on /usr/local directory.
$ cd /usr/local
$ sudo wget https://archive.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz
|
$ sudo tar -xvf spark-1.6.1-bin-hadoop2.6.tgz
$ sudo mv
spark-1.6.1-bin-hadoop2.6 spark
|
Set Environment veriable
First we need to set environment variable for java. Edit ~/.bashrc file.
# nano ~/.bashrc
|
export SPARK_HOME=/usr/local/spark
export PATH=$
SPARK_HOME/bin:$PATH
|
$ sudo chown -R hdfs:hdfs
/usr/local/spark
$ sudo chmod -R 755
/usr/local/spark
|
$ sudo cp /usr/local/hive/conf/hive-site.xml /usr/local/spark/conf/
|
$ sudo nano /usr/local/spark/conf/
|
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
</property>
|
Start the Spark Services
Start the spark service using following command.
$ cd /usr/local/spark/sbin
$ ./start-all.sh
|
$ cd /usr/local/spark/bin
$ ./spark-shell
|
$ cd /usr/local/spark/bin
$ ./spark-sql
|