Spark Installation & Configuration ~ For New Generation Of Technologies...

Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads. Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

Prerequisites:

Java 8
Hadoop 2.6
Scala (Spark comes prebuilt with Hadoop and scala)

Downloading Apache Spark

Download spark-1.6.1 using below command on /usr/local directory.

$ cd /usr/local

$ sudo wget https://archive.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz

Extract the Spark tar file

$ sudo tar -xvf spark-1.6.1-bin-hadoop2.6.tgz

$ sudo mv spark-1.6.1-bin-hadoop2.6 spark

Set Environment veriable

First we need to set environment variable for java. Edit ~/.bashrc file.

# nano ~/.bashrc

Append following values at end of file and save the file.

export SPARK_HOME=/usr/local/spark

export PATH=$ SPARK_HOME/bin:$PATH

Change the ownership and permissions of the directory /usr/local/spark

$ sudo chown -R hdfs:hdfs /usr/local/spark

$ sudo chmod -R 755 /usr/local/spark

For spark-sql, copy hive-site.xml file to /usr/local/spark/conf folder.

$ sudo cp /usr/local/hive/conf/hive-site.xml /usr/local/spark/conf/

Edit hive-site.xml and add the following code in the file

$ sudo nano /usr/local/spark/conf/hive-site.xml

<name>hive.metastore.uris</name>

<value>thrift://localhost:9083</value>

</property>

Start the Spark Services

Start the spark service using following command.

$ cd /usr/local/spark/sbin

$ ./start-all.sh

Get spark-shell prompt using following command.

$ cd /usr/local/spark/bin

$ ./spark-shell

Get spark-sql prompt using following command.

$ cd /usr/local/spark/bin

$ ./spark-sql

For New Generation Of Technologies...

Thursday, 15 November 2018