Thursday, 15 November 2018

Spark Installation & Configuration

   
Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads. Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.


Prerequisites:

Java 8
Hadoop 2.6
Scala (Spark comes prebuilt with Hadoop and scala)

Downloading Apache Spark

Download spark-1.6.1 using below command on /usr/local directory.
$ cd /usr/local
$ sudo wget https://archive.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz
Extract the Spark tar file
$ sudo tar -xvf spark-1.6.1-bin-hadoop2.6.tgz
$ sudo mv spark-1.6.1-bin-hadoop2.6 spark

Set Environment veriable

First we need to set environment variable for java. Edit ~/.bashrc file.
# nano ~/.bashrc
Append following values at end of file and save the file.
export SPARK_HOME=/usr/local/spark
export PATH=$ SPARK_HOME/bin:$PATH
Change the ownership and permissions of the directory /usr/local/spark
$ sudo chown -R hdfs:hdfs /usr/local/spark
$ sudo chmod -R 755 /usr/local/spark
For spark-sql, copy hive-site.xml file to /usr/local/spark/conf folder.
$ sudo cp /usr/local/hive/conf/hive-site.xml /usr/local/spark/conf/

Edit hive-site.xml and add the following code in the file
$ sudo nano /usr/local/spark/conf/hive-site.xml
<property>
  <name>hive.metastore.uris</name>
  <value>thrift://localhost:9083</value>
</property>

Start the Spark Services

Start the spark service using following command.
$ cd /usr/local/spark/sbin
$ ./start-all.sh
Get spark-shell prompt using following command.
$ cd /usr/local/spark/bin
$ ./spark-shell
Get spark-sql prompt using following command.
cd /usr/local/spark/bin
$ ./spark-sql
Share:

Total Pageviews

Lables

Powered by Blogger.

Followers