Thursday, December 29, 2016

Apache spark in docker container

I was participating in an IndustryHack last month and as a proof of concept wanted to use machine learning and predictive analysis of sensor data.

Basically the need of the hour was to have Apache Spark up and running in minimal configuration and be able to use predictive analysis with limited data and see the data correlation of different events. The sensor data was provided by the company who was hosting the hackathon.

I thought that it would be great if we can start Apache Spark in docker container and play with it during the hackathon as we don't need a full fledged cluster to do initial learning and linear regression based predictions of the data.

The docker file that we used for starting a container with Apache Spark is as follows,

 # Ubuntu Dockerfile
 #
 # https://github.com/dockerfile/ubuntu
 #
 FROM ubuntu

 # File Author / Maintainer
 MAINTAINER Ganesh Vasudevan

 RUN apt update && \
     apt install -y wget vim less ssh default-jre && \
     sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
     cd home && \
     wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz && \
     tar xvzf spark-2.0.2-bin-hadoop2.7.tgz

 EXPOSE 22

 # Define default command.
 ENTRYPOINT service ssh restart && bash

This basically downloads the latest ubuntu, installs the packages and fetches the spark 2.0.2 and untars the same in /home/.

To get the docker up and running,

1. copy the file and execute docker build -t
2. Once the image is built successfully docker run -i -t

This will start an interactive session, if you need ssh to the ubuntu instance then set a new password (passwd to set password) and then can ssh from your terminal. Once ssh is working you can stop the docker and start without -i flag.

Cant test the apache spark by running some examples which are shipped with the latest Spark.

cd /home/spark-2.0.2-bin-hadoop2.7
./bin/run-example SparkPi

This will execute the Pi calculation and provide the result. The examples are located in
examples/src/main/scala/org/apache/spark/examples/

With this above setup we were able to do initial studies and data analysis within few hours.

No comments:

Post a Comment