Added chapter about Big Data processing with Hadoop and Docker

This commit is contained in:
Fabiane Bizinella Nardon 2017-03-18 21:19:02 -03:00
Родитель aabbff4656
Коммит 037b97f87a
5 изменённых файлов: 235 добавлений и 1 удалений

1
.gitignore поставляемый
Просмотреть файл

@ -1 +1,2 @@
.vscode/
.DS_Store

Просмотреть файл

@ -0,0 +1,232 @@
:imagesdir: images
= Example: Big Data Processing with Docker and Hadoop
*PURPOSE*: This chapter explains how to use Docker to create a Hadoop cluster and a Big Data application in Java. It highlights several concepts like service scale, dynamic port allocation, container links, integration tests, debugging, etc.
== Download images and application
Clone the project at `https://github.com/fabianenardon/hadoop-docker-demo`
Inspect the `docker/docker-compose.yml` file. It defines a MongoDB service and the services needed to run a Hadoop cluster. It also defines a service for our application. See how the services are linked together.
== Build the application
[source, text]
----
cd sample
mvn clean install -Papp-docker-image
----
In the command above, `-Papp-docker-image` will fire up the `app-docker-image` profile, defined in the application pom.xml. This profile will create a dockerized version of the application, creating two images:
. `docker-hadoop-example`: docker image used to run the application
. `docker-hadoop-example-tests`: docker image used to run integration tests
== Start all the services
Go to the `sample/docker` folder and start the services:
[source, text]
----
docker-compose up -d
----
Open `http://localhost:8088/cluster` to see your if your cluster is running. You should see 1 active node when everything is up.
== Running the application
This application reads a text file from hdfs and counts how many words it has. The result is saved on MongoDB.
First, create a folder on hdfs. We will save the file to be processed on it:
[source, text]
----
docker-compose exec yarn hdfs dfs -mkdir /files/
----
Put the file we are going to process on hdfs:
[source, text]
----
docker-compose run docker-hadoop-example hdfs dfs -put /maven/test-data/text_for_word_count.txt /files/
----
The `text_for_word_count.txt` was added to the application image by maven when we built it, so we can use it to test.
Run our application
[source, text]
----
docker-compose run docker-hadoop-example hadoop jar /maven/jar/docker-hadoop-example-1.0-SNAPSHOT-mr.jar hdfs://namenode:9000 /files mongo yarn:8050
----
If everything ran successful, you should be able to see the results on MongoDB.
Connect to the Mongo container:
[source, text]
----
docker-compose exec mongo mongo
----
When connected, type:
[source, text]
----
use mongo_hadoop
db.word_count.find();
----
You should see the results of running the application. Something like this:
[source, text]
----
> db.word_count.find();
{ "_id" : "Counts on Sat Mar 18 18:16:20 UTC 2017", "words" : 256 }
----
== Scaling the Hadoop cluster
If you want, you can scale your cluster, adding more Hadoop nodes to it:
[source, text]
----
docker-compose scale nodemanager=2
----
This means that you want to have 2 nodes in your Hadoop cluster. Go to `http://localhost:8088/cluster` and refresh until you see 2 active nodes.
The trick to scale the nodes is to use dynamically allocated ports and let docker assign a different port to each new nodemanager. See this approach in this snippet of the docker-compose.yml file:
[source, text]
----
nodemanager:
image: tail/hadoop:2.7.2
command: yarn nodemanager
ports:
- "8042" # local port dynamically assigned. allows node to be scaled up and down
links:
- namenode
- datanode
- yarn
hostname: nodemanager
----
== Stopping the services
Stop all the services
[source, text]
----
docker-compose down
----
Note that since our docker-compose.yml file defines volume mappings for hdfs and mongoDB, next time you start the services again, your data will still be there.
== Debugging your code
Debugging distributed Hadoop applications can be cumbersome. However, you can configure your environment to use the docker Hadoop cluster and debug your code easily from an IDE.
First, make sure your services are up:
[source, text]
----
docker-compose up -d
----
Then, add this to your /etc/hosts:
[source, text]
----
127.0.0.1 datanode
127.0.0.1 yarn
127.0.0.1 namenode
127.0.0.1 secondarynamenode
127.0.0.1 nodemanager
----
This configuration will allow you to access the docker Hadoop cluster from your IDE.
Then, open your project on Netbeans (or any other IDE) and run the application file:
image::docker-bigdata01.png[]
Note that you will be connecting to the docker services at localhost.
You can also set a breakpoint and debug your application:
image::docker-bigdata02.png[]
== Integration tests
When running integration tests, you want to test your application in an environment as close to production as possible, so you can test interactions between the several components, services, databases, network communication, etc. Fortunately, docker can help you a lot with integration tests.
There are several strategies to run integration test, but in this application we are going to use the following:
. Start the services with a `docker-compose.yml` file created for testing purposes. This file won't have any volumes mapped, so when the test is over, no state will be saved. The test `docker-compose.yml` file won't publish any port on the host machine, so we can run simultaneous tests.
. Run the application, using the services started with the `docker-compose.yml` test file.
. Run Maven integration tests to check if the application execution produced the expected results. This will be done by checking what was saved on the MongoDB database.
. Stop the services. No state will be stored, so next time you run the integration tests, you will have a clean environment.
Here is how to execute this strategy, step by step:
Start the services with the test configuration:
[source, text]
----
docker-compose --file src/test/resources/docker-compose.yml up -d
----
Make sure all services are started and create the folder we need on hdfs to test:
[source, text]
----
docker-compose --file src/test/resources/docker-compose.yml exec yarn hdfs dfs -mkdir /files/
----
Put the test file on hdfs:
[source, text]
----
docker-compose --file src/test/resources/docker-compose.yml run docker-hadoop-example hdfs dfs -put /maven/test-data/text_for_word_count.txt /files/
----
Run the application
[source, text]
----
docker-compose --file src/test/resources/docker-compose.yml run docker-hadoop-example hadoop jar /maven/jar/docker-hadoop-example-1.0-SNAPSHOT-mr.jar hdfs://namenode:9000 /files mongo yarn:8050
----
Run our integration tests:
[source, text]
----
docker-compose --file src/test/resources/docker-compose.yml run docker-hadoop-example-tests mvn -f /maven/code/pom.xml -Dmaven.repo.local=/m2/repository -Pintegration-test verify
----
Stop all the services:
[source, text]
----
docker-compose --file src/test/resources/docker-compose.yml down
----
If you want to remote debug tests, run the tests this way instead:
[source, text]
----
docker run -v ~/.m2:/m2 -p 5005:5005 --link mongo:mongo --net resources_default docker-hadoop-example-tests mvn -f /maven/code/pom.xml -Dmaven.repo.local=/m2/repository -Pintegration-test verify -Dmaven.failsafe.debug
----
Running with this configuration, the application will wait until an IDE connects for remote debugging on port 5005.
See more about integration tests in the link:chapters/ch09-cicd.adoc[CI/CD using Docker] chapter

Двоичные данные
developer-tools/java/chapters/images/docker-bigdata-01.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 136 KiB

Двоичные данные
developer-tools/java/chapters/images/docker-bigdata-02.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 111 KiB

Просмотреть файл

@ -16,8 +16,9 @@ This tutorial offers Java developers an intro-level and self-paced hands-on work
** link:chapters/ch08-aws.adoc[Docker for AWS]
** link:chapters/ch08-azure.adoc[Docker for Azure] (coming)
** link:chapters/ch08-cloud.adoc[Docker Cloud]
* link:chapters/ch09-cicd.adoc[CI/CD using Docker] (coming)
* link:chapters/ch09-cicd.adoc[CI/CD using Docker]
* link:chapters/ch10-monitoring.adoc[Monitoring Java Container]
* link:chapters/ch11-bigdata.adoc[Example: Big Data Processing with Docker and Hadoop]
* link:chapters/appa-common-commands.adoc[Common Docker Commands]
* link:chapters/appb-troubleshooting.adoc[Troubleshooting]
* link:chapters/appc-references.adoc[References]