Saturday, September 25, 2021

Working with Multiple Containers using Docker Compose

Introduction

Recently for some project I have to use Grafana and Postgres. I was having the option to install them and use. But this seemed to me a tedious process. I thought of using the existing docker containers with docker-compose tool. The containers were up and available for use in few minutes. The containers were able to communicate among themselves and with the host laptop network.

Types of Mount in Docker

There are multiple types of mount available in docker:
  • Bind Mounts
  • Volumes

Bind Mount

Bind mounts have limited functionality compared to volumes. When you use a bind mount, a file or directory on the host machine is mounted into a container. The file or directory is referenced by its full path on the host machine. The file or directory does not need to exist on the Docker host already. It is created on demand if it does not yet exist. Bind mounts are very performant, but they rely on the host machine’s filesystem having a specific directory structure available. If you are developing new Docker applications, consider using named volumes instead. You can’t use Docker CLI commands to directly manage bind mounts. It is also called as `Host-Mounted Volumes`. These are Below is the syntax for it. 

/host/path:/container/path

Below is the example of bind mount for docker-compose

version: '3.9'


services:

  grafana:

    image: grafana/grafana

    container_name: grafana

    ports:

      - 3000:3000

    links:

      - postgres

    volumes:

      - /Users/aagarwal/dev/grafana/grafana_data:/var/lib/grafana:rw

    

  postgres:

    image: postgres:9.6.6

    container_name: postgres

    environment:

      POSTGRES_USER: postgres     # define credentials

      POSTGRES_PASSWORD: postgres # define credentials

      POSTGRES_DB: postgres       # define database

    ports:

      - 5432:5432                 # Postgres port

    volumes:

      - /Users/aagarwal/dev/grafana/postgres_data:/var/lib/postgresql/data


networks:   grafana_network: driver: bridge

There are 2 volumes used in the above code. The host directories are made as the persistent store for the containers. Here we are using full absolute path for the host directories. These are bind mounts.

docker run -d -it --name devtest -v "$(pwd)"/target:/app nginx:latest

Use docker inspect devtest to verify that the bind mount was created correctly. Look for the Mounts section:

"Mounts": [
    {
        "Type": "bind",
        "Source": "/tmp/source/target",
        "Destination": "/app",
        "Mode": "",
        "RW": true,
        "Propagation": "rprivate"
    }
],

This shows that the mount is a bind mount, it shows the correct source and destination, it shows that the mount is read-write, and that the propagation is set to rprivate.

Stop the container:

$ docker container stop devtest

$ docker container rm devtest


Ideally we should use bind mounts for configs etc. For data persistence, we should use volumes.


Volumes

Created and managed by Docker. You can create a volume explicitly using the  docker volume create command, or Docker can create a volume during container or service creation.

When you create a volume, it is stored within a directory on the Docker host. When you mount the volume into a container, this directory is what is mounted into the container. This is similar to the way that bind mounts work, except that volumes are managed by Docker and are isolated from the core functionality of the host machine.

A given volume can be mounted into multiple containers simultaneously. When no running container is using a volume, the volume is still available to Docker and is not removed automatically. You can remove unused volumes using docker volume prune.

When you mount a volume, it may be named or anonymous. Anonymous volumes are not given an explicit name when they are first mounted into a container, so Docker gives them a random name that is guaranteed to be unique within a given Docker host. Besides the name, named and anonymous volumes behave in the same ways.

Volumes also support the use of volume drivers, which allow you to store your data on remote hosts or cloud providers, among other possibilities.

We have used volumes as we want the data to available for later container runs. Docker manages the volumes in /var/lib/docker/volumes/ using the host filesystem. The path /var/lib/docker/volumes is the logical path on the host filesystems.

Create and Manage Volumes

Unlike a bind mount, you can create and manage volumes outside the scope of any container.

Create a volume:

$ docker volume create my-vol

List volumes:

$ docker volume ls

local               my-vol

Inspect a volume:

$ docker volume inspect my-vol
[
    {
        "Driver": "local",
        "Labels": {},
        "Mountpoint": "/var/lib/docker/volumes/my-vol/_data",
        "Name": "my-vol",
        "Options": {},
        "Scope": "local"
    }
]

Remove a volume:

$ docker volume rm my-vol

If you start a container with a volume that does not yet exist, Docker creates the volume for you. The following example mounts the volume myvol2 into /app/  in the container.

 docker run -d \
  --name devtest \
  -v myvol2:/app \
  nginx:latest

Use docker inspect devtest to verify that the volume was created and mounted correctly. Look for the Mounts section:

"Mounts": [
    {
        "Type": "volume",
        "Name": "myvol2",
        "Source": "/var/lib/docker/volumes/myvol2/_data",
        "Destination": "/app",
        "Driver": "local",
        "Mode": "",
        "RW": true,
        "Propagation": ""
    }
],

This shows that the mount is a volume, it shows the correct source and destination, and that the mount is read-write.

Use a volume with docker-compose

A single docker compose service with a volume looks like this:

version: "3.9"
services:
  frontend:
    image: node:lts
    volumes:
      - myapp:/home/node/app
volumes:
  myapp:

On the first invocation of docker-compose up the volume will be created. The same volume will be reused on following invocations.

A volume may be created directly outside of compose with docker volume create and then referenced inside docker-compose.yml as follows:

version: "3.9"
services:
  frontend:
    image: node:lts
    volumes:
      - myapp:/home/node/app
volumes:
  myapp:
    external: true

Network

When we install docker on the host machine. It creates 3 types of network:

  • bridge
  • host
  • none

$ docker network ls
NETWORK ID     NAME                  DRIVER    SCOPE
ee768c83bcda   bridge                bridge    local
f855c01f90a8   host                  host      local
4d762bc12676   none                  null      local
$

The default   network is listed, along with host and none. The latter two are not fully-fledged networks, but are used to start a container connected directly to the Docker daemon host’s networking stack, or to start a container with no network devices. This tutorial will connect two containers to the bridge network.

We can use default-bridge network or user-defined-bridge network.

Use the default bridge network

We can start one or more docker container by using below command.

$ docker run -dit --name alpine1 alpine ash

$ docker run -dit --name alpine2 alpine ash

This will launch 2 containers and connect them to default-bridge network. The containers will be able ping each other using IP address but will note be able using the name.

Inspect the bridge network to see what containers are connected to it.
$ docker network inspect bridge

[
    {
        "Name": "bridge",
        "Id": "17e324f459648a9baaea32b248d3884da102dde19396c25b30ec800068ce6b10",
        "Created": "2017-06-22T20:27:43.826654485Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.17.0.0/16",
                    "Gateway": "172.17.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Containers": {
            "602dbf1edc81813304b6cf0a647e65333dc6fe6ee6ed572dc0f686a3307c6a2c": {
                "Name": "alpine2",
                "EndpointID": "03b6aafb7ca4d7e531e292901b43719c0e34cc7eef565b38a6bf84acf50f38cd",
                "MacAddress": "02:42:ac:11:00:03",
                "IPv4Address": "172.17.0.3/16",
                "IPv6Address": ""
            },
            "da33b7aa74b0bf3bda3ebd502d404320ca112a268aafe05b4851d1e3312ed168": {
                "Name": "alpine1",
                "EndpointID": "46c044a645d6afc42ddd7857d19e9dcfb89ad790afb5c239a35ac0af5e8a5bc5",
                "MacAddress": "02:42:ac:11:00:02",
                "IPv4Address": "172.17.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.bridge.default_bridge": "true",
            "com.docker.network.bridge.enable_icc": "true",
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
            "com.docker.network.bridge.name": "docker0",
            "com.docker.network.driver.mtu": "1500"
        },
        "Labels": {}
    }
]

Near the top, information about the bridge network is listed, including the IP address of the gateway between the Docker host and the bridge network (172.17.0.1). Under the Containers key, each connected container is listed, along with information about its IP address (172.17.0.2 for alpine1 and 172.17.0.3 for alpine2).


The containers are running in the background. Use the docker attach command to connect to alpine1.

$ docker attach alpine1

/ #

The prompt changes to # to indicate that you are the root user within the container. Use the ip addr show command to show the network interfaces for alpine1 as they look from within the container:

# ip addr show

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
27: eth0@if28: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/16 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:2/64 scope link
       valid_lft forever preferred_lft forever

The first interface is the loopback device. Ignore it for now. Notice that the second interface has the IP address 172.17.0.2, which is the same address shown for alpine1 in the previous step.

From within alpine1, make sure you can connect to the internet by pinging google.com. The -c 2 flag limits the command to two ping attempts.

# ping -c 2 google.com

PING google.com (172.217.3.174): 56 data bytes
64 bytes from 172.217.3.174: seq=0 ttl=41 time=9.841 ms
64 bytes from 172.217.3.174: seq=1 ttl=41 time=9.897 ms

--- google.com ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 9.841/9.869/9.897 ms

Now try to ping the second container. First, ping it by its IP address, 172.17.0.3:

# ping -c 2 172.17.0.3

PING 172.17.0.3 (172.17.0.3): 56 data bytes
64 bytes from 172.17.0.3: seq=0 ttl=64 time=0.086 ms
64 bytes from 172.17.0.3: seq=1 ttl=64 time=0.094 ms

--- 172.17.0.3 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.086/0.090/0.094 ms

This succeeds. Next, try pinging the alpine2 container by container name. This will fail.

# ping -c 2 alpine2

ping: bad address 'alpine2'
Detach from alpine1 without stopping it by using the detach sequence, CTRL + p CTRL + q (hold down CTRL and type p followed by q). If you wish, attach to alpine2 and repeat steps 4, 5, and 6 there, substituting alpine1 for alpine2.


Stop and remove both containers.

$ docker container stop alpine1 alpine2
$ docker container rm alpine1 alpine2

Remember, the default bridge network is not recommended for production. 

Use user-defined bridge networks

In this example, we again start two alpine containers, but attach them to a user-defined network called alpine-net which we have already created. These containers are not connected to the default bridge network at all. We then start a third alpine container which is connected to the bridge network but not connected to alpine-net, and a fourth alpine container which is connected to both networks.

  1. Create the alpine-net network. You do not need the --driver bridge flag since it’s the default, but this example shows how to specify it.

    $ docker network create --driver bridge alpine-net
    
  2. List Docker’s networks:

    $ docker network ls
    
    NETWORK ID          NAME                DRIVER              SCOPE
    e9261a8c9a19        alpine-net          bridge              local
    17e324f45964        bridge              bridge              local
    6ed54d316334        host                host                local
    7092879f2cc8        none                null                local
    

    Inspect the alpine-net network. This shows you its IP address and the fact that no containers are connected to it:

    $ docker network inspect alpine-net
    
    [
        {
            "Name": "alpine-net",
            "Id": "e9261a8c9a19eabf2bf1488bf5f208b99b1608f330cff585c273d39481c9b0ec",
            "Created": "2017-09-25T21:38:12.620046142Z",
            "Scope": "local",
            "Driver": "bridge",
            "EnableIPv6": false,
            "IPAM": {
                "Driver": "default",
                "Options": {},
                "Config": [
                    {
                        "Subnet": "172.18.0.0/16",
                        "Gateway": "172.18.0.1"
                    }
                ]
            },
            "Internal": false,
            "Attachable": false,
            "Containers": {},
            "Options": {},
            "Labels": {}
        }
    ]
    

    Notice that this network’s gateway is 172.18.0.1, as opposed to the default bridge network, whose gateway is 172.17.0.1. The exact IP address may be different on your system.

  3. Create your four containers. Notice the --network flags. You can only connect to one network during the docker run command, so you need to use docker network connect afterward to connect alpine4 to the bridge network as well.

    $ docker run -dit --name alpine1 --network alpine-net alpine ash
    
    $ docker run -dit --name alpine2 --network alpine-net alpine ash
    
    $ docker run -dit --name alpine3 alpine ash
    
    $ docker run -dit --name alpine4 --network alpine-net alpine ash
    
    $ docker network connect bridge alpine4
    

    Verify that all containers are running:

    $ docker container ls
    
    CONTAINER ID        IMAGE               COMMAND             CREATED              STATUS              PORTS               NAMES
    156849ccd902        alpine              "ash"               41 seconds ago       Up 41 seconds                           alpine4
    fa1340b8d83e        alpine              "ash"               51 seconds ago       Up 51 seconds                           alpine3
    a535d969081e        alpine              "ash"               About a minute ago   Up About a minute                       alpine2
    0a02c449a6e9        alpine              "ash"               About a minute ago   Up About a minute                       alpine1
    
  4. Inspect the bridge network and the alpine-net network again:

    $ docker network inspect bridge
    
    [
        {
            "Name": "bridge",
            "Id": "17e324f459648a9baaea32b248d3884da102dde19396c25b30ec800068ce6b10",
            "Created": "2017-06-22T20:27:43.826654485Z",
            "Scope": "local",
            "Driver": "bridge",
            "EnableIPv6": false,
            "IPAM": {
                "Driver": "default",
                "Options": null,
                "Config": [
                    {
                        "Subnet": "172.17.0.0/16",
                        "Gateway": "172.17.0.1"
                    }
                ]
            },
            "Internal": false,
            "Attachable": false,
            "Containers": {
                "156849ccd902b812b7d17f05d2d81532ccebe5bf788c9a79de63e12bb92fc621": {
                    "Name": "alpine4",
                    "EndpointID": "7277c5183f0da5148b33d05f329371fce7befc5282d2619cfb23690b2adf467d",
                    "MacAddress": "02:42:ac:11:00:03",
                    "IPv4Address": "172.17.0.3/16",
                    "IPv6Address": ""
                },
                "fa1340b8d83eef5497166951184ad3691eb48678a3664608ec448a687b047c53": {
                    "Name": "alpine3",
                    "EndpointID": "5ae767367dcbebc712c02d49556285e888819d4da6b69d88cd1b0d52a83af95f",
                    "MacAddress": "02:42:ac:11:00:02",
                    "IPv4Address": "172.17.0.2/16",
                    "IPv6Address": ""
                }
            },
            "Options": {
                "com.docker.network.bridge.default_bridge": "true",
                "com.docker.network.bridge.enable_icc": "true",
                "com.docker.network.bridge.enable_ip_masquerade": "true",
                "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
                "com.docker.network.bridge.name": "docker0",
                "com.docker.network.driver.mtu": "1500"
            },
            "Labels": {}
        }
    ]
    

    Containers alpine3 and alpine4 are connected to the bridge network.

    $ docker network inspect alpine-net
    
    [
        {
            "Name": "alpine-net",
            "Id": "e9261a8c9a19eabf2bf1488bf5f208b99b1608f330cff585c273d39481c9b0ec",
            "Created": "2017-09-25T21:38:12.620046142Z",
            "Scope": "local",
            "Driver": "bridge",
            "EnableIPv6": false,
            "IPAM": {
                "Driver": "default",
                "Options": {},
                "Config": [
                    {
                        "Subnet": "172.18.0.0/16",
                        "Gateway": "172.18.0.1"
                    }
                ]
            },
            "Internal": false,
            "Attachable": false,
            "Containers": {
                "0a02c449a6e9a15113c51ab2681d72749548fb9f78fae4493e3b2e4e74199c4a": {
                    "Name": "alpine1",
                    "EndpointID": "c83621678eff9628f4e2d52baf82c49f974c36c05cba152db4c131e8e7a64673",
                    "MacAddress": "02:42:ac:12:00:02",
                    "IPv4Address": "172.18.0.2/16",
                    "IPv6Address": ""
                },
                "156849ccd902b812b7d17f05d2d81532ccebe5bf788c9a79de63e12bb92fc621": {
                    "Name": "alpine4",
                    "EndpointID": "058bc6a5e9272b532ef9a6ea6d7f3db4c37527ae2625d1cd1421580fd0731954",
                    "MacAddress": "02:42:ac:12:00:04",
                    "IPv4Address": "172.18.0.4/16",
                    "IPv6Address": ""
                },
                "a535d969081e003a149be8917631215616d9401edcb4d35d53f00e75ea1db653": {
                    "Name": "alpine2",
                    "EndpointID": "198f3141ccf2e7dba67bce358d7b71a07c5488e3867d8b7ad55a4c695ebb8740",
                    "MacAddress": "02:42:ac:12:00:03",
                    "IPv4Address": "172.18.0.3/16",
                    "IPv6Address": ""
                }
            },
            "Options": {},
            "Labels": {}
        }
    ]
    

    Containers alpine1alpine2, and alpine4 are connected to the alpine-net network.

  5. On user-defined networks like alpine-net, containers can not only communicate by IP address, but can also resolve a container name to an IP address. This capability is called automatic service discovery. Let’s connect to alpine1 and test this out. alpine1 should be able to resolve alpine2 and alpine4 (and alpine1, itself) to IP addresses.

    $ docker container attach alpine1
    
    # ping -c 2 alpine2
    
    PING alpine2 (172.18.0.3): 56 data bytes
    64 bytes from 172.18.0.3: seq=0 ttl=64 time=0.085 ms
    64 bytes from 172.18.0.3: seq=1 ttl=64 time=0.090 ms
    
    --- alpine2 ping statistics ---
    2 packets transmitted, 2 packets received, 0% packet loss
    round-trip min/avg/max = 0.085/0.087/0.090 ms
    
    # ping -c 2 alpine4
    
    PING alpine4 (172.18.0.4): 56 data bytes
    64 bytes from 172.18.0.4: seq=0 ttl=64 time=0.076 ms
    64 bytes from 172.18.0.4: seq=1 ttl=64 time=0.091 ms
    
    --- alpine4 ping statistics ---
    2 packets transmitted, 2 packets received, 0% packet loss
    round-trip min/avg/max = 0.076/0.083/0.091 ms
    
    # ping -c 2 alpine1
    
    PING alpine1 (172.18.0.2): 56 data bytes
    64 bytes from 172.18.0.2: seq=0 ttl=64 time=0.026 ms
    64 bytes from 172.18.0.2: seq=1 ttl=64 time=0.054 ms
    
    --- alpine1 ping statistics ---
    2 packets transmitted, 2 packets received, 0% packet loss
    round-trip min/avg/max = 0.026/0.040/0.054 ms
    
  6. From alpine1, you should not be able to connect to alpine3 at all, since it is not on the alpine-net network.

    # ping -c 2 alpine3
    
    ping: bad address 'alpine3'
    

    Not only that, but you can’t connect to alpine3 from alpine1 by its IP address either. Look back at the docker network inspect output for the bridge network and find alpine3’s IP address: 172.17.0.2 Try to ping it.

    # ping -c 2 172.17.0.2
    
    PING 172.17.0.2 (172.17.0.2): 56 data bytes
    
    --- 172.17.0.2 ping statistics ---
    2 packets transmitted, 0 packets received, 100% packet loss
    

    Detach from alpine1 using detach sequence, CTRL + p CTRL + q (hold down CTRL and type p followed by q).

  7. Remember that alpine4 is connected to both the default bridge network and alpine-net. It should be able to reach all of the other containers. However, you will need to address alpine3 by its IP address. Attach to it and run the tests.

    $ docker container attach alpine4
    
    # ping -c 2 alpine1
    
    PING alpine1 (172.18.0.2): 56 data bytes
    64 bytes from 172.18.0.2: seq=0 ttl=64 time=0.074 ms
    64 bytes from 172.18.0.2: seq=1 ttl=64 time=0.082 ms
    
    --- alpine1 ping statistics ---
    2 packets transmitted, 2 packets received, 0% packet loss
    round-trip min/avg/max = 0.074/0.078/0.082 ms
    
    # ping -c 2 alpine2
    
    PING alpine2 (172.18.0.3): 56 data bytes
    64 bytes from 172.18.0.3: seq=0 ttl=64 time=0.075 ms
    64 bytes from 172.18.0.3: seq=1 ttl=64 time=0.080 ms
    
    --- alpine2 ping statistics ---
    2 packets transmitted, 2 packets received, 0% packet loss
    round-trip min/avg/max = 0.075/0.077/0.080 ms
    
    # ping -c 2 alpine3
    ping: bad address 'alpine3'
    
    # ping -c 2 172.17.0.2
    
    PING 172.17.0.2 (172.17.0.2): 56 data bytes
    64 bytes from 172.17.0.2: seq=0 ttl=64 time=0.089 ms
    64 bytes from 172.17.0.2: seq=1 ttl=64 time=0.075 ms
    
    --- 172.17.0.2 ping statistics ---
    2 packets transmitted, 2 packets received, 0% packet loss
    round-trip min/avg/max = 0.075/0.082/0.089 ms
    
    # ping -c 2 alpine4
    
    PING alpine4 (172.18.0.4): 56 data bytes
    64 bytes from 172.18.0.4: seq=0 ttl=64 time=0.033 ms
    64 bytes from 172.18.0.4: seq=1 ttl=64 time=0.064 ms
    
    --- alpine4 ping statistics ---
    2 packets transmitted, 2 packets received, 0% packet loss
    round-trip min/avg/max = 0.033/0.048/0.064 ms
    
  8. As a final test, make sure your containers can all connect to the internet by pinging google.com. You are already attached to alpine4 so start by trying from there. Next, detach from alpine4 and connect to alpine3 (which is only attached to the bridge network) and try again. Finally, connect to alpine1 (which is only connected to the alpine-net network) and try again.

    # ping -c 2 google.com
    
    PING google.com (172.217.3.174): 56 data bytes
    64 bytes from 172.217.3.174: seq=0 ttl=41 time=9.778 ms
    64 bytes from 172.217.3.174: seq=1 ttl=41 time=9.634 ms
    
    --- google.com ping statistics ---
    2 packets transmitted, 2 packets received, 0% packet loss
    round-trip min/avg/max = 9.634/9.706/9.778 ms
    
    CTRL+p CTRL+q
    
    $ docker container attach alpine3
    
    # ping -c 2 google.com
    
    PING google.com (172.217.3.174): 56 data bytes
    64 bytes from 172.217.3.174: seq=0 ttl=41 time=9.706 ms
    64 bytes from 172.217.3.174: seq=1 ttl=41 time=9.851 ms
    
    --- google.com ping statistics ---
    2 packets transmitted, 2 packets received, 0% packet loss
    round-trip min/avg/max = 9.706/9.778/9.851 ms
    
    CTRL+p CTRL+q
    
    $ docker container attach alpine1
    
    # ping -c 2 google.com
    
    PING google.com (172.217.3.174): 56 data bytes
    64 bytes from 172.217.3.174: seq=0 ttl=41 time=9.606 ms
    64 bytes from 172.217.3.174: seq=1 ttl=41 time=9.603 ms
    
    --- google.com ping statistics ---
    2 packets transmitted, 2 packets received, 0% packet loss
    round-trip min/avg/max = 9.603/9.604/9.606 ms
    
    CTRL+p CTRL+q
    
  9. Stop and remove all containers and the alpine-net network.

    $ docker container stop alpine1 alpine2 alpine3 alpine4
    
    $ docker container rm alpine1 alpine2 alpine3 alpine4
    
    $ docker network rm alpine-net
    


Grafana and Postgres

The docker-compose.yml is below:

version: '3.9'

services:
  grafana:
    image: grafana/grafana
    container_name: grafana
    ports:
      - 3000:3000
    links:
      - postgres
    volumes:
      - grafana_data:/var/lib/grafana:rw
    networks:
      - gf_net
    
  postgres:
    image: postgres:9.6.6
    container_name: postgres
    environment:
      POSTGRES_USER: postgres     # define credentials
      POSTGRES_PASSWORD: postgres # define credentials
      POSTGRES_DB: postgres       # define database
    ports:
      - 5432:5432                 # Postgres port
    volumes:
      - postgres_data:/var/lib/postgresql/data
    networks:
      - gf_net

volumes:
  grafana_data:
  postgres_data:

networks:
  gf_net:
    driver: bridge    

The command to start the docker container for Grafana and Postgres is

$ docker-compose up -d

The network, Grafana and Postgres container will be started and available for us. In this case, grafana_data and postgres_data will be the name inside the docker-compose.yml file; for the 
real volume name, it will be prepended with project name prefix.


$ docker volume ls 
local     grafana_grafana_data
local     grafana_postgres_data

This created 2 grafana_grafana_data and grafana_postgres_data volumes. The volumes are not externally created but created by docker for us.

$ docker volume inspect grafana_grafana_data
[
    {
        "CreatedAt": "2021-09-26T02:32:16Z",
        "Driver": "local",
        "Labels": {
            "com.docker.compose.project": "grafana",
            "com.docker.compose.version": "1.29.2",
            "com.docker.compose.volume": "grafana_data"
        },
        "Mountpoint": "/var/lib/docker/volumes/grafana_grafana_data/_data",
        "Name": "grafana_grafana_data",
        "Options": null,
        "Scope": "local"
    }
]

The other volume.
$ docker volume inspect grafana_postgres_data
[
    {
        "CreatedAt": "2021-09-26T02:27:19Z",
        "Driver": "local",
        "Labels": {
            "com.docker.compose.project": "grafana",
            "com.docker.compose.version": "1.29.2",
            "com.docker.compose.volume": "postgres_data"
        },
        "Mountpoint": "/var/lib/docker/volumes/grafana_postgres_data/_data",
        "Name": "grafana_postgres_data",
        "Options": null,
        "Scope": "local"
    }
]

The details of the network.

$ docker network ls
NETWORK ID     NAME             DRIVER    SCOPE
ee768c83bcda   bridge           bridge    local
7a112508d3bb   grafana_gf_net   bridge    local
f855c01f90a8   host             host      local
4d762bc12676   none             null      local

Below are the details of the network.

$ docker network inspect grafana_gf_net
[
    {
        "Name": "grafana_gf_net",
        "Id": "7a112508d3bb08a69e95be809ce55335d42afb55edf305c3c458f5e53eadd796",
        "Created": "2021-09-26T02:27:13.3604873Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.22.0.0/16",
                    "Gateway": "172.22.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "2b55c5da9e850a1d47c169fa1e4b68cba516234aceac32a56d8b0dc2a993fc08": {
                "Name": "grafana",
                "EndpointID": "826cbfbaa52eeac91bd372ea490d8b81f594ad07b4d61bb8f0956cda0e4f2949",
                "MacAddress": "02:42:ac:16:00:03",
                "IPv4Address": "172.22.0.3/16",
                "IPv6Address": ""
            },
            "944f260241be8ead1777315971936db41ef1fe4fdd41d33c3396eea84b03a28f": {
                "Name": "postgres",
                "EndpointID": "d9fe759283bbf415c91285cf2b845cd6bb88216090dc120efa80d706be6b2bd6",
                "MacAddress": "02:42:ac:16:00:02",
                "IPv4Address": "172.22.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {
            "com.docker.compose.network": "gf_net",
            "com.docker.compose.project": "grafana",
            "com.docker.compose.version": "1.29.2"
        }
    }
]

The below are the docker containers.

$ docker ps
CONTAINER ID   IMAGE             COMMAND                  CREATED              STATUS              PORTS                                       NAMES
2b55c5da9e85   grafana/grafana   "/run.sh"                About a minute ago   Up About a minute   0.0.0.0:3000->3000/tcp, :::3000->3000/tcp   grafana
944f260241be   postgres:9.6.6    "docker-entrypoint.s…"   About a minute ago   Up About a minute   0.0.0.0:5432->5432/tcp, :::5432->5432/tcp   postgres

The grafana dashboard can be access on http://localhost:3000/ in your browser.





The datasource can be added using the postgres name.



The connection is successful. 

Happy Coding.





Tuesday, August 7, 2018

Pass config to Spark Hadoop

In one of my project, we needed to migrate the Hadoop Java code to Spark. The spark code was submitted via boto3 on EMR. The configs like start_date, end_date was required by InputFormat. We followed the below steps for reading the files with CustomInputFormat.

public class MyProcessor {

  private static final Logger logger = LoggerFactory.getLogger(MyProcessor.class);

  public static String isNull(String param) {
    return (param != null && !param.trim().isEmpty()) ?  param.trim() : null;
  }

  public static void main(String[] args) {

    logger.info("MyProcessor");

    SparkSession spark = SparkSession
        .builder()
        .appName("MyProcessor")
        .getOrCreate();

    SparkContext sc = spark.sparkContext();

    logger.info("input.format.start_date : "+sc.hadoopConfiguration().get("input.format.start_date"));
    logger.info("input.format.end_date : "+sc.hadoopConfiguration().get("input.format.end_date"));

    JavaSparkContext jsc = new JavaSparkContext(sc);

    JavaPairRDD<LongWritable, CustomWritable> rdd = jsc.newAPIHadoopRDD(sc.hadoopConfiguration(), CustomInputFormat.class,
        LongWritable.class, CustomWritable.class);

    JavaRDD rowsRdd = rdd.map(x -> RowFactory.create(isNull(x._2.getA()), isNull(x._2.getB()))).repartition(partitions);

    StructType schema = new StructType(new StructField[]{
        new StructField("a_str", DataTypes.StringType, true, Metadata.empty()),
        new StructField("b_str", DataTypes.StringType, true, Metadata.empty()),
    });

    Dataset<Row> df = spark.createDataFrame(rowsRdd, schema);
    df.cache();

    List<DQResult> results = new ArrayList();

    long recCount = df.count();
    logger.info("recCount : " + recCount);
    spark.close();
  }
}

Create a Uber jar using shade plugin or assembly plugin.

Pass the config by prepending the config spark.hadoop. The below command

spark-submit --class MyProcessor --conf spark.hadoop.input.format.start_date=1532131200000 --conf spark.hadoop.input.format.end_date=1532131200000 --master yarn --deploy-mode cluster /home/hadoop/jars/myjar.jar

Happy Coding !

Include config files in shade plugin

We were working on some project where we have to include `config` folder as the `src/main/resources` in maven project.

Add below lines to pom.xml

<build>
    <resources>
        <resource>
            <directory>${project.basedir}/conf</directory>
        </resource>
    </resources>
    <testResources>
        <testResource>
            <directory>${project.basedir}/conf</directory>
        </testResource>
        <testResource>
            <directory>${project.basedir}/src/test/resources</directory>
        </testResource>
    </testResources>
</build>

Thursday, July 26, 2018

Run Jupyter Notebook on Spark

We were looking solution for providing pyspark notebook for analyst. The below steps provide a virtual environment and local spark.

mkdir project-folder
cd project-folder
mkvirtualenv notebook
pip install jupyter

Check if browser opens the notebook using below command:

jupyter notebook

Quit the terminal by Cntrl + c, y.

For enabling the spark in notebook, Add below to .bashrc or .bash_profile

export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS=notebook

I have already downloaded the spark tar and untar it in the /Users/ashokagarwal/devtools/.

Now open the terminal and run below command:

/Users/ashokagarwal/devtools/spark/bin/pyspark

This will open a browser. Choose new -> python 2.



spark.sparkContext.parallelize(range(10)).count()

df = spark.sql('''select 'spark' as hello ''')
df.show()

Spark Scala Uber Jar using Maven

We were working on some project. The code was already in java and build tool was maven. I was looking around for creating Uber Jar which can work with spark-submit easily.

spark-submit --class demo.HW --master yarn --deploy-mode cluster /home/hadoop/jars/spark-grep-jar-with-dependencies.jar

OR

spark-submit --class demo.HW --master yarn --deploy-mode cluster /home/hadoop/jars/spark-grep.jar

The above jars are created using the below pom.xml

<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>my.org</groupId>
    <artifactId>spark-grep</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <jdk.version>1.8</jdk.version>
        <spark.version>2.2.0</spark.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.8</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
    </dependencies>
    <build>
        <finalName>${project.artifactId}</finalName>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>${jdk.version}</source>
                    <target>${jdk.version}</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
    <profiles>
        <profile>
            <id>shade</id>
            <activation>
                <activeByDefault>false</activeByDefault>
            </activation>
            <build>
                <plugins>
                    <plugin>
                        <groupId>org.scala-tools</groupId>
                        <artifactId>maven-scala-plugin</artifactId>
                        <version>2.15.2</version>
                        <executions>
                            <execution>
                                <goals>
                                    <goal>compile</goal>
                                </goals>
                            </execution>
                        </executions>
                    </plugin>
                    <plugin>
                        <groupId>org.apache.maven.plugins</groupId>
                        <artifactId>maven-shade-plugin</artifactId>
                        <version>2.3</version>
                        <executions>
                            <execution>
                                <phase>package</phase>
                                <goals>
                                    <goal>shade</goal>
                                </goals>
                            </execution>
                        </executions>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </plugin>
                </plugins>
            </build>
        </profile>
        <profile>
            <id>assembly</id>
            <activation>
                <activeByDefault>false</activeByDefault>
            </activation>
            <build>
                <plugins>
                    <plugin>
                        <groupId>org.scala-tools</groupId>
                        <artifactId>maven-scala-plugin</artifactId>
                        <version>2.15.2</version>
                        <executions>
                            <execution>
                                <goals>
                                    <goal>compile</goal>
                                </goals>
                            </execution>
                        </executions>
                    </plugin>
                    <plugin>
                        <artifactId>maven-assembly-plugin</artifactId>
                        <configuration>
                            <descriptorRefs>
                                <descriptorRef>jar-with-dependencies</descriptorRef>
                            </descriptorRefs>
                        </configuration>
                        <executions>
                            <execution>
                                <id>make-assembly</id>
                                <phase>package</phase>
                                <goals>
                                    <goal>single</goal>
                                </goals>
                            </execution>
                        </executions>
                    </plugin>
                </plugins>
            </build>
        </profile>
    </profiles>
</project>

There are two profiles using maven-assembly-plugin and maven-shade-plugin.

mvn clean package -Passembly

This will generate a jar file with suffix: jar-with-dependencies.jar. Like - spark-grep-jar-with-dependencies.jar

mvn clean package -Pshade

This will create jar file like spark-grep.jar.

Copy the jar file to the cluster and run using spark-submit.

Happy Coding
  

Tuesday, June 26, 2018

Scheduling Pipeline using Celery

Overview


Recently I came across situation where the Hadoop MR job was to be launched on the EMR cluster. There many options of launching the job on EMR:

  • AWS Web Console: The job can be launched from EMR AWS console by choosing hadoop version, instance types, log file path on s3 etc.
  • AWS command line: AWS provides command line tool for interacting with EMR, S3 etc. ```aws emr create-cluster --applications Name=Hadoop Name=Spark --tags 'owner=Ashok' 'env=Testing' ...```
  • Using Boto and Celery: Celery task calls the boto api boto3.client('emr', region_name=region_name) response = client.run_job_flow( Name=name, LogUri=log_uri, ReleaseLabel=release_label, Instances={
  • ...........

Celery Task

The celery task will run after scheduled time - daily, hourly. The task will execute the command/call to boto api.

mkvirtualenv celery_env

pip install celery

brew install redis

pip install celery[redis]


Create a directory with below file structure

             celery_task/
                |___ myceleryconfig.py
                |___ mycelery.py

The contents of the myceleryconfig.py 

from celery.schedules import crontab

CELERYBEAT_SCHEDULE = {
    'every-minute': {
        'task': 'mycelery.add',
        'schedule': crontab(minute='*/1'),
        'args': (1,2),
    },
}
BROKER_URL='redis://localhost'

Create

The mycelery.py will be as below:

from celery import Celery

app = Celery('tasks')
app.config_from_object('celeryconfig')

@app.taskdef add(x, y):
    return x + y

Execution

The program can be executed as:

celery -A mycelery worker --loglevel=info --beat 

The output will be like below:



The task will execute after every 1 minute.

Happy Coding