Skip to content

Nomad

Nomad is a simple and flexible scheduler and workload orchestrator to deploy and manage containers and non-containerized applications across on-prem and clouds at scale.

Technology overview

Below is diagram of how Nomad fits with other technologies.

graph LR
    nomad["Nomad"]
    docker["Docker"]
    consul["Consul"]
    linux["Linux"]

    docker -- Task driver for --> nomad
    consul -- Provides service discovery and service mesh for --> nomad
    linux -- Runs --> nomad

Basic Overview

Key Terms:

  • agent: Nomad process running in server or client mode.
  • client: Responsible for running the tasks assigned to it. Registers itself with the servers and watches for any work to be assigned, also known as a node.
  • server: Manages all jobs and clients, monitors tasks and controls which tasks get placed on which client nodes. The servers replicate data between each other to ensure high availability.
  • dev_agent: An agent configuration that provides useful defaults for running a single node cluster of Nomad.

Key Operations:

  • task: the smallest unit of work, executed by task drivers.
  • group: a series of tasks that run on the same client.
  • job: core unit of control, defines the application and its configurations. Can contain one or more tasks.
  • jobspec: describes the job, tasks and resources required to run the job.
  • allocation: mapping between a task group in a job and a client. When a job is run, Nomad will choose a client capable of running it.

An application is defined in a jobspec with groups of tasks and once submitted to Nomad, a job is created along with allocations for each group defined in that jobspec.

Overview

graph TD
    developer["Developer"]
    job["Job"]
    task-group["Task Group"]
    task["Task"]
    driver["Driver"]
    client["Client"]
    allocation["Allocation"]
    evaluation["Evaluation"]
    deployment["Deployment"]
    server["Server"]
    region["Region"]
    datacenter["Datacenter"]

    developer -- writes --> job
    job -- consists of one or more --> task-group
    job -- submitted to --> server
    job --> deployment
    evaluation -- changes --> allocation
    task-group -- a set of --> task
    driver -- executes --> task
    allocation -- schedules task group on --> client
    allocation -- schedules --> task-group

    server -- creates --> allocation
    server -- runs --> evaluation
    server -- manages --> client

    region -- contains one or more --> datacenter
    datacenter -- group of --> client

Pages

CLI Commands

Run a Job

nomad job run job.nomad

Open Web UI

nomad ui

Directories

  • Config File (Fedora): /etc/nomad.d/nomad.hcl
  • Data Directory (Fedora): /opt/nomad/data

Tips

Docker images failing to pull due to timeouts

Permission denied after mounting host volume into Docker container

Set the user to root

task "prometheus" {
  driver = "docker"
  user = "root"

  config {
    image = "prom/prometheus:latest"
  }
}

Debugging failed allocations

Set the entrypoint to prevent the container from crashing so it can be exec’d into

task "grafana" {
  driver = "docker"
  config {
    image = "grafana/grafana-oss:latest"
    ports = ["grafana-ui"]

    entrypoint = ["/bin/sh", "-c", "while true; do sleep 500; done"]
  }
}

Debugging Environment Variables

Exec into a container a run printenv or env

Extending the Garbage Collection threshold for a Job

By default old Jobs are removed after 4 hours, after that time passes all data related to the Job is removed (including logs).

It can be useful to increase this, for example to see logs of failed Jobs in the UI.


Last update: April 30, 2023
Created: April 30, 2023