igor: A Node Reservation System

Introduction

igor is a tool for cluster management.

Users make reservations with igor, requesting either a number of nodes or a specific set of nodes. They also specify a kernel and initial ramdisk which their nodes should boot. At reservation creation time, igor will determine which nodes will be used and at what time; the user is therefore guaranteed to know when his or her reservation will execute. igor allows only the reservation's owner to delete the reservation.

igor will display the status of the cluster by "drawing" the layout of the nodes and racks and highlighting each reservation in a different color; see the "Using igor" section for more information.

igor assumes you have a cluster with uniformly-named compute nodes (e.g. "kn1", "kn2", and so on) and a head node. We assume that the cluster is already configured for netbooting, with the head node handing out DHCP addresses and serving pxelinux over tftp. Alternatively, igor can interface with Cobbler, allowing Cobbler to manage the configuration of DHCP and netbooting.

Using igor

Generally, to use igor you will check what nodes are reserved, submit a reservation, and possibly delete the reservation if you finish early.

igor includes built-in help; you can run "igor help" to get started.

Checking status

To see what reservations already exist:

$ igor show

This will show a diagram of the nodes in your cluster; nodes that are already reserved will be highlighted in a specific color. A key at the bottom matches colors to reservations. An example is shown below.

This cluster consists of 5 nodes in a single rack, as illustrated in the diagram. All 5 are currently running; if any were powered off, they would be highlighted in red. At the time of the screenshot, there were three total reservations in the schedule: 'res1', 'res2', and 'bigres'. Looking at the node diagram and the reservation table, we can see that 'res1' and 'res2' are currently running, occupying nodes kn[1-2] and kn[3-4], respectively. They will run from 16:56 through 17:56. When those reservations complete, 'bigres', which reserves all 5 nodes, will begin.

Making a reservation

To make a reservation, use the "igor sub" command. "Sub" comes from "submit", as used in HPC batch schedulers. You'll need to specify the reservation name, a path to a kernel, a path to an initrd, and the number of nodes you wish to reserve. For example, if I need only 2 nodes, I can make a reservation called "testing" like this:

$ igor sub -r testing -k /path/to/kernel -i /path/to/initrd -n 2

This reservation will default to 1 hour duration; if I wanted to reserve it for longer, I'd use the -t flag. igor will go through the schedule of existing reservations and find the first period in which 2 nodes are available for a full hour. It will then create a reservation in that slot, printing out when the reservation begins and ends, and which nodes it will use.

Reservations with a Cobbler profile

If igor is configured to use Cobbler, reservations can also be made by specifying an existing Cobbler profile rather than a new kernel+ramdisk pair. For instance, if there is a profile in Cobbler called 'Ubuntu16', you could make a reservation like this:

$ igor sub -r testing -profile Ubuntu16 -n 2 -t 120

This will reserve two nodes for two hours, using the Ubuntu16 profile to boot.

Reservations at a specific start time

You may not always want a reservation to start as soon as possible; for example, it may be useful to reserve 4 nodes for a demonstration on September 25 at 1 p.m., to ensure that the nodes are available when you need them. This is possible using the -a ("at") flag and specifying a time in the format "2017-Sept-25-13:00":

$ igor sub -r demo -k /path/to/kernel -i /path/to/initrd -n 4 -a "2017-Sept-25-13:00"

igor will reserve the first available chunk of 4 nodes on or after 13:00 on September 25. Note that if there are not 4 nodes available at 13:00, the reservation will be made as soon as possible after 13:00. Use the -s flag (see the section "Checking for available run slots" below) to check if a time slot is available before making the reservation.

Reserving specific nodes

Rather than using the -n flag to specify a number of nodes, it is possible to request specific nodes using the -w flag. This is in place because some experiments require special configurations of the nodes or on the switch; it is not recommended in general use, as it tends to fragment the cluster and allows experiment designers to lazily hard-code node names into their scripts.

$ igor sub -r testing -k /path/to/kernel -i /path/to/initrd -w kn[1-3]

Checking for available run slots

If there are many reservations pending, it can be difficult to figure out when your reservation could run. The -s flag performs "speculation": igor will look through the schedule and find the next 10 available reservations for the number of nodes requested.

The image above shows that there are many reservations of varying sizes already in the system, running at various times between 09:45 and 15:45. There are spaces available in that schedule for a reservation of just 1 node, but it is difficult to see where they are from the list.

By running igor sub with the -s flag, we can see that there is one slot available at 12:45, two slots at 13:45, two slots at 14:45, and then five slots available at 15:45 (when all existing reservations have completed). Running the same igor sub command without the -s flag would cause igor to create a reservation in that first slot at 12:45.

Using the -a flag described in a previous section, you can tell igor to make a reservation in one of the other time slots shown:

$ igor sub -r testing -k /scratch/test.kernel -i /scratch/test.initrd -n 1 -a 2017-Sept-20-14:45

You can also use the -a flag in combination with -s to check for available slots further in the future:

Setting kernel command line arguments

When booting from a kernel+ramdisk pair (rather than an existing Cobbler profile), it is possible to specify command line arguments for the Linux kernel using the -c flag:

$ igor sub -r testing -k /scratch/test.kernel -i /scratch/test.initrd -n 4 -c "console=tty0"

Deleting a reservation

Although reservations will be automatically deleted when they run out of time, it is polite to delete your reservation immediately when you're finished with it. To get rid of our "testing" reservation from above, simply run:

$ igor del testing

Power control

When properly configured, igor can turn nodes on and off. You can only affect nodes in a reservation that is 1) currently active and 2) owned by you.

To turn off all the nodes in a reservation named 'foo':

$ igor power -r foo off

To turn on nodes kn2, kn3, kn4, and kn7:

$ igor power -n kn[2-4,7] on

Building/Installing

igor is included as part of the minimega distribution. All the compilation and installation tasks should be done on your head node. Before installing and configuring igor, you must set up netbooting, DHCP, and hostnames for the cluster nodes. This can be done with or without Cobbler; igor can use either. The two sections below describe both options.

Pre-requisites & dependencies: without Cobbler

You will need a DHCP server, a TFTP server, and PXELINUX. On Debian, we use the dnsmasq (which can serve both DHCP and TFTP) and pxelinux packages. You can use other servers for DHCP and TFTP, but this document provides information on setting up dnsmasq.

Populate /etc/hosts

You will want a mapping of desired IP to hostname for your nodes. For example:

172.16.0.254    head
172.16.0.1    kn1
172.16.0.2    kn2
172.16.0.3    kn3

Configure dnsmasq

The dnsmasq configuration files have a dizzying array of options, but there are only a few you'll need to enable for igor. First, open /etc/dnsmasq.conf and add something like this to the end of the file:

dhcp-range=172.16.0.0,static
dhcp-boot=pxelinux.0
enable-tftp
tftp-root=/tftpboot
log-dhcp

This specifies that we will be handing out addresses in the 172.16.0.0/16 subnet, and that devices trying to netboot should be handed the file pxelinux.0 from the /tftpboot directory.

Next, create a file called /etc/dnsmasq.d/cluster.conf and fill it with entries for the nodes of your cluster, giving each node's MAC address, IP, and hostname.

dhcp-host=90:e2:ba:3a:e3:62,172.16.0.1,kn1
dhcp-host=90:e2:ba:3a:e5:aa,172.16.0.2,kn2
dhcp-host=90:e2:ba:38:ad:d8,172.16.0.3,kn3
dhcp-host=90:e2:ba:3a:e3:d6,172.16.0.4,kn4
dhcp-host=90:e2:ba:38:ad:8c,172.16.0.5,kn5

NOTE: If your cluster is large, it may be a very slow process to gather each individual node's MAC address. One way to speed it up is to use dnsmasq's logging feature. Before creating cluster.conf, start dnsmasq with the configuration shown above. Turn on the first node and watch /var/log/daemon.log on the head node--you should eventually see a log entry showing a DHCP request from a certain MAC. You can then turn off the first node and move on to the second. Using powerbot, we wrote a shell script that would automatically turn on every node in order to gather the MAC addresses for our very 520-node cluster.

Prepare /tftpboot

Create the directory /tftpboot and populate it with the PXELINUX files:

$ mkdir /tftpboot
$ cp /usr/lib/PXELINUX/pxelinux.0 /tftpboot
$ mkdir /tftpboot/pxelinux.cfg

pxelinux.0 may be in a different location depending on your Linux distribution.

Pre-requisites and dependencies: with Cobbler

If you already have Cobbler set up, there's very little that needs to be done before using igor. You'll want to create a Cobbler profile to use as igor's 'default' profile; when a reservation ends, igor will set the reservation's nodes back to the default profile. This profile will later be entered into the igor.conf configuration file. This could be a useful way to keep your nodes occupied when not in use, if you configure your default profile to automatically perform some form of computation when booted.

You also need to configure Cobbler such that the user 'igor' can run the cobbler command-line tool.

When you configure igor you will still need to set the tftproot variable, as igor needs a directory to use for its state files. To avoid conflicts with Cobbler, we recommend creating /var/lib/igor and using that.

Get igor

Follow the instructions in the installation article to get minimega downloaded or built from source. The igor binary will be in the bin/ subdirectory.

A note about calling igor

The igor binary we build expects to run as the 'igor' user so it can access the reservation and scheduling configurations. Previously, we accomplished this by setting the SUID bit on the igor binary. However, igor is now integrated with Cobbler and uses a command-line Python program to execute Cobbler commands. Linux disables the SUID bit when an interpreted program is called. For that reason, we now install a sudo rule which allows any user to run /usr/local/bin/igor-bin as user 'igor' without entering a password (see the file sudo-rule in the igor source directory) and install a shell script (igor.sh) into /usr/local/bin/igor to perform the sudo automatically. To a user, it should appear the same as ever.

Installation

Having set up DHCP and TFTP (or Cobbler) as above, we can configure igor.

We have provided a shell script to ease the installation of igor. It takes one argument, the path where igor will store its state files. If you are using Cobbler, this may be /var/lib/igor; if you set up your own TFTP server manually, it may be /tftpboot or wherever else you installed PXELINUX. It will show you the commands it intends to run before running them, in case you wish to change anything. (Remember to run this on the head node, not any other node)

$ cd minimega/src/igor
$ sudo ./setup.sh /tftpboot/        # or /var/lib/igor, for Cobbler

To configure igor, edit /etc/igor.conf, a JSON config file created by setup.sh. The defaults in igor.conf are cautious; you should only need to modify the "general" options described in the next subsection. Here's what one of ours looks like:

{
    "tftproot" : "/tftpboot/",
    "prefix" : "kn",
    "start" : 1,
    "end" : 5,
    "padlen" : 0,
    "rackwidth" : 1,
    "rackheight" : 5,
    "logfile": "/var/log/igor/log",
    "timelimit":    4320,
    "nodelimit":    3,
    "usecobbler" : false,
    "cobblerdefaultprofile": "",
    "poweroncommand" : "powerbot on %s",
    "poweroffcommand" : "powerbot off %s",
    "autoreboot": false,
    "network": "",
    "vlan_min": 1000,
    "vlan_max": 2000,
    "networkuser": "admin",
    "networkpassword": "admin",
    "network_url": "arista:80/command-api",
    "node_map":
    {
        "kn1": "Et1/1",
        "kn2": "Et1/2",
        "kn3": "Et1/3",
        "kn4": "Et1/4",
        "kn5": "Et1/5"
    }
}

N.B.: It is extremely important that the last entry is not followed by a comma; this is a quirk of json.

General configuration options

The "tftproot" setting should be whatever directory contains the "pxelinux.cfg" directory, or in the case of a Cobbler configuration, whatever directory you have decided igor should use (again, we recommend /var/lib/igor). The next options describe your cluster naming scheme. Our cluster nodes are named kn1 through kn520, so our "prefix" is "kn", "start" is 1, and "end" is 520. Note that the numbers are not in quotes. The "padlen" option is for use on clusters that 'pad' numbers; if your first node is named kn001 instead of kn1, "padlen" should be set to 3.

"rackheight" and "rackwidth" define the physical dimensions of your cluster hardware, for use with igor show. We have a cluster composed of 13 shelves, each containing 5 shelves of 8 PCs each. When igor show runs, part of the information it gives is a diagram of "racks"; one "rack" from our cluster is shown below:

---------------------------------
|281|282|283|284|285|286|287|288|
|289|290|291|292|293|294|295|296|
|297|298|299|300|301|302|303|304|
|305|306|307|308|309|310|311|312|
|313|314|315|316|317|318|319|320|
---------------------------------

If you are running a cluster of 4x 1U servers, and they are all in a single rack, you would set rackheight = 4, and rackwidth = 1, to see something like this:

---
|1|
|2|
|3|
|4|
---

If the physical layout of your cluster is strange, or if you'd just prefer a big grid, you can set rackheight = sqrt(# nodes) and rackwidth = sqrt(# nodes). This will just show one big grid of all your nodes.

The "logfile" setting configures a location to write out logs of created and deleted reservations, as well as information about power on/off events and other debugging info.

The "timelimit" option configures the maximum number of minutes a reservation can last. If a longer reservation is needed, root is allowed to make reservations of unlimited length. Set the option to 0 to allow any user to make unlimited reservations.

The "nodelimit" option configures the maximum number of nodes a reservation can contain. root can make reservations of arbitrary size. Set the option to 0 to disable the limit.

Cobbler-specific options

The "usecobbler" option specifies whether or not igor should attempt to use Cobbler. If it is set to true, make sure the igor user has permission to run the cobbler command-line tool.

The "cobblerdefaultprofile" variable specifies the name of a profile within Cobbler. When an igor reservation completes, igor will set the nodes in that reservation back to the default profile.

Power options

igor can automatically reboot nodes in a reservation when the reservation starts. Setting the "autoreboot" option enables this.

The "poweroncommand" and "poweroffcommand" options specify which command should be executed to power on and power off nodes. It should be executable by the igor user and use '%s' to represent the name of a node; for example, the configuration shown above sets "poweroncommand" to "powerbot on %s". This means igor will run powerbot on kn1 to turn on node kn1.

Network segmentation options

In certain clusters, igor can partition the experiment network such that nodes in a reservation can talk to each other but not to nodes in other reservations. This is accomplished using the QinQ (802.1ad) specification. For this to work, you need an Arista core switch with the management api http-commands option enabled.

It is essential that your cluster has separate experiment and boot/management networks for this. The experiment network will be partitioned, meaning the head node will be unable to communicate with the reservation nodes. The cluster nodes must therefore do their netbooting over a different network.

To enable network segementation, set the "network" option to "arista". If support for other switch types is added, that value will change, but for the time being only Arista is supported.

The "vlan_min" and "vlan_max" fields specify a range of VLANs to be used as the 'outer' VLANs in QinQ segmentation. It is a good idea to make this value larger than the maximum number of reservations you can expect to exist at any one time; using the defaults of 1000 through 2000 is probably a safe default.

The "networkuser" and "networkpassword" fields specify a user on the Arista switch who is allowed to make configuration changes.

The "network_url" field specifies the path to the switch's HTTP API. In the example, the switch's name is "arista" and the API is on port 80, so the URL looks like "arista:80/command-api".

The "node_map" field is critical to properly partitioning the network. Each entry maps a node's name to the name of a port on the core switch. In the example shown here, the node kn1 is on the first port of the first line-card in the switch. If a reservation is created containing kn1 and kn2, igor will connect to the switch and configure ports Et1/1 and Et1/2 to be in the same QinQ domain.

Other options

Although it is typically not needed, the "dnsserver" configuration option allows you to specify which DNS server igor should use when resolving node names. This can be necessary when multiple DNS servers are specified in the head node's resolv.conf, but only one contains records for the cluster nodes. If you can ping a node manually, but igor shows it as down, setting this option may solve the problem.

Authors

The minimega authors

19 September 2017