Notes

Automation of Systems Experiments with Wickie

2024-09-02

TL;DR: Over the last few years I built a tool to automate my experiments. In this post I outline the need for such a piece of software and discuss its design.

After identifying a relevant problem or use case, systems research consists of protocol design, implementation, and evaluation. Evaluation generally takes the longest, not only because it might require to refine the protocol design or update its implementation, but also because setting up a number of physical machines to run experiments on is very time-consuming.

Surprisingly, there exists no standard way of setting up and running experiments. When talking to other researchers they usually mention custom scripts they built over the years but are not public, or project specific scripts that are hard to adapt to other use cases. Around midway through my PhD studies I decided that I want a reusable framework for running experiments.

This framework must support the following three features. First, it needs to be able to perform the initial setup of machines, such as creating users or formatting disks, and then be able to install all required pieces of software to run the experiments (either using a package manager, binary distribution, or from source) (either using a package manager, binary distribution, or from source). Further, it must have the ability to run individual experiments and collect their output somewhere, so that one can easily inspect logs. Finally, it needs to have an end-to-end mechanism for running an entires series of experimental measurements and collect their data, without minimal user interaction. For example, I might want to evaluate how the throughput of a database correlates with the number of shards it executes on. While I technically could run individual experiments for each possible number of shards manually, it is much more convenient and efficient to automate this process, so that the entire evaluation can complete without any human interaction.

While a solution that fits all use cases is impossible, I¹ built wke (pronounced "vicky") and used it to evaluate a variety of systems of the last few years. In the past, I called this tool "distributed make" as, like conventional make, it allows to run pre-defined actions. Unlike the latter, wke does not execute on the local machine but on a cluster of remote machines. Because there are tools named distributed make or dmake already, I changed hte name to "wke". wke stands for "Wickie", the German name of fictional Viking boy from a popular animated series. More importantly, it is a short command and, to my knowledge, no existing command line tools uses this combination of letters.

wke enables defining targets that execute code written in bash or Python on individual machines or sets of machines. Such targets can be used to prepare experiments, such as compiling the project we want to evaluate, and to run the experimental workload. When running a target, wke will record all its output in dedicated log files.

wke is agnostic to any particular cluster or cluster provider. While I mainly use it with Cloudlab, it works with Azure and AWS as well. Generally, all you need to supply it with is a file that contains the addresses of machines in your test cluster. But there are also tool, like wke-cloudlab that automate the generation of such files.

wke relies on Paramiko to establish SSH connections through which scripts are run on remote machines and uses rsync to efficiently copy data to and from machines. In order to not have to transfer a script using rsync every time it is changed, scripts are pipe through SSH. This has some downsides, such that scripts need to be contained within a single file, but allows for fast iteration.

Setting up your project

I now want to outline how to use wke by walking through some example scripts (which can be found here). Note that wke uses TOML as a format for all its configuration files. TOML is pretty intuitive in my opinion, but you might need to familiarize yourself with it before continuing to read this post.

wke projects have a cluster.toml file at their top-level, which contains the addresses of all machines. In most cases you want to automatically generate a cluster file, but you can also create it manually. A simple cluster with client and server machines can look like this.

[cluster]
username="cskama"
workdir="/users/cskama"

[machines]
node0 = { external_addr="128.105.144.25", internal_addr="10.1.1.1" }
node1 = { external_addr="128.105.144.27", internal_addr="10.1.1.2" }

The cluster section contains some global configuration information in cluster and a list of all its servers in machines. A required cluster configuration is to set the username that we will connect as by default. In addition, you can set a dedicated working directory, which is beneficial if machines use non-standard home directories, or the home directory is limited in space. Additionally, there are optional cluster configuration options not shown here, such as which SSH port to use.

You can either define a single address per machine, or list separate "external" and "internal" addresses. The latter is useful as most cloud providers require you to set up a dedicated internal network for communication among nodes in the cluster, separate from the public network that you will use to connect to the machines. Such internal networks are also used by network testbeds, such as those of Emulab and Cloudlab. A machine's external address is then only used connect to it using SSH from wherever wke is run, e.g., your laptop.

In addition to the cluster configuration, a project also contains one or multiple configurations, each residing in a subfolder of the project. The purpose of a configuration is to group artifact-specific scripts. For example, you might want to compare MongoDB to MariaDB, in which case you would create a separate configuration for each. Configurations can also inherit from another configuration, so that you can put functionality shared between two configurations into a third parent configuration.

Each configuration has a config.toml file, like the one listed below, that lists all targets and metadata for this configuration. Targets can have arguments (or "parameters") that will be passed to the script in the pre-defined order.

[targets]
install-mongodb = []

[targets.run-mongod]
about = "Runs the MongoDB server process"
arguments = []

[targets.prepare-data]
about = "Sets up the data needed to run experiments"
arguments = [{ name="server-address"}, { name="num-entries", default=1000 }]

[targets.client-ops]
about = "Issues client operations"
arguments = [
    { name="server-address" },
    { name="num-entries", default=1000 },
    { name="client-multiply", default=8},
    { name="num-operations", default=10000},
    { name="write-chance", default=50 }
]

With this configuration we can install mongodb on one node0 like so:

wke mongodb node0 install-mongodb

The wke command always needs the configuration, a machine selector (e.g., its name), and the target name. You can additionally overwrite parameters, such as the server-address in prepare-data, using the -D flag, among some other flags. So if you ran the server on node0 and using node1 as a client machine, you would run the command below. Note, that there is a way to automate this, which we will talk about later.

wke mongodb node1 prepare-data -Dserver-address=10.1.1.1

As mentioned before, targets are simple Python or bash scripts. For example, the install-mongodb target adds the official MongoDB repository and installs its packages by running the commands outlined in the official repository).

#! /bin/bash

curl -fsSL https://www.mongodb.org/static/pgp/server-7.0.asc | \
   sudo gpg -o /usr/share/keyrings/mongodb-server-7.0.gpg --dearmor --yes

echo "deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list

sudo apt-get update
sudo apt-get install mongodb-org -y

The benchmark suite

Calling wke directly from the command line is an okay way run experiments, but it does not scale and can be cumbersome. For example, you will have to manually set parameters like shown before. The goal of wke has always been to automate the evaluation process as much as possible in order to reduce the human labor required to create and recreate experimental results. wke is written in Python and exposes all of its functionality as a library that provides many utilities to create project-specific benchmark suites.

To use this you would set up a benchmark script, which we will look at now. Benchmark scripts require some boilerplate code but are fairly straightforward to write once you get the hang of it. First, you start with a main function, as shown below, which is intended to do two things. First, it sets up the parameters we configure set via command line or experiment files (more on those later). Second, it passes a measurement function, that will set up and run our measurement, to wke.

from wke.benchmark import benchmark_main

def main():
    parameters = {
        "num-entries": {
            "default": 100000,
            "about": "How many entries does the database have initially",
        },
        "write-chance": {
            "default": 0,
            "about": "Likelihood that a client operation is a write",
        },
        "num-operations": {
            "default": 1000,
            "about": "How many requests to issue (per client)",
        },
        "client-multiply": {
            "default": 32,
            "about": "How many clients per client machine",
        }
    }

    benchmark_main(parameters, run_measurement)

if __name__ == "__main__":
    main()

Parameters depend on what exactly you would like to evaluate. For example, you might want to see how the performance of MongoDB changes when then number of clients increases (client-multiply) or the share of writes changes (write-chance). Similarly, we might want to change the initial number of entries (num-entries) before running the measurement, or run experiments of different length by increasing the number of operations issued per client (num-operations). Changing parameters allows you to do that, without having to modify the benchmark script itself.

The rest of the benchmark function then contains the run_measurement function which will set up the experiment and run a single measurement, as shown next. The function will be invoked, potentially, many times and the benchmark logic will feed it the required parameter configuration for each specific run. In addition, the function also takes at its arguments a flag stating whether to collect statistics such as CPU usage of nodes in the cluster (a feature I will not have time to talk about due to lack of space), the result_printer which generates the output CSV files for us, and a flag indicating the verbosity of console output.

from wke import Cluster, Configuration, MeasurementFailedError
from wke.measurement import MeasurementSession

def run_measurement(params, collect_statistics, result_printer, verbose=False) -> bool:
    def get_opts(arg_names, params):
        ''' Get a subset of the parameters needed for a specific target '''
        return dict((k, params[k]) for k in arg_names)

    cluster = Cluster(path='cluster.toml')
    config = Configuration("mongodb")
    session = MeasurementSession(cluster, config, verbose=verbose,
           collect_statistics=collect_statistics)

    # Make one machine a server and one client node
    nodes = cluster.create_slice()
    servers = nodes.create_subslice(1)
    clients = nodes.create_subslice(1)

    # add the server address to parameters, so it can be passed to targets
    params["server-address"] = nodes.get_internal_addrs()[0]
   
    # Start the MongoDB daemon in the background 
    session.run_background(servers, "run-mongod")

    success = session.run(clients, 'prepare-data',
            options=get_opts(['server-address', 'num-entries'], params))

    if not success:
        print("Experiment setup failed")
        return False

    try:
        # Specify which parameters to pass to the run target
        opt_names = [
            'server-address', 'num-entries', 'client-multiply',
            'write-chance', 'num-operations'
        ]

        # Figure out how many operations there will be in total
        # so that the calculated throughput is correct
        total_num_ops = len(clients)*params['client-multiply']*params['num-operations']

        result = session.measure(clients, "client-ops",
            num_operations=total_num_ops,
            options=get_opts(opt_names, params))

    except MeasurementFailedError as err:
        print(str(err))
        return False

    # Any additional data we would like to add to the resulting CSV
    extra_data = {
        "throughput": round(result.throughput, 3),
    }

    # Write output to CSV file
    result_printer.print(session.uid, params, extra_data)
    return True

You can roughly split this function into two parts: preparation of the experiment (everything before session.run) and collection of results (the rest). wke provides multiple utilities functions to make writing such functions easy. First, it allows to conveniently interact with a cluster using the slice abstraction. A slice refers to a set of machines within the cluster and contain a cursor that advances whenever you create a subslice. E.g., if you create a subslice of size for the cursor also advances by four. As a result, the first subslice would contain node one to four, while the next one would start at node five. Sublices are thus disjoint subsets of the original slice. A common pattern when using wke is to fetch a slice containing the entire cluster using cluster.create_slice, and then invoking create_subslice on that slice.

Another core utility to write such benchmark script is the MeasurementSession. It keeps track of running background tasks, such as the MongoDB server. For example, when we spawn a MongoDB server in the background (using session.run_background(servers, "run-mongod") the background task will be terminated when the session ends. It can also be used to run specific foreground tasks, such as setting up the data needed for the measurement. Its main purpose, however, is running the actual measurement. session.measure takes a set of machines, a target name, and some other information, such as the total number of operations that will be run. It then returns a result containing information, such as the throughput and the elapsed time. We can then feed the result to the result_printer. To do this we give it the parameters and a dictionary containing any other metrics we want to collect, such as the throughput in this case.

Now that we know this one function contains the entire experiment setup, execution, and teardown, it is actually relatively short...

After saving the script as benchmark.py, we would invoke something like the following to execute a measurement run for a write-only workload.

./benchmark.py cmdline -Cwrite-chance=100

The first argument indicates that we should take the configuration directly from the command line and arguments starting with -C configure the value of a specific parameter. In this case, we set write-chance to 100%.

Experiment Files

But there is more! The part where wke really shines, in my opinion, is a declarative way to define experiments. It allows using the same benchmark script for multiple different experiments and have a straightforward way to keep track of them. They can also automatically generate basic plots, albeit those plots are more intended for quick results, and not meant to be "pretty plots" you put in your paper.

Below you can see how we can define an experiment for our benchmark scripts that tests read-only workloads for different database sizes. In particular, it defines different values for the num-entries and write-chance parameters. For the former, it defines an exponential range from 10^0 to 10^6. For the other, a list of five distinct values.

The benchmark script will automatically run a measurement for each combination of parameters defined. For example, for 1000 entries, it will run a separate set of measurements for each of the defined values for write-chance. You can additionally set the number of iteration per parameter combination. Here, it will run three measurements per combination.

[experiment]
about="Runs different workloads with a varying number of entries in the database"
num_iterations=3

[parameters]
num-entries = { base=10, start=0, end=6 }
write-chance = [0,25,50,75,100]
num-operations = 10000

[[plots]]
type = "line"
x-axis = "num-entries"
y-axis = "throughput"
group-by = "write-chance"

[[plots]]
label = "read-only"
type = "line"
x-axis = "num-entries"
y-axis = "throughput"
filter-by = { key="write-chance", value=0 }

Further, the experiment script also defines plots that are is generated while data is collected. To write such a definition, we need to define the type of plot (here it is line) and parameters for x and y axes. For the first plot, we also set group-by="write-chance" which then will generate a separate line for each value of write-chance. The second plot script creates only one line for the read-only workload. Note, we need to label the second plot to give it a distinct name. Each experiment can only contain one unlabeled (or "main") plot. In more complex experiments additional plots are helpful to understand the numbers generated by the main plot.

To invoke this script named num-entries.toml we call the benchmark script with slightly different arguments as shown below. The first argument is now toml to indicate a toml file should be parsed and the second argument its path.

./benchmark.py toml experiments/num-entries.toml

Whenever a measurement run has concluded, wke will automatically update the experiments CSV file and (re-)generate the defined plots This means you quickly get feedback without having to wait for the entire experiment to conclude, and you can look at plots that automatically refresh instead of console output or changes to a CSV file. I found this especially helpful when writing benchmarks as you can quickly iterate and refine the benchmark scripts and experiment files.

There is much more to experiment definition, but this should give you a good idea of its expressiveness.

Limitation and Shortcomings

There are three things I want to improve about the tool in the future. I welcome help in tackling these issues.

Target Templates

wke requires targets to be simple scripts, which can result in code duplication. In practice, this has not been a huge issue as all targets for a particular wke projects in total are not more than a few hundred lines of code. However, I would like to support templates for common tasks, such as cloning a GitHub repository. If you have recommendation for a suitable template engine or syntax, I would love to hear it.

Improved Experiment Setup

A core goal of wke has always been to execute on the bare metal of a remote server. As a result, targets need to build or install the required binaries directly on the target machines. Usually, one only needs to recompile when code changes and not for every experiment run, but this approach is still quite wasteful as compilation is done on every machine in the cluster.

In the future I would like to rely on images that can be built once and then deployed on all machines. I am still trying to figure out a mechanism for this that does not turn wke into a full-fledged package manager and also does not break any of the existing workflows.

asyncio Support

wke periodically polls each SSH connection in a dedicated thread, because no more efficient mechanisms was supported by Paramiko at the time of wke's creation. This implementation is generally fine as, in most cases, you only connect to a handful of machines, but this approach is not very elegant and wastes some CPU cycles. One option would the rewrite the tool in a "systems" language, such at Rust. However, the Python ecosystem contains much better libraries for data analysis and visualization, while similar libraries for Rust are still in their infancy.

Fortunately, Python recently received support for async I/O operations through the asyncio library and there exist AsyncSSH, an SSH library with async support. As of now, I plan to wait a little longer for AsyncSSH to mature before making this big transition. I also anticipate that such a change will cause quite a few issues, as it took a while to fix all performance issues and bugs I encountered with Paramiko².

Conclusion

wke is a tool that has served me well and is fairly mature for measuring distributed systems. I hope that in making it public will enable other people to save some time, especially during the start of their careers as a researcher. I made the code available under a permissive license (MIT).

Note, that wke is only intended to set up and run experiments and has not been vetted for security. You should not use this tool with real user data or on a production cluster.

Ideally, wke will make it easer to perform experimental evaluation and make them results easier to reproduce. I hope that in the future our community can rely on a well-tested and widely-used tool instead of custom shell scripts. If wke is not that tool, I hope another will eventually take the place as the de facto standard mechanism to run systems experiments.