From: sterni <sternenseemann@systemli•org>
To: depot@tvl.su
Subject: Custom Continuous Benchmarking for depot
Date: Mon, 10 Feb 2025 16:46:43 +0100 [thread overview]
Message-ID: <ca85ee9e-558b-4123-8eef-b94c48048ddf@systemli.org> (raw)
Hello everyone,
# Overview
Below I sketch out an idea for a continuous benchmarking service built
on top of our existing CI/CD infrastructure, i.e. Buildkite. This would
be suitable to replace our existing use of aspen's [windtunnel.ci].
# Status
I'm looking for feedback on the general idea (i.e. this document)
before implementing a prototype (to prove that it is actually feasible).
# Motivation
- [windtunnel.ci] does not work at the moment.
- The domain has expired.
- I do not know whether the backing service is still running.
- aspen probably does not have time to work on it / maintain it in
the near future (which is understandable).
- We have benchmarks in depot which are only executed manually at the
moment (not exactly many, though: just //tvix and
//third_party/lisp/mime4cl. There is, of course, no telling how
this may change if we have a proper continuous benchmarking service
built into our CI).
- Using a separate service, be it windtunnel.ci or a self-hosted
[bencher] instance, is annoying because it
needs to be configured separately from our existing CI
infrastructure and can't leverage readTree for automatically
discovering benchmarks.
# Proposal
I've recently stumbled over gipeda ("GIt PErformance DAshboard"), a
static site generator which used to back perf.haskell.org/ghc (which no
longer exists). It was not hard to [revive] the code. The nice
thing about it is that it's not concerned with running benchmarks at
all. It just takes a directory of runs (matched to git revisions) and
generates a [dashboard] which visualizes changes in performance and
integrates with a preexisting git viewer.
I think we can set up relatively simple continuous benchmarking for
depot using gipedia:
1. Add a simple helper which allows easily creating an extraStep that
executes a benchmark, e.g.:
meta.ci.extraSteps = {
hyperfine-bench = depot.ops.benchmarking.mkBench {
run = run-hyperfine-bench;
converter = depot.ops.benchmarking.hyperfine-json-to-csv;
};
cargo-bench = depot.ops.benchmarking.mkBench {
run = run-cargo-bench;
converter = depot.ops.benchmarking.criterion-to-csv;
};
}
The arguments (tentatively) have the following meaning:
run
: should run the benchmark and dump the result to stdout in some
machine-readable format.
converter
: is a program that reads the output of `run` and converts it to
the CSV format gipeda uses. This is a separate attribute to
allow reusing such programs. Derivation are used so that
new converters can also be defined in an ad hoc way
(i.e. inline).
The converted benchmark result would be uploaded as a Buildkite
[artifact].
If we want usable results, we'll have to figure out how to
constrain the execution of these `extraSteps`:
- They always need to be executed on the same machine, so times
are comparable between runs.
- Benchmark execution may not be parallelized at all.
- The executing machine should otherwise be idle, i.e. no other
pipeline runs, no other Nix builds etc.
I'm not sure if it is even possible to express this in Buildkite's
step configuration in a way that we'd be able to have these
extraSteps part of the normal depot pipeline (which would be pretty
cool, though) so that they are executed for refs/heads/canon. We'd
essentially need some kind of super low priority, but exclusive
step.
This may be easier to achieve if we either have a dedicated machine
for benchmarking or a separate pipeline (see Alternatives section)
which also doesn't run as often.
2. Have a step in the pipeline that collects all benchmark results
and merges them into a single CSV file named after the git revision
of the run for gipeda. The step would either upload this merged
collection as an artifact or directly move it into gipeda's data
directory.
Buildkite allows the use of globs when downloading
artifacts, so we could probably just use e.g. `benchmarks/*`.
This way the merge step would not need to have a full list
of all benchmarks available to it.
3. After a such a pipeline run, trigger a rebuild of the dashboard
(or just run gipeda regularly on a timer).
# Alternatives
- Use a separate pipeline for benchmark execution
- Benchmarks would be identified by e.g. a `meta.ci.benchmarks`
attribute
- Benchmark execution could be implemented in a single step or
possibly be controlled by pipeline parallelism.
- If artifacts are used, disambiguating them from other artifacts
would be trivial.
- We could use a separate [cluster] if we had dedicated agents for
benchmark execution.
- Revive [windtunnel.ci]
- Setup [bencher] (which, at first glance, looks relatively
complicated)
# Open Questions
## How to Identify Benchmarks
I think this points to some gaps in our Bazel inspired target syntax. We
should eventually work on filling those (see also b/438). Basically,
gipeda uses strings to identify specific benchmarks. In some places,
globbing can be used (which is very simple, i.e. it just expands `*` at
the end of the string to anything). The question, then becomes, how do
we express a specific benchmark as a single string. We have the
following components:
1. The readTree target (which, due to `extraSteps`, may not be a
subtarget)
2. The extraStep name, i.e. the benchmarking script.
3. The named benchmark results the script returns (of which there may
be multiple).
Also we would need to figure out which part of the code is in charge of
labeling. Probably the benchmark result merge operation should add the
first two components to the raw results which only have the 3.
component. However, if we use globbed artifact downloads, the merge step
may not have the necessary information for this.
## Gipeda Frontend
The gipeda frontend has quite a few JS dependencies which probably pin
ancient versions of dependencies. I haven't tried to get this to work
or modernize it yet. Probably feasible, worst case is probably that
we'd have to redo the graph rendering or pin dependencies indefinitely.
## CSV as the Canonical Format
With the current proposal, CSV would become the canonical output format
which is a relatively simple key value map of benchmark name to
numerical result. The source output from e.g. criterion would be a lot
richer.
I don't think this is a huge concern since we would be in a position to
change this later if necessary.
[revive]: https://github.com/nomeata/gipeda/pull/65
[bencher]: https://bencher.dev
[windtunnel.ci]:
https://web.archive.org/web/20240926214808/https://windtunnel.ci/
[dashboard]: https://perf.haskell.org/gipeda/
[artifact]: https://buildkite.com/docs/pipelines/configure/artifacts
[cluster]: https://buildkite.com/docs/pipelines/clusters
reply other threads:[~2025-02-10 15:46 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ca85ee9e-558b-4123-8eef-b94c48048ddf@systemli.org \
--to=sternenseemann@systemli$(echo .)org \
--cc=depot@tvl.su \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://code.tvl.fyi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).