Reproducing the Trace Analysis Figures from our OSDI '26 Paper
Artifact guide for the trace-analysis figures in our OSDI '26 paper — where to find the SwissAI serving trace, and how to regenerate the plots from the OpenTela repository.
Alongside our OSDI '26 paper, we are releasing the production serving trace collected from the SwissAI inference platform together with the analysis code used to produce the trace characterization figures, as well as scripts for running the system performance benchmarks. This guide explains how to access the trace dataset and reproduce the figures from the paper.
Quick start — running OpenTela
OpenTela is the decentralized platform that powered the SwissAI inference cluster and collected the trace. Although running the system at the scale of SwissAI's production cluster is out of reach for most, we have designed OpenTela to be easy to get started with on a single machine or a small cluster. Here is how to get a multi-LLM serving cluster up and running in a few minutes. For full details, see the installation, spin-up, and routing guides.
1. Install the binary
Download the latest release for your architecture:
# x86_64
wget https://github.com/eth-easl/OpenTela/releases/latest/download/otela-amd64 -O otela && chmod +x otela
# arm64
wget https://github.com/eth-easl/OpenTela/releases/latest/download/otela-arm64 -O otela && chmod +x otelaYou can also build from source if you prefer.
2. Start a head node
Pick a machine with a public IP (it does not need a GPU) and run:
./otela start --mode standalone --public-addr {YOUR_IP_ADDR} --seed 0Note the Peer ID printed in the logs (e.g. QmafRyc9ef1KKKMfG973aApDKCEEjnhf89dZDckgUeSMbB). You will need it for the worker node.
3. Start a worker node
On a GPU machine, install your favourite serving engine (e.g. vLLM) and run:
./otela start \
--bootstrap.addr /ip4/{YOUR_IP_ADDR}/tcp/43905/p2p/{YOUR_HEAD_NODE_PEER_ID} \
--subprocess "vllm serve Qwen/Qwen3-8B --max_model_len 16384 --port 8080" \
--service.name llm \
--service.port 8080 \
--seed 1OpenTela will launch vLLM as a subprocess and register the model with the head node. You can add more workers with different models or hardware the same way. You can also use Docker for containerized deployments.
4. Send a request
Requests go through the head node, which routes them to the right worker automatically:
import openai
client = openai.OpenAI(
base_url="http://{YOUR_HEAD_NODE_IP}:8092/v1/service/llm/v1",
api_key="test-token"
)
response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[{"role": "user", "content": "Hello, world!"}]
)
print(response)That's it — you have a working multi-LLM serving cluster. Read on to learn how to reproduce the trace-analysis figures from the paper.
The trace
The dataset is published on the Hugging Face Hub:
- Dataset:
eth-easl/swissai-serving-trace
It contains the per-request log used in the paper (trace.jsonl) and the per-model KV-cache bucket reuse logs that drive the reuse-lifecycle figures (qwen3-32b-bucket-reuse.jsonl, qwen380b_thinking_bucket-reuse.jsonl, qwen380b_instruct_bucket-reuse.jsonl, llama3-70b_bucket-reuse.jsonl, apertus-70b-bucket-reuse.jsonl, and qwen3-32b-buckets.jsonl). The notebooks pull only the files they need via huggingface_hub.snapshot_download, so you do not need to clone the full dataset by hand.
Reproducing the figures
The analysis code lives in contrib/trace_analysis/ of the OpenTela repository. Two notebooks are provided:
plot_all.ipynb— request-level latency distributions, per-model token usage, and the model-burstiness / lifecycle figures.plot_bucket.ipynb— KV-cache bucket reuse over the model lifecycle for Qwen3-32B, Qwen3-80B (Instruct and Thinking), Llama-3-70B, and Apertus-70B.
Shared plotting style lives in plot_utils.py, and the rendered PDFs land in contrib/trace_analysis/figures/.
Steps
-
Clone the repository and
cdinto the analysis directory:git clone https://github.com/eth-easl/OpenTela.git cd OpenTela/contrib/trace_analysis -
Install the Python dependencies. The notebooks rely on
pandas,numpy,scipy,matplotlib,seaborn, andhuggingface_hub:python -m venv .venv && source .venv/bin/activate pip install pandas numpy scipy matplotlib seaborn huggingface_hub jupyter -
(Optional) If the dataset becomes gated, log in to Hugging Face once so
snapshot_downloadcan authenticate:huggingface-cli login -
Run the notebooks. Either open them in Jupyter, or execute non-interactively:
jupyter nbconvert --to notebook --execute plot_all.ipynb --output plot_all.executed.ipynb jupyter nbconvert --to notebook --execute plot_bucket.ipynb --output plot_bucket.executed.ipynbThe first run downloads the relevant
.jsonlfiles from the Hugging Face dataset into the localhuggingface_hubcache; subsequent runs reuse the cache. -
Inspect the generated figures under
contrib/trace_analysis/figures/. You should obtain the same PDFs that appear in the paper, includinglatency_distribution_e2e.pdf,model_token_usage_top20_single.pdf,model_burstiness_lifecycle.pdf, and the*_output_share_*.pdf/reuse_lifecycle_comparison.pdfplots from the bucket notebook.
Notes
- The notebooks filter the trace to completed requests (
status == 'DEFAULT'with validcreated_at/finished_at) and clip the long tail (latency_s <= 1000) before computing distributions. - A small helper,
trim_output.py, is included for keeping notebook outputs reasonably sized when committing changes back to the repository; it is not required to reproduce the figures.
Running System Performance Analysis
We also provide scripts for analyzing the resilience and performance of the OpenTela system under different conditions. These scripts can be found in the contrib/benchmark/ directory. These script are designed to run workloads on the shared HPC cluster, but can be adapted to run on any OpenTela deployment.
- To run the experiment on resilience to node failures, use the following command:
python simulator/real/run_recovery.py --config meta/experiments/1_recovery/recovery.yaml --output .local/output/70b_recovery.jsonl- To run the experiment on performance under load, use the following command:
bash contrib/benchmark/experiments/scripts/bench_all.shor to run a specific workload:
cd contrib/benchmark/experiments/scripts
# Spin up the cluster with the desired configuration
python cli/spinup.py --config meta/experiments/2_placement/70b_gh200.yaml
# Run the workloads
python cli/run_workloads.py --config meta/experiments/2_placement/70b_gh200.yaml --output-file .local/output/70b_gh200.jsonl
# Cancel all jobs after the experiment is done
python cli/cancel_all.pyNotes
- The
spinup.pyscript uses a Jinja2 template to generate the OpenTela configuration for the cluster. You may need to modify thespinup_workload.pyscript to set the correct paths for the OCF binary and environment file based on your setup. We only tested it with the specific setup on our HPC cluster, so adjustments may be necessary for other environments. - The workloads are designed to run for a fixed duration (e.g., 30 minutes) and generate load on the cluster. You can adjust the workload configurations in the
meta/experiments/directory to test different scenarios or models.