OpenTela
Blog

Reproducing the Trace Analysis Figures from our OSDI '26 Paper

Artifact guide for the trace-analysis figures in our OSDI '26 paper — where to find the SwissAI serving trace, and how to regenerate the plots from the OpenTela repository.

Alongside our OSDI '26 paper, we are releasing the production serving trace collected from the SwissAI inference platform together with the analysis code used to produce the trace characterization figures, as well as scripts for running the system performance benchmarks. This guide explains how to access the trace dataset and reproduce the figures from the paper.

Quick start — running OpenTela

OpenTela is the decentralized platform that powered the SwissAI inference cluster and collected the trace. Although running the system at the scale of SwissAI's production cluster is out of reach for most, we have designed OpenTela to be easy to get started with on a single machine or a small cluster. Here is how to get a multi-LLM serving cluster up and running in a few minutes. For full details, see the installation, spin-up, and routing guides.

1. Install the binary

Download the latest release for your architecture:

# x86_64
wget https://github.com/eth-easl/OpenTela/releases/latest/download/otela-amd64 -O otela && chmod +x otela

# arm64
wget https://github.com/eth-easl/OpenTela/releases/latest/download/otela-arm64 -O otela && chmod +x otela

You can also build from source if you prefer.

2. Start a head node

Pick a machine with a public IP (it does not need a GPU) and run:

./otela start --mode standalone --public-addr {YOUR_IP_ADDR} --seed 0

Note the Peer ID printed in the logs (e.g. QmafRyc9ef1KKKMfG973aApDKCEEjnhf89dZDckgUeSMbB). You will need it for the worker node.

3. Start a worker node

On a GPU machine, install your favourite serving engine (e.g. vLLM) and run:

./otela start \
  --bootstrap.addr /ip4/{YOUR_IP_ADDR}/tcp/43905/p2p/{YOUR_HEAD_NODE_PEER_ID} \
  --subprocess "vllm serve Qwen/Qwen3-8B --max_model_len 16384 --port 8080" \
  --service.name llm \
  --service.port 8080 \
  --seed 1

OpenTela will launch vLLM as a subprocess and register the model with the head node. You can add more workers with different models or hardware the same way. You can also use Docker for containerized deployments.

4. Send a request

Requests go through the head node, which routes them to the right worker automatically:

import openai

client = openai.OpenAI(
    base_url="http://{YOUR_HEAD_NODE_IP}:8092/v1/service/llm/v1",
    api_key="test-token"
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=[{"role": "user", "content": "Hello, world!"}]
)
print(response)

That's it — you have a working multi-LLM serving cluster. Read on to learn how to reproduce the trace-analysis figures from the paper.

The trace

The dataset is published on the Hugging Face Hub:

It contains the per-request log used in the paper (trace.jsonl) and the per-model KV-cache bucket reuse logs that drive the reuse-lifecycle figures (qwen3-32b-bucket-reuse.jsonl, qwen380b_thinking_bucket-reuse.jsonl, qwen380b_instruct_bucket-reuse.jsonl, llama3-70b_bucket-reuse.jsonl, apertus-70b-bucket-reuse.jsonl, and qwen3-32b-buckets.jsonl). The notebooks pull only the files they need via huggingface_hub.snapshot_download, so you do not need to clone the full dataset by hand.

Reproducing the figures

The analysis code lives in contrib/trace_analysis/ of the OpenTela repository. Two notebooks are provided:

  • plot_all.ipynb — request-level latency distributions, per-model token usage, and the model-burstiness / lifecycle figures.
  • plot_bucket.ipynb — KV-cache bucket reuse over the model lifecycle for Qwen3-32B, Qwen3-80B (Instruct and Thinking), Llama-3-70B, and Apertus-70B.

Shared plotting style lives in plot_utils.py, and the rendered PDFs land in contrib/trace_analysis/figures/.

Steps

  1. Clone the repository and cd into the analysis directory:

    git clone https://github.com/eth-easl/OpenTela.git
    cd OpenTela/contrib/trace_analysis
  2. Install the Python dependencies. The notebooks rely on pandas, numpy, scipy, matplotlib, seaborn, and huggingface_hub:

    python -m venv .venv && source .venv/bin/activate
    pip install pandas numpy scipy matplotlib seaborn huggingface_hub jupyter
  3. (Optional) If the dataset becomes gated, log in to Hugging Face once so snapshot_download can authenticate:

    huggingface-cli login
  4. Run the notebooks. Either open them in Jupyter, or execute non-interactively:

    jupyter nbconvert --to notebook --execute plot_all.ipynb    --output plot_all.executed.ipynb
    jupyter nbconvert --to notebook --execute plot_bucket.ipynb --output plot_bucket.executed.ipynb

    The first run downloads the relevant .jsonl files from the Hugging Face dataset into the local huggingface_hub cache; subsequent runs reuse the cache.

  5. Inspect the generated figures under contrib/trace_analysis/figures/. You should obtain the same PDFs that appear in the paper, including latency_distribution_e2e.pdf, model_token_usage_top20_single.pdf, model_burstiness_lifecycle.pdf, and the *_output_share_*.pdf / reuse_lifecycle_comparison.pdf plots from the bucket notebook.

Notes

  • The notebooks filter the trace to completed requests (status == 'DEFAULT' with valid created_at / finished_at) and clip the long tail (latency_s <= 1000) before computing distributions.
  • A small helper, trim_output.py, is included for keeping notebook outputs reasonably sized when committing changes back to the repository; it is not required to reproduce the figures.

Running System Performance Analysis

We also provide scripts for analyzing the resilience and performance of the OpenTela system under different conditions. These scripts can be found in the contrib/benchmark/ directory. These script are designed to run workloads on the shared HPC cluster, but can be adapted to run on any OpenTela deployment.

  1. To run the experiment on resilience to node failures, use the following command:
python simulator/real/run_recovery.py --config meta/experiments/1_recovery/recovery.yaml --output .local/output/70b_recovery.jsonl
  1. To run the experiment on performance under load, use the following command:
bash contrib/benchmark/experiments/scripts/bench_all.sh

or to run a specific workload:

cd contrib/benchmark/experiments/scripts
# Spin up the cluster with the desired configuration
python cli/spinup.py --config meta/experiments/2_placement/70b_gh200.yaml
# Run the workloads
python cli/run_workloads.py --config meta/experiments/2_placement/70b_gh200.yaml --output-file .local/output/70b_gh200.jsonl
# Cancel all jobs after the experiment is done
python cli/cancel_all.py

Notes

  • The spinup.py script uses a Jinja2 template to generate the OpenTela configuration for the cluster. You may need to modify the spinup_workload.py script to set the correct paths for the OCF binary and environment file based on your setup. We only tested it with the specific setup on our HPC cluster, so adjustments may be necessary for other environments.
  • The workloads are designed to run for a fixed duration (e.g., 30 minutes) and generate load on the cluster. You can adjust the workload configurations in the meta/experiments/ directory to test different scenarios or models.

On this page