AdaptFM

Efficient Qwen Competition

The goal of the competition is to minimise inference latency for Qwen3.5-4B, as measured on Nvidia A10G devices.

💬 Join the AdaptFM Slack for announcements, Q&A, and updates: adaptfm.slack.com

The Challenge

How fast can you make Qwen3.5-4B run on a single NVIDIA A10G GPU without breaking it?

We provide a base Docker image with the unoptimised Qwen3.5-4B model serving on an AWS g5.xlarge instance. Your job: make inference as fast as possible while keeping model quality above minimum thresholds on standard benchmarks.

Quantize it. Prune it. Rewrite the kernels. Swap the inference engine. Anything goes, as long as the model still works.

What You’re Optimizing

Primary metric: End-to-end inference latency (lower is better)
Hardware: Single NVIDIA A10G (24GB VRAM), AWS g5.xlarge
Batch size: 1
Prompt lengths: 64, 2,048, and 8,192 tokens

The baseline (unoptimized bf16 on vLLM) averages 4,831ms. A simple AWQ 4-bit quantization already cuts this to 2,329ms — a 2x speedup. Can you do better?

Quality Gates

Your optimized model must still pass these benchmarks:

Benchmark	Baseline	Minimum Threshold
MMLU-Pro	0.5844	≥ 0.5260 (90%)
IFEval	0.4448	≥ 0.4226 (95%)
GPQA-Diamond	0.2273	≥ 0.2046 (90%)

Fail any gate and your submission is marked invalid.

What’s Allowed

Quantization (AWQ, GPTQ, GGUF, etc. at any bit-width)
Pruning (structured, unstructured, semi-structured)
Knowledge distillation (student must init from Qwen3.5-4B)
Architecture modifications (layer removal, head pruning)
Custom CUDA/Triton kernels (source code required)
Custom inference engines (TensorRT-LLM, llama.cpp, anything)
KV-cache optimization, operator fusion, speculative decoding
torch.compile, Flash Attention variants
Memory offloading to system RAM

What’s NOT Allowed

Hard-coded or cached benchmark answers
Benchmark detection/routing logic
Obfuscated code
Training on evaluation benchmark data
Proprietary/non-distributable software

Getting Started

1. Pull the Base Image

docker pull adaptfm/adaptfm-base:latest

Stack: CUDA 12.4 · vLLM 0.19.0 · lm-eval 0.4.11 · PyTorch 2.10 · Transformers 4.57

2. Download Model Weights

docker run --rm --gpus all \
  -v hf_cache:/root/.cache/huggingface \
  --entrypoint python3 adaptfm/adaptfm-base \
  -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3.5-4B')"

3. Test Locally

docker run -d --gpus all -p 8080:8080 --name test-run \
  -v hf_cache:/root/.cache/huggingface --shm-size=4g \
  adaptfm/adaptfm-base:latest

# Wait ~4 min for model to load, then test
curl http://localhost:8080/ping

curl -s -X POST http://localhost:8080/invocations \
  -H 'Content-Type: application/json' \
  -d '{"prompt": "What is 2+2?", "max_tokens": 32}' | python3 -m json.tool

docker rm -f test-run

4. Build Your Submission

📦 Base Image: adaptfm/adaptfm-base

FROM adaptfm/adaptfm-base:latest

# Add your optimizations here
COPY my_serve.py /opt/program/my_serve.py

# Must serve /ping + /invocations + /v1/completions on port 8080
ENTRYPOINT ["python3", "/opt/program/my_serve.py"]

API Contract

Your container must serve on port 8080:

Endpoint	Method	Purpose
`/ping`	GET	Return `200` when model is loaded and ready
`/invocations`	POST	Accept inference request, return output
`/v1/completions`	POST	Same as `/invocations` (needed for quality benchmarks)

Request format:

{"prompt": "...", "max_tokens": 128, "temperature": 0.0}

Response format (OpenAI-compatible):

{
  "choices": [{"text": "generated text here"}]
}

Evaluation

Rankings are by average speedup over the unoptimized baseline:

Category	Prompt tokens	Output tokens	Baseline latency
Short	64	128	2,579 ms
Medium	2,048	256	5,422 ms
Long	8,192	256	6,492 ms
Average	—	—	4,831 ms

You can submit multiple times. Only your best valid submission (highest avg speedup with all quality gates passed) appears on the leaderboard.

Timeline

Date	Milestone
April 20	Competition launches — rules + base Docker published
May 8	Submission portal opens — registration + submissions begin
May 18	Registration deadline (teams of 1–4)
June 8	Submissions close (23:59 AoE)
June 12	Final leaderboard announced
July 10–11	Top teams present at ICML 2026, Seoul

Prizes

Place	Prize	Presentation
🥇 1st	$3,000	Oral + poster
🥈 2nd	$2,000	Oral + poster
🥉 3rd	$1,000	Poster

Top 10 teams must open-source their code and model weights under a permissive license (BSD, MIT, Apache, etc.).

Who Can Participate

Open to individuals and teams worldwide (1–4 members per team)
Amazon employees may not submit
Academic and industry participants welcome
No ICML registration required to submit (only to present)
Max 1 submission per team per day

Organizers

Organized as part of the AdaptFM Workshop at ICML 2026, Seoul, South Korea. Prizes sponsored by Amazon. Model provided by the Qwen team (Alibaba Cloud).

Contact Details

✉️ Email: adaptfmworkshop@gmail.com
🗨 Slack: AdaptFM Slack, join the #efficient-qwen-competition channel
🌐 Leaderboard & submission portal: Opens May 8