Efficient Qwen Competition
The goal of the competition is to minimise inference latency for Qwen3.5-4B, as measured on Nvidia A10G devices.
The Challenge
How fast can you make Qwen3.5-4B run on a single NVIDIA A10G GPU without breaking it?
We provide a base Docker image with the unoptimised Qwen3.5-4B model serving on an AWS g5.xlarge instance. Your job: make inference as fast as possible while keeping model quality above minimum thresholds on standard benchmarks.
Quantize it. Prune it. Rewrite the kernels. Swap the inference engine. Anything goes, as long as the model still works.
What You’re Optimizing
- Primary metric: End-to-end inference latency (lower is better)
- Hardware: Single NVIDIA A10G (24GB VRAM), AWS
g5.xlarge - Batch size: 1
- Prompt lengths: 64, 2,048, and 8,192 tokens
The baseline (unoptimized bf16 on vLLM) averages 4,831ms. A simple AWQ 4-bit quantization already cuts this to 2,329ms — a 2x speedup. Can you do better?
Quality Gates
Your optimized model must still pass these benchmarks:
| Benchmark | Baseline | Minimum Threshold |
|---|---|---|
| MMLU-Pro | 0.5844 | ≥ 0.5260 (90%) |
| IFEval | 0.4448 | ≥ 0.4226 (95%) |
| GPQA-Diamond | 0.2273 | ≥ 0.2046 (90%) |
Fail any gate and your submission is marked invalid.
What’s Allowed
- Quantization (AWQ, GPTQ, GGUF, etc. at any bit-width)
- Pruning (structured, unstructured, semi-structured)
- Knowledge distillation (student must init from Qwen3.5-4B)
- Architecture modifications (layer removal, head pruning)
- Custom CUDA/Triton kernels (source code required)
- Custom inference engines (TensorRT-LLM, llama.cpp, anything)
- KV-cache optimization, operator fusion, speculative decoding
torch.compile, Flash Attention variants- Memory offloading to system RAM
What’s NOT Allowed
- Hard-coded or cached benchmark answers
- Benchmark detection/routing logic
- Obfuscated code
- Training on evaluation benchmark data
- Proprietary/non-distributable software
Getting Started
1. Pull the Base Image
docker pull adaptfm/adaptfm-base:latest
Stack: CUDA 12.4 · vLLM 0.19.0 · lm-eval 0.4.11 · PyTorch 2.10 · Transformers 4.57
2. Download Model Weights
docker run --rm --gpus all \
-v hf_cache:/root/.cache/huggingface \
--entrypoint python3 adaptfm/adaptfm-base \
-c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3.5-4B')"
3. Test Locally
docker run -d --gpus all -p 8080:8080 --name test-run \
-v hf_cache:/root/.cache/huggingface --shm-size=4g \
adaptfm/adaptfm-base:latest
# Wait ~4 min for model to load, then test
curl http://localhost:8080/ping
curl -s -X POST http://localhost:8080/invocations \
-H 'Content-Type: application/json' \
-d '{"prompt": "What is 2+2?", "max_tokens": 32}' | python3 -m json.tool
docker rm -f test-run
4. Build Your Submission
- 📦 Base Image:
adaptfm/adaptfm-base
FROM adaptfm/adaptfm-base:latest
# Add your optimizations here
COPY my_serve.py /opt/program/my_serve.py
# Must serve /ping + /invocations + /v1/completions on port 8080
ENTRYPOINT ["python3", "/opt/program/my_serve.py"]
API Contract
Your container must serve on port 8080:
| Endpoint | Method | Purpose |
|---|---|---|
/ping |
GET | Return 200 when model is loaded and ready |
/invocations |
POST | Accept inference request, return output |
/v1/completions |
POST | Same as /invocations (needed for quality benchmarks) |
Request format:
{"prompt": "...", "max_tokens": 128, "temperature": 0.0}
Response format (OpenAI-compatible):
{
"choices": [{"text": "generated text here"}]
}
Evaluation
Rankings are by average speedup over the unoptimized baseline:
| Category | Prompt tokens | Output tokens | Baseline latency |
|---|---|---|---|
| Short | 64 | 128 | 2,579 ms |
| Medium | 2,048 | 256 | 5,422 ms |
| Long | 8,192 | 256 | 6,492 ms |
| Average | — | — | 4,831 ms |
You can submit multiple times. Only your best valid submission (highest avg speedup with all quality gates passed) appears on the leaderboard.
Timeline
| Date | Milestone |
|---|---|
| April 20 | Competition launches — rules + base Docker published |
| May 8 | Submission portal opens — registration + submissions begin |
| May 18 | Registration deadline (teams of 1–4) |
| June 8 | Submissions close (23:59 AoE) |
| June 12 | Final leaderboard announced |
| July 10–11 | Top teams present at ICML 2026, Seoul |
Prizes
| Place | Prize | Presentation |
|---|---|---|
| 🥇 1st | $3,000 | Oral + poster |
| 🥈 2nd | $2,000 | Oral + poster |
| 🥉 3rd | $1,000 | Poster |
Top 10 teams must open-source their code and model weights under a permissive license (BSD, MIT, Apache, etc.).
Who Can Participate
- Open to individuals and teams worldwide (1–4 members per team)
- Amazon employees may not submit
- Academic and industry participants welcome
- No ICML registration required to submit (only to present)
- Max 1 submission per team per day
Organizers
Organized as part of the AdaptFM Workshop at ICML 2026, Seoul, South Korea. Prizes sponsored by Amazon. Model provided by the Qwen team (Alibaba Cloud).
Contact Details
- ✉️ Email: adaptfmworkshop@gmail.com
- 🗨 Slack: AdaptFM Slack, join the #efficient-qwen-competition channel
- 🌐 Leaderboard & submission portal: Opens May 8