AdaptFM

Efficient Qwen Competition

The goal of the competition is to minimise inference latency for Qwen3.5-4B, as measured on Nvidia A10G devices.

💬 Join the AdaptFM Slack for announcements, Q&A, and updates: adaptfm.slack.com

The Challenge

How fast can you make Qwen3.5-4B run on a single NVIDIA A10G GPU without breaking it?

We provide a base Docker image with the unoptimised Qwen3.5-4B model serving on an AWS g5.xlarge instance. Your job: make inference as fast as possible while keeping model quality above minimum thresholds on standard benchmarks.

Quantize it. Prune it. Rewrite the kernels. Swap the inference engine. Anything goes, as long as the model still works.

What You’re Optimizing

The baseline (unoptimized bf16 on vLLM) averages 4,831ms. A simple AWQ 4-bit quantization already cuts this to 2,329ms — a 2x speedup. Can you do better?

Quality Gates

Your optimized model must still pass these benchmarks:

Benchmark Baseline Minimum Threshold
MMLU-Pro 0.5844 ≥ 0.5260 (90%)
IFEval 0.4448 ≥ 0.4226 (95%)
GPQA-Diamond 0.2273 ≥ 0.2046 (90%)

Fail any gate and your submission is marked invalid.

What’s Allowed

What’s NOT Allowed


Getting Started

1. Pull the Base Image

docker pull adaptfm/adaptfm-base:latest

Stack: CUDA 12.4 · vLLM 0.19.0 · lm-eval 0.4.11 · PyTorch 2.10 · Transformers 4.57

2. Download Model Weights

docker run --rm --gpus all \
  -v hf_cache:/root/.cache/huggingface \
  --entrypoint python3 adaptfm/adaptfm-base \
  -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3.5-4B')"

3. Test Locally

docker run -d --gpus all -p 8080:8080 --name test-run \
  -v hf_cache:/root/.cache/huggingface --shm-size=4g \
  adaptfm/adaptfm-base:latest

# Wait ~4 min for model to load, then test
curl http://localhost:8080/ping

curl -s -X POST http://localhost:8080/invocations \
  -H 'Content-Type: application/json' \
  -d '{"prompt": "What is 2+2?", "max_tokens": 32}' | python3 -m json.tool

docker rm -f test-run

4. Build Your Submission

FROM adaptfm/adaptfm-base:latest

# Add your optimizations here
COPY my_serve.py /opt/program/my_serve.py

# Must serve /ping + /invocations + /v1/completions on port 8080
ENTRYPOINT ["python3", "/opt/program/my_serve.py"]

API Contract

Your container must serve on port 8080:

Endpoint Method Purpose
/ping GET Return 200 when model is loaded and ready
/invocations POST Accept inference request, return output
/v1/completions POST Same as /invocations (needed for quality benchmarks)

Request format:

{"prompt": "...", "max_tokens": 128, "temperature": 0.0}

Response format (OpenAI-compatible):

{
  "choices": [{"text": "generated text here"}]
}

Evaluation

Rankings are by average speedup over the unoptimized baseline:

Category Prompt tokens Output tokens Baseline latency
Short 64 128 2,579 ms
Medium 2,048 256 5,422 ms
Long 8,192 256 6,492 ms
Average 4,831 ms

You can submit multiple times. Only your best valid submission (highest avg speedup with all quality gates passed) appears on the leaderboard.


Timeline

Date Milestone
April 20 Competition launches — rules + base Docker published
May 8 Submission portal opens — registration + submissions begin
May 18 Registration deadline (teams of 1–4)
June 8 Submissions close (23:59 AoE)
June 12 Final leaderboard announced
July 10–11 Top teams present at ICML 2026, Seoul

Prizes

Place Prize Presentation
🥇 1st $3,000 Oral + poster
🥈 2nd $2,000 Oral + poster
🥉 3rd $1,000 Poster

Top 10 teams must open-source their code and model weights under a permissive license (BSD, MIT, Apache, etc.).

Who Can Participate

Organizers

Organized as part of the AdaptFM Workshop at ICML 2026, Seoul, South Korea. Prizes sponsored by Amazon. Model provided by the Qwen team (Alibaba Cloud).

Contact Details