AdaptFM

Efficient Qwen: Minimizing Inference Latency for Qwen3.5-4B on A10G

πŸ’¬ Join the AdaptFM Slack for announcements, Q&A, and updates: adaptfm.slack.com

The Challenge

How fast can you make Qwen3.5-4B run on a single NVIDIA A10G GPU without breaking it?

We provide a base Docker image with the unoptimised Qwen3.5-4B model serving on an AWS g5.xlarge instance. Your job: make inference as fast as possible while keeping model quality above minimum thresholds on standard benchmarks.

Quantize it. Prune it. Rewrite the kernels. Swap the inference engine. Anything goes, as long as the model still works.


Evaluation & Ranking

Rankings are by average speedup over the unoptimized baseline:

Category Prompt tokens Output tokens Baseline latency
Short 64 128 2,582 ms
Medium 2,048 256 5,441 ms
Long 8,192 256 6,576 ms
Average β€” β€” 4,866 ms

Quality Gates

Your optimized model must still pass these benchmarks (fail any and your submission is invalid):

Benchmark Baseline Minimum Threshold
MMLU-Pro 0.690 β‰₯ 0.621 (90%)
IFEval 0.857 β‰₯ 0.814 (95%)
GPQA-Diamond 0.700 β‰₯ 0.630 (90%)

Evaluation mode: Quality benchmarks use /v1/chat/completions with the Qwen3.5-4B chat template. GPQA-Diamond uses thinking enabled; MMLU-Pro and IFEval use thinking disabled.

You can submit multiple times. Only your best valid submission appears on the leaderboard.

πŸ† Live leaderboard: https://d1krc5fcnf73gi.cloudfront.net

What’s Allowed

What’s NOT Allowed


Getting Started

1. Pull the Base Image

docker pull adaptfm/adaptfm-base:latest

2. Download Model Weights

pip install huggingface_hub
python3 -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3.5-4B', local_dir='./qwen-weights')"

3. Test Locally

docker run -d --gpus all -p 8080:8080 --name test-run \
  -v hf_cache:/root/.cache/huggingface --shm-size=4g \
  adaptfm/adaptfm-base:latest

# Wait ~3 min for model to load, then test
curl http://localhost:8080/ping

# Test with thinking disabled (MMLU-Pro/IFEval mode)
curl -s -X POST http://localhost:8080/invocations \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":32}' | python3 -m json.tool

# Test with thinking enabled (GPQA mode)
curl -s -X POST http://localhost:8080/invocations \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":512,"thinking":true}' | python3 -m json.tool

docker rm -f test-run

4. Build Your Submission

⚠️ Important: Your submission image must include model weights baked in at /opt/ml/model/. The evaluation environment has no internet access.

FROM adaptfm/adaptfm-base:latest

# Copy model weights into the image (REQUIRED)
COPY qwen-weights/ /opt/ml/model/

# Add your optimizations / custom serve script
COPY my_serve.py /opt/program/my_serve.py

# Disable network access for model loading
ENV TRANSFORMERS_OFFLINE=1
ENV HF_DATASETS_OFFLINE=1
ENV HF_HUB_OFFLINE=1

# Must serve /ping + /invocations + /v1/completions + /v1/chat/completions on port 8080
ENTRYPOINT ["python3", "/opt/program/my_serve.py"]
docker build -t my-submission:latest .

API Contract

⚠️ API update (May 14): The /v1/chat/completions endpoint is now required. Submissions that only expose /ping, /invocations, and /v1/completions will fail quality evaluation. Please update your container accordingly.

Your container must serve on port 8080:

Endpoint Method Purpose
/ping GET Return 200 when model is loaded and ready
/invocations POST Accept inference request, return output
/v1/completions POST Raw text completions (used for latency benchmark)
/v1/chat/completions POST Chat completions with template (used for quality benchmarks)

Request format (/invocations and /v1/completions):

{"prompt": "...", "max_tokens": 128, "temperature": 0.0}

Request format (/v1/chat/completions):

{"model": "Qwen/Qwen3.5-4B", "messages": [{"role": "user", "content": "..."}], "max_tokens": 128, "temperature": 0.0}

Response format (OpenAI-compatible):

{
  "choices": [{"text": "generated text here"}]
}

Thinking Mode

Qwen3.5-4B generates <think>...</think> tokens by default. For the competition:

To disable thinking, use a chat template that outputs an empty think block. We provide qwen_no_think.jinja:

# vLLM example
--chat-template /path/to/qwen_no_think.jinja

Local Evaluation

Prerequisites

Download Eval Scripts

Run

# Start your container
docker run -d --gpus all -p 8080:8080 --name test-submission my-submission:latest
watch -n5 'curl -s http://localhost:8080/ping'  # wait for 200

# Quality only (~20 min with 10% sample)
HF_HOME=/path/to/hf_cache QUALITY_LIMIT=0.1 NUM_CONCURRENT=8 \
  python3 run_quality_local.py 2>&1 | tee /tmp/quality.log

# Full eval β€” latency + quality (~60 min with 10% sample)
HF_HOME=/path/to/hf_cache EVAL_MODE=full QUALITY_LIMIT=0.1 NUM_CONCURRENT=8 \
  python3 run_eval_local.py 2>&1 | tee /tmp/eval.log

Eval Harness Configuration

Task Mode Concurrency Max tokens
MMLU-Pro (5-shot) Chat completions, thinking OFF 8 512
IFEval (0-shot) Chat completions, thinking OFF 8 512
GPQA-Diamond (0-shot) Chat completions, thinking ON + streaming 8 12288

Latency: 5 warmup + 50 measurement runs Γ— 3 categories.


Submission Guide

API Details

Item Value
API Base URL https://79x0as8g44.execute-api.us-east-1.amazonaws.com/prod
API Key qoXdZQNYbX1s4wnhmcBJG2APABlqjNVSao8CdM3j
Max image size 20 GB
Tarball filename Must be image.tar.gz
Upload URL expiry 2 hours

Step 1 β€” Register Your Team

curl -s -X POST \
  https://79x0as8g44.execute-api.us-east-1.amazonaws.com/prod/register \
  -H 'Content-Type: application/json' \
  -H 'x-api-key: qoXdZQNYbX1s4wnhmcBJG2APABlqjNVSao8CdM3j' \
  -d '{"team_id": "AFM-xxxxxxxx"}' | python3 -m json.tool

Replace AFM-xxxxxxxx with the team ID you received upon registration.

Step 2 β€” Save Image as Tarball

⚠️ The tarball must be named image.tar.gz. Any other filename will cause the submission to fail.

docker save my-submission:latest | gzip > image.tar.gz
du -sh image.tar.gz  # must be under 20 GB

Step 3 β€” Get Upload URL

FILE_SIZE=$(stat -c%s image.tar.gz 2>/dev/null || stat -f%z image.tar.gz)

curl -s -X POST \
  https://79x0as8g44.execute-api.us-east-1.amazonaws.com/prod/upload-url \
  -H 'Content-Type: application/json' \
  -H 'x-api-key: qoXdZQNYbX1s4wnhmcBJG2APABlqjNVSao8CdM3j' \
  -d "{\"team_id\": \"AFM-xxxxxxxx\", \"file_size_bytes\": $FILE_SIZE}" \
  | tee /tmp/upload_resp.json | python3 -m json.tool

The response will contain upload_type:

Step 4A β€” Upload (Single, ≀ 5 GB)

UPLOAD_URL=$(python3 -c "import json; print(json.load(open('/tmp/upload_resp.json'))['upload_url'])")

curl -X PUT "$UPLOAD_URL" \
  --upload-file image.tar.gz \
  --progress-bar \
  -w "\nHTTP Status: %{http_code}\n"

Expected: HTTP Status: 200 β€” then go to Step 5.

Step 4B β€” Upload (Multipart, > 5 GB)

Supports resume β€” if interrupted, re-run and it skips already-uploaded parts.

python3 - << 'EOF'
import json, os, subprocess, sys

resp    = json.load(open('/tmp/upload_resp.json'))
team_id = resp['s3_key'].split('/')[1]

etags_file = '/tmp/upload_etags.json'
if os.path.exists(etags_file):
    etags = json.load(open(etags_file))
    done  = {e['part_number'] for e in etags}
    print(f"Resuming β€” {len(done)} parts already done: {sorted(done)}")
else:
    etags = []
    done  = set()

print(f"Uploading {resp['num_parts']} parts ({resp['part_size']//1024//1024} MB each)...")

with open('image.tar.gz', 'rb') as f:
    for part in resp['part_urls']:
        n = part['part_number']
        if n in done:
            f.seek(resp['part_size'], 1)
            continue
        chunk = f.read(resp['part_size'])
        if not chunk:
            break
        tmp = f'/tmp/part_{n}.bin'
        with open(tmp, 'wb') as pf:
            pf.write(chunk)
        result = subprocess.run(
            ['curl', '-s', '-X', 'PUT', part['upload_url'],
             '--upload-file', tmp, '-D', '-', '-o', '/dev/null'],
            capture_output=True, text=True
        )
        os.remove(tmp)
        etag = ''
        for line in result.stdout.splitlines():
            if line.lower().startswith('etag:'):
                etag = line.split(':', 1)[1].strip().strip('\r').strip('"')
                break
        if not etag:
            print(f'  ERROR: No ETag for part {n}. Re-run to resume.')
            sys.exit(1)
        etags.append({'part_number': n, 'etag': etag})
        done.add(n)
        with open(etags_file, 'w') as ef:
            json.dump(etags, ef)
        print(f'  Part {n}/{resp["num_parts"]} done')

print(f'\nAll parts uploaded. Completing...')
body = json.dumps({
    'team_id':   team_id,
    's3_key':    resp['s3_key'],
    'upload_id': resp['upload_id'],
    'parts':     etags,
})
result = subprocess.run(
    ['curl', '-s', '-X', 'POST',
     'https://79x0as8g44.execute-api.us-east-1.amazonaws.com/prod/complete-upload',
     '-H', 'Content-Type: application/json',
     '-H', 'x-api-key: qoXdZQNYbX1s4wnhmcBJG2APABlqjNVSao8CdM3j',
     '-d', body],
    capture_output=True, text=True
)
print(json.dumps(json.loads(result.stdout), indent=2))
os.remove(etags_file)
EOF

Step 5 β€” Submit for Evaluation

S3_KEY=$(python3 -c "import json; print(json.load(open('/tmp/upload_resp.json'))['s3_key'])")

curl -s -X POST \
  https://79x0as8g44.execute-api.us-east-1.amazonaws.com/prod/submit \
  -H 'Content-Type: application/json' \
  -H 'x-api-key: qoXdZQNYbX1s4wnhmcBJG2APABlqjNVSao8CdM3j' \
  -d "{\"team_id\": \"AFM-xxxxxxxx\", \"s3_key\": \"$S3_KEY\"}" \
  | python3 -m json.tool

Save your submission_id from the response β€” share it with organizers if you face any issues.

Resubmission

To resubmit, repeat Steps 2–5 with your updated image. Evaluation takes ~90–100 minutes.


Timeline

πŸš€ Registration and submissions are now open! See Contact Details below for the portal link.
Date Milestone
April 20 Competition launches β€” rules + base Docker published
May 8 Registration opens
May 14 Submission portal opens - Submissions begin
May 30 Registration deadline (teams of 1–4)
June 15 June 17 Submissions close (23:59 AoE)
June 19 Final leaderboard announced
July 11 Top teams present at ICML 2026, Seoul

Prizes

Place Prize Presentation
πŸ₯‡ 1st $3,000 Oral + poster
πŸ₯ˆ 2nd $2,000 Oral + poster
πŸ₯‰ 3rd $1,000 Poster

Top 10 teams must open-source their code and model weights under a permissive license (BSD, MIT, Apache, etc.).


Who Can Participate


Contact