To answer that question fairly, I ran the same eval suite twice on the same RTX 5090, same llama.cpp build, same Q5_K_M quant: once on Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash (the distill), and once on Qwen/Qwen3.5-9B (the official post-trained model that the distill was built from). Same parameter count, same architecture, same chat template family, same prompts. Two clear wins for the distill emerge: (1) it thinks tighter on long-horizon agentic tasks — the base hits the 8K thinking cap on 3 of 5 prompts, the distill finishes all 5 cleanly. (2) On open-ended creative front-end prompts, the distill ships output that punches above its parameter count — the base produces credible work on templated dashboards but struggles visibly on the designer portfolio and the mobile app marketing prompt: animation issues, overlay bugs, visually flatter design language. Tool-calling and throughput tie. Disk and VRAM are the same. There is no tax I could find.
Reasoning distills usually carry a tax: you transplant a frontier model's chain-of-thought style into a smaller base, and you can buy reasoning polish at the cost of throughput, instruction-following, or quality at the seams. The right way to look for that tax is to compare against the same base the distill came from, at the same parameter count, on the same hardware. Comparing a 9B distill against a 27B abliterated finetune is the kind of comparison that gets you applause but not insight. This one is meant to surface signal.
Both runs used the same Hermes-style harness — agentic prompts at 8K cap with thinking on, design prompts at 32K cap, real OpenAI-style tools: [...] for the tool-calling section. Both at Q5_K_M, both at FP16 KV cache, both at 40K context. The only thing different is the model file.
5 prompts, 8K cap each. 8K cap hit means the model used its entire token budget on thinking and didn't reach a final answer.
| Prompt | Base · tok | Base · s | Distill · tok | Distill · s | Notes |
|---|---|---|---|---|---|
| multi_step_planning | 8000 | 54.9 | 2899 | 20.3 | Base looped on the same plan; distill committed. |
| self_critique | 8000 | 55.0 | 1969 | 13.8 | Base regenerated the rewrite repeatedly. |
| structured_extraction | 8000 | 55.0 | 4353 | 30.5 | Base second-guessed JSON shapes; distill emitted it. |
| code_debug | 6386 | 43.7 | 3170 | 22.1 | Both found the bugs; distill in half the tokens. |
| tool_use_json | 756 | 5.3 | 1415 | 10.0 | The one prompt where the base was tighter. |
| Totals | 31,142 | 213.9 s | 13,806 | 96.7 s | Distill: 2.2× faster wall time, < half the tokens, zero cap hits. |
Three of the five agentic prompts hit the 8K thinking cap on the base model. None did on the distill. That's not a quality preference — that's the base model running out of budget before producing a final answer at all, in 60% of the agentic suite. The DeepSeek-V4 distill teaches a more decisive thinking style: less "wait, let me reconsider" padding, faster commit to a working answer.
One honest counter-example: tool_use_json went the other direction — base 756 tokens, distill 1,415. So the distill isn't categorically tighter; it's tighter on the open-ended tasks where the base spirals. On a prompt with a clear single-shape answer, both produce concise output.
5 prompts, 32K cap. Both models produced valid self-contained HTML/CSS/JS. The numbers in the table look close — but the actual eye review of side-by-side outputs tells a sharper story than the token counts suggest. Open the A/B buttons on the design cards from the overview page and judge for yourself; this section captures what I saw.
| Prompt | Base · tok | Base · s | Distill · tok | Distill · s | Eye-review verdict |
|---|---|---|---|---|---|
| saas_landing | 9,849 | 67.9 | 15,347 | 109.1 | Distill richer; base credible. |
| analytics_dashboard | 13,187 | 91.4 | 13,032 | 92.6 | Close — both clean templated dashboards. |
| designer_portfolio | 5,930 | 40.7 | 6,213 | 43.7 | Distill clearly wins. Base flatter, less polish. |
| pricing_page | 9,503 | 65.4 | 8,367 | 58.9 | Close — pricing page is a templated form. |
| mobile_app_marketing | 32,000 | 228.3 | 10,161 | 71.7 | Distill clearly wins. Base hit cap; output has animation + overlay bugs. |
| Totals | 70,469 | 493.8 s | 53,120 | 376.0 s | Distill: 24% less tokens, 24% less wall time, better creative output. |
The two prompts where the base holds its own — analytics_dashboard and pricing_page — are well-templated forms. Dashboards have a known shape: sidebar, KPI cards, chart, table. Pricing pages have three tiers and a toggle. The base 9B has clearly seen enough of those that it produces clean output without needing creative judgment.
The two prompts where the base falls short — designer_portfolio and mobile_app_marketing — ask the model to make taste-driven decisions. What does "kinetic typography" look like rendered? How do you draw a CSS-only iPhone with screen content inside it that doesn't look like a kid's homework? On both, the base produces something visually flat with overlay glitches and unfinished animation states. The distill produces output that, side-by-side, looks like it came out of a much larger model. One-shot, same prompt, same hardware.
The mobile app marketing case is the most striking. The base burned all 32K tokens (228 seconds), and the output it landed on still has overlapping z-index issues and broken hover transitions. The distill rendered a cohesive Stillwater landing page with a clean CSS phone mock in 10K tokens / 72 seconds — three times less budget, with the polish a much bigger model would produce.
It's the same pattern as agentic: when the task has a clear shape, both models do well. When the task is open-ended and requires committing to an aesthetic decision, the base dithers and the distill commits. The DeepSeek-V4 distill is teaching the model to have taste, not just to follow templates.
Both models scored 5 PASS, 1 PARTIAL on the same six prompts. Both partial on the same prompt (multi_tool_sequence) for the same reason — Hermes one-tool-per-turn protocol expecting tool results before continuing. Tool-calling is unchanged by the distill.
| Slug | Base | Distill |
|---|---|---|
| single_tool_simple | PASS | PASS |
| tool_selection | PASS | PASS |
| multi_tool_sequence | PARTIAL | PARTIAL |
| no_tool_needed | PASS | PASS |
| complex_args | PASS | PASS |
| structured_email | PASS | PASS |
This is a useful negative result: the distill doesn't break tool calling. A common failure mode of reasoning distills is degrading instruction-following / structured output — that didn't happen here. The model still correctly emits <tool_call> blocks with right shapes when asked.
Same engine, same architecture, same quant — speed comes out the same. Both averaged 142–146 tok/s flat across all 16 main runs, standard deviation under 1 tok/s on each. The distill costs nothing in throughput.
Aggregating reasoning trace size across the agentic suite tells a clean story:
<think> across 5 promptsThe base writes 2.6× more thinking for the same problems. Skim a base agentic .thinking.txt next to a distill one (open agentic_*.txt from the index) and you can see the difference — the base often re-states the problem two or three times before working on it, double-checks intermediate steps that don't need it, and runs simulations of edge cases that won't fire. The distill goes more directly: state, plan, execute, verify, done.
| Distill | Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash · Q5_K_M (locally quantized, 6.1 GB) |
| Base | Qwen/Qwen3.5-9B · Q5_K_M from bartowski (6.4 GB) |
| Runtime | llama.cpp cuda-12.8 (b8708), --flash-attn on, --jinja, embedded chat templates |
| Context | 40,960 tokens, FP16 KV cache |
| Hardware | RTX 5090 (32 GB), Intel Core Ultra 7 265K, 125 GB RAM |
| Thinking | Always on for both models |
The DeepSeek-V4-Flash distill is a real upgrade over plain Qwen 3.5 9B in two specific places: long-horizon thinking (agentic / planning / self-critique / structured extraction) and creative one-shot front-end design. Tool-calling, throughput, disk, VRAM are all the same. Templated UI work is a wash. There's no tax I could find on the things the base is already good at.
If you were running base Qwen 3.5 9B as a local default, this distill is a free upgrade in the literal sense — same size, same speed, more reliable on the prompts where the base spirals, and visibly better one-shot output when the prompt asks for taste.
← Back to the eval overview · distill model card · base model card · Jackrong on HF