Going Psycho on Open WebUI: Benchmarking Qwen MoE vs. Dense Models

Sometimes, the best technical deep-dives start with a completely unrelated rabbit hole.

Lately, I’ve been digging through the Open WebUI documentation, laying the groundwork to start implementing custom tools for my local LLM environment. Around the same time, I was listening to Donald Spoto's biography of Alfred Hitchcock. That sparked an idea, which led me to an Academia.edu paper by Dineshika Muthushani titled Analysis on Psycho by Alfred Hitchcock Q & A.

I had Gemini ingest Muthushani's paper and convert it into a highly constrained, complex prompt. The goal? To use the 69-page, 32,000-word original script of Psycho to stress-test my local inference stack.

For those of you pushing the limits of your home lab configurations (looking at you, Andrew), here is a breakdown of what happens when you throw a massive movie script at local models, and how Open WebUI handles the heavy lifting.

The Gauntlet: The Prompt

The prompt Gemini generated was not a simple summarization task. It cast the LLM in the role of an expert in Clinical Psychology and Film Studies, demanding a six-part evaluation of Norman Bates.

The models were required to diagnose his disorder, cross-reference his behavior against modern clinical standards, evaluate the movie's psychiatric intervention, and apply two specific General Psychology modules (like Human Development or Memory) to the narrative. Furthermore, it strictly enforced a citation rule: the model could only use [1] inline citations when referencing the provided script, and had to suppress all raw XML tags in the output.

The Contenders & Inference Realities

I ran this test across two distinct architectures hosted locally via Podman and llama-server:

Qwen3.5-122B-A10B-UD-Q4_K_XL: A massive Mixture of Experts (MoE) model running at 4-bit quantization.
Qwen3.6-27B-Q8_0: A smaller, dense architecture running at a pristine 8-bit quantization.

The physical realities of memory bandwidth became immediately apparent. The 122B MoE model absolutely flew, generating text at over 22 tokens per second (t/s). Because it only activates roughly 10 billion parameters per token, it was highly efficient.

The dense 27B model, however, had to push all 27 billion parameters through the memory bus for every single word. It settled into a rock-solid, but noticeably slower, ~7.5 t/s. However, the real story wasn't the generation speed—it was how the frontend handled the 32,000-word file.

Evaluating Open WebUI Parameters

When dealing with a 45,000-token text file, the connection between your UI and your backend is everything. I tested three distinct document parsing settings in Open WebUI to see how they impacted the models:

1. Standard RAG (Chunking and Retrieval) By default, Open WebUI sliced the Psycho script into 1,000-token chunks. Using hybrid search, it evaluated the prompt and sent only the top 7 chunks (about 2,600 tokens) to the model. It was fast, but the models lost the broader narrative context of the film, operating only on the severed pieces of dialogue the vector database deemed relevant.

2. Full Context Mode This setting forces the RAG pipeline to inject the entire document context. This is where you hit the Time-to-First-Token (TTFT) wall. Dumping 45,000 tokens into the model caused a significant processing stall—taking over two minutes for the system to build the Key-Value (KV) cache in VRAM. However, once that cache was built, subsequent queries against the script were virtually instantaneous, processing at a mere 190 milliseconds.

3. Bypass Embedding and Retrieval The nuclear option. This disables the vector database entirely and dumps the raw text directly into the prompt. The model had to ingest over 54,000 tokens (the script + system prompts) from scratch. This resulted in a grueling 6-minute TTFT delay as the transformer's attention mechanism scaled quadratically to map the text. But the result? The dense 27B model had flawless, uninterrupted access to the entire document, resulting in a phenomenal, deeply nuanced clinical evaluation.

The Takeaway

If you are strictly evaluating localized text, a dense model forced to read the entire file via "Bypass Embedding" provides incredible, consistent adherence to complex constraints. But if you value your time and require high-speed generation for massive context windows, a highly quantized MoE is still the undisputed king of the home lab.

The 122B MoE "Bypass" Revelation

After analyzing the dense 27B model, I had to know if the impressive speed of a Mixture of Experts (MoE) architecture came at the cost of analytical depth. Generating text rapidly is useless if the model hallucinates clinical data or fails strict formatting constraints.

To find out, I ran one final, definitive test: I fed the exact same 54,513-token Psycho script directly into the Qwen3.5-122B-A10B-UD-Q4_K_XL model with Open WebUI’s "Bypass Embedding and Retrieval" enabled.

For those of us serving up local LLMs, the results of this test represent the holy grail of local inference:

1. The Ingestion Bottleneck Vanishes When the dense 27B model ingested the 54k-token prompt, the memory bandwidth choked, dropping to 146 t/s and resulting in a 6.2-minute Time-to-First-Token (TTFT) stall. The 122B MoE, however, chewed through the exact same prompt at a blistering 227.84 t/s. It built the entire KV cache in just 3.9 minutes. The MoE’s routing efficiency mathematically eliminates the massive ingestion penalty of dense models.

2. Near-20 t/s Generation with 54k Tokens in VRAM Once it started generating, the 122B MoE sustained 19.52 tokens per second. It only had to push roughly 10 billion active parameters through the memory bus per word, allowing it to generate text nearly three times faster than the 27B dense model, all while holding a massive 54,000-token attention map in memory.

3. PhD-Level Qualitative Synthesis The real victory, however, was the output quality. Smaller models given massive texts often fall into the trap of "lazy synthesis"—summarizing the plot and awkwardly bolting the requested theories onto the end. The 122B MoE provided a masterclass in clinical evaluation:

Deep Critique: Under the Clinical Accuracy section, it didn't just point out that DID patients aren't usually violent. It correctly identified that the script itself (via the Dr. Simon character) clumsily conflates DID, psychosis, and transvestism.
Perfect Constraint Adherence: The model navigated the strict citation trap flawlessly. It used the [1] format exactly as requested, tying every citation to a direct quote from the script rather than scattering them randomly. Furthermore, it completely suppressed the raw XML tags it was instructed to avoid.
Zero Knowledge Bleed: Despite having a vastly larger reservoir of training data regarding Alfred Hitchcock and Psycho, the MoE aggressively tethered itself to the provided text. It resisted the urge to hallucinate visual scenes from the movie that weren't explicitly described in the script excerpts.

The Final Verdict If you have the memory capacity to host a heavily quantized MoE model, there is no contest. Bypassing the RAG vector database and feeding 54,000 tokens directly into the Qwen3.5-122B MoE doesn't just give you a massive speed boost. It delivers a rigorously compliant, clinically nuanced, and academically structured response that smaller dense models simply cannot match, even at half the speed.

Disclaimer: This post documents a personal benchmarking experiment with local LLMs (Open WebUI, Qwen3.5-122B MoE, Qwen3.6-27B). The original prompt was authored by Gemini using Dineshika Muthushani's Academia.edu paper as source material; the model outputs analyzed within are those of the local inference stack. Claude (an AI assistant by Anthropic) helped with editing and visual formatting of this write-up. All conclusions and observations are the author's own.

Stupid Shit!!!

Monday, May 04, 2026