Zero Parse Failures at Scale: Structured Output with Local LLMs
We validated Ollama's grammar-constrained structured output at scale — 557 consecutive calls, zero parse failures. On-premises AI pipelines now get the same reliability guarantees as cloud APIs.
Healthcare, legal, and finance clients need AI that processes sensitive data without that data leaving their environment. Cloud APIs are off the table. That means on-premises inference. That means rebuilding the reliability and developer-experience guarantees you get with managed APIs from scratch.
This post documents one piece of that work: making structured output from a locally-hosted LLM as reliable as calling OpenAI’s API directly.
The Problem
We run an on-premises inference server for AI R&D — a dedicated machine for benchmarking local model performance, validating architectures before client recommendations, and building pipelines that operate entirely within a client’s firewall. One active workstream: a voice content processing pipeline that classifies podcast transcript segments, detects speaker changes, and filters non-dialogue content like audience laughter.
The pipeline runs hundreds of LLM calls per batch. The question was binary: does Ollama’s grammar-level schema enforcement guarantee zero parse failures at scale, or do edge cases slip through under production conditions? The answer matters beyond this pipeline. It is a prerequisite for any on-premises AI deployment where downstream code must trust the LLM’s output.
What We Built
We migrated from raw requests calls to Ollama’s /api/generate endpoint to the openai Python SDK pointed at Ollama’s OpenAI-compatible /v1/ endpoint. This unlocked the structured outputs interface.
Instead of prompting for JSON and parsing text, we define a Pydantic model and pass it as the response schema:
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
class TranscriptClassification(BaseModel):
is_monologue: bool
speaker_count: int
confidence_score: int
contains_audience_reaction: bool
response = client.beta.chat.completions.parse(
model="qwen3:8b",
messages=[{"role": "user", "content": segment}],
response_format=TranscriptClassification,
)
result = response.choices[0].message.parsed
Ollama uses grammar-constrained generation under the hood. The model can only produce tokens that conform to the schema. Not approximately. Not usually. A hard constraint at the token level.
Alongside this, we built a 15-case eval harness to stress-test classification logic: clean monologues, sitcom [laughter] tags, mid-sentence cuts, and multi-speaker exchanges. Two prompt variants. Pass rate measured against typed assertions.
What We Found
557 consecutive production calls. Zero parse failures.
Full run of the eval harness plus a production batch. Not a cherry-picked result.
The eval work exposed a useful pattern: the baseline prompt was false-rejecting [laughter] annotations as multi-speaker dialogue (0/4 passing). One explicit clarification brought it to 4/4. Multi-speaker detection improved from 0/5 to 5/5 by naming bare dash dialogue markers as a rejection criterion. Schema enforcement handles structure. Prompt quality drives classification accuracy.
The score threshold needed one calibration pass — adjusted from ≤3 to ≤4 after observing consistent borderline scoring on genuinely ambiguous embedded dialogue.
Why It Matters
The Pydantic + Ollama /v1/ pattern is now our default for any on-premises inference task that produces structured data. It eliminates an entire class of runtime errors. Pipelines become deterministic in a way that prompt-based parsing cannot match.
Regulated industries. Healthcare, legal, and financial services clients need AI that operates entirely within their own infrastructure. This architecture delivers the same reliability as a managed cloud API — zero parse failures, typed outputs, schema validation — with zero data leaving the building. For a HIPAA-covered entity, that is not a nice-to-have. It is the only viable path.
On-premises AI infrastructure. This is part of a larger research program into production-grade local inference. Knowing that grammar-constrained generation is reliable at hundreds of calls is a building block. We can architect pipelines for clients where the LLM layer is on-prem without adding a reliability tax.
Cost economics. At scale, the break-even math between on-premises GPU inference and cloud API costs shifts decisively. Validating that local inference is reliable enough for production workloads is the prerequisite for making that argument with confidence.
If you are building AI pipelines where data governance matters and you are still parsing text responses from a local model, this is the upgrade path.
Ready to build something that works?
Every engagement starts with a free 30-minute consultation. Let's talk about your project.
Start the Conversation