Home / Probably Raises $9M to Solve AI Hallucinations via Deterministic Validation

Probably Raises $9M to Solve AI Hallucinations via Deterministic Validation

Saran K | June 16, 2026 | 8 min read

The Reliability Gap in Generative AI

For all the breathless hype surrounding Large Language Models (LLMs), a fundamental flaw persists: they are probabilistic, not deterministic. When you ask a model like GPT-4 or Claude 3.5 to analyze a dataset, it isn’t calculating an answer in the traditional sense; it is predicting the next most likely token. This inherent architecture is exactly why AI hallucinations—the tendency for models to confidently assert falsehoods—remain the primary barrier to the widespread adoption of AI in medicine, accounting, and high-stakes data science.

While the industry has attempted to patch this with Retrieval-Augmented Generation (RAG) and human-in-the-loop reviews, these are essentially filters applied after the fact. They don’t fix the core instability of the model. This is the gap that Probably, a startup that recently secured $9 million in seed funding led by Andreessen Horowitz, intends to close. By implementing a rigorous “harness” around the AI, Probably is attempting to move LLM accuracy from the “mostly correct” range to a 99.99% reliability standard, mirroring the stability found in traditional software engineering.

The Core Problem: LLMs predict patterns, they don’t verify facts, leading to hallucinations.
The Solution: A deterministic validator system that acts as a factual “guardrail” for AI outputs.
The Business Impact: Higher reliability allows the use of smaller, cheaper models on local hardware, slashing token costs.
The Target: Precision-sensitive industries including financial auditing, medical research, and complex data analytics.

Breaking the Probabilistic Loop: How Deterministic Validation Works

To understand Probably’s approach, one must first understand the difference between a probabilistic system and a deterministic one. In a deterministic system—like a basic calculator—the input 2+2 will always result in 4. There is no chance of the calculator “guessing” that the answer might be 5 based on a pattern it saw in another dataset. LLMs, conversely, are probabilistic; they provide the most likely answer, which is often correct but occasionally catastrophically wrong.

Founder Peter Elias describes the company’s architecture as a “data science mech suit.” Rather than letting the LLM operate freely, Probably wraps the model in a deterministic validator. When the LLM generates a response based on a complex dataset, that response is not sent directly to the user. Instead, it is bounced back to a validator system that checks the answer against the raw, hard data. If the LLM’s summary contradicts the source data, the validator rejects it and forces the model to iterate until the output is factually aligned.

The Role of Harness Engineering

The most provocative claim from the Probably team is that harness engineering—the design of the environment and constraints around the AI—is more important than the model itself. Elias argues that if the harness is sufficiently refined, the underlying model can actually be less powerful. This is a significant departure from the current industry trend of “scaling laws,” which suggests that the only way to get better performance is to build larger models with more parameters.

By reducing ambiguity through a strict validation harness, Probably has successfully deployed a tool that runs on models “four classes weaker than frontier models.” This has a direct impact on the bottom line. Because these smaller models require fewer computational resources, they can be run on local desktop hardware rather than expensive cloud-based data centers, drastically reducing the token costs that have made many enterprise AI deployments prohibitively expensive.

Practical Implications: What This Means for Enterprise AI

The shift toward deterministic validation marks a transition from “AI as a creative assistant” to “AI as a reliable utility.” For the average business, this changes the risk profile of integrating AI into core workflows.

Reduction in ‘Human-in-the-Loop’ Fatigue

Currently, most companies employing AI for data analysis require a human expert to audit every single output. This creates a bottleneck that negates much of the efficiency gained by using AI. If a system can guarantee 99.99% accuracy through an automated audit trail and citations, the human role shifts from verification to strategy.

Lowering the Barrier to Entry for Specialized AI

Because Probably’s approach allows for the use of smaller models, it opens the door for highly specialized, on-premise AI. In sectors like healthcare or defense, where data privacy prevents the use of cloud-based LLMs, the ability to run a reliable, small-scale model locally is a critical requirement. This removes the “privacy vs. power” trade-off that currently plagues the sector.

The Economic Incentive Conflict

One of the more critical insights shared by Elias is the systemic reason why the major AI labs—OpenAI, Google, and Anthropic—may not be prioritizing this type of deterministic rigor. The current business model of AI labs is heavily reliant on token consumption. A model that is slightly unreliable requires the user to prompt it multiple times, refine the query, and regenerate the answer—all of which consume more tokens and increase revenue for the provider.

“I think it’s really interesting that the big AI labs have not even attempted to do this,” says Elias. “They’re incentivized not to, because they make money the more times you have to correct the model.”

This creates a market opportunity for “middleware” companies like Probably. By positioning themselves as the reliability layer, they aren’t competing with the frontier models but are instead making those models viable for a class of users who cannot afford a 5% error rate.

Technical Breakdown: The Audit Trail and Citations

Probably’s first product is focused on data science, where the primary goal is to extract insights from massive, often messy, datasets. To ensure trust, the system generates a transparent audit trail for every answer. This isn’t just a list of sources; it is a step-by-step record of how the AI arrived at its conclusion, which is then verified by the deterministic layer.

Feature	Standard LLM Approach	Probably’s Approach
Output Nature	Probabilistic (Best Guess)	Deterministic (Verified)
Hardware	Cloud Data Centers (H100s)	Local Hardware / Desktop
Cost Structure	High Token Consumption	Optimized / Low Token Cost
Verification	Manual Human Review	Automated Validator + Audit Trail
Error Rate	Variable (Hallucinations common)	Targeting 99.99% Accuracy

Expanding Beyond Data Science

While the initial rollout focuses on data science tools, the architecture is designed for any “precision-sensitive use case.” The company has identified accounting and medical services as the next logical frontiers. In accounting, where a single misplaced decimal can lead to regulatory failure, the probabilistic nature of current AI is a non-starter. In medical diagnostics, hallucinations can be life-threatening.

By decoupling the reasoning (handled by the LLM) from the verification (handled by the deterministic harness), Probably creates a template that can be adapted to any industry where the cost of being wrong is higher than the benefit of being fast.

Frequently Asked Questions

What exactly are AI hallucinations?

AI hallucinations occur when a Large Language Model generates text that is grammatically correct and confident but factually incorrect or nonsensical. This happens because LLMs predict the next likely word based on patterns rather than accessing a database of verified facts.

How does a deterministic validator differ from a standard AI filter?

A standard filter often uses another AI to check if the first AI’s answer “looks” correct. A deterministic validator, however, uses hard logic and direct data comparison. It checks the AI’s output against the actual raw data; if the numbers don’t match exactly, the answer is rejected regardless of how confident the AI sounds.

Why does this approach allow for smaller AI models?

When a model is wrapped in a high-quality harness, the harness handles the “guardrails” and the strict formatting. This reduces the amount of “reasoning weight” the model needs to carry. Essentially, the harness does the hard work of keeping the model on track, allowing a smaller, more efficient model to perform as well as a giant one.

Will this replace the need for human data scientists?

No. Instead, it changes their role. Rather than spending hours manually verifying that an AI didn’t make up a number in a summary, data scientists can focus on interpreting the verified data and making strategic decisions.

Is this technology available for public use yet?

Probably is currently in its early stages following the $9 million seed round. The focus is on refining the data science tool before expanding into other precision-sensitive verticals like accounting and healthcare.

The Path Toward Verifiable Intelligence

The venture capital interest from Andreessen Horowitz suggests a growing realization in Silicon Valley: the “bigger is better” era of AI is hitting a wall of diminishing returns. For AI to move from a novelty tool to a core piece of infrastructure, it must be predictable. The transition from probabilistic outputs to deterministic verification is not just a technical upgrade; it is a fundamental shift in how we define machine intelligence.

By prioritizing the harness over the model, Probably is betting that the future of AI isn’t just about smarter models, but about smarter ways to keep those models honest. For enterprises currently hesitant to fully integrate AI due to reliability concerns, this approach provides a potential roadmap toward a truly trustworthy digital workforce.

#artificialIntelligence #startups #dataScience #ventureCapital #enterpriseSoftware

” “Artificial Intelligence in Film” Aerospace Startups Data Science Enterprise Software Venture Capital