← back to blog

LLM-generated SQL is mostly correct — that's the problem

Why 'looks right and runs' is the most dangerous failure mode for AI-generated SQL, and what a validation layer needs to catch before the dashboard updates.

· #validation#llm#sql

The story goes like this. An analytics engineer is in a hurry. The marketing team wants weekly active users segmented by channel, and the existing query is wrong in a small way. They paste the schema into a chat window, paste the broken query, and ask the model to fix the join. Forty seconds later there is a beautifully formatted block of SQL. It parses. It runs. It returns numbers. The dashboard refreshes. Marketing is happy. The engineer closes the laptop.

The problem is that the numbers are wrong.

Not catastrophically wrong. Maybe 8% too high because the new join silently duplicated rows for users with multiple channel attributions. Maybe a LEFT JOIN quietly became an INNER JOIN and dropped the people the team most wanted to count. Maybe the date predicate uses CURRENT_DATE when it should have used the warehouse’s session timezone. The query compiles. The query runs. The query is wrong. And nothing in the toolchain has any opinion about that.

This is the failure mode that makes LLM-generated SQL dangerous as a primary tool rather than a supporting one. The errors are subtle, locally plausible, and globally costly. They are not the kind of bugs that throw exceptions. They are the kind of bugs that move quarterly numbers by a percent and a half without anyone noticing for two weeks.

”It compiled” is not “it’s correct”

A SQL compiler answers a specific question: is this syntactically valid and does it bind to objects that exist? That is a lower bar than most people realize. A query can pass that check while:

  • Joining on columns that share a name but represent different concepts.
  • Aggregating across a fan-out join, silently inflating sums and counts.
  • Filtering on a WHERE predicate that uses NULL semantics the author didn’t expect.
  • Using DISTINCT to paper over a join that should have been narrower.
  • Running against a view that was deprecated six months ago and now returns a truncated subset.
  • Casting a string to a date in a way that drops time-zone information.

None of these throw errors. None of these make the query “fail.” All of them produce wrong numbers, often very close to right numbers, which is the worst possible kind of wrong.

LLMs are very good at generating SQL that passes the compiler. That is what they have been trained to do. They have read every query on the public internet, and they have learned that real SQL looks confident and uses joins and produces output. They have not learned, because nothing about the loss function asked them to, whether the output matches the question.

What validation actually needs to mean

If LLMs are going to write SQL, then something else has to read it. Not in the sense of compiling it, but in the sense of asking the question a careful reviewer would ask. l0l1’s position is that this second pass should be a first-class layer, not an afterthought, and it should run on every query the model produces before any human signs off.

The checks that matter most are the ones that the database engine cannot do for you:

Schema awareness. A query that says SELECT email FROM users is fine only if users has an email column. The model thinks it does because users usually do. The validator checks the actual schema you handed it.

Fan-out detection. If two tables join on a non-unique key, the result set contains duplicates. The model rarely warns you. A validator that knows the cardinality of the join columns can flag this before the dashboard goes wrong.

Star-select hygiene. SELECT * from a wide table when only three columns are used downstream is a smell, not always a bug — but worth surfacing, especially when one of those columns turns out to be customer_email_raw.

Predicate sanity. A WHERE clause that filters on created_at > NOW() - INTERVAL 7 DAY might be exactly what the user meant. It might also be a silent timezone bug waiting to happen. Validators can ask for confirmation.

Aggregation discipline. Counting rows in a joined result is almost never what people want. Counting distinct entity IDs usually is. The shape of the query gives away which one the author tried to do; the LLM almost always picks the simpler one.

None of these are heroic checks. They are the things a careful senior would notice in a code review. The point of the validation layer is to do them mechanically, every time, before the model’s output becomes anyone’s production query.

Why people skip this step

Because it’s slow, mostly. Because the LLM took forty seconds and the validator takes another twenty, and that feels like the validator is the expensive part even though it’s the cheap part. Because the failure mode of skipping validation is silent and shows up later, and humans have very poor intuition for risks that don’t bite immediately.

There is also an honest concern that validation produces a lot of false positives, and after the fifth time you’ve been warned about a SELECT * that was perfectly fine, you stop reading the warnings. This is real. The answer is not to remove the validation; the answer is to tune it per workspace, learn which warnings the team has been ignoring for good reasons, and demote those. l0l1 keeps per-workspace pattern history specifically so the second pass gets quieter over time without going silent.

What “mostly correct” looks like in practice

We have looked at a lot of LLM-generated SQL. The error distribution is not uniform. The model is excellent at:

  • Choosing reasonable column names.
  • Producing a query shape that looks like other queries on similar topics.
  • Adding GROUP BY and ORDER BY clauses in plausible places.
  • Picking idiomatic functions for date arithmetic in whatever dialect the prompt implies.

The model is bad at:

  • Knowing your conventions. Your status = 'active' versus their is_active = TRUE.
  • Respecting your soft-delete logic. The model does not know that every query needs AND deleted_at IS NULL.
  • Honoring your row-level security. The model will happily query a table the current user can’t see.
  • Remembering that your revenue column is in cents, not dollars.

The model isn’t wrong because LLMs are bad at SQL. The model is wrong because the model doesn’t have your context. That’s not a model problem to solve. That is a problem for the layer between the model and the warehouse.

Treating the LLM as a junior

A useful way to frame it: the LLM is a junior who started Monday, has memorized every SQL textbook ever written, and has never seen your codebase. You would not let that junior ship to production unreviewed. You would not let that junior near a table called pii_pending_deletion. You would want their output to pass through someone — or something — that knows your conventions.

l0l1 is built to be that something. The model writes the draft. The validator reads it against your schema. The PII layer checks that no sensitive literal is about to leak into a model provider’s logs. The learning layer remembers which patterns your team has approved before, so the next draft is closer to the one you would have written by hand.

That doesn’t make the LLM less useful. It makes it usable for the things it’s genuinely good at — generating a first draft that’s seventy percent of the way there — without confusing that seventy percent for the whole job.

The next post in this series gets specific about what a validation layer should look like when you treat it as a real engineering surface rather than a warning popup. It’s not glamorous. It is, however, the part of the stack that quietly stops you from publishing wrong numbers.

Want this in your stack? Read the docs or browse more posts.