Your Model Hit 99% Accuracy. Then You Shipped It.

By Addy · March 2, 2026

Your training loop finished. Loss curve looked clean. Accuracy: 99.3%. You saved the model, wrote the deployment script, maybe even sent a message to your team.

Then production happened.

58%. Sometimes lower. Users complaining. You re-run training. 99% again. You check for bugs. Nothing. The model is not broken. The model never saw the real world.

The dataset was broken before you touched it.

The Problem Is Not Your Model

When a model performs perfectly in training and falls apart in production, the instinct is to blame the architecture. Add more layers. Tune the learning rate. Try a different optimizer.

But the model learned exactly what you showed it. That is the problem.

Dataset filtering is the part of AI development nobody talks about seriously. It sounds like data cleaning. Like a chore before the real work starts. It is not. It is the most consequential decision in your entire pipeline, and most developers make it wrong, or make it without realizing they made it at all.

What Filtering Actually Does

When you filter a dataset, you are not just removing rows. You are making a claim about what the real world looks like.

Every row you drop is a decision: this does not represent the distribution my model will face. Every row you keep is a decision: this does. Most of the time, developers make these decisions based on convenience, not on that logic.

You remove nulls because nulls cause errors. You remove outliers because they mess up the loss curve. You remove duplicates because that is what the tutorial said to do. None of these are wrong. But none of them are the same as asking: what does production traffic actually look like?

If production has nulls, and you trained on a null-free dataset, your model has never learned what to do with missing information. It will guess. The guess will be confident. It will be wrong.

Where It Breaks

Here is the specific failure. You have a dataset of 100,000 rows. 40,000 have missing values in a key feature column. You drop them. You now have 60,000 clean rows. Training accuracy: 99%.

But the reason those 40,000 rows had missing values is not random. It is because that feature is hard to collect in certain conditions. Certain user types. Certain geographies. Certain edge cases. The rows you dropped are not noise. They are a class of real-world inputs your model will encounter every day and has never seen.

This is overfitting to a filtered distribution. You trained on the data that survived your cleaning process. Production sends you everything.

The 58% accuracy is not a failure of the model. It is an accurate reflection of how well your model understands the actual problem.

The Other Way It Breaks

Filtering too little is also a failure, just a slower one.

You keep everything. Training data includes mislabeled rows, data entry errors, rows that belong to a completely different distribution because someone ran a test campaign six months ago and the data never got separated. The model trains on noise as if it were signal.

Accuracy looks mediocre in training. You tune hyperparameters trying to fix it. The actual fix was three filters you never ran.

This is underfitting by contamination. The model is not too simple for the problem. The problem you gave it is not the problem you meant to give it.

The Mental Model

Think of your dataset as a contract.

Every filter you apply is a clause in that contract: I am telling this model that inputs matching this description do not exist or do not matter. Before you add a clause, ask whether you can honor it in production.

Dropping nulls? Can you guarantee production will never send a null? If not, either handle nulls explicitly in preprocessing or keep some null-containing rows in training so the model learns behavior for that case.

Dropping outliers? Are they errors, or are they rare real events your model needs to handle correctly? Fraud detection built on outlier-free training data is a fraud detection model that has never seen fraud.

Capping the dataset at a clean balanced split? Did you verify that the real-world distribution matches that balance, or did you create an artificial world your model will never actually live in?

The filter is a claim. Validate the claim before you make it.

Before You Filter Anything

Look at what you are about to remove. Not the count. The content.

If you are dropping 40% of your dataset, that is not cleaning. That is a signal. That 40% is telling you something about the problem that the other 60% is not. Understand it before you delete it.

Log every filter. Write down why. Not in a comment in the notebook you will close tomorrow. In documentation you will read six months from now when production accuracy has drifted and you are trying to figure out when it started.

And always ask: does my test set have the same filtering applied? If your test set is also null-free and outlier-free and perfectly balanced, your 99% accuracy is a measure of how well your model performs in the world you built, not the world that exists.

The Real World Does Not Cooperate

Production data is messy. It has nulls and duplicates and edge cases and inputs no one anticipated. A model trained on a surgically clean dataset is a model trained on a fiction.

The goal is not a perfectly clean dataset. The goal is a dataset that honestly represents the distribution your model will face. Sometimes that means keeping the messy rows. Sometimes that means building a preprocessing layer that handles the mess at inference time. Sometimes that means going back to the data collection step and asking why certain inputs are consistently missing.

The 99% accuracy was real. It just measured the wrong thing.

Go deeper

Chapter 3: Classical Machine Learning -- thinking in features, how training data shapes what a model learns
Chapter 7: Building AI That Survives Reality -- what happens when models meet production and how to prepare for it
Appendix: Common Traps -- a master list of the mistakes that catch developers, including data pitfalls

Related guide: RAG Works in Theory. Here's Why It Fails in Production. -- the same gap between training and production, applied to retrieval systems