AI Engineering and Evals as New Layers of Software Work

look quite the same as before. As a software engineer in the AI space, my work has been a hybrid of software engineering, AI engineering, product intuition, and doses of user empathy.

With so much going on, I wanted to take a step back and reflect on the bigger picture, and the kind of skills and mental models engineers need to stay ahead. A recent read of O’Reilly’s AI Engineering gave me the nudge to also wanted to deep dive into how to think about evals — a core component in any AI system.

One thing stood out: AI engineering is often more software than AI.

Outside of research labs like OpenAI or Anthropic, most of us aren’t training models from scratch. The real work is about solving business problems with the tools we already have — giving models enough relevant context, using APIs, building RAG pipelines, tool-calling — all on top of the usual SWE concerns like deployment, monitoring and scaling.

In other words, AI engineering isn’t replacing software engineering — it’s layering new complexity on top of it.

This piece is me teasing out some of those themes. If any of them resonates, I’d love to hear your thoughts — feel free to reach out here!

The three layers of an AI application stack

Think of an AI app as being built on three layers: 1) Application development 2) Model development 3) Infrastructure.

Most teams start from the top. With powerful models readily available off the shelf, it often makes sense to begin by focusing on building the product and only later dip into model development or infrastructure as needed.

As O’Reilly puts it, “AI engineering is just software engineering with AI models thrown into the stack.”

Why evals matter and why they’re tough

In software, one of the biggest headaches for fast-moving teams is regressions. You ship a new feature, and in the process unknowingly break something else. Weeks later, a bug surfaces in a dusty corner of the codebase, and tracing it back becomes a nightmare.

Having a comprehensive test suite helps catch these regressions.

AI development faces a similar problem. Every change — whether it’s prompt tweaks, RAG pipeline updates, fine-tuning, or context engineering — can improve performance in one area while quietly degrading another.

In many ways, evaluations are to AI what tests are to software: they catch regressions early and give engineers the confidence to move fast without breaking things.

But evaluating AI isn’t straightforward. Firstly, the more intelligent models become, the harder evaluation gets. It’s easy to tell if a book summary is bad if it’s gibberish, but much harder if the summary is actually coherent. To know whether it’s actually capturing the key points, not just sounding fluent or factually correct, you might have to read the book yourself.

Secondly, tasks are often open-ended. There’s rarely a single “right” answer and impossible to curate a comprehensive list of correct outputs.

Thirdly, foundation models are treated as black boxes, where details of model architecture, training data and training process are often scrutinised or even made public. These details reveal alot about a model’s strengths and weaknesses and without it, people only evaluate models based by observing it’s outputs.

How to think about evals

I like to group evals into two broad realms: quantitative and qualitative.

Quantitative evals have clear, unambiguous answers. Did the math problem get solved correctly? Did the code execute without errors? These can often be tested automatically, which makes them scalable.

Qualitative evals, on the other hand, live in the grey areas. They’re about interpretation and judgment — like grading an essay, assessing the tone of a chatbot, or deciding whether a summary “sounds right.”

Most evals are a mix of both. For example, evaluating a generated website means not only testing whether it performs its intended functions (quantitative: can a user sign up, log in, etc.), but also judging whether the user experience feels intuitive (qualitative).

Functional correctness

At the heart of quantitative evals is functional correctness: does the model’s output actually do what it’s supposed to do?

If you ask a model to generate a website, the core question is whether the site meets its requirements. Can a user complete key actions? Does it work reliably? This looks a lot like traditional software testing, where you run a product against a suite of test cases to verify behaviour. Often, this can be automated.

Similarity against reference data

Not all tasks have such clear, testable outputs. Translation is a good example: there’s no single “correct” English translation for a French sentence, but you can compare outputs against reference data.

The downside: This relies heavily on the availability of reference datasets, which are expensive and time-consuming to create. Human-generated data is considered the gold standard, but increasingly, reference data is being bootstrapped by other AIs.

There are a few ways to measure similarity:

Human judgement
Exact match: whether the generated response matches one of the reference responses exactly. These produces boolean results.
Lexical similarity: measuring how similar the outputs look (e.g., overlap in words or phrases).
Semantic similarity: measuring whether the outputs mean the same thing, even if the wording is different. This usually involves turning data into embeddings (numerical vectors) and comparing them. Embeddings aren’t just for text — platforms like Pinterest use them for images, queries, and even user profiles.

Lexical similarity only checks surface-level resemblance, while semantic similarity digs deeper into meaning.

AI as a judge

Some tasks are nearly impossible to evaluate cleanly with rules or reference data. Assessing the tone of a chatbot, judging the coherence of a summary, or critiquing the persuasiveness of ad copy all fall into this category. Humans can do it, but human evals don’t scale.

Here’s how to structure the process:

Define a structured and measurable evaluation criteria. Be explicit about what you care about — clarity, helpfulness, factual accuracy, tone, etc. Criteria can use a scale (1–5 rating) or binary checks (pass/fail).
The original input, the generated output, and any supporting context are given to the AI judge. A score, label or even an explanation for evaluation is then returned by the judge.
Aggregate over many outputs. By running this process across large datasets, you can uncover patterns — for example, noticing that helpfulness dropped 10% after a model update.

Because this can be automated, it enables continuous evaluation, borrowing from CI/CD practices in software engineering. Evals can be run before and after pipeline changes (from prompt tweaks to model upgrades), or used for ongoing monitoring to catch drift and regressions.

Of course, AI judges aren’t perfect. Just as you wouldn’t fully trust a single person’s opinion, you shouldn’t fully trust a model’s either. But with careful design, multiple judge models, or running them over many outputs, they can provide scalable approximations of human judgment.

Eval driven development

O’Reilly talked about the concept of eval-driven development, inspired by test-driven development in software engineering, something I felt is worth sharing.

The idea is simple: Define your evals before you build.
In AI engineering, this means deciding what “success” looks like and how it’ll be measured.

Impact still matters most — not hype. The right evals ensure that AI apps demonstrate value in ways that are relevant to users and the business.

When defining evals, here are some key considerations:

Domain knowledge

Public benchmarks exist across many domains — code debugging, legal knowledge, tool use — but they’re often generic. The most meaningful evals usually come from sitting down with stakeholders and defining what truly matters for the business, then translating that into measurable outcomes.

Correctness isn’t enough if the solution is impractical. For example, a text-to-SQL model might generate a correct query, but if it takes 10 minutes to run or consumes huge resources, it’s not useful at scale. Runtime and memory usage are important metrics too.

Generation capability

For generative tasks — whether text, image, or audio — evals may include fluency, coherence, and task-specific metrics like relevance.

A summary might be factually accurate but miss the most important points — an eval should capture that. Increasingly, these qualities can themselves be scored by another AI.

Factual consistency

Outputs need to be checked against a source of truth. This can happen in two ways:

Local consistency
This means verifying outputs against a provided context. This is especially useful for specific domains that are unique to themselves and have limited scope. For instance, extracted insights should be consistent with the data.
Global consistency
This means verifying outputs against open knowledge sources such as by fact checking via a web search or a market research and so on.
Self verification
This happens when a model generates multiple outputs, and measures how consistent these responses are with each other.

Safety

Beyond the usual concept of safety such as to not include profanity and explicit content, there are actually many ways in which safety can be defined. For instance, chatbots should not reveal sensitive customer data and should be able to guard against prompt injection attacks.

To sum up

As AI capabilities grow, robust evals will only become more important. They’re the guardrails that let engineers move quickly without sacrificing reliability.

I’ve seen how challenging reliability can be and how costly regressions are. They damage a company’s reputation, frustrate users, and create painful dev experiences, with engineers stuck chasing the same bugs over and over.

As the boundaries between engineering roles blur, especially in smaller teams, we’re facing a fundamental shift in how we think about software quality. The need to maintain and measure reliability now extends beyond rule-based systems to those that are inherently probabilistic and stochastic.

Source link

Sign Up to Our Newsletter

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News

The three layers of an AI application stack

Why evals matter and why they’re tough

How to think about evals

Functional correctness

Similarity against reference data

AI as a judge

Eval driven development

Domain knowledge

Generation capability

Factual consistency

Safety

To sum up

“Our work is never done” – Logitech CEO on the trends shaping the future of work, and how hardware can be “the eyes, the ears and the hands of AI”

TDS Newsletter: September Must-Reads on ML Career Roadmaps, Python Essentials, AI Agents, and More

Team TeachToday

About Author

You may also like

Our Company

Categories

Get Latest Updates and big deals

Our expertise, as well as our passion for web design, sets us apart from other agencies.