Sign Up to Our Newsletter

Be the first to know the latest tech updates

Uncategorized

How to Make Your AI App Faster and More Interactive with Response Streaming

How to Make Your AI App Faster and More Interactive with Response Streaming


In my latest posts, talked a lot about prompt caching as well as caching in general, and how it can improve your AI app in terms of cost and latency. However, even for a fully optimized AI app, sometimes the responses are just going to take some time to be generated, and there’s simply nothing we can do about it. When we request large outputs from the model or require reasoning or deep thinking, the model is going to naturally take longer to respond. As reasonable as this is, waiting longer to receive an answer can be frustrating for the user and lower their overall user experience using an AI app. Happily, a simple and straightforward way to improve this issue is response streaming.

Streaming means getting the model’s response incrementally, little by little, as generated, rather than waiting for the entire response to be generated and then displaying it to the user. Normally (without streaming), we send a request to the model’s API, we wait for the model to generate the response, and once the response is completed, we get it back from the API in one step. With streaming, however, the API sends back partial outputs while the response is generated. This is a rather familiar concept because most user-facing AI apps like ChatGPT, from the moment they first appeared, used streaming to show their responses to their users. But beyond ChatGPT and LLMs, streaming is essentially used everywhere on the web and in modern applications, such as for instance in live notifications, multiplayer games, or live news feeds. In this post, we are going to further explore how we can integrate streaming in our own requests to model APIs and achieve a similar effect on custom AI apps.

There are several different mechanisms to implement the concept of streaming in an application. Nonetheless, for AI applications, there are two widely used types of streaming. More specifically, those are:

  • HTTP Streaming Over Server-Sent Events (SSE): That is a relatively simple, one-way type of streaming, allowing only live communication from server to client.
  • Streaming with WebSockets: That is a more advanced and complex type of streaming, allowing two-way live communication between server and client.

In the context of AI applications, HTTP streaming over SSE can support simple AI applications where we just need to stream the model’s response for latency and UX reasons. Nonetheless, as we move beyond simple request–response patterns into more advanced setups, WebSockets become particularly useful as they allow live, bidirectional communication between our application and the model’s API. For example, in code assistants, multi-agent systems, or tool-calling workflows, the client may need to send intermediate updates, user interactions, or feedback back to the server while the model is still generating a response. However, for most simple AI apps where we just need the model to provide a response, WebSockets are usually overkill, and SSE is sufficient.

In the rest of this post, we’ll be taking a better look at streaming for simple AI apps using HTTP streaming over SSE.

. . .

What about HTTP Streaming Over SSE?

HTTP Streaming Over Server-Sent Events (SSE) is based on HTTP streaming.

. . .

HTTP streaming means that the server can send whatever it is that it has to send in parts, rather than all at once. This is achieved by the server not terminating the connection to the client after sending a response, but rather leaving it open and sending the client whatever additional event occurs immediately.

For example, instead of getting the response in one chunk:

Hello world!

we could get it in parts using raw HTTP streaming:

Hello

World

!

If we were to implement HTTP streaming from scratch, we would need to handle everything ourselves, including parsing the streamed text, managing any errors, and reconnections to the server. In our example, using raw HTTP streaming, we would have to somehow explain to the client that ‘Hello world!’ is one event conceptually, and everything after it would be a separate event. Fortunately, there are several frameworks and wrappers that simplify HTTP streaming, one of which is HTTP Streaming Over Server-Sent Events (SSE).

. . .

So, Server-Sent Events (SSE) provide a standardized way to implement HTTP streaming by structuring server outputs into clearly defined events. This structure makes it much easier to parse and process streamed responses on the client side.

Each event typically includes:

  • an id
  • an event type
  • a data payload

or more properly..

id: 
event: 
data: 

Our example using SSE could look something like this:

id: 1
event: message
data: Hello world!

But what is an event? Anything can qualify as an event – a single word, a sentence, or thousands of words. What actually qualifies as an event in our particular implementation is defined by the setup of the API or the server we are connected to.

On top of this, SSE comes with various other conveniences, like automatically reconnecting to the server if the connection is terminated. Another thing is that incoming stream messages are clearly tagged as text/event-stream, allowing the client to appropriately handle them and avoid errors.

. . .

Roll up your sleeves

Frontier LLM APIs like OpenAI’s API or Claude API natively support HTTP streaming over SSE. In this way, integrating streaming in your requests becomes relatively simple, as it can be achieved by altering a parameter in the request (e.g., enabling a stream=true parameter).

Once streaming is enabled, the API no longer waits for the full response before replying. Instead, it sends back small parts of the model’s output as they are generated. On the client side, we can iterate over these chunks and display them progressively to the user, creating the familiar ChatGPT typing effect.

But, let’s do a minimal example of this using, as usual the OpenAI’s API:

import time
from openai import OpenAI

client = OpenAI(api_key="your_api_key")

stream = client.responses.create(
    model="gpt-4.1-mini",
    input="Explain response streaming in 3 short paragraphs.",
    stream=True,
)

full_text = ""

for event in stream:
    # only print text delta as text parts arrive
    if event.type == "response.output_text.delta":
        print(event.delta, end="", flush=True)
        full_text += event.delta

print("\n\nFinal collected response:")
print(full_text)

In this example, instead of receiving a single completed response, we iterate over a stream of events and print each text fragment as it arrives. At the same time, we also store the chunks into a full response full_text to use later if we want to.

. . .

So, should I just slap streaming = True on every request?

The short answer is no. As useful as it is, with great potential for significantly improving user experience, streaming is not a one-size-fits-all solution for AI apps, and we should use our discretion for evaluating where it should be implemented and where not.

More specifically, adding streaming in an AI app is very effective in setups when we expect long responses, and we value above all the user experience and responsiveness of the app. Such a case would be consumer-facing chatbots.

On the flip side, for simple apps where we expect the provided responses to be short, adding streaming isn’t likely to provide significant gains to the user experience and doesn’t make much sense. On top of this, streaming only makes sense in cases where the model’s output is free-text and not structured output (e.g. json files).

Most importantly, the major drawback of streaming is that we are not able to review the full response before displaying it to the user. Remember, LLMs generate the tokens one-by-one, and the meaning of the response is formed as the response is generated, not in advance. If we make 100 requests to an LLM with the exact same input, we are going to get 100 different responses. That is to say, no one knows before the responses are completed what it is going to say. As a result, with streaming activated is much more difficult to review the model’s output before displaying it to the user, and apply any guarantees on the produced content. We can always try to evaluate partial completions, but again, partial completions are more difficult to evaluate, as we have to guess where the model is going with this. Adding that this evaluation has to be performed in real time and not just once, but recursively on different partial responses of the model, renders this process even more challenging. In practice, in such cases, validation is run on the entire output after the response is complete. Nevertheless, the issue with this is that at this point, it may already be too late, as we may have already shown the user inappropriate content that doesn’t pass our validations.

. . .

On my mind

Streaming is a feature that doesn’t have an actual impact on the AI app’s capabilities, or its associated cost and latency. Nonetheless, it can have a great impact on the way the user’s perceive and experience an AI app. Streaming makes AI systems feel faster, more responsive, and more interactive, even when the time for generating the complete response stays exactly the same. That said, streaming is not a silver bullet. Different applications and contexts may benefit more or less from introducing streaming. Like many decisions in AI engineering, it’s less about what’s possible and more about what makes sense for your specific use case.

. . .

If you made it this far, you might find pialgorithms useful — a platform we’ve been building that helps teams securely manage organizational knowledge in one place.

. . .

Loved this post? Join me on 💌Substack and 💼LinkedIn

. . .

All images by the author, except mentioned otherwise.



Source link

Team TeachToday

Team TeachToday

About Author

TechToday Logo

Your go-to destination for the latest in tech, AI breakthroughs, industry trends, and expert insights.

Get Latest Updates and big deals

Our expertise, as well as our passion for web design, sets us apart from other agencies.

Digitally Interactive  Copyright 2022-25 All Rights Reserved.