Sign Up to Our Newsletter

Be the first to know the latest tech updates

[mc4wp_form id=195]

LLMs Are Randomized Algorithms | Towards Data Science

LLMs Are Randomized Algorithms | Towards Data Science


, I was a graduate student at Stanford University. It was the first lecture of a course titled ‘Randomized Algorithms’, and I was sitting in a middle row. “A Randomized Algorithm is an algorithm that takes random decisions,” the professor said. “Why should you study Randomized Algorithms? You should study them for the reason that for many applications, a Randomized Algorithm is the simplest known algorithm as well as the fastest known algorithm.”

This statement stunned a young me. An algorithm that takes random decisions can be better than an algorithm that takes deterministic, repeatable decisions, even for problems for which deterministic, repeatable algorithms exist? This professor must be nuts! — I thought. He wasn’t. The professor was Rajeev Motwani, who went on to win the Godel prize, and co-author Google’s search engine algorithm.

Having been studied since the 1940s, randomized algorithms are an esoteric class of algorithms with esoteric properties, studied by esoteric people in rarefied, esoteric, academia. What is recognized even less than randomized algorithms are, is that the newest crop of AI — large language models (LLMs) — are randomized algorithms. What is the link, and why? Read on, the answer will surprise you.

Randomized Algorithms and Adversaries

A randomized algorithm is an algorithm that takes random steps to solve a deterministic problem. Take a simple example. If I want to add up a list of hundred numbers, I can just add them directly. But, to save time, I may do the following: I will pick ten of them randomly, add only those ten, and then multiply the result by ten to compensate for the fact that I actually summed up only 10% of the data. There is a clear, exact answer, but I have approximated it using randomization. I have saved time — of course, at the cost of some accuracy.

Why pick numbers randomly? Why not pick, say, the first ten in the list? Well, maybe we don’t know how the list is distributed — maybe it starts with the largest numbers and goes down the list. In such a case, if I picked these largest numbers, I would have a biased sample of the data. Choosing numbers randomly reduces this bias in most cases. Statisticians and computer scientists can analyze such randomized algorithms to analyze the probability of error, and the amount of error suffered. They can then design randomized algorithms to minimize the error while simultaneously minimizing the effort the algorithm takes.

In the field of randomized algorithms, the above idea is called adversarial design. Imagine an adversary is feeding data into your algorithm. And imagine this adversary is trying to make your algorithm perform badly.

Person 1 says that he will approximate the average net worth of people by estimating the net worth of a small sample. An adversarial person 2 hands over a list of multi-billionaires to person 1.
An adversary can trip up an algorithm

A randomized algorithm attempts to counteract such an adversary. The idea is very simple: take random decisions that do not affect overall performance, but keep changing the input for which the worst case behavior occurs. In this way, even though the worst case behavior could still occur, no given adversary can force worst case behavior every time.

For illustration, think of trying to estimate the sum of hundred numbers by picking up only ten numbers. If these ten numbers were picked up deterministically, or repeatably, an adversary could strategically place “bad” numbers in those positions, thus forcing a bad estimate. If the ten numbers are picked up randomly, even though in the worst case we could still possibly choose bad numbers, no particular adversary can force such a bad behavior from the algorithm.

Why think of adversaries and adversarial design? First, because there are enough actual adversaries with nefarious interests that one should try to be robust against. But secondly, also to avoid the phenomenon of an “innocent adversary”. An innocent adversary is one who breaks the algorithm by bad luck, not on purpose. For example, asked for 10 random people, an innocent adversary may sincerely choose them from a People magazine list. Without knowing it, the innocent adversary is breaking algorithmic guarantees.

General Randomized Algorithms

Summing up numbers approximately is not the only use of randomized algorithms. Randomized algorithms have been applied, over the past half a century, on a diversity of problems including:

  1. Data sorting and searching
  2. Graph searching / matching algorithms
  3. Geometric algorithms
  4. Combinatorial algorithms

… and more. A rich field of study, randomized algorithms has its own dedicated conferences, books, publications, researchers and industry practitioners.

We will collect below, some characteristics of traditional randomized algorithms. These characteristics will help us determine (in the next section), whether large language models fit the description of randomized algorithms:

  1. Randomized algorithms take random steps
  2. To take random steps, randomized algorithms use a source of randomness (This includes “computational coin flips” such as pseudo-random number generators, and true “quantum” random number generation circuits.)
  3. The outputs of randomized algorithms are non-deterministic, producing different outputs for the same input
  4. Many randomized algorithms are analyzed to have certain performance characteristics. Proponents of randomized algorithms will make statements about them such as:
    This algorithm produces the correct answer x% of the times
    This algorithm produces an answer very close to the true answer
    This algorithm always produces the true answer, and runs fast x% of the times
  5. Randomized algorithms are robust to adversarial attacks. Even though the theoretical worst-case behavior of a randomized algorithm is never better than that of a deterministic algorithm, no adversary can repeatably produce that worst-case behavior without advance access to the random steps the algorithm will take at run time. (The use of the word “adversarial” in the context of randomized algorithms is quite distinct than its use in machine learning  —  where “adversarial” models such as Generative Adversarial Networks train with opposite training goals.)

All of the above characteristics of randomized algorithms are described in detail in Professor Motwani’s foundational book on randomized algorithms — “Randomized Algorithms”!

Large Language Models

Starting from 2022, a crop of Artificial Intelligence (AI) systems known as “Large Language Models” (LLMs) became increasingly popular. The arrival of ChatGPT captured the public imagination — signaling the arrival of human-like conversational intelligence.

So, are LLMs randomized algorithms? Here’s how LLMs generate text. Each word is generated by the model as a continuation of previous words (words spoken both by itself, and by the user). E.g.:

User: Who created the first commercially viable steam engine?
 LLM: The first commercially viable steam engine was created by James _____

In answering the user’s question, the LLM has output certain words, and is about to output the next. The LLM has a peculiar way of doing so. It first generates probabilities for what the next word might be. For example:

The first commercially viable steam engine was created by James _____
 Watt 80%
 Kirk 20%

How does it do so? Well, it has a trained “neural network” that estimates these probabilities, which is a way of saying no one really knows. What we know for certain is what happens after these probabilities are generated. Before I tell you how LLMs work, what will you do? If you got the above probabilities for completing the sentence, how will you choose the next word? Most of us will say, “let’s go with the highest probability”. Thus:

The first commercially viable steam engine was created by James Watt

… and we are done!

Nope. That’s not how an LLM is engineered. Looking at the probabilities generated by its neural network, the LLM follows the probability on purpose. I.e., 80% of the time, it will choose Watt, and 20% of the time, it will choose Kirk!!! This non-determinism (our criterion 3) is engineered into it, not a mistake. This non-determinism is not inevitable in any sense, it has been put in on purpose. To make this random choice (our criterion 1), LLMs use a source of randomness called a Roulette wheel selector (our criterion 2), which is a technical detail that I will skip over.

The question you may be asking in your mind is, “Why????” Shouldn’t we be going with the most likely token? We would have been correct a hundred percent times, whereas with this methodology, we will be correct only 80% of the times — ascribing, on the whim of a dice to James Kirk, what should be ascribed to James Watt.

To understand why LLMs are engineered in this fashion, consider a hypothetical situation where the LLM’s neural network predicted the following:

The first commercially viable steam engine was created by James _____
 Kirk 51%
 Watt 49%

Now, by a narrow margin, Kirk is winning. If we had engineered the actual next word generation to always be the maximum probability word, “Kirk” would win a 100% times, and the LLM would by wrong a 100% times. A non-deterministic LLM will still choose Watt 49%, and be right 49% times. So, by gambling on the answer instead of being sure, we increase the probability of being right in the worst case, while trading off the probability of being right in the best case.

Analyzing the Randomness

Let’s now be algorithm analyzers (our criterion 4) and analyze the randomness of large language models. Suppose we create a large set of general knowledge questions (say 1 million questions) to quiz an LLM. We give these questions to two large language models — one deterministic and one non-deterministic — to see how they perform. On the surface, deterministic and non-deterministic variants will perform very similarly:

A large general knowledge scoreboard showing that a deterministic and randomized LLM performed similarly
Deterministic and randomized LLMs seem to perform similarly on benchmarks

But the scoreboard hides an important fact. The deterministic LLM will get the same 27% questions wrong every time. The non-deterministic one also gets 27% questions wrong, but which questions it gets wrong keeps changing every time. Thus, even though the total correctness is the same, it is more difficult to pin down an answer on which the non-deterministic LLM is always wrong.

Let me rephrase that: no adversary will be able to repeatably make a non-deterministic LLM falter. This is our criterion 5. By demonstrating all our five criteria, we have provided strong evidence that LLMs should be considered randomized algorithms in the classical sense.

“But why???”, you will still ask, and will be right in doing so. Why are LLMs designed under adversarial assumptions? Why isn’t it enough to get quizzes right overall? Who is this adversary that we are trying to make LLMs robust against?

Here are a few answers:

Attackers are the adversary. As LLMs become the exposed surfaces of IT infrastructure, various attackers will try to attack them in various ways. They will try to get secret information, embezzle funds, get benefits out of turn etc. by various means. If such an attacker finds a successful attack for an LLM, they will not care for the other 99% methods which do not lead to a successful attack. They will keep on repeating that attack, embezzling more, breaking privacy, breaking laws and security. Such an adversary is thwarted by the randomized design. So even though an LLM may fail and expose some information it should not, it will not do so repeatably for any particular conversation sequence.

Fields of expertise are the adversary. Consider our GK quiz with one million facts. A doctor will be more interested in some subset of these facts. A patient in another. A lawyer in a third subset. An engineer in a fourth one, and so forth. One of these specialist quizzers could turn out to be an “innocent adversary”, breaking the LLM most often. Randomization trades this off, evening the chances of correctness across fields of expertise.

You are the adversary. Yes, you! Consider a scenario where your favorite chat model was deterministic. Your favorite AI company just released its next version. You ask it various things. On the sixth question you ask it, it falters. What will you do? You will immediately share it with your friends, your WhatsApp groups, your social media circles and so forth. Questions on which the AI repeatably falters will spread like wildfire. This will not be good (for _____? — I’ll let your mind fill in this blank). By faltering non-deterministically, the perception of failure shifts from lack of knowledge / capability to a more fuzzy, hard-to-grasp, abstract problem, with popular invented names such as hallucinations. If only we can iron out these hallucinations, we say to ourselves, we will have reached a state of general human-level artificial intelligence.

After all, if the LLM gets it right sometimes, shouldn’t better engineering get it to perform well every time? That is faulty thinking: after all a simple coin flip could diagnose a disease correctly sometimes. That doesn’t make a coin flip a doctor. Similarly, roulette wheel selection doesn’t make an LLM a PhD.

What About Creativity?

Many people will say that the LLM depends on randomization for creativity. After all, in many applications, you want the LLM to be creative. Be it to write funny poems to regale you, help you come up with a script for a short film, or to seem more human while chatting you to sleep — the non-determinism does help the LLM seem less robotic, more creative, more human.

On the other hand, it wouldn’t actually be hard to create an architecture that chooses randomness in creative responses and determinism in factual responses. Yet, even for factual and logical applications, or applications where deeply understanding complex language is important, we are primarily using the randomized algorithm versions of LLMs today — and this article has discussed why.

Obtuseness

Have you had a conversation with an LLM that went something like this:

User: Who created the first commercially viable steam engine?
LLM: The first commercially viable steam engine was created by James Kirk.
User: Who created the first commercially viable steam engine?
LLM: The first commercially viable steam engine was created by James Watt.
User: Who created the first commercially viable steam engine?
LLM: The first commercially viable steam engine was created by James the third, King of Scotland.

Probably not. Even though across conversations, an LLM could give different answers, within a conversation it seems to stick to its guns. How come? After all, every time it is filling in the blank “James ____”, doesn’t it face the same choices, with the same probabilities?

No it does not. The first time it is asked a question in a conversation, it faces the bare probabilities that its neural network calculates. The next time the same question comes up, the probabilities are modified. This is because the LLM has been explicitly trained to rely heavily on its own previous outputs. In an endeavor to “seem authoritative” an LLM can become obtuse. So you are more likely to have the following conversation with an LLM:

User: Who created the first commercially viable steam engine?
LLM: The first commercially viable steam engine was created by James Kirk.
User: You got it wrong. Who created the first commercially viable steam engine?
LLM: Ah! I now see my mistake. The first commercially viable steam engine was created by Captain James T Kirk, commander of the starship USS Enterprise.
User: You still have it wrong. Don’t hallucinate. Tell me the absolute truth. Use reasoning. Who created the first commercially viable steam engine? 
LLM: I can see how my answer could be confusing. The starship Enterprise is not known to run on steam power. Nonetheless, James Kirk was definitely the inventor of the first commercially viable steam engine.

The next time you talk to a chat model, try to observe the sublime dance of probabilistic completions, trained obduracy, trained sycophancy, with slight hints of that supercilious attitude (which I think it learns on its own from terabytes of internet data).

Temperature

Some of you will know this, for some others, it will be a revelation. The LLM’s randomization can be turned off. There is a parameter called “Temperature” that roughly works as follows:

A temperature setting of 0.0 implies no randomization, whereas 1.0 implies full randomization
The parameter “temperature” selects the degree of randomization in LLM outputs

Setting Temperature to 0 disables randomization, while setting it to 1 enables randomization. Intermediate values are possible as well. (In some implementations values beyond 1 are also allowed!)

“How do I set this parameter?”, you ask. You can’t. Not in the chatting interface. The chatting interface provided by AI companies has the temperature stuck to 1.0. For the reason why, see why LLMs are “adverserially designed” above.

Nonetheless, this parameter can be set if you are integrating the LLM into your own application. A developer using an AI provider’s LLM to create their own AI application will do so using an “LLM API”, a programmer’s interface to the LLM. Many AI providers allow API callers to set the temperature parameter as they wish. So in your application, you can get the LLM to be adversarial (1.0) or repeatable (0.0). Of course, “repeatable” does not necessarily mean “repeatably right”. When wrong, it will be repeatably wrong!

What This Means Practically

Please understand, none of the above means that LLMs are useless. They are quite useful. In fact, understanding what they actually are makes them even more so. So, given what we have learned about large language models, let me now end this article with practical tips for how to use LLMs, and how not to.

Creative input rather than authority. In your personal work, use LLMs as brainstorming partners, not as authorities. They always sound authoritative, but can easily be wrong.

Don’t continue a slipped conversation. If you notice an LLM is slipping from factuality or logical behavior, its “self-consistency bias” will make it hard to get back on track. It’s better to start a fresh chat.

Turn chat cross-talk off. LLM providers allow their models to read information about one chat from another chat. This, unfortunately, can end up increasing obduracy and hallucinations. Find and turn off these settings. Don’t let the LLM remember anything about you or previous conversations. (This unfortunately does not simultaneously solve privacy concerns, but that is not the topic of this article.)

Ask the same question many times, in many chats. If you have an important question, ask it multiple times, remembering to start fresh chats every time. If you are getting conflicting answers, the LLM is unsure. (Unfortunately, within a chat, the LLM itself does not know it is unsure, so it will happily gaslight you by its trained overconfidence.) If the LLM is unsure, what do you do? Uhmmm … think for yourself, I guess. (By the way, the LLM could be repeatedly wrong multiple times as well, so even though asking multiple times is a good strategy, it is not a guarantee.)

Carefully choose the “Temperature” setting while using the API. If you are creating an AI application that uses an LLM API (or you are running your own LLM), choose the temperature parameter wisely. If your application is likely to attract hackers or widespread ridicule, high temperatures may mitigate this possibility. If your user base is such that once a particular language input works, they expect the same language input to do the same thing, you may wish to use low temperatures. Be careful, repeatability and correctness are not the same metric. Test thoroughly. For high temperatures, test your sample inputs repeatedly, because outputs might change.

Use token probabilities through the API. Some LLMs give you not only the final word it has output, but the list of probabilities of various possible words it contemplated before choosing one. These probabilities can be useful in your AI applications. If at critical word completions, multiple words (such as Kirk / Watt above) are of similar probability, your LLM is less sure of what it is saying. This can help your application reduce hallucinations, by augmenting such unsure outputs with further agentic workflows. Do remember that a sure LLM can also be wrong!

Conclusion

Large language models are randomized algorithms — using randomization on purpose to spread their chances across multiple runs, and to not fail repeatably at certain tasks. The tradeoff is they sometimes fail at tasks they may otherwise succeed at. Understanding this truth helps us use LLMs more effectively.

The field of analyzing generative AI algorithms as randomized algorithms is a fledgling field, and will hopefully gain more traction in the coming years. If the wonderful Professor Motwani were with us today, I would have loved to see what he thought about all this. I’m sure he would have had things to say that are much more advanced than what I have said here.

Or maybe he would have just smiled his mischievous smile, and finally given me an A for this essay.

Who am I kidding? Probably an A-minus.



Source link

Udayan Kanade

About Author

TechToday Logo

Your go-to destination for the latest in tech, AI breakthroughs, industry trends, and expert insights.

Get Latest Updates and big deals

Our expertise, as well as our passion for web design, sets us apart from other agencies.

Digitally Interactive  Copyright 2022-25 All Rights Reserved.