[ad_1]
ETL is about to be transformed
Large language models (LLMs) can extract information and generate information, but they can also transform it, making extract, transform, and load (ETL) a potentially different effort entirely. I’ll provide an example that illustrates these ideas, which should also show how LLMs can, and should, be used for many related tasks including transforming unstructured text to structured text.
Google recently made its large language model (LLM) suite of offerings publicly available in preview and have branded a part of the offering “Generative AI Studio.” In short, GenAI Studio within the Google Cloud Platform Console is a UI to Google’s LLMs. However, unlike Google Bard (which is a commercial application using an LLM), no data is kept by Google for any reason. Note that Google also released an API for many of the capabilities outlined here.
Getting into GenAI Studio is pretty straightforward — from the GCP Console, simply use the navigation bar on the left, hover over Vertex AI, and select Overview under GENERATIVE AI STUDIO.
As of late May 2023, there are two options — Language and Speech. (Before long, Google is also expected to release a Vision category here.) Each option contains some sample prompt styles, which can help you spawn ideas and focus your existing ideas into useful prompts. But more than that, this is a “safe” Bard-like experience in that your data is not kept by Google.
The landing page for Language, which is the only feature used for this example, has several different capabilities, while also containing an easy way to tune the foundation model (currently, tuning can only be done in certain regions).
Create Prompt
The Get started area is where un-guided interactions with Google’s models (one or more depending on the timing and interaction type) are quickly created.
Selecting TEXT PROMPT invokes a Bard-like UI with some important differences (in addition to data privacy):
- The underlying LLM can be changed. Currently, the text-bison001 model is the only one available but others will appear over time.
- Model parameters can be changed. Google provides explanations for each parameter using the question marks next to each.
- The filter for blocking unsafe responses can be adjusted (options include “Block few”, “Block some”, and “Block most”.
- Inappropriate responses can be easily reported.
Aside from the obvious differences with Bard, using the models this way also lacks some of the Bard “add-ons,” such as current events. For example, if a prompt asking about yesterday’s weather in Chicago is entered, this model will not give the correct answer, but Bard will.
The large text section is where a prompt is entered.
A prompt is created by entering the text within the Prompt section, (optionally) adjusting parameters, and then selecting the SUBMIT button. In this example, the prompt is “What is 1+1?” using the text-bison001 model and default parameter values. Notice the model simply returns the number 2, which is a good example of the effect Temperature has on replies. Repeating this prompt (by selecting SUBMIT repeatedly) yields “2” most of the time, but randomly a different reply is given. Changing the Temperature to 1.0 yields, “The answer is 2. 1+1=2 is one of the most basic mathematical equations that everyone learns in elementary school. It is the foundation for all other math that is learned later on.” This happens because Temperature adjusts the probabilistic selection for tokens, the lower the value the less variable (i.e., more deterministic) the replies are. If the value is set to 0 in this example, the model will always return “2.” Pretty cool, and very Bard-like but better. You can also save prompts and view code for the prompt. The following is the code for “What is 1+1?”
import vertexai
from vertexai.preview.language_models import TextGenerationModeldef predict_large_language_model_sample(
project_id: str,
model_name: str,
temperature: float,
max_decode_steps: int,
top_p: float,
top_k: int,
content: str,
location: str = "us-central1",
tuned_model_name: str = "",
) :
"""Predict using a Large Language Model."""
vertexai.init(project=project_id, location=location)
model = TextGenerationModel.from_pretrained(model_name)
if tuned_model_name:
model = model.get_tuned_model(tuned_model_name)
response = model.predict(
content,
temperature=temperature,
max_output_tokens=max_decode_steps,
top_k=top_k,
top_p=top_p,)
print(f"Response from Model: {response.text}")
predict_large_language_model_sample(
"mythic-guild-339223",
"text-bison@001", 0, 256, 0.8, 40,
'''What is 1+1?''', "us-central1")
The generated code contains the prompt, but it’s easy to see that the function, predict_large_language_model_sample
is general-purpose and can be used for any text prompt.
In my day job, I spend lots of time figuring out how to extract information from text (including documents). LLMs can do this in surprisingly easy and accurate ways, and in doing so can also change the data. An example illustrates this potential.
Presume for the sake of this example, that the following email message is received by a fictitious ACME Incorporated:
Purchaser: Galveston WidgetsDear Purchasing,
Can you please send me the following items, and provide an invoice for them?
Item Number
Widget 11 22
Widget 22 4
Widget 67 1
Widget 99 44
Thank you.
Arthur Galveston
Purchasing Agent
(312)448-4492
Also presume that the objectives for the system are to extract specific data from the email, apply prices (and subtotals) for each item entered, and also generate a grand total.
If you’re thinking an LLM can’t do all that, think again!
There’s a prompt style called extractive Q&A that fits the bill very nicely in some situations (maybe all situations if applied by tuning the model versus simply prompt engineering). The idea is simple:
- Provide a Background, which is the original text.
- Provide a Q (for Question), which should be something extractive, such as “Extract all the information as JSON.”
- Optionally provide an A (for Answer) that has the desired output.
If no A is provided, then zero shot engineering is applied (and this works better than I expected). You can provide one-shot or multi-shot as well, up to a point. There’s a limit to the size of a prompt, which restricts how many samples you can provide.
In summary, an extractive Q&A prompt has the following form:
Background: [the text]
Q: [the extractive question]
A: [nothing, or an example desired output]
In the example, the email is the text, and “Extract all information as JSON” is the extractive question. If nothing is provided as A: the LLM will attempt to do the extraction (zero shot). (JSON stands for JavaScript Object Notation. It is a lightweight data-interchange format.)
Here is the zero shot output:
Background: Purchaser: Galveston WidgetsDear Purchasing,
Can you please send me the following items, and provide an invoice for them?
Item Number
Widget 11 22
Widget 22 4
Widget 67 1
Widget 99 44
Thank you.
Arthur Galveston
Purchasing Agent
(312)448-4492
Q: Extract all information as JSON
A:
You don’t need to bold Background:, Q:, and A:, I just did so for clarity.
In the UI, I left the prompt as FREEFORM and I entered the prompt above in the Prompt area. Then, I set the Temperature to 0 (I want the same answer for the same input every time) and increased the Token limit to 512 to allow for a longer response.
Here is what the zero shot prompt and reply looks like:
The “E”xtract works and even does a nice job of putting the line items in a list within the JSON. But that’s really good enough. Assume my requirements are to have specific labels for the data, and also presume I want to capture the purchasing agent and their phone. Finally, assume I want line item subtotals and a grand total (this presumption requires that a line item price exists).
My ideal output, which is both an “E”xtract and “T”ransform, looks like this:
{"company_name": "Galveston Widgets",
"items" : [
{"item_name": "Widget 11",
"quantity": "22",
"unit_price": "$1.50",
"subtotal": "$33.00"},
{"item_name": "Widget 22",
"quantity": "4",
"unit_price": "$50.00",
"subtotal": "$200.00"},
{"item_name": "Widget 67",
"quantity": "1",
"unit_price": "$3.50",
"subtotal": "$3.50"},
{"item_name": "Widget 99",
"quantity": "44",
"unit_price": "$1.00",
"subtotal": "$44.00"}],
"grand_total": "$280.50",
"purchasing_agent": "Arthur Galveston",
"purchasing_agent_phone": "(312)448-4492"}
For this prompt, I change the UI from FREEFORM to STRUCTURED, which makes laying out the data a bit easier. With this UI, I can set a Context for the LLM (which can have a surprising effect on model responses). Then, I provide one Example— both the input text and the output text — and then a Test input.
The parameters are the same for STRUCTURED and FREEFORM. Here is the Context, and Example (both Input and Output) for the invoice ETL example.
I added a Test email, with entirely different data (same widgets though). Here’s everything, shown in the UI. I then selected SUBMIT, which filled in the Test JSON, which is in the bottom right pane in the image.
That right there is voodoo magic. Yes, the math is completely correct.
At this point, I’ve shown extract and transform — it’s time for the load bit. That part is actually very simple, with zero-shot (if this is done with the API, it’s two calls — one for E+T, one for L.
I provided the JSON from the last step as the Background and changed the Q: to “Convert the JSON to a SQL insert statement.” Here’s the result, which deduces an invoices table and an invoice_items table. (You can fine-tune that SQL either with the question and/or an example SQL.)
This example demonstrates a pretty amazing LLM capability, which may very well change the nature of ETL work. I have no doubt there are limits to what LLMs can do in this space, but I don’t know what those limits are yet. Working with the model on your problems is critical in understanding what can, cannot, and should be done with LLMs.
The future looks bright, and GenAI Studio can get you going very quickly. Remember, the UI gives you some simple copy/paste code so you can use the API rather than the UI, which is required for actual applications doing this type of work.
This also means that the hammer still doesn’t make houses. By this I mean that the model didn’t figure out this ETL example. The LLM is the very elaborate “hammer” — I was the carpenter, just like you.
This article is the author’s opinion and perspective and does not reflect those of his employer. (Just in case Google is watching.)
Source link