What ChatGPT Knows about You: OpenAI’s Journey Towards Data Privacy | by Andrea Valenzuela | May, 2023

What ChatGPT Knows about You: OpenAI’s Journey Towards Data Privacy | by Andrea Valenzuela | May, 2023

[ad_1]

Companies allowing their users to ask for their personal data make them comply with the aforementioned GDPR regulation. Nevertheless, there is a catch: the file format can make the data unreadable for most of the population. In this case, we got both html and json files. While html can be read directly, json files can be more difficult to interpret. I personally think that new regulations should also enforce a readable format of the data. But for the time being…

Let’s explore the files one by one to get the most out of this new feature!

The first file is chat.html which contains my entire chat history with ChatGPT. Conversations are stored with their corresponding title. The user’s questions and ChatGPT’s answers are labeled as assistantand user, respectively.

If you have ever trained an AI model yourself, this labeling system will sound familiar to you.

Let’s observe a sample conversation from my history:

Self-made screenshot from my ChatGPT history. The conversation title is highlighted in blue. User/Assistant labels are highlighted in red and green, respectively.

Have you ever seen the thumbs-up, thumbs-down icons (👍👎) next to any ChatGPT answer?

This information is seen by ChatGPT as the feedback for a given answer, which will then help in the chatbot training.

This information is stored in the message_feedback.json file containing any feedback you provided to ChatGPT using the thumbs icons. Information is stored in the following format:

[{"message_id": <MESSAGE ID>, "conversation_id": <CONVERSATION ID>, "user_id": <USER ID>, "rating": "thumbsDown", "content": "{\"tags\": [\"not-helpful\"]}"}]

The thumbsDown rating accounts for wrongly-generated answers while the thumbsUp accounts for the correctly-generated ones.

There is also a file (user.json) containing the following personal data from the user:

{"id": <USER ID>, "email": <USER EMAIL>, "chatgpt_plus_user": [true|false], "phone_number": <USER PONE>}

Some platforms are known for creating a model of the user based on their usage of the platform. For example, if the Google searches of a user are mostly about programming, Google is likely to infer that the user is a programmer and use this information to show personalized advertisements.

ChatGPT could do the same with the information from the conversations, but they are currently obliged to include this inferred information in the exported data.

⚠️ FYI, One can access What Google knows about them from Gmail by clicking on Account >> Data & Privacy >> Personalized Ads >> My Ad Center.

There is another file containing the conversation history, and also including some metadata. This file is named conversations.json and includes information such as the creation time, several identifiers, and the model behind ChatGPT, among others.

⚠️ The metadata provides information about the main data. It may include information such as the origin of the data, its meaning, its location, its ownership, and its creation. Metadata accounts for information related to the main data, but it is not part of it.

Let’s explore the same conversation about the A320 Hydraulic System Failure exposed in the first example in this json format. The conversation itself consists of the following Q&A:

From this simple conversation, OpenAI keeps quite some information. Let’s review the stored information:

  • The main fields of the json file contain the following information:

The field moderation_results is empty since no feedback was provided to ChatGPT in this concrete case. In addition, the [+] symbol in the mapping field means that more information is available.

  • In fact, the mapping field contains all the information about the conversation itself. Since the conversation has four interactions, the mapping stores one children entry per interaction.

Again, the [+] symbol indicates that more information is available. Let’s review the different entries!

  • mapping_id: It contains an id for the conversation as well as information about the creation time and the type of content, among others. As far as one can infer, it also creates a parent_id for the conversation and a children_id that corresponds to the following interaction of the user with ChatGPT. Here is an example:
  • children_idX: A new children entry is created for each interaction either from the user or from the assistant. Since the conversation has four interactions, the json file displays four children entries. Each children entry has the following structure:

The first children entry is nested within the conversation by having the mapping_id as a parent and the second interaction — the answer from ChatGP — as a second child.

  • Children that correspond to a ChatGPT answer contain additional fields. For example, for the second interaction:

In the case of a ChatGPT answer, we get information about the model behind ChatGPT and the stopping words. It also shows the first children as it parent and the third children as the following interaction.

The full file can be found in this GitHub gist.

Have you ever used the “Regenerate response” button when you were not fully convinced by the response provided by ChatGPT?

Self-made screenshot from the Regenerate response button in ChatGPT.

This feedback information is also stored!

There is a last file named model_comparisons.json that contains snippets of the conversations and the consecutive attempts anytime ChatGPT regenerated the response. The information contains only the text without the title but including some other metadata. Here is the basic structure of this file:

{
"id":"<id>",
"user_id":"<user_id>",
"input":{[+]},
"output":{[+]},
"metadata":{[+]},
"create_time": "<time>"
}

The metadata field contains some important information such as the country and continent where the conversation took place, and information about the https access schema, among others. The interesting part of this file comes in the input/output entries:

Input

The input contains a collection of messages from the original conversation. Interactions are labeled depending on the author and, as in the previous cases, some additional information is also stored. Let’s observe the messages stored for our sample conversation:

User/Assistant entries are expected, but I am sure at this point we are all wondering why is there a system label?

And moreover, why do they feed an initial statement like this at the beginning of each conversation?

Is ChatGPT pre-feed with the current date in any new conversation?

Yes, those entries are the so-called system messages.

System Messages

System messages give overall instructions to the assistant. They help to set the behavior of the assistant. In the web interface, system messages are transparent to the user, which is why we do not see them directly.

The benefit of the system message is that it allows the developer to tune the assistant without making the request itself part of the conversation. System messages can be fed by using the API. For example, if you are building a car sales assistant, one possible system message could be “You are a car sales assistant. Use a friendly tone and ask questions to the users until you understand their necessity. Then, explain the available cars that match their preferences”. You could even feed the list of vehicles, specifications, and prices so that the assistant can give this information too.

[ad_2]
Source link

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *