The synthetic data field guide. A guide to the various species of fake… | by Cassie Kozyrkov | Jun, 2023

The synthetic data field guide. A guide to the various species of fake… | by Cassie Kozyrkov | Jun, 2023

[ad_1]

A guide to the various species of fake data: Part 2

Cassie Kozyrkov
Towards Data Science

If you want to work with data, what are your options? Here’s an answer that’s as coarse as possible: you could get hold of real data or you could get hold of fake data.

In my previous article, we made friends with the concept of synthetic data and discussed the thought process around creating it. We compared real data, noisy data, and handcrafted data. Let’s dig into the species of synthetic data that’s fancier than asking a human to pick a number, any number…

A classic of British sketch comedy.

(Note: the links in this post take you to explainers by the same author.)

Duplicated data

Maybe you measured 10,000 real human heights but you want 20,000 datapoints. One approach you take is to suppose your existing dataset already represents your population fairly well. (Assumptions are always dangerous, proceed with caution.) Then you could simply duplicate the dataset or duplicate some portion of it using ye olde copy-paste. Ta-da! More data! But is it good and useful data? That always depends on what you need it for. For most situations, the answer would be no. But hey, there are reasons you were born with a head, and those reasons are to chew and to apply your best judgment.

Resampled data

Speaking of duplicating only a portion of your data, there’s a way to inject a spot of randomness to assist you in figuring out which portion to pick. You can use a random number generator to assist you in picking which height to draw from your existing list of heights. You could do this “without replacement”, meaning that you make at most one copy of each existing height, but…

Bootstrapped data

You’ll more often see people doing this “with replacement”, meaning that every time you randomly pick a height to copy, you immediately forget you did this so that the same height could make its way into your dataset as a second, third, fourth, etc. copy. Perhaps if there’s enough interest in the comments, I’ll explain why this is a powerful and effective technique (yes, it sounds like witchcraft at first, I thought so too) for population inference.

Augmented data

Augmented data might sound fancy, and there *are* fancy ways to augment data, but usually when you see this term, it means you took your resampled data and added some random noise to it. In other words, you generated a random number from a statistical distribution and typically you simply added it to the resampled datapoint. That’s it. That’s the augmentation.

All image rights belong to the author.

Oversampled data

Speaking of duplicating only a portion of your data, there’s a way to be intentional about boosting certain characteristics over others. Maybe you took your measurements at a typical AI conference, so female heights are underrepresented in your data (sad but true these days). That’s called the problem of unbalanced data. There are techniques for rebalancing the representation of those characteristics, such as SMOTE (Synthetic Minority Oversampling TEchnique), which is pretty much what it sounds like. The most naive way to smite the problem is to simply limit your resampling to the minority datapoints, ignoring the others. So in our example, you’d just resample the female heights while ignoring the other data. You could also consider more sophisticated augmentation, still limiting your efforts to the female heights.

If you wanted to get even fancier, you’d look up techniques like ADASYN (Adaptive Synthetic Sampling) and follow the breadcrumbs on a trail that’s out of scope for a quick intro to this topic.

Edge case data

You could also make up (handcrafted) data that’s totally unlike anything you (or anyone) has ever seen. This would be a very silly thing to do if you were trying to use it to create models of the real world, but it’s clever if you’re using it to, for example, test your system’s ability to handle weird things. To get a sense of whether your model/theory/system chokes when it meets an outlier, you might make synthetic outliers on purpose. Go ahead, put in a height of 3 meters and see what explodes. Kind of like a fire drill at work. (Do not leave an actual fire in the building or an actual monster outlier in your dataset.)

http://bit.ly/quaesita_ytoutliers

Simulated data

Once you’re getting comfy with the idea of making data up according to your specifications, you might like to go a step further and create a recipe to describe the underlying nature of the kind of data that you’d like in your dataset. If there’s a random component, then what you’re actually doing is simulating from a statistical distribution that allows you to specify what the core principles are, as described by a model (which is just a fancy way of saying “a formula that you’re going to use as a recipe”) with a rule for how the random bits work. Instead of adding random noise to an existing datapoint as the vanilla data augmentation techniques do, you can add noise to a set of rules you came up with, either by meditating or by doing some statistical inference with a related dataset. Learn more about that here.

All image rights belong to the author.

Heights? Wait, you’re asking me for a dataset of nothing but one height at a time? How boring! How… floppy disk era of us. We call this univariate data and it’s rare to see it collected in the wild these days.

Now that we have incredible storage capacity, data can come in much more interesting and complex forms. It’s very cheap to grab some extra characteristics along with heights while we’re at it. We could, for example record hairstyle, making our dataset bivariate. But why stop there? How about the age too, so our data’s multivariate? How fun!

But these days, we can go wild and combine all that with image data (take a photo during the height measurement) and text data (that essay they wrote about how their unnecessarily boring their statistics class was). We call this multimodal data and we can synthesize that too! If you’d like to learn more about that, let me know in the comments.

Why might someone want to make synthetic data? There are good reasons to love it and some solid reasons to avoid it like the plague (article coming soon), but if you’re a data science professional, head over to this article to find out which reason I think should be your favorite to use it often.

If you had fun here and you’re looking for an entire applied AI course designed to be fun for beginners and experts alike, here’s the one I made for your amusement:

Enjoy the course on YouTube here.

P.S. Have you ever tried hitting the clap button here on Medium more than once to see what happens? ❤️

[ad_2]
Source link

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *