Using Transformers to Forecast Incredibly Rare Solar Flares

Introduction (X-45)

forecasting fundamentally changes whenever we try to predict a very rare event. We must fundamentally shift what we are modelling to focus on tail events. From model performance metrics and target definition to the tail model and the transformer output heads, rare-event forecasting is difficult. Difficult yet worth it.

The Halloween storms of 2003 began as a disturbance on the Sun, a single dark spot that created one of the strongest space weather events of the satellite era. Through late October to early November, a series of enormous active regions churned across the solar disk. This released powerful flares and clouds of magnetized plasma towards Earth. This event presented a uniquely aesthetic flair-up with radio-wave implications.

Satellites malfunctioned, GPS and radio were disrupted, and airlines rerouted polar flights. According to NOAA, power grids worldwide were affected, with some currents exceeding 100 amps, leading to the Malmö Blackout in Sweden. At 20:07 UT, a power outage hit the region, leaving approximately 50,000 customers without electricity for 20 to 50 minutes.

The Sun erupts with intense magnetic activity, its corona glowing in extreme ultraviolet light as bright active regions and a powerful limb flare arc above the solar surface.
Image credit: NASA / Solar Dynamics Observatory (SDO) / AIA. Public domain

An international shock, the event saturated GOES X-ray sensors, so the true size of the flare could be calculated only through reconstruction. Often called X-45, after its Magnitude, 450 times larger than M-1, a medium flare. The table below shows the Flare Richter Scale.

Solar flare classes are measured by peak soft X-ray brightness at Earth. Each main letter class is ten times stronger than the one before it. The number after the letter scales the flare within that class: X45 is 45 times stronger than X1, 450 times stronger than M1, and 4,500 times stronger than C1.

The Prediction Problem

A paradoxical problem with catastrophes is that the more catastrophic they are, the rarer they tend to be. Think floods, snow-storms and avalanches. Every 50-year story happens once in fifty years. This is usually a good thing, but because of their rarity, they become incredibly hard to predict.

There are several things that make predicting rare-events a particularly interesting challenge in machine learning:

Our metrics for model evaluation must change
Features need to be engineered from magnetism data
Make a tail model to specifically capture rare events
Combine the tail model with the full distribution model using a transformer

A note on accuracy, which is typically a good metric for binary classification. We could achieve 99% accuracy by missing every single solar flare in 10,000 forecasts if we had only 100 major flares. We could simply guess. It won’t happen every single time.

Accuracy = (10,000-100)/10,000 = 9900/10,000 = 0.99 = 99%

True Positives = 0

The Data

If you’re interested in where this data comes from, all the data we have on solar flares comes from an altogether different layer of the sun than where the flare occurs. The data we have on solar flares comes from the Photosphere, the sun’s first visible layer.

Flares occur in the Corona and Chromosphere. The data is collected by the Solar Dynamics Observatory (SDO), a NASA spacecraft that continuously observes the Sun to monitor its activity. Using the Helioseismic and Magnetic Imager (HMI).

Solar flare forecasting measures the magnetic field most directly at the photosphere, the Sun’s visible surface, while flare energy release occurs higher in the corona. Photospheric sunspot and magnetic-field data are therefore used to infer the buildup of coronal magnetic stress that can lead to reconnection and flares. Image made with the help of Chat GPT

Model Input

Fortunately, thanks to NASA, our satellite’s construction, deployment, and voyage to the Sun have already been completed, and we can now focus on our model input. A vector magnetogram estimates the magnetic field vector B. First observations come in two flavours:

From this starting point, the Space Weather HMI Active Region Patch does two things:

Localization
Feature engineering

means selecting active regions on the Sun (Localization) and computing magnetic parameters that better describe the solar and magnetic structure (feature engineering).

The important lesson here is that, to address how rare the event we are trying to predict is, we focus on gathering data from locations where it is most likely to happen. We take our starting measurement data on the magnetic fields and compute different features like:

Four magnetic quantities used to understand flare-producing active regions: magnetic flux shows how field lines connect opposite sunspot polarities, electric current traces energy-bearing flows along those fields, magnetic twist shows helical winding within a flux tube, and magnetic helicity describes the larger-scale linkage, braiding, and knotting of coronal magnetic fields. Image made with the help of Chat GPT

A solar flare begins when magnetic energy accumulates in stressed field lines above a sunspot region. As the field reconnects, stored energy is released in the form of intense radiation, plasma eruptions, and post-flare magnetic loops. Image made with the help of Chat GPT

Our input data become a function of time and engineered features:

If our model uses the past 24 hours, and 9 engineered features our input would be

Model Target

We might as well make our target more precise now. We define it as the probability of observing an M-1 class event in the next 24 hours, given the magnetic history. Here, the magnetic history would be our entire input data.

But there are many implicit design decisions we’ve made that the following table makes explicit.

Notice that there are many options when constructing our target. This is a major problem when comparing different models. It’s worth noting that simply taking more data is not better, as events that happened further in the past tend to be less powerful predictors of future events. This introduces a noise-to-signal problem with regards to your training window.

The Metric TSS

To solve the problem presented earlier of having a model with 99% accuracy and zero recall, we introduce a new statistic called the True Skill Statistic (TSS), defined as the difference between the true positive rate and the false positive rate. TSS rewards true positives while also punishing false positives.

Making a tail model

Because of flare rarity, if we use the following risk objective, we will find that common events, where no solar flare was present, dominate the loss term. Rare events barely contribute, as they happen so little, even though they are the most relevant to what we are trying to predict. The model can become very good at the bulk of the distribution while learning very little about the extreme events, which we are interested in. This is why it makes sense to consider tailoring.

Objective/Empirical Risk (what most of ML minimizes)

We can more accurately describe the problem by saying that our objective is frequency-weighted, meaning that frequent events dominate the loss term, while less frequent (rare) events contribute the least, even though that is what our model needs to learn.

NASA’s Solar Dynamics Observatory captured the opening moments of an X4.9-class solar flare on Feb. 24, 2014, seen here in multiple wavelengths as a bright eruption on the Sun’s left limb. The flare peaked at 7:49 p.m. EST; loops of hot plasma are visible above the active region in the corona. Credit: NASA/SDO. Licence: NASA image-use policy Public domain.

So our model can learn from mostly rare events. We choose a constant threshold for a continuous variable, such as soft X-ray flux, anything that measures flare severity could work. We set our target to the difference between the threshold and our observed flare-severity variable, and use only data from the tail of the distribution.

Then the data we model is:

Using Transformers

We can now combine our original model and tail model using a transformer to achieve a more robust solution, which ideally learns what happens both below the threshold for a rare event and above it. In other words, we would like the model to learn the origin of the discrete function as well as the shape of excess risk defined by the tail model. For this, we can use transformers with different heads. A model can begin with magnetic history data and encode it into a representation h; separate heads can estimate different quantities like flare probability, uncertainty tail exceedance and precursor signal.

The classification head, which estimates the probability that our target is one given our data, is often trained with the binary cross-entropy, perhaps weighted to account for class imbalance.

We can use the Generalized Pareto Distribution (GPD), which provides a compact model for the excesses (our tail distribution). Here, σ controls the scale, and ξ controls the tail heaviness. The transformer produces a representation of the recent solar states h maps that representation into GPD parameters, so different magnetic histories imply different tail distributions for one active region (sunspot).

The full objective combines two forecasting tasks. The classification term teaches the model to estimate whether a flare crosses the chosen threshold, while the tail term teaches it what the excess severity looks like after that threshold has been crossed. This matters because the model should not only learn “flare or no flare.” It should also learn how large the event might be once it enters the dangerous part of the distribution.

Sunspot AR 1302 on the Sun, photographed on September 24, 2011. NASA described the active region as producing large solar flares during Solar Cycle 24.
NASA, *Sunspots 1302 Sep 2011 by NASA.jpg*, September 24, 2011, via Wikimedia Commons. Public domain

Conclusion

When it comes to getting a good forecast for a very rare event using a transformer, it’s not enough to just plug in the data and minimize the loss function. When it comes to predicting solar flares, localization and feature engineering techniques must first be applied to our data. Then we need to specify a model target that can distinguish between positive and negative events. We have to choose an appropriate metric that both rewards true positives and penalizes false positives. Also, because of the huge class imbalance, it makes sense to make a tail model which uses the generalized Pareto distribution to model exceedances beyond a threshold. These techniques and loss functions can be used as different heads of a transformer that is capable of both prediction and estimation, and also learns how large an event might be once it enters a dangerous part of a distribution. What we get from this is improved predictive performance and a better-specified model.

Website | LinkedIn | GitHub