[ad_1]
We start by importing some helpful libraries.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import powerlaw
from scipy.stats import kurtosis
Next, we will load each dataset and store them in a dictionary.
filename_list = ['medium-followers', 'YT-earnings', 'LI-impressions']df_dict = {}
for filename in filename_list:
df = pd.read_csv('data/'+filename+'.csv')
df = df.set_index(df.columns[0]) # set index
df_dict[filename] = df
At this point, looking at the data is always a good idea. We can do that by plotting histograms and printing the top 5 records for each dataset.
for filename in filename_list:
df = df_dict[filename]# plot histograms (function bleow is defined in notebook on GitHub)
plot_histograms(df.iloc[:,0][df.iloc[:,0]>0], filename, filename.split('-')[1])
plt.savefig("images/"+filename+"_histograms.png")
# print top 5 records
print("Top 5 Records by Percentage")
print((df.iloc[:,0]/df.iloc[:,0].sum()).sort_values(ascending=False)[:5])
print("")
Based on the histograms above, each dataset appears fat-tailed to some extent. Let’s see the top 5 records by percentage to get another look at this.
From this view, Medium followers appear the most fat-tailed, with 60% of followers coming from just 2 months. YouTube earnings are also strongly fat-tailed, where about 60% of revenue comes from just 4 videos. LinkedIn impressions seem the least fat-tailed.
While we may get a qualitative sense of the fat-tailedness just by looking at the data, let’s make this more quantitative via our 4 heuristics.
Heuristic 1: Power Law Tail Index
To obtain an α for each dataset, we can use the powerlaw library as we did in the previous article. This is done in the code block below, where we perform the fit and print the parameter estimates for each dataset in a for loop.
for filename in filename_list:
df = df_dict[filename]# perform Power Law fit
results = powerlaw.Fit(df.iloc[:,0])
# print results
print("")
print(filename)
print("-"*len(filename))
print("Power Law Fit")
print("a = " + str(results.power_law.alpha-1))
print("xmin = " + str(results.power_law.xmin))
print("")
The results above match our qualitative assessment that Medium followers are the most fat-tailed, followed by YouTube earnings and LinkedIn impressions (remember, a smaller α means a fatter tail).
Heuristic 2: Kurtosis
An easy way to compute Kurtosis is using an off-the-shelf implementation. Here, I use Scipy and print the results in a similar way as before.
for filename in filename_list:
df = df_dict[filename]# print results
print(filename)
print("-"*len(filename))
print("kurtosis = " + str(kurtosis(df.iloc[:,0], fisher=True)))
print("")
Kurtosis tells us a different story than Heuristic 1. The ranking of fat-tailedness according to this measure is as follows: LinkedIn > Medium > YouTube.
However, these results should be taken with a grain of salt. As we saw with the power law fits above, all 3 datasets fit a power law with α < 4, meaning the Kurtosis is infinite. So, while the computation returns a value, it’s probably wise to be suspicious of these numbers.
Heuristic 3: Log-normal’s σ
We can again use the powerlaw library to obtain σ estimates similar to what we did for Heuristic 1. Here’s what that looks like.
for filename in filename_list:
df = df_dict[filename]# perform Power Law fit
results = powerlaw.Fit(df.iloc[:,0])
# print results
print("")
print(filename)
print("-"*len(filename))
print("Log Normal Fit")
print("mu = " + str(results.lognormal.mu))
print("sigma = " + str(results.lognormal.sigma))
print("")
Looking at the σ values above, we see all fits imply the data are fat-tailed, where Medium followers and LinkedIn impressions have similar σ estimates. YouTube earnings, on the other hand, have a significantly larger σ value, implying a (much) fatter tail.
One cause for suspicion, however, is that the fit estimates a negative μ, which may suggest a Log-normal fit may not explain the data well.
Heuristic 4: Taleb’s κ
Since I couldn’t find an off-the-shelf Python implementation for computing κ (I didn’t look very hard), this computation requires a few extra steps. Namely, we need to define 3 helper functions, as shown below.
def mean_abs_deviation(S):
"""
Computation of mean absolute deviation of an input sample S
"""
M = np.mean(np.abs(S - np.mean(S)))return M
def generate_n_sample(X,n):
"""
Function to generate n random samples of size len(X) from an array X
"""
# initialize sample
S_n=0
for i in range(n):
# ramdomly sample len(X) observations from X and add it to the sample
S_n = S_n + X[np.random.randint(len(X), size=int(np.round(len(X))))]
return S_n
def kappa(X,n):
"""
Taleb's kappa metric from n0=1 as described here: https://arxiv.org/abs/1802.05495
Note: K_1n = kappa(1,n) = 2 - ((log(n)-log(1))/log(M_n/M_1)), where M_n denotes the mean absolute deviation of the sum of n random samples
"""
S_1 = X
S_n = generate_n_sample(X,n)
M_1 = mean_abs_deviation(S_1)
M_n = mean_abs_deviation(S_n)
K_1n = 2 - (np.log(n)/np.log(M_n/M_1))
return K_1n
The first function, mean_abs_deviation(), computes the mean absolute deviation as defined earlier.
Next, we need a way to generate and sum n samples from our empirical data. Here, I take a naive approach and randomly sample an input array (X) n times and sum the samples together.
Finally, I bring together mean_abs_deviation(S) and generate_n_sample(X,n) to implement the κ calculation defined before and compute it for each dataset.
n = 100 # number of samples to include in kappa calculationfor filename in filename_list:
df = df_dict[filename]
# print results
print(filename)
print("-"*len(filename))
print("kappa_1n = " + str(kappa(df.iloc[:,0].to_numpy(), n)))
print("")
The results above give us yet another story. However, given the implicit randomness of this calculation (recall generate_n_sample() definition) and the fact we’re dealing with fat tails, point estimates (i.e. just running the computation once) cannot be trusted.
Accordingly, I run the same calculation 1000x and print the mean κ(1,100) for each dataset.
num_runs = 1_000
kappa_dict = {}for filename in filename_list:
df = df_dict[filename]
kappa_list = []
for i in range(num_runs):
kappa_list.append(kappa(df.iloc[:,0].to_numpy(), n))
kappa_dict[filename] = np.array(kappa_list)
print(filename)
print("-"*len(filename))
print("mean kappa_1n = " + str(np.mean(kappa_dict[filename])))
print("")
These more stable results indicate Medium followers are the most fat-tailed, followed by LinkedIn Impressions and YouTube earnings.
Note: One can compare these values to Table III in ref [3] to better understand each κ value. Namely, these values are comparable to a Pareto distribution with α between 2 and 3.
Although each heuristic told a slightly different story, all signs point toward Medium followers gained being the most fat-tailed of the 3 datasets.
While binary labeling data as fat-tailed (or not) may be tempting, fat-tailedness lives on a spectrum. Here, we broke down 4 heuristics for quantifying how fat-tailed data are.
Although each approach has its limitations, they provide practitioners with quantitative ways of comparing the fat-tailedness of empirical data.
👉 More on Power Laws & Fat Tails: Introduction | Power Law Fits
Source link