Interpreting Machine Learning Models Using Data-Centric Explainable AI | by Aditya Bhattacharya | Feb, 2023

Interpreting Machine Learning Models Using Data-Centric Explainable AI | by Aditya Bhattacharya | Feb, 2023

[ad_1]

Source: Pixabay

Explainable AI (XAI) is an emerging concept that aims to bridge the gap between AI and end-users, thereby increasing AI adoption. XAI can make AI/ML models more transparent, trustworthy, and understandable. It is a necessity, especially for critical domains such as healthcare, finance, and law enforcement.

To get an introduction to XAI, the following 45 minutes presentation of mine from the AI Accelerator Festival APAC, 2021 will be very helpful:

Popular XAI methods, such as LIME, SHAP, Saliency Maps, etc., are model-centric explanation methods. These methods approximate the important features used by machine learning models to generate predictions. However, due to the inductive bias of ML models, an estimation of important features considered by the predictive models might not always be correct. Consequently, model-centric feature importance methods may not be very useful always.

Additionally, considering the principles of Data-Centric AI, the quality of ML models is only as good as the quality of the data used to train them. Data quality issues caused due to correlated features, data drifts, outliers, skewed data, and so on can impact the performance of the trained ML models. Yet, non-technical consumers of AI hardly have an awareness of the goodness of the datasets used to train ML models. Hence, Data-Centric Explainable AI (DCXAI) is a better choice instead of model-centric explanations when potential data issues in datasets are detected during the training and inference of ML models.

If you are interested to know how Data-Centric Explainable AI can be leveraged to explain ML models for a high stake domain, such as healthcare, please take a look at my research publication — Directive Explanations for Monitoring the Risk of Diabetes Onset: Introducing Directive Data-Centric Explanations and Combinations to Support What-If Explorations.

As discussed in my book Applied Machine Learning Explainability Techniques, Data-Centric Explainable AI (DCXAI) is an XAI method that explains how an ML model can behave by generating insights about the underlying dataset used to train them.

Source: Pixabay

Examples of data-centric explanation approaches include summarizing datasets using common statistical methods like mean, mode, and variance, visualizing the data distributions to compare feature values to those across the remaining dataset, and observing changes in model predictions through what-if analysis to probe into the sensitivity of the features. Additionally, data-centric explanations include data-driven rule-based approaches that are commonly adopted in decision support systems. Furthermore, DCXAI includes creating more awareness about the data quality by sharing more insights about the various data issues, such as data drift, skewed data, outliers, correlated features and etc., that can impact the overall performance of the ML models.

More recently, due to the failure of ML models trained on biased, inconsistent, and poor-quality data, the ML research community is exploring data-centric approaches for training ML models instead of solely relying on hyperparameter tuning and exploring different ML algorithms. If the data is consistent, unambiguous, balanced, and available in sufficient quantity, ML models can be trained faster with higher accuracy and faster deployment for any production-level system.

Unfortunately, all AI and ML systems that exist in production today are not in alignment with the principles of data-centric AI. Consequently, there can be severe issues with the underlying data that seldom get detected but eventually lead to the failure of ML systems. That is why DCXAI is important to inspect and evaluate the quality of the data being used.

There can be different approaches to data-centric explanations, which can be further categorized by the following types:

  • Generating insights about the training data — Exploratory Data Analysis (EDA) is an important practice conducted by all data scientists and ML experts before building ML models. However, every insight generated from EDA is very rarely communicated to the non-technical consumers of ML models. So, one of the approaches of DCXAI is to communicate the insights generated to the end-users to explain the potential behavior of ML models. This is particularly useful for domain experts who may not have ML knowledge but are experts in their own domains.
    Moreover, visualization of the data distribution can indicate how well the dataset is balanced. It can also show the presence of skewness and outliers in training data which can affect the model. Generating insights by building data profiles through statistical measures can also be very useful for local as well as global explanations.
  • Highlighting the data quality — Most of the time, the poor performance of ML models is related to the poor quality of data used to train them. However, the information on data quality is rarely communicated to end-users. Consequently, when ML models fail to generate good predictions, end-users are never aware of the issues in the dataset. So, DCXAI involves explaining about the data quality by communicating about the potential data issues such as data drifts, correlated features, class imbalance, biased datasets and etc.
    It is indeed true that some of these data issues are complicated to understand as these are technical concepts. But when presented using simplified and interactive visualizations, it creates awareness about the data quality, which highlights the true reason for the failure of ML models.
  • Estimating data forecastability — Sometimes, datasets are too noisy. Getting beyond a certain amount of accuracy is always difficult with such a dataset. Then, how do we gain the trust of our end users if we know that the trained model is not extremely accurate in making the correct predictions? I would say that the best way to gain trust is by being transparent and clearly communicating what is feasible. So, measuring data forecastability and communicating the model’s efficiency to end users helps to set the right expectation. Data forecastability is an estimation of the model’s performance using the underlying data.
    For example, we have a model to predict the stock price of a particular company. The stock price data that is being modeled by the ML algorithm can predict the stock price with a maximum of 60% accuracy. Beyond that point, it is not practically possible to generate a more accurate outcome using the given dataset. But let’s say that if other external factors are considered to supplement the current data, the model’s accuracy can be boosted. This proves that it is not the ML algorithm that is limiting the performance of the system, but rather the dataset that is used for modeling does not have sufficient information to get a better model performance. Hence, it is a limitation of the dataset that can be estimated by measure of data forecastability. It is better to perform data forecastability at a granular level to give additional insights about the performance of the ML model at different values of demographic variables, as illustrated by the following diagram.
Data forecastability estimate for different demographic variables (image by author)

Now that we have understood the different approaches of DCXAI, let us also summarize its benefits as presented in the following points.

  • Easy detection of biased and unfair data.
  • Creating awareness about issues with data quality, purity, and integrity to explain the failure of ML models.
  • DCXAI is more simple to understand for non-technical consumers of ML than other popular model-centric explanation methods such as LIME, SHAP, and saliency maps.
  • Domain experts tend to have more trust in DCXAI than LIME and SHAP, as DCXAI creates more transparency about the datasets used to train ML models. They can use DCXAI to justify the model generated predictions by referring to the underlying training data.

In my recent research publication — Directive Explanations for Monitoring the Risk of Diabetes Onset, I presented about an elaborate user-centered design process for an XAI dashboard that includes DCXAI. We have further made DCXAI more actionable by making the following adaptions to tailor these explanations for healthcare experts –

  • Provided interactive visual explanations for exploring what-if scenarios.
  • Considering only actionable feature variables instead of non-actionable features.
  • Providing explicit visual indicators that enable users to explore the system to understand the working of the ML models.
  • Obtaining local explanations with a global perspective.

With these additional modifications to traditional data-centric explanation approaches, we have designed and developed Visually Directive Data-Centric Explanations. You can find out more about this research study in the research paper.

In this article, we have covered Data-Centric Explainable AI (DCXAI) and its various approaches. We have also covered how DCXAI is different from other XAI methods, such as LIME, SHAP, etc., which provide model-centric explanations. We have discussed about the different approaches to provide data-centric explanations. Additionally, we have discussed the benefits of DCXAI and how DCXAI can be further modified to generate more actionable explanations for domain experts and lay users. You can learn more about data-centric explanations from my book Applied Machine Learning Explainability Techniques and take a look at code examples presented in the GitHub repo from the book.

  1. Explainable AI with TCAV from Google AI
  2. Essential Explainable AI Python frameworks that you should know about
  3. Explainable Machine Learning for Models Trained on Text Data: Combining SHAP with Transformer Models
  4. EUCA — An effective XAI framework to bring artificial intelligence closer to end-users
  5. Understand the Workings of SHAP and Shapley Values Used in Explainable AI
  6. How to Explain Image Classifiers Using LIME
  1. Directive Explanations for Monitoring the Risk of Diabetes Onset: Introducing Directive Data-Centric Explanations and Combinations to Support What-If Explorations
  2. Applied Machine Learning Explainability Techniques
  3. GitHub repo from the book Applied Machine Learning Explainability Techniques — https://github.com/PacktPublishing/Applied-Machine-Learning-Explainability-Techniques/
[ad_2]
Source link

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *