Conceptual Frameworks for Data Science Projects

are analytical structures for representing abstract concepts and organizing data. Data scientists regularly use such frameworks — knowingly or unknowingly — to derive project plans, select machine learning models that balance various trade-offs, and present findings and recommendations to stakeholders. This article provides an overview of common types of conceptual frameworks, a simple three-step process for building custom frameworks, and tips for successfully doing so.

Note: All figures in the following sections have been created by the author of this article.

Common Framework Types

Although conceptual frameworks come in many different shapes and sizes, four basic framework types stand out as being especially common in data science projects: hierarchies, matrices, process flows, and relational maps. We will briefly go over each of these framework types below.

Hierarchies

Hierarchical frameworks often take the form of tree diagrams, starting with a root node and ending with several leaf nodes, as shown in Figure 1. For example, the root node may represent an overarching concept in a taxonomy or an initial binary question in a decision tree. A node’s position in the hierarchy (or tree) gives us valuable information about its relationship to other nodes. Although Figure 1 labels the items in the hierarchy as “concepts,” they can be any kind of entity. Entities may be neutral (e.g., concepts, topics, segments) or have some positive or negative valence (e.g., revenues, costs, problems, issues). The hierarchical structure can vary in depth and breadth.

Figure 1: Generic Structure of a Hierarchical Framework

In visual representations of hierarchies, vertical links between two entities are typically drawn explicitly and can be non-directional (simple lines) or directional (downward or upward arrows, depending on the flow of the relationship). By contrast, horizontal links between entities at the same level of a hierarchy are typically not shown explicitly. Same-level entities may be subject to a natural ordering (e.g., temporal or spatial), which can be shown by placing them accordingly in the framework. For instance, entities that occur earlier in an ordering should be placed to the left of entities that occur later. If the entities do not come with a natural ordering, you can still consider ordering them in some way (e.g., by level of importance or priority) to aid reasoning. Entities at the same level in a hierarchy should generally also be at the same level of abstraction.

In many situations, it helps if the nodes of a hierarchy are mutually exclusive and cumulatively exhaustive, or MECE (pronounced “me-see”), to a large extent. Being mutually exclusive means that the concepts represented by individual nodes have no major overlaps (i.e., no redundancies), while being cumulatively exhaustive means that the framework leaves out nothing important. A MECE hierarchy can be useful for breaking down a broad concept into sub-concepts (or components) to identify key drivers of the whole.

Matrices

A matrix is a tabular data structure consisting of n rows and m columns. Data scientists working on tabular use cases routinely leverage matrices for storing training data and model weights. Training machine learning models can yield high-dimensional matrices of weights that capture complex relationships between predictors and targets. Low-dimensional matrices like the one shown in Figure 2 can be useful for analyzing problems and communicating key insights.

Figure 2: Generic Structure of a Two-by-Two Matrix Framework

The generic two-by-two matrix shown in Figure 2 compares two different dimensions against each other. Such a matrix naturally yields four quadrants. By convention, the bottom-left quadrant (where both dimensions are “low”) is typically the undesirable region of the matrix, and the top-right quadrant (where both dimensions are “high”) represents the desirable region. For example, the market research firm Gartner uses two-by-two matrices to analyze the competitive landscape in various industry sectors and calls the top-right region of the matrix (where the market leaders are plotted) the “magic quadrant.”

The dimensions of a matrix may represent continuous, ordinal or categorical data types. Ideally, these dimensions (or axes) should be important to the overarching framework objective in some way (e.g., key sub-concepts, problems, or drivers in a given context). The interactions between these dimensions should be of particular interest as a source of insight, since it is these interactions that matrices can capture well.

In general, the MECE principle also applies to the choice of dimensions — they should collectively cover the important sub-concepts or drivers of the problem being investigated and avoid redundancies. Otherwise, looking at the interaction will be no different from looking at an individual dimension. If analyzing the interaction is not important, a hierarchical framework may be more suitable. Converting between a matrix framework and its hierarchical analog can be straightforward. For instance, to transform the matrix in Figure 2 into a hierarchy, create a root node that defines the overall context, let its child nodes be Dimensions 1 and 2, and let their respective child nodes be “high” and “low.”

Process Flows

A process flow defines a sequence of logically ordered activities that interact to achieve an overarching objective. For instance, tools such as Dataiku and KNIME allow users to construct data science pipelines as process flows, going from data ingestion all the way to modeling and report generation. Figure 3 depicts a generic process framework.

Figure 3: Generic Structure of a Process Framework

The entities of the process in Figure 3 are labeled as activities, but these could be steps, stages, operations, etc. The process starts with an activity (Activity 1), ends with an activity (Activity 3), and has one or more activities in between (Activity 2). Some inputs are typically fed into the process at the start and transformed over the sequence of activities to yield an output. Note that inputs and outputs can also enter and leave at intermediate steps within the process.

As with hierarchies and matrices, the MECE principle can be important in formulating the different activities of the process. If two activities have significant conceptual overlap, you could consider either grouping them into a single activity or breaking them up into a more granular set of distinct activities. For instance, the intermediate activities in Figure 9 may have resulted from this sort of analysis; Activity 2 could be the outcome of merging some overlapping activities, while Activities 2.1-2.3 could be a granular breakdown of a special subset of those merged activities. If an activity or a larger part of the process repeats, then it can be represented as a cycle, whereby an activity transitions to another activity that has already occurred before.

The transition from one activity to another should meaningfully transform the inputs of the process (e.g., by increasing, reducing, combining or otherwise altering the inputs in some way) with the aim of producing the desired output. If a transition does not change the inputs, then the two activities on either side of the transition are likely redundant and should be merged or split up differently, as discussed above.

Relational Maps

Relational maps shift the focus from individual concepts (or entities) to the relationships between them. Data scientists working with knowledge graphs or box-and-arrow “path diagrams” of causal relationships (as shown in Figure 4) will be familiar with this framework type.

Figure 4: Generic Structure of a Path Diagram

A relationship can generally be any function that links two different concepts together. Four types of relationships are especially common:

Transactional: A relationship can represent one or more transactions between entities. The transactions may involve the flow of tangible things (e.g., products bought and sold) or intangible things (e.g., information, money). Transactional relationships can incorporate directionality; a transaction can flow from A to B, from B to A, or in both directions, and each of these cases has a different meaning for the entities (e.g., they may be receivers, senders, or both).
Causal: Entities A and B may be causally related if A is responsible — at least in part — for the occurrence or state of B (or vice versa). The nature of the causal relationship may vary. The role of A is strong if its presence is sufficient to fully cause B (although A may not be the only entity that can fully cause B). The role of A is also strong if it is necessary to cause B (although A may not be able to do this alone). Moreover, if A causes B, it does not necessarily follow that B causes A; the notion of directionality is clearly important for specifying causal relationships.
Similarity-based: Entities may be related because they are similar or dissimilar in some way. For example, entities A and B can be similar because they tend to appear in the same place or happen at the same time (and dissimilar if the occurrence of one entity tends to preclude the occurrence of the other). The notion of correlation is a mathematical formalization often used to construct measurable, similarity-based relationships. Note that, just because two entities are correlated does not necessarily mean that they are causally related (although if they are causally related, then they would also be correlated).
Membership-based: Entities can be linked together by being members of the same group, community, or category. For instance, people can be related by being in the same neighborhood, grocery items can be part of the same product category, and a set of sub-concepts may be part of an overarching concept. Indeed, one could apply a hierarchical framework to drill down into successively deeper levels of membership within entities under consideration.

How to Build Your Own Frameworks

The following three-step process can be used to build a custom framework:

Define the framework’s objective.
Identify the right building blocks (i.e., the framework type and dimensions).
Put the building blocks together in an effective manner to answer the framework’s objective.

Step 1: Define the Objective

In defining the framework’s objective, ask yourself: In what context will the framework be used? What should the framework accomplish? Can an existing framework be reused — perhaps with some minor modifications — or does a new one need to be built to fit your specific needs?

The construction of the framework should be tied to a higher goal, such as the delivery of a project, formulation of a decision, or creation of some documentation. Once the context has been properly understood, careful consideration should be given to what the framework should accomplish in concrete terms. Is the framework intended as a decision-making tool? Is the framework meant to structure the flow of an argument in a report or a presentation?

Just because you need a framework does not mean that you must build one yourself. In many situations, existing conceptual frameworks can be reused without significant modification. Spending some effort to maintain a solid, up-to-date overview of relevant existing frameworks avoids downstream costs of “reinventing the wheel.” Reusing existing frameworks has benefits beyond not having to start from scratch; if the framework has been around for some time, its main features, as well as its strengths and limitations, may be well-documented and tested in different settings. Platforms such as Towards Data Science are a great source for keeping abreast of conceptual frameworks related to data science projects.

Step 2: Identify the Framework Type and Dimensions

Having clarified the objective of the framework, it is time to think more concretely about the construction of the framework itself. One of the main difficulties here is that conceptual frameworks are inherently not as tangible as physical ones (like molds in a factory). We tend to intuit the link between form and function — the framework and its purpose — more easily when the framework and its object are tangible. The hallmark of a good conceptual framework is its ability to turn a seemingly intangible argument or decision into something more tangible, and the key to this is representation.

Broadly speaking, there are two aspects that determine the representation of conceptual frameworks: the type of the framework and the dimensions of the framework. You are likely to notice the framework type first since it determines how the framework appears as a whole. The previous sections covered the four common framework types. The framework dimensions dictate what the framework can specifically represent (e.g., in terms of granularity and ordering). By adjusting the dimensions, the same framework type can be reused to generate a wide range of different insights. Following are three common classes of framework dimensions:

Categorical: These dimensions consist of a finite set of discrete categories that fully describe the dimension. The categories need not be ordered (e.g., a set of products, customer segments, gender).
Ordinal: These dimensions are ordered, which means that you can analyze whether something is “less than,” “greater than,” “equal to,” and so on, in relation to something else (e.g., negative/positive, low/medium/high).
Continuous: Such dimensions can take the notion of ordinal dimensions to a much more granular level. Being continuous means that the dimension is numerical and can include decimals (e.g., 1.23, -2.718, 3.14159).

Step 3: Put It All Together

Once the framework type and dimensions have been identified, they can be combined to produce a custom framework. Often, the identification and combination steps are not explicitly separated, since you rarely do one without the other. But the framework type and its dimensions — the basic building blocks — are not necessarily wedded to each other. Some combinations may make more sense than others, and you can generally mix and match the building blocks in many ways, over several iterations, until the framework feels right. Be able to spot and exploit this combinatorial flexibility is a crucial skill that you should start developing from the outset of your framework-building journey.

Moreover, there are broadly four “pathways of analysis” that capture the link between the framework and its objective:

Descriptive: Approaches the framework’s objective by gathering and organizing past information (e.g., using visuals such as graphs and tables, or written summaries). Doing so allows us to better describe and analyze what happened in the past, but it may not necessarily tell us why something happened, or whether it will happen again.
Diagnostic: Takes descriptive information of past events and goes a step further to look at why something happened. This is done by drilling down into the data, mining for clues and correlations, and trying to find a plausible link between cause and effect. As with the descriptive pathway, the focus is on the past.
Predictive: Differs from the prior two by asking and answering questions about the future. The focus is on making an educated guess about what will happen in the future by relying on a host of typically quantitative techniques that range from the simple (e.g., basic probability theory, linear models) to the more complex (e.g., neural nets).
Prescriptive: Goes beyond merely predicting future events to recommending ways to deal with them. The focus is on figuring out how to make something happen — or whether it should happen — in the future. The reasoning for the prescription can be quantitative (e.g., based on statistics or simulation modeling) or qualitative (e.g., based on personal experience).

Framework types and dimensions can therefore be combined in different ways to produce custom frameworks that lend themselves to descriptive, diagnostic, predictive, and prescriptive use cases.

Top Tips

This section gives five tips for building good conceptual frameworks. The tips are by no means an exhaustive list of the points that you should consider, but represent a basic set of things to keep in mind.

Tip 1: Focus on the Objective and Audience

The process of building frameworks broadly consists of three steps, namely defining the objective, then identifying and combining the building blocks (framework types and dimensions) accordingly. While the first step will, by its nature, emphasize the strategic objective and target audience of the framework, the focus in the latter two steps shifts to the nitty-gritty details of the framework’s building blocks. The deeper you get into the mechanics of the framework, the harder it can be to maintain visibility of the original objective. To maintain visibility of the bigger picture, it can help to take a step back from time to time during the framework-building process and remind yourself of the strategic objective and target audience. It may also help to delay part of the analysis until the necessary data becomes available and to seek regular feedback from colleagues and the target audience of your framework where possible.

Tip 2: Keep It as Simple as Possible

To paraphrase a quote often attributed to Albert Einstein — one of the most accomplished builders of conceptual frameworks of the last century — we can say that a framework should be made as simple as possible, but not simpler. Since the process inherently involves trying out different combinations of framework types and dimensions, it can sometimes be tempting to snap more and more pieces together. Yet sacrificing simplicity can potentially diminish the broader value of the framework in practice. Complex frameworks can be difficult to understand, apply, evaluate, and build — you may need to verify several assumptions and preconditions, and adjust many different levers within the framework.

Tip 3: Make It MECE

Ensuring that a framework is MECE has some important advantages. From a theoretical standpoint, being MECE means that the sub-concepts follow a consistent, additive part-whole logic; you expect the sub-concepts to “add up” to form the bigger concept. Crucially, this logic allows you to substitute the set of sub-concepts for the bigger concept (and vice versa) throughout your analysis. The additive logic of MECE also lets you compare different concepts in a rigorous manner; instead of saying that two concepts are similar, you can state precisely the extent to which they are similar by identifying the sub-concepts they share. From a practical perspective, being MECE means that you can “divide and conquer” big problems efficiently and solutions to some sub-problems may be reusable. Sometimes you can even reach the solution of the bigger problem without solving all the sub-problems (e.g., if the bigger problem can be represented as a disjunction of the sub-problems). Bypassing sub-problems also works when you are solving the bigger problem inductively (e.g., as in mathematical induction).

Tip 4: Make It Flexible

Fundamentally, a conceptual framework should be designed to meet its overall objective, so you may be wondering why flexibility is an important aspect to consider. In practice, there are at least two types of situations in which flexibility can be a big help. In the first situation, you may be dealing with an objective that is a moving target, with some parts of the objective’s full scope changing (even slightly) from time to time; responding to such scope changes can be a pain if some flexibility is not baked into the framework. In the second situation, your framework may have to undergo several iterations, in which different framework types and dimensions are added, modified and removed over the course of the framework’s evolution; a flexible design makes it much easier to facilitate such alterations of the framework’s shape and content. Modularity, scalability, robustness, extensibility, and portability — while typically associated with software engineering and architecture — are also relevant design considerations for building flexible conceptual frameworks.

Tip 5: Build It Iteratively

It would be great if you could come up with the perfect framework in one go, but it rarely works out that way. Several factors can make the first iteration more of a first draft, to be followed by at least a few more. The overarching objective — and especially the operational implications when it comes to building the framework — may not be fully clear at first. Over a couple of iterations, however, you will probably begin to get the hang of which framework types and dimensions work and which do not. While your output after a given iteration may be far from perfect, it could nevertheless amount to a minimum viable product (MVP) if it yields a viable solution to the overarching objective with minimal effort and complexity. The MVP can be tested (e.g., with actual data and real users) to understand its strengths and weaknesses. Each successive iteration can produce an improved MVP by adding, removing or changing features of the previous iteration.

To close off, here is a video that shares some more good advice on building and using conceptual frameworks:

The Wrap

Conceptual frameworks help us turn abstract ideas into concrete, tangible products that other people can see, use, and appreciate. This can be especially important for data scientists, or so-called “knowledge workers,” whose jobs involve collecting, analyzing, and deriving conclusions from data. If you are reading this article, you are probably a knowledge worker. To paraphrase famous management guru Peter Drucker, “It is data that enables knowledge workers to do their job,” but it is the ability to meaningfully organize this data that leads to a job well done — and that, in a nutshell, is why the proper use of conceptual frameworks can aid the successful design and delivery of data science projects.

Source link

Sign Up to Our Newsletter

Top Categories

Tech News

Tech

Software development

Robotics

Popular Tech News