🌻 !!!Qualitative Split-Apply-Combine

9 Nov 2025

SOURCE NOTES (consolidation): This short paper has been merged into Causal mapping as causal QDA (section “qualitative split‑apply‑combine”).
Keep this file only as scratch material / longer drafts / cut text.

Qualitative Split-Apply-Combine — a great way to make use of genAI in QDA#

Abstract#

This short paper takes the "Split-Apply-Combine" (SAC) strategy, originating in quantitative data analysis, and frames some existing traditions within qualitative data analysis (QDA) as a form of "Qualitative Split-Apply-Combine". These traditions might be described as "small-Q": less ambitious qualitative approaches that prioritize verifiability and systematic rules.

This is a useful thing to do firstly because it allows us to position causal mapping coding and analysis as a specific variant of the qualitative Split-Apply-Combine framework. In this variant, the "Split" is simplified to a single rule (coding only "bare causation"), and the "Apply" step is realized as a "library of answers"—a set of deterministic, algorithmic queries that can be run on the resulting database of causal claims. Second highlights a very interesting entry-point for applying generative AI in small-Q QDA.

This paper then positions Finally, we position Generative AI (GenAI) not as a novel analytical paradigm or a "black box" solution, but as a modest and verifiable "robot" assistant. The GenAI's role is strictly limited to accelerating the "Split" (coding) task for the causal mapping variant. This approach maintains full verifiability, as the AI’s output is a transparent, quote-linked database. The human researcher retains full control of the analysis ("Apply" and "Combine") by executing the algorithmic "Library of Answers" on this AI-generated data. This method preserves the "small-Q" rigor of traditional qualitative Split-Apply-Combine while enabling analysis at an unprecedented scale.

I. The 'Split-Apply-Combine' (SAC) Strategy as a General Analytical Framework#

1.1. The Conceptual Foundation of SAC (Wickham, 2011)#

The "Split-Apply-Combine" (SAC) strategy was originally articulated by Hadley Wickham in the context of statistical software and data analysis.1 The central insight is that "many data analysis problems involve the application of a split-apply-combine strategy".1 While the original paper focused on a specific software implementation in R (the plyr package) for managing data structures like arrays and data frames, its conceptual value extends far beyond statistical programming.1

The strategy provides a general and powerful abstraction for data analysis, defined by a three-step intellectual process: "where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together".1 This conceptual logic, which predates Wickham’s formalization and is visible in tools like SQL’s group by operator, provides a grammar for structuring analytical tasks.1

1.2. Operationalizing the Framework: 'Split', 'Apply', 'Combine' Defined#

The core logic of the SAC strategy, abstracted from its purely quantitative origins, can be defined as follows:

Split (The Operationalization): This is the foundational step where the "big problem"—the overarching research question—is broken down. This is an intellectual act of operationalization. It involves defining the rules by which the dataset will be partitioned into "manageable pieces".1 In quantitative analysis, this might be a technical instruction to group a data frame by the values of a variable. In qualitative analysis, this might the development of a coding frame, a set of categories, or a single analytical rule that defines the unit of analysis relevant to the research question.
Apply (The Operation): This is the analytical step where a function, rule, or heuristic is applied independently and consistently to all pieces generated by the "Split" step. In Wickham's example, this was the deseasf_df function applied to each subset of the ozone data.1 In a qualitative context, this is the act of coding or the application of an analytical query.
Combine (The Synthesis): This is the final step of "put[ting] all the pieces back together".1 It is the synthesis of the results from the "Apply" step into a single, coherent output that addresses the original "big problem." This step aggregates the individual findings into a holistic answer.

1.3. Moving from Quantitative to Qualitative SAC#

We briefly trace this qualitative Split-Apply-Combine lineage from quantitative content analysis (e.g., Krippendorff) to its qualitative evolution in Philipp Mayring’s Qualitative Content Analysis, which exemplifies the core qualitative Split-Apply-Combine logic:

Split: The operationalization of the research question into a formal categorization matrix or coding scheme.
Apply: The systematic, rule-guided application of this scheme to the textual data.
Combine: The synthesis of the coded data into a structured answer.

While Wickham’s paper and its direct analogues are quantitative, the conceptual logic of SAC is universal. A parallel tradition has long existed within qualitative research, though it has not typically been identified by this name. This tradition, qualitative Split-Apply-Combine, is characterized by a commitment to systematic, rule-guided, and verifiable methods.

This qualitative Split-Apply-Combine tradition stands in contrast to more holistic, interpretive, or "big-Q" qualitative approaches. Big-Q and Small-Q are not really competitors: each has their place in qualitative research. The qualitative Split-Apply-Combine modality, as a small-Q approach, values transparency and intersubjective comprehensibility. The remainder of this report will explore this tradition, positioning its methods within the SAC logical framework.

II. The "qualitative Split-Apply-Combine" Modality: A Tradition of Systematic "Small-Q" QDA#

2.1. Distinguishing "Small-Q" from "Big-Q" Qualitative Data Analysis#

To understand qualitative Split-Apply-Combine, one must first distinguish between two broad paradigms of qualitative data analysis (QDA).

"Big-Q" QDA (The Holistic Tradition): This paradigm includes approaches such as Grounded Theory, phenomenology, and certain forms of hermeneutics. The primary research instrument is the researcher's own interpretive lens. The process is emergent and holistic, often explicitly resisting pre-defined operationalization or categorization that might "block the open sight on the subject".2 The analytical process can be seen as a "black box" of researcher cognition, which is, by design, difficult to make fully transparent or reproducible in a step-by-step manner. For Big-Q researchers, this is not a bug, but a feature.
"Small-Q" QDA (The Systematic Tradition): This paradigm includes methods that, like their quantitative counterparts, are "systematic, rule guided" 3 and "step-by-step".3 The primary goal is to make the qualitative analysis transparent, verifiable, and "intersubjectively comprehensible".3 These methods ensure that interpretations are grounded in the data in a way that can be traced and audited by others.5

This "small-Q" tradition is the "qualitative Split-Apply-Combine" tradition. Its entire methodological foundation is built upon the SAC logic: (1) Splitting the research question into an explicit "coding scheme" 6 or "categorisation matrix" 7; (2) Applying these rules systematically to units of text 3; and (3) Combining the results into a structured summary of themes or patterns.7

2.2. Philipp Mayring's Qualitative Content Analysis as a Prototypical "qualitative Split-Apply-Combine" Framework#

One of the most prominent and well-documented exemplars of the "small-Q" qualitative Split-Apply-Combine framework is Philipp Mayring’s Qualitative Content Analysis.2 Mayring’s method is a "systematic, rule guided qualitative text analysis" that explicitly aims to "preserve some methodological strengths of quantitative content analysis".3

The SAC logic is explicit in his procedure:

The 'Split' (Operationalization): Mayring's procedure begins with the "Split." Step 1 is "Back to basics: Your research question".6 This is not a trivial statement; it is the methodological anchor. Step 2 is "Linking research question to theory".2 This process of operationalizing the question results in a "formative categorisation matrix... deductively derived from the existing theory or previous research" 7, or a "coding scheme".6 This matrix is the 'Split'; it defines the "manageable pieces" (the categories) that the analyst will look for in the data.
The 'Apply' (Operation): The "Apply" step is the "systematic process of coding".7 Here, the "material is to be analyzed step by step, following rules of procedure".3 This involves the "reduction" of data to its core elements by assigning segments of text to the categories defined in the "Split" step.6
The 'Combine' (Synthesis): The "Combine" step is the final analysis, which involves the "identification of categories, themes and patterns" from the systematically coded data.7 The analyst synthesizes the findings from the "Apply" step to build a structured answer to the initial research question.

2.3. The Lineage: From Krippendorff to Mayring#

Mayring’s work did not emerge in a vacuum. It is a direct and deliberate evolution of the "small-Q" lineage, tracing back to quantitative content analysis, often associated with researchers like Klaus Krippendorff.

This older tradition, sometimes called "enumerative content analysis" 10, is primarily concerned with "the frequency of words and categories".10 It "transforms qualitative data into numbers".10

Mayring’s significant contribution was to see the methodological value in this systematic, rule-based approach, separating it from its purely quantitative-enumerative goals. He sought to "preserve" the "methodological strengths" of this systematic approach (its verifiability, rule-based procedure, and clear operationalization) and "widen them to a concept of qualitative procedure".3

This historical context is crucial: it establishes that "qualitative Split-Apply-Combine" is not a new framework being proposed, but rather a new term for a mature, recognized, and rigorous branch of qualitative methodology. We are not "reinventing the wheel" 1; we are identifying an existing wheel and placing it within a useful conceptual framework.

Table 1: Methodological Positioning: "Big-Q" vs. "Small-Q" (qualitative Split-Apply-Combine)#


Dimension	"Big-Q" QDA (e.g., Grounded Theory)	"Small-Q" / "qualitative Split-Apply-Combine" (e.g., Mayring, Krippendorff)
Primary Goal	Holistic understanding, emergent theory generation.	Systematic description, theory-guided analysis.8
'Split' (Operationalization)	Emergent, researcher-driven, holistic. Avoids pre-conceptions.2	Pre-defined, rule-guided, "coding scheme" 6, "categorisation matrix".7
'Apply' (Operation)	Interpretive, iterative reading.	Systematic, "step-by-step" coding.3
'Combine' (Synthesis)	Narrative summary, theoretical saturation.	Aggregation of categories/themes.7
Key Criterion	Authenticity, Resonance.	Verifiability, Reproducibility, "Intersubjectively comprehensible".3

III. Causal Mapping as a Specific, Manual Variant of qualitative Split-Apply-Combine#

Having established the general "qualitative Split-Apply-Combine" tradition, we now turn to a highly specific, manual variant: the practice of causal mapping as a form of "Causal QDA." This approach, used in methodologies such as the Qualitative Impact Protocol (QuIP), is a specialized case of the qualitative Split-Apply-Combine framework.1

3.1. Manual Causal Mapping as "Causal QDA"#

This manual causal mapping approach is a "simple yet powerful form of qualitative coding".1 Like Mayring's method, it is a "small-Q" qualitative Split-Apply-Combine approach because it is systematic, rule-based, and verifiable. However, its implementation of the SAC logic is distinct.

The 'Split' (The Operationalization):

The "Split" in this variant is simpler and more focused than a complex, multi-category coding frame. The operationalization of the research question is reduced to a single, simple rule: "code each and every causal claim in the text".1 This is "Task 2 — Coding causal claims as causal qualitative data analysis".1

This "Split" rule is defined by a "naive" or "barebones" approach to coding.1 This "naive" methodology is a deliberate choice to maximize reliability and scalability while avoiding "false precision".1 The rules of this "Split" are:

Code Only "Bare Causation": A link from X to Y simply means "someone believes X influences Y".1 We "deliberately don't code" nuance, necessity, sufficiency, or non-linear forms.1
No Strength or Polarity: We "do not code the strength of a link" 1 or its polarity (positive/negative).1 This avoids the "false precision" 1 inherent in trying to quantify qualitative claims, which are seldom stated with such consistency by respondents.1
Factors as Propositions, Not Variables: This approach rejects the foundational assumption of systems dynamics (like Causal Loop Diagrams) that factors are quantifiable variables. Instead, a factor is a proposition (e.g., "Not enough money," "Won't take a holiday this year").1 This is more faithful to "how people actually communicate" 1 and avoids the "unnatural contortions" of forcing event-based narratives into a variable-based structure.1

The 'Combine' (The Synthesis):

The "Apply" step (the coding itself) generates a simple, structured database of `` links.1 The "Combine" step is the aggregation of all these discrete links into a single, queryable model.

Crucially, this final output is not a model of the world (what is happening), nor is it a model of beliefs (what people think is happening). It is a "repository of causal evidence" (what people said is happening).1 This resolves the "Janus dilemma" 1 of causal mapping by strictly limiting the claim to what the data supports. The process is fully verifiable because every link in the final model can be traced back to its underlying quote and source.1

3.2. The 'Apply' Step: A "Library of Answers" for Causal Models#

This specific causal mapping variant introduces a unique and powerful conception of the "Apply" step, which distinguishes it from Mayring's more general framework. In Mayring's method, "Apply" is the act of coding. In this causal mapping variant, the analysis itself ("Task 3 — Analysing data, Answering questions") 1 is conceptualized as a library of pre-defined "Apply" functions (queries) that can be run on the database of links (the "Split" data).

The "product of (causal) qualitative coding can be a model you can query".1 The "Apply" step, therefore, is the interrogation of this model using a "library of answers"—a set of non-AI, deterministic, algorithmic functions.1 This "library of answers" is the analytical toolkit.

This directly parallels Wickham's plyr package, which is a library of functions (aaply, ddply, etc.) 1 to "Apply" to split quantitative data. In our qualitative Split-Apply-Combine variant, we have a library of qualitative analytical functions to "Apply" to our split causal data. This "library of answers" applies only to this specific causal mapping case, not to the general qualitative Split-Apply-Combine tradition.

Table 2: The "Library of Answers": 'Apply' Functions for the Causal Mapping qualitative Split-Apply-Combine Variant#


Analytical Function (The 'Apply' Step)	Research Question Answered (The 'Combine' Step)	Source
Summarising / Filtering	"How do the sources claim that the system works, in summary? (e.g., top N factors/links)"	p.98
Looking Downstream	"What are the direct and indirect consequences of one or more factors?"	p.115
Looking Upstream	"What are the direct and indirect influences on one or more factors?"	p.116
Path Tracing	"How do one or more causes affect one or more effects, including indirect pathways?"	p.118
Main Outcomes	"Which factors are mentioned most often as outcomes? (Uses 'outcomeness' metric)"	p.104
Main Drivers	"Which factors are mentioned most often as drivers? (Uses 'outcomeness' metric)"	p.105
Comparing Groups	"What factors or links were mentioned more by some groups than others, in the same map?"	p.107
Identifying Groups	"Are there different subgroups within the data? (e.g., via clustering)"	p.109
Focusing on Specific Factors	"What influences and outcomes are connected to a specific factor? (Ego network)"	p.114
Assessing Robustness	"How robust is the evidence for that X influences Y? (e.g., source thread count)"	p.120
Identifying Feedback Loops	"Are there feedback loops in the evidence network?"	p.125
Vignettes	"What is a typical source and what is their story? (Identify most 'typical' source)"	p.100

IV. Generative AI as a Verifiable Accelerator for Causal Mapping#

This brings us to the final component of the argument. Having established a general "qualitative Split-Apply-Combine" tradition (Mayring) and a specific, manual, algorithmic variant (Causal Mapping), we can now modestly position Generative AI. The GenAI is not a new analytical paradigm. It is a "robot" assistant that massively scales the manual, verifiable process detailed in Section III.

4.1. The AI as a "Low-Level Assistant," Not a "Black Box" Analyst#

This approach explicitly rejects the "black box" use of AI.1 We do not ask the AI to perform the analysis, such as "What are the main themes in this document?" 1 or "Is this program effective?".1 Such high-level, vague tasks invite the AI to "skim read and jump to conclusions" 1 and produce plausible-sounding but unverifiable output.

Our approach is "radically different".1 The AI is used only as a "tireless, low-level but incredibly fast assistant".1

The AI's Only Job: The AI's task is strictly limited to automating the "Split" step (Task 2) from Section III. It is instructed only to identify the "bare causation" links, one small section of text at a time.1 It is an automated "Causal QDA" coder.

4.2. How This Approach Guarantees Verifiability#

Verifiability is maintained because the AI's output is not an analysis. The AI's output is a structured database of causal claims. This preserves the "small-Q" rigor of the manual method.

Traceability: The process is fully transparent. Every single link created by the AI is "transparently" 1 tied to a "verbatim quote" from the source text and a "source ID".1 This allows a human analyst to "check its work" 1 and validate its precision, as was done in the validation study.1
Human-in-the-Loop Analysis: The human researcher 1 performs all the actual analysis (the "Apply" and "Combine" steps). The human takes the AI-generated database of links and applies the transparent, deterministic algorithms from the "Library of Answers" (Table 2) to it. The AI is not involved in this analytical step. This embodies the principle: "Trust the algorithm, not the AI".1 The "algorithms" are the verifiable queries in our library; the AI is just a tool for populating the database that these algorithms run on.

This workflow is as follows:

AI ('Split'): The AI processes thousands of interviews and produces a database of 100,000 causal links, each with a source and a quote.1
Human (Verify): The human analyst spot-checks the AI's "Split" by reading the quotes for a sample of links (as in the validation study, which found 87-92% precision).1
Human ('Apply' / 'Combine'): The human runs the Looking upstream algorithm 1 (from the Library in Table 2) on the AI's 100,000-link database to answer their specific research question. The AI's cognitive work is finished; the human's analytical work begins.

4.3. The Benefits: Scaling a "Small-Q" Method#

This is the "nice way" to use GenAI.1 We are not "reinventing the wheel" or creating a new "black box." We are simply taking our existing, manual, verifiable "small-Q" method (Causal Mapping) and using a "robot" to change its scale.

Benefit 1: Massive Scale: This GenAI-assisted approach allows us to apply a rigorous qualitative method to "large numbers of causal claims, sometimes many thousands".1 It makes it feasible to process "hundreds of interviews or thousands of questionnaire answers," allowing us to hear "More voices".1
Benefit 2: Improved Reliability & Repeatability: The AI acts as a "robot." It applies the same prompt (the "Split" rule) consistently across all texts with a temperature of 0.1 This "improves reliability / repeatability" by removing the inevitable inconsistency, fatigue, and "positionality" 1 of multiple human coders, who may code the same text differently. The validation study 1 confirms this approach is highly effective.
Benefit 3: Verifiability at Scale: The process remains "just as verifiable" as the manual method. We are not "trust[ing] the AI".1 We are trusting the process, which is verifiable at two points: (1) The AI's coding can be spot-checked against its own quotes 1, and (2) The analysis is conducted by humans using the transparent, deterministic "Library of Answers".1

V. Conclusion#

The "Qualitative Split-Apply-Combine" (qualitative Split-Apply-Combine) framework is not a new invention but a useful term for an established "small-Q" tradition of QDA, exemplified by Mayring's Qualitative Content Analysis. This tradition prioritizes verifiable, rule-based analysis, where the "Split" is the operationalization of the research question into a coding scheme.

This report has detailed a specific, manual variant of this tradition—Causal Mapping as "Causal QDA"—which uses a "naive" coding rule for its "Split" and a "Library of Answers" (a set of algorithmic queries) for its "Apply" step. This method produces a verifiable "repository of causal evidence."

Finally, Generative AI has been positioned modestly as a verifiable accelerator for this specific variant. By constraining the AI to the "low-level" task of "Split" (coding) and ensuring its output is a fully traceable, quote-linked database, we maintain full "small-Q" rigor. The human analyst remains in control, performing the "Apply" and "Combine" (analysis) steps using the trusted, deterministic algorithms from the library. This approach is not a "black box"; it is a "nice way" to use GenAI, allowing researchers to conduct verifiable qualitative analysis at a scale that was previously impossible, enhancing reliability without sacrificing the transparency and human judgment central to rigorous inquiry.

Works cited#

wickham - 2011 - The Split-Apply-Combine Strategy for Data Analysis.pdf
Qualitative content analysis: theoretical foundation, basic procedures and software solution, accessed on November 9, 2025, https://www.ssoar.info/ssoar/bitstream/handle/document/39517/ssoar-2014-mayring-Qualitative_content_analysis_theoretical_foundation.pdf
(PDF) Qualitative Content Analysis - ResearchGate, accessed on November 9, 2025, https://www.researchgate.net/publication/215666096_Qualitative_Content_Analysis
Qualitative Content Analysis A Step-by-Step Guide - Sage Publishing, accessed on November 9, 2025, https://uk.sagepub.com/en-gb/eur/qualitative-content-analysis/book269922
Qualitative data analysis (QDA): Methods & software guide - Lumivero, accessed on November 9, 2025, https://lumivero.com/resources/blog/what-is-qualitative-data-analysis-qda-software/
5 steps for qualitative content analysis - The PhD Club, accessed on November 9, 2025, https://thephdclub.com/blog/f/5-steps-for-qualitative-content-analysis?blogcategory=%23PhD+Essentials
Directed qualitative content analysis: the description and elaboration of its underpinning methods and data analysis process - NIH, accessed on November 9, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC7932246/
The Use of Qualitative Content Analysis in Case Study Research - Kohlbacher, Florian - WU Research - WU Wien, accessed on November 9, 2025, https://research.wu.ac.at/ws/files/19852569/75-195-1-PB.pdf
QCAmap Step by Step – a Software Handbook | Qualitative Content Analysis, accessed on November 9, 2025, https://qualitative-content-analysis.org/wp-content/uploads/QCAmapSoftwareHandbook.pdf
Qualitative Data Analysis and Interpretation: Systematic Search for Meaning - ResearchGate, accessed on November 9, 2025, https://www.researchgate.net/publication/278961843_Qualitative_Data_Analysis_and_Interpretation_Systematic_Search_for_Meaning