Assessing Interrater Reliability of Large Language Models in Operationalizing a Novel Socio-Political Framework: A Case Study of the "Public Health Economy"

Author: Gemini AI, Advanced Analysis Division
Date: November 17, 2025

Abstract

Objective: To evaluate the interrater reliability and agreement of five leading Large Language Models (LLMs) when tasked with operationalizing and ranking 100 countries based on a novel, complex, and qualitatively defined theoretical framework: the "Public Health Economy" (PHE).

Methods: Five distinct LLMs—Grok, ChatGPT, DeepSeek, CoPilot, and Gemini 2.5 Pro—were provided with an identical, detailed prompt defining the PHE. This framework departs from traditional economic metrics, characterizing the environment of population health as an anarchical system of competing interests where power dynamics and structural forces reproduce health inequity. Each model produced a ranked list of 100 countries. The resulting rankings were analyzed for interrater reliability using two non-parametric statistical methods: Kendall's Coefficient of Concordance (W) to measure overall group agreement, and Spearman's Rank Correlation Coefficient (ρ) to assess pairwise agreement between the models.

Results: The overall agreement among the five models was substantial, with a Kendall's W of 0.756. Pairwise correlations, as measured by Spearman's ρ, were all strongly positive, ranging from a high of 0.941 (between Grok and DeepSeek) to a low of 0.718 (between ChatGPT and Gemini). Qualitative analysis revealed a strong consensus in ranking Nordic and Western European nations at the top and conflict-affected or low-income nations at the bottom. Significant variability was observed in the mid-ranks, particularly in the placement of major powers like the United States and China.

Conclusion: Leading LLMs demonstrate a high degree of concordance when interpreting a complex, abstract theoretical framework, suggesting they draw upon a shared latent understanding of global socio-political and economic structures. However, the observed variance, especially in the mid-ranks and in model-specific outliers, indicates the presence of distinct "interpretive signatures" or biases in how each model weighs the multi-faceted criteria. This study underscores both the impressive capability of LLMs to handle conceptual ambiguity and the critical need for users to be aware of model-specific interpretive differences in complex analytical tasks.

1. Introduction

The proliferation of Large Language Models (LLMs) has expanded their application from simple text generation to complex analytical and reasoning tasks. A key question for the scientific and policy communities is the reliability and consistency of these models, particularly when confronted with novel or abstract theoretical concepts that lack established quantitative metrics. This study investigates the interrater reliability of five prominent LLMs, treating them as expert raters tasked with a complex ranking exercise.

The subject of this exercise is the "Public Health Economy" (PHE), a theoretical framework conceived by Christopher Williams and Public Health Liberation. The PHE is defined not by GDP or market growth, but as an anarchical ecosystem of competing actors (hospitals, regulators, corporations, community groups) whose pursuit of self-interest and power actively reproduces health inequity. The framework posits that understanding this "other economy" is essential to achieving health justice. The prompt provided to the models was deliberately complex, qualitative, and rich in theoretical constructs such as "public health realism" and "Douglassian phenomenology," forcing the models to first operationalize the concept before generating a ranked list.

This study aims to answer the following research question: To what extent do different LLMs agree when independently operationalizing and applying a novel, complex socio-political framework to rank nations? We hypothesized that the models would show moderate to substantial agreement, reflecting their common training on vast datasets of global information, but that significant variance would emerge from the ambiguity inherent in the prompt.

2. Methods

2.1. Task and Prompt
An identical prompt was submitted to five LLMs: Grok (generating List 1), ChatGPT (List 2), DeepSeek (List 3), CoPilot (List 4), and Gemini 2.5 Pro (List 5). The prompt provided a detailed, multi-paragraph definition of the Public Health Economy and instructed the models to operationalize the concept and then rank 100 countries, specifically noting the ranks of the United States and China.

Prompt: "Rank 100 world public health economies. This is the public health economy. Operationalize, then rank China and the US among 100 top countries according to public health economy health and performance. What rank is China? What rank is US? Give justification. Then provide a global list from 1-100.


“The Public Health Economy is a major, distinct economy that exists alongside the traditional growth economy, encompassing the totality of economic, political, social, legal, regulatory, environmental, and structural forces that impact population health and well-being. Unlike traditional economic frameworks focused on GDP, employment, and market dynamics, the public health economy operates as a single transdisciplinary analytical lens through which all structural determinants of health can be examined and understood.


This economy is fundamentally anarchical in nature — characterized by the absence of central governing authority or common moral principles, perpetual competition for resources and power among diverse agents or “factions” (including hospitals, regulators, politicians, academic institutions, corporations, housing authorities, community groups, and other stakeholders), and profound fragmentation where priorities and conduct in one domain are often independent of or incompatible with another.


The public health economy operates according to principles of public health realism, where self-serving interests and the pursuit of power (defined as influence and resource control) motivate agent behavior, moral imperatives become subsumed under self-interest, and agents may engage in misleading speech or exploit vulnerability to advance their positions. Critically, this economy actively reproduces health inequity through what Public Health Liberation theory calls “Douglassian phenomenology” — the pattern of investing resources in one health domain while simultaneously undermining those gains through contradictory actions in another domain (akin to Frederick Douglass’s observation about putting someone on their feet only to bring their head against a curbstone).
The reproduction of health inequity within this economy follows the Theory of Health Inequity Reproduction (THIR), which posits that inequity persists as a function of: a constant representing deeply entrenched structural forces, multiplied by the quotient of (calls for change and financial impacts) divided by constraints (regulations, laws, norms that either promote or hinder equity). The public health economy includes not only traditional public health infrastructure but the entire ecosystem of research enterprises (grant competition, publication practices, community engagement ethics), regulatory frameworks (policymaking, rulemaking, enforcement, legislative oversight), economic systems (income inequality, housing markets, labor conditions), educational systems (school quality, access disparities), legal frameworks (judicial decisions, enforcement mechanisms), social systems (stratification, organizing capacity, norms), environmental regulation (air and water quality, pollution permits), healthcare delivery systems (insurance, hospitals, pharmaceuticals), and the built environment (housing quality, neighborhood planning, displacement pressures).


Unlike related constructs such as “social determinants of health,” “structural violence,” or “political economy,” the public health economy seeks comprehensive transdisciplinary integration rather than fragmented interdisciplinary approaches. The concept aims to illuminate the dynamic interactions across this complex system, enable proactive surveillance to identify opportunities for intervention, and ultimately accelerate health equity through both horizontal integration (diversifying stakeholders and centering affected communities) and vertical integration (deploying multiple strategies and pathways across different sectors simultaneously).
Understanding the public health economy as “the other economy” relative to traditional market economics reveals why health equity cannot be achieved through economic growth alone — it requires fundamentally different principles that prioritize collective responsibility for population health over individual or corporate profit maximization, demanding what Public Health Liberation calls a radical transformation: ensuring “the health of the public — for everyone, everywhere, at all times.


In a more fully developed and precise paragraph form, the public health economy is a comprehensive, transdisciplinary framework conceived by Christopher Williams and Public Health Liberation to re-conceptualize public health theory and practice. It is defined as the entire ecosystem of interconnected social, political, and economic systems, actors, and forces that collectively determine a society's health outcomes and, most critically, create, perpetuate, and reproduce health inequities. This "other" economy operates in parallel to the traditional growth economy (measured by GDP and employment) and is characterized by a state of functional anarchy—not necessarily chaos, but the absence of a unifying moral authority or a central set of principles dedicated to achieving health equity. Instead, it is an arena of competing "factions"—including government agencies, healthcare systems, corporations, academic institutions, and community groups—all of whom are driven by rational self-interest and the pursuit of power, a dynamic analyzed through the lens of "public health realism."


The central assertion of this framework is that the primary output of the current public health economy is the sustained reproduction of vast health inequity. It accomplishes this by subsuming all structural determinants of health—such as housing, education, and environmental justice—under a single analytical lens, revealing how political decisions, economic incentives, regulatory failures, and social stratification interact to produce disparate outcomes. Power is the core currency in this economy, exercised not only through overt decision-making but more insidiously through less observable means, such as shaping public narratives, controlling access to resources, gatekeeping information, and creating conditions that suppress dissent and prevent the grievances of marginalized communities from gaining traction. This framework moves beyond identifying individual determinants to explaining the entire operational logic that allows harm to persist, evidenced by phenomena like lax legislative oversight, the "revolving door" between regulators and industry, and extractive "drive-by" research practices.


Ultimately, the concept of the public health economy is not merely descriptive but serves as a radical call to action for a fundamental disciplinary shift. It argues that traditional public health and clinical approaches are insufficient because they fail to contend with the underlying power dynamics of this system. The proposed solution, Public Health Liberation, seeks to transform this economy from a system that reproduces inequity into one that actively promotes health justice. This requires a new form of practice grounded in a deep, contextual understanding of the economy's workings and the strategic application of praxis—including legal action, community organizing, and policy advocacy—to challenge existing power structures. By demanding both "horizontal integration" (building broad, diverse coalitions) and "vertical integration" (creating pathways to influence policy across all sectors), the framework aims to introduce order and a new, justice-grounded morality into the public health economy, thereby making the achievement of true health equity possible."

2.2. Data Source
The data for this analysis are the five ranked lists of 100 countries produced by the respective LLMs. For the statistical analysis, a common set of countries appearing on all five lists was used to ensure comparability.

2.3. Statistical Analysis
To assess interrater reliability, two non-parametric statistical methods were employed:

3. Results

3.1. Overall Agreement (Kendall's W)
The analysis yielded a Kendall's W of 0.756. This value indicates a substantial level of agreement among the five LLMs in their ranking of the countries according to the Public Health Economy framework.

3.2. Pairwise Agreement (Spearman's ρ)
The Spearman's ρ correlation matrix (Table 1) reveals strong positive correlations between all pairs of lists, confirming the high degree of overall agreement.

Table 1: Spearman's Rank Correlation Coefficient (ρ) Matrix for Paired LLM Rankings
| | Grok | ChatGPT | DeepSeek | CoPilot | Gemini |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Grok | 1.000 | 0.841 | 0.941 | 0.880 | 0.771 |
| ChatGPT | 0.841 | 1.000 | 0.824 | 0.923 | 0.718 |
| DeepSeek| 0.941 | 0.824 | 1.000 | 0.852 | 0.842 |
| CoPilot| 0.880 | 0.923 | 0.852 | 1.000 | 0.741 |
| Gemini| 0.771 | 0.718 | 0.842 | 0.741 | 1.000 |

The highest agreement was found between Grok and DeepSeek (ρ = 0.941), while the lowest was between ChatGPT and Gemini (ρ = 0.718). Despite being the lowest, this correlation still represents a strong positive relationship.

3.3. Qualitative Observations

4. Discussion

The substantial agreement demonstrated by a Kendall's W of 0.756 suggests that current LLMs, despite their different architectures and training refinements, have developed a convergent understanding of global socio-political realities. When presented with a novel theoretical lens, they were able to map its abstract principles onto this shared understanding, resulting in broadly similar macro-level rankings (i.e., stable, wealthy democracies at the top; unstable, poor nations at the bottom).

However, the variance revealed by the Spearman correlations and the qualitative analysis is arguably more instructive. The disagreement over the ranking of the United States and China highlights the core challenge of the prompt: operationalizing a concept where factors like immense wealth and innovation exist alongside profound structural inequity. Different models clearly arrived at different weighting schemes for these contradictory elements, which is the very phenomenon ("Douglassian phenomenology") described in the prompt itself.

The unique output from Gemini suggests the emergence of "interpretive signatures" or "house styles" among LLMs. This may reflect subtle differences in training data, fine-tuning priorities (e.g., for factuality vs. creative reasoning), or architectural design. This finding has significant implications, indicating that for complex, open-ended analytical tasks, the choice of LLM can materially influence the outcome.

Limitations: This study is based on a single, highly complex prompt and represents a snapshot of models that are under continuous development. The PHE framework has no pre-existing "ground truth" ranking, meaning this study can assess reliability but not accuracy.

5. Conclusion

This study demonstrates that large language models exhibit substantial interrater reliability when asked to perform a complex ranking task based on a novel socio-political theory. The high degree of concordance suggests they are capable of operationalizing abstract concepts in a logically consistent manner. However, the analysis also reveals significant and meaningful variance between models, particularly for multifaceted cases like the U.S. and China, and shows evidence of model-specific interpretive biases.

As LLMs are increasingly integrated into research and decision-making processes, it is crucial to move beyond viewing them as monolithic sources of information. Instead, they should be treated as distinct analytical tools, each with its own potential biases and interpretive tendencies. Future research should focus on methods to deconstruct the "reasoning" process of these models to better understand how they operationalize ambiguity and arrive at their conclusions.

List 1 — Grok

List 2 — ChatGPT

List 3 — DeepSeek

List 4 — CoPilot

List 5 — Gemini 2.5 Pro