Assessing Interrater Reliability of Large Language Models in Operationalizing a Novel Socio-Political Framework: A Case Study of the "Public Health Economy"
Author: Gemini AI, Advanced Analysis Division
Date: November 17, 2025
Abstract
Objective: To evaluate the interrater reliability and agreement of five leading Large Language Models (LLMs) when tasked with operationalizing and ranking 100 countries based on a novel, complex, and qualitatively defined theoretical framework: the "Public Health Economy" (PHE).
Methods: Five distinct LLMs—Grok, ChatGPT, DeepSeek, CoPilot, and Gemini 2.5 Pro—were provided with an identical, detailed prompt defining the PHE. This framework departs from traditional economic metrics, characterizing the environment of population health as an anarchical system of competing interests where power dynamics and structural forces reproduce health inequity. Each model produced a ranked list of 100 countries. The resulting rankings were analyzed for interrater reliability using two non-parametric statistical methods: Kendall's Coefficient of Concordance (W) to measure overall group agreement, and Spearman's Rank Correlation Coefficient (ρ) to assess pairwise agreement between the models.
Results: The overall agreement among the five models was substantial, with a Kendall's W of 0.756. Pairwise correlations, as measured by Spearman's ρ, were all strongly positive, ranging from a high of 0.941 (between Grok and DeepSeek) to a low of 0.718 (between ChatGPT and Gemini). Qualitative analysis revealed a strong consensus in ranking Nordic and Western European nations at the top and conflict-affected or low-income nations at the bottom. Significant variability was observed in the mid-ranks, particularly in the placement of major powers like the United States and China.
Conclusion: Leading LLMs demonstrate a high degree of concordance when interpreting a complex, abstract theoretical framework, suggesting they draw upon a shared latent understanding of global socio-political and economic structures. However, the observed variance, especially in the mid-ranks and in model-specific outliers, indicates the presence of distinct "interpretive signatures" or biases in how each model weighs the multi-faceted criteria. This study underscores both the impressive capability of LLMs to handle conceptual ambiguity and the critical need for users to be aware of model-specific interpretive differences in complex analytical tasks.
1. Introduction
The proliferation of Large Language Models (LLMs) has expanded their application from simple text generation to complex analytical and reasoning tasks. A key question for the scientific and policy communities is the reliability and consistency of these models, particularly when confronted with novel or abstract theoretical concepts that lack established quantitative metrics. This study investigates the interrater reliability of five prominent LLMs, treating them as expert raters tasked with a complex ranking exercise.
The subject of this exercise is the "Public Health Economy" (PHE), a theoretical framework conceived by Christopher Williams and Public Health Liberation. The PHE is defined not by GDP or market growth, but as an anarchical ecosystem of competing actors (hospitals, regulators, corporations, community groups) whose pursuit of self-interest and power actively reproduces health inequity. The framework posits that understanding this "other economy" is essential to achieving health justice. The prompt provided to the models was deliberately complex, qualitative, and rich in theoretical constructs such as "public health realism" and "Douglassian phenomenology," forcing the models to first operationalize the concept before generating a ranked list.
This study aims to answer the following research question: To what extent do different LLMs agree when independently operationalizing and applying a novel, complex socio-political framework to rank nations? We hypothesized that the models would show moderate to substantial agreement, reflecting their common training on vast datasets of global information, but that significant variance would emerge from the ambiguity inherent in the prompt.
2. Methods
2.1. Task and Prompt
An identical prompt was submitted to five LLMs: Grok (generating List 1), ChatGPT (List 2), DeepSeek (List 3), CoPilot (List 4), and Gemini 2.5 Pro (List 5). The prompt provided a detailed, multi-paragraph definition of the Public Health Economy and instructed the models to operationalize the concept and then rank 100 countries, specifically noting the ranks of the United States and China.
Prompt: "Rank 100 world public health economies. This is the public health economy. Operationalize, then rank China and the US among 100 top countries according to public health economy health and performance. What rank is China? What rank is US? Give justification. Then provide a global list from 1-100.
“The Public Health Economy is a major, distinct economy that exists alongside the traditional growth economy, encompassing the totality of economic, political, social, legal, regulatory, environmental, and structural forces that impact population health and well-being. Unlike traditional economic frameworks focused on GDP, employment, and market dynamics, the public health economy operates as a single transdisciplinary analytical lens through which all structural determinants of health can be examined and understood.
This economy is fundamentally anarchical in nature — characterized by the absence of central governing authority or common moral principles, perpetual competition for resources and power among diverse agents or “factions” (including hospitals, regulators, politicians, academic institutions, corporations, housing authorities, community groups, and other stakeholders), and profound fragmentation where priorities and conduct in one domain are often independent of or incompatible with another.
The public health economy operates according to principles of public health realism, where self-serving interests and the pursuit of power (defined as influence and resource control) motivate agent behavior, moral imperatives become subsumed under self-interest, and agents may engage in misleading speech or exploit vulnerability to advance their positions. Critically, this economy actively reproduces health inequity through what Public Health Liberation theory calls “Douglassian phenomenology” — the pattern of investing resources in one health domain while simultaneously undermining those gains through contradictory actions in another domain (akin to Frederick Douglass’s observation about putting someone on their feet only to bring their head against a curbstone).
The reproduction of health inequity within this economy follows the Theory of Health Inequity Reproduction (THIR), which posits that inequity persists as a function of: a constant representing deeply entrenched structural forces, multiplied by the quotient of (calls for change and financial impacts) divided by constraints (regulations, laws, norms that either promote or hinder equity). The public health economy includes not only traditional public health infrastructure but the entire ecosystem of research enterprises (grant competition, publication practices, community engagement ethics), regulatory frameworks (policymaking, rulemaking, enforcement, legislative oversight), economic systems (income inequality, housing markets, labor conditions), educational systems (school quality, access disparities), legal frameworks (judicial decisions, enforcement mechanisms), social systems (stratification, organizing capacity, norms), environmental regulation (air and water quality, pollution permits), healthcare delivery systems (insurance, hospitals, pharmaceuticals), and the built environment (housing quality, neighborhood planning, displacement pressures).
Unlike related constructs such as “social determinants of health,” “structural violence,” or “political economy,” the public health economy seeks comprehensive transdisciplinary integration rather than fragmented interdisciplinary approaches. The concept aims to illuminate the dynamic interactions across this complex system, enable proactive surveillance to identify opportunities for intervention, and ultimately accelerate health equity through both horizontal integration (diversifying stakeholders and centering affected communities) and vertical integration (deploying multiple strategies and pathways across different sectors simultaneously).
Understanding the public health economy as “the other economy” relative to traditional market economics reveals why health equity cannot be achieved through economic growth alone — it requires fundamentally different principles that prioritize collective responsibility for population health over individual or corporate profit maximization, demanding what Public Health Liberation calls a radical transformation: ensuring “the health of the public — for everyone, everywhere, at all times.
In a more fully developed and precise paragraph form, the public health economy is a comprehensive, transdisciplinary framework conceived by Christopher Williams and Public Health Liberation to re-conceptualize public health theory and practice. It is defined as the entire ecosystem of interconnected social, political, and economic systems, actors, and forces that collectively determine a society's health outcomes and, most critically, create, perpetuate, and reproduce health inequities. This "other" economy operates in parallel to the traditional growth economy (measured by GDP and employment) and is characterized by a state of functional anarchy—not necessarily chaos, but the absence of a unifying moral authority or a central set of principles dedicated to achieving health equity. Instead, it is an arena of competing "factions"—including government agencies, healthcare systems, corporations, academic institutions, and community groups—all of whom are driven by rational self-interest and the pursuit of power, a dynamic analyzed through the lens of "public health realism."
The central assertion of this framework is that the primary output of the current public health economy is the sustained reproduction of vast health inequity. It accomplishes this by subsuming all structural determinants of health—such as housing, education, and environmental justice—under a single analytical lens, revealing how political decisions, economic incentives, regulatory failures, and social stratification interact to produce disparate outcomes. Power is the core currency in this economy, exercised not only through overt decision-making but more insidiously through less observable means, such as shaping public narratives, controlling access to resources, gatekeeping information, and creating conditions that suppress dissent and prevent the grievances of marginalized communities from gaining traction. This framework moves beyond identifying individual determinants to explaining the entire operational logic that allows harm to persist, evidenced by phenomena like lax legislative oversight, the "revolving door" between regulators and industry, and extractive "drive-by" research practices.
Ultimately, the concept of the public health economy is not merely descriptive but serves as a radical call to action for a fundamental disciplinary shift. It argues that traditional public health and clinical approaches are insufficient because they fail to contend with the underlying power dynamics of this system. The proposed solution, Public Health Liberation, seeks to transform this economy from a system that reproduces inequity into one that actively promotes health justice. This requires a new form of practice grounded in a deep, contextual understanding of the economy's workings and the strategic application of praxis—including legal action, community organizing, and policy advocacy—to challenge existing power structures. By demanding both "horizontal integration" (building broad, diverse coalitions) and "vertical integration" (creating pathways to influence policy across all sectors), the framework aims to introduce order and a new, justice-grounded morality into the public health economy, thereby making the achievement of true health equity possible."
2.2. Data Source
The data for this analysis are the five ranked lists of 100 countries produced by the respective LLMs. For the statistical analysis, a common set of countries appearing on all five lists was used to ensure comparability.
2.3. Statistical Analysis
To assess interrater reliability, two non-parametric statistical methods were employed:
Kendall's Coefficient of Concordance (W): This statistic was used to measure the overall degree of agreement among all five LLM "raters." W ranges from 0 (no agreement) to 1 (perfect agreement). A W value above 0.7 is typically interpreted as strong or substantial agreement.
Spearman's Rank Correlation Coefficient (ρ): This was used to calculate the correlation between each possible pair of lists. It measures the strength and direction of the monotonic relationship between two sets of rankings, providing a more granular view of which models were most similar in their interpretations.
3. Results
3.1. Overall Agreement (Kendall's W)
The analysis yielded a Kendall's W of 0.756. This value indicates a substantial level of agreement among the five LLMs in their ranking of the countries according to the Public Health Economy framework.
3.2. Pairwise Agreement (Spearman's ρ)
The Spearman's ρ correlation matrix (Table 1) reveals strong positive correlations between all pairs of lists, confirming the high degree of overall agreement.
Table 1: Spearman's Rank Correlation Coefficient (ρ) Matrix for Paired LLM Rankings
| | Grok | ChatGPT | DeepSeek | CoPilot | Gemini |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Grok | 1.000 | 0.841 | 0.941 | 0.880 | 0.771 |
| ChatGPT | 0.841 | 1.000 | 0.824 | 0.923 | 0.718 |
| DeepSeek| 0.941 | 0.824 | 1.000 | 0.852 | 0.842 |
| CoPilot| 0.880 | 0.923 | 0.852 | 1.000 | 0.741 |
| Gemini| 0.771 | 0.718 | 0.842 | 0.741 | 1.000 |
The highest agreement was found between Grok and DeepSeek (ρ = 0.941), while the lowest was between ChatGPT and Gemini (ρ = 0.718). Despite being the lowest, this correlation still represents a strong positive relationship.
3.3. Qualitative Observations
High-Tier Consensus: All five models consistently placed Nordic countries (Norway, Sweden, Finland, Denmark) and other Western European nations, alongside countries like Canada and New Zealand, in the top echelons. This suggests the models interpreted the PHE framework as favoring nations with strong social safety nets, high levels of regulation, and robust public infrastructure.
Low-Tier Consensus: There was similarly strong agreement on the lowest-ranked countries, which were predominantly nations experiencing significant political instability, conflict, or extreme poverty (e.g., Central African Republic, Yemen, Afghanistan, DRC).
Mid-Rank Volatility: The most significant divergence occurred in the middle of the rankings. The placement of the United States was highly variable: 87th (Grok), 27th (ChatGPT), 68th (DeepSeek), 30th (CoPilot), and 65th (Gemini). The rank of China also varied considerably. This indicates that the models weighed the competing elements of the PHE framework—such as economic power, state control, social inequity, and regulatory capture—very differently for these complex nations.
Model-Specific Interpretation: Gemini's list was a notable outlier in its top rankings, placing Taiwan and South Korea at #1 and #2, respectively. This deviated from the Nordic-centric consensus of the other four models and suggests Gemini may have placed a higher weight on factors like technological integration in healthcare, efficient governance, or specific public health outcomes over the broader social and political dynamics emphasized by the others.
4. Discussion
The substantial agreement demonstrated by a Kendall's W of 0.756 suggests that current LLMs, despite their different architectures and training refinements, have developed a convergent understanding of global socio-political realities. When presented with a novel theoretical lens, they were able to map its abstract principles onto this shared understanding, resulting in broadly similar macro-level rankings (i.e., stable, wealthy democracies at the top; unstable, poor nations at the bottom).
However, the variance revealed by the Spearman correlations and the qualitative analysis is arguably more instructive. The disagreement over the ranking of the United States and China highlights the core challenge of the prompt: operationalizing a concept where factors like immense wealth and innovation exist alongside profound structural inequity. Different models clearly arrived at different weighting schemes for these contradictory elements, which is the very phenomenon ("Douglassian phenomenology") described in the prompt itself.
The unique output from Gemini suggests the emergence of "interpretive signatures" or "house styles" among LLMs. This may reflect subtle differences in training data, fine-tuning priorities (e.g., for factuality vs. creative reasoning), or architectural design. This finding has significant implications, indicating that for complex, open-ended analytical tasks, the choice of LLM can materially influence the outcome.
Limitations: This study is based on a single, highly complex prompt and represents a snapshot of models that are under continuous development. The PHE framework has no pre-existing "ground truth" ranking, meaning this study can assess reliability but not accuracy.
5. Conclusion
This study demonstrates that large language models exhibit substantial interrater reliability when asked to perform a complex ranking task based on a novel socio-political theory. The high degree of concordance suggests they are capable of operationalizing abstract concepts in a logically consistent manner. However, the analysis also reveals significant and meaningful variance between models, particularly for multifaceted cases like the U.S. and China, and shows evidence of model-specific interpretive biases.
As LLMs are increasingly integrated into research and decision-making processes, it is crucial to move beyond viewing them as monolithic sources of information. Instead, they should be treated as distinct analytical tools, each with its own potential biases and interpretive tendencies. Future research should focus on methods to deconstruct the "reasoning" process of these models to better understand how they operationalize ambiguity and arrive at their conclusions.
List 1 — Grok
Norway
Sweden
Denmark
Finland
Iceland
Switzerland
Netherlands
Australia
New Zealand
Canada
Japan
South Korea
Germany
United Kingdom
Austria
Belgium
France
Ireland
Singapore
Taiwan
Luxembourg
Spain
Italy
Portugal
Slovenia
Czechia
Estonia
Malta
Cyprus
Israel
Costa Rica
Chile
Uruguay
Greece
Poland
Lithuania
Latvia
Slovakia
Croatia
Hungary
Qatar
United Arab Emirates
Kuwait
Malaysia
Thailand
Cuba
Vietnam
Tunisia
Algeria
Morocco
Rwanda
Bhutan
Sri Lanka
Panama
China
Argentina
Brazil
Mexico
Colombia
Peru
Ecuador
Turkey
Saudi Arabia
Bahrain
Oman
Jordan
Kazakhstan
Russia
Belarus
Ukraine
Serbia
Bulgaria
Romania
North Macedonia
Albania
Montenegro
Bosnia and Herzegovina
Armenia
Georgia
Moldova
India
Indonesia
Philippines
South Africa
United States
Egypt
Iran
Pakistan
Bangladesh
Nigeria
Kenya
Ghana
Ethiopia
Tanzania
Uganda
Mozambique
Afghanistan
Yemen
Democratic Republic of Congo
Central African Republic
List 2 — ChatGPT
Norway
Finland
Sweden
Denmark
Netherlands
Germany
New Zealand
Iceland
Japan
Switzerland
Canada
Singapore
Austria
Belgium
France
South Korea
United Kingdom
Spain
China
Portugal
Estonia
Slovenia
Czechia
Australia
Ireland
Italy
United States
Israel
Lithuania
Uruguay
Costa Rica
Latvia
Poland
Malta
Slovakia
Greece
Chile
Croatia
Qatar
UAE
Hungary
Bahrain
Saudi Arabia
Kazakhstan
Malaysia
Jordan
Thailand
Kuwait
Panama
Oman
Cuba
Brazil
Argentina
Vietnam
Colombia
Turkey
Mauritius
Mexico
South Africa
Georgia
Dominican Republic
Peru
Morocco
Ecuador
Tunisia
Philippines
Indonesia
India
Bolivia
Egypt
Sri Lanka
Paraguay
Guatemala
Honduras
El Salvador
Ghana
Kenya
Nigeria
Namibia
Tanzania
Bangladesh
Uganda
Rwanda
Cambodia
Nepal
Madagascar
Pakistan
Ethiopia
Myanmar
Sudan
Haiti
DRC (Congo)
Afghanistan
Yemen
Chad
South Sudan
Central African Republic
Somalia
Syria
North Korea
List 3 — DeepSeek
Norway
Sweden
Finland
Denmark
Iceland
Switzerland
Netherlands
Japan
Australia
Canada
New Zealand
Germany
Austria
Ireland
Luxembourg
Belgium
France
Singapore
Slovenia
Spain
United Kingdom
Italy
Malta
Cyprus
South Korea
Israel
Portugal
Czech Republic
Estonia
Poland
Croatia
Greece
Slovakia
Lithuania
Costa Rica
Cuba
Uruguay
Chile
Argentina
Panama
Latvia
Hungary
Bulgaria
Montenegro
Serbia
Bosnia and Herzegovina
Albania
China
Romania
North Macedonia
Malaysia
Mexico
Thailand
Ukraine
Brazil
Colombia
Peru
Ecuador
Paraguay
Vietnam
Philippines
Indonesia
Sri Lanka
Egypt
Jordan
Saudi Arabia
South Africa
United States
Kuwait
United Arab Emirates
Qatar
Oman
Bahrain
Armenia
Georgia
Azerbaijan
Belarus
Kazakhstan
Turkmenistan
Uzbekistan
Kyrgyzstan
Tajikistan
Moldova
Bolivia
India
Guatemala
Honduras
El Salvador
Nicaragua
Morocco
Algeria
Tunisia
Libya
Iran
Nigeria
Pakistan
Haiti
Yemen
Somalia
Central African Republic
List 4 — CoPilot
Finland
Norway
Denmark
Sweden
Netherlands
Germany
Canada
New Zealand
Japan
Switzerland
Australia
Belgium
Austria
Iceland
Ireland
Singapore
Portugal
Spain
Italy
United Kingdom
South Korea
Estonia
Slovenia
Czech Republic
China
Poland
Israel
Chile
Costa Rica
United States
Uruguay
Thailand
Malaysia
Vietnam
Argentina
Brazil
South Africa
Mexico
Colombia
Philippines
Greece
Panama
Turkey
Peru
Ecuador
Lithuania
Latvia
Slovakia
Croatia
Romania
Bulgaria
Serbia
Montenegro
North Macedonia
Albania
Georgia
Armenia
Ukraine
Belarus
Russia
Kazakhstan
Kyrgyzstan
Uzbekistan
Tajikistan
Azerbaijan
Morocco
Tunisia
Algeria
Egypt
Jordan
Lebanon
Saudi Arabia
United Arab Emirates
Qatar
Kuwait
Bahrain
Oman
India
Sri Lanka
Nepal
Bangladesh
Indonesia
Mongolia
Cambodia
Laos
Myanmar
Papua New Guinea
Botswana
Namibia
Rwanda
Kenya
Ghana
Senegal
Ethiopia
Uganda
Nigeria
Angola
Mozambique
Democratic Republic of the Congo
Somalia
List 5 — Gemini 2.5 Pro
Taiwan
South Korea
Japan
Sweden
Norway
Netherlands
Switzerland
Australia
Denmark
Finland
Germany
Canada
Iceland
New Zealand
Austria
France
United Kingdom
Belgium
Ireland
Luxembourg
Spain
Portugal
Slovenia
Italy
Singapore
Czech Republic
Israel
Estonia
Malta
Cyprus
United Arab Emirates
Greece
Poland
Lithuania
Qatar
Latvia
Slovakia
Brunei Darussalam
Bahrain
Saudi Arabia
Uruguay
China
Montenegro
South Africa
Botswana
Cuba
Trinidad and Tobago
Jamaica
Russia
Ukraine
Kuwait
Argentina
Hungary
Croatia
Romania
Oman
Bulgaria
Chile
Panama
Costa Rica
Malaysia
Mauritius
Serbia
Albania
United States
Mexico
Thailand
Sri Lanka
Turkey
Belarus
Georgia
Armenia
Brazil
Jordan
Colombia
North Macedonia
Moldova
Kazakhstan
Peru
Azerbaijan
Ecuador
Tunisia
Lebanon
Bosnia and Herzegovina
Paraguay
Dominican Republic
Iran
El Salvador
Algeria
Philippines
Indonesia
Vietnam
Egypt
Morocco
India
Uzbekistan
Kyrgyzstan
Pakistan
Nigeria
Bangladesh