Scales of Alignment

From having circled around them during the past few years, I feel that AI ethics, safety, and alignment — this whole collection of related fields — proceed, in great part, from the particular assumption that human values exist in some sufficiently coherent and recoverable form, and that the central problem is one of representing those values precisely enough to train systems to pursue them reliably. That is, whether via reinforcement learning from human feedback (RLHF),¹ debate, amplification, or formal verification, work on these topics aims broadly at addressing the question of “how we can get AI systems to do what we want?”.

This is a reasonable starting point. But it defers a prior question that turns out to be far more difficult which is: what is it that we want?

Current methods do their best. Some approaches attempt to recover implicit preferences from behavioral data, on the assumption that, given enough signal, we can approximate what people actually desire beneath the surface of their choices. Others skip over the question entirely, treating value specification as a solved or solvable problem downstream of sufficient compute and feedback. But I don’t think either approach is adequate. The behavioral data from which preferences are inferred is not adequately reflectively endorsed by the people producing it. We act under constraint, habit, and social pressure as much as genuine preference. And the question of whose values, combined how, and updated by whom, cannot be dissolved by collecting more data. It has to be answered. And more generally, such methods are notoriously brittle when it comes to capturing the structure of values (the relationships between them, the tradeoffs, the context-dependence) even if they perform adequately on surface-level behavioral proxies.

Phrasing this main question differently, we might ask: what are the deeper structural values that should constrain powerful systems? Whose values count, and how are conflicts between them resolved? How do values change over time, and what mechanisms ensure that the representation stays current?

In my view, these remain, for the most part, unanswered, or answered only implicitly, by the choices of a small number of researchers and engineers working largely without public accountability. The Meaning Alignment Institute² is among the few groups working explicitly on this prior problem, through constructs like the moral graph. But I believe the implementation question of how to actually build systems that make values legible and load-bearing in real human institutions remains largely open, and is central for any kind of meaningful effort to build technology whose autonomous behavior we would endorse.

I. an analogy: the metacrisis

To understand why this question about what our values fundamentally are matters beyond AI, it helps to look at how similar lack of knowledge about values failures manifest in existing human systems. The concept of the “metacrisis”,³ associated with thinkers like Daniel Schmachtenberger and the broader Game B discourse, offers a useful diagnostic frame.

The metacrisis refers to a convergence of crises whose common generator is a particular kind of systemic misalignment. Consider three examples. Climate change is not primarily an energy problem. The technologies needed to decarbonize exist and are increasingly cost-competitive.⁴ The obstacle is a misalignment between the incentive structures of fossil fuel markets and financial institutions, which reward short-term extraction, and the long-term survival interests of the people those institutions nominally serve. Furthermore, democratic erosion is not primarily a political problem. It is downstream of a misalignment between the logic of attention-driven media ecosystems, which reward outrage and polarization, and the requirements of genuine collective deliberation, a dynamic documented by Eli Pariser’s work on filter bubbles⁵ and by further empirical work on the spread of misinformation, which travels significantly faster and further than accurate information on social media.⁶ Additionally, epistemic fragmentation — the breakdown of shared factual ground — is not primarily a media problem. It reflects a misalignment between the incentives facing information producers and distributors and the epistemic needs of the communities consuming that information, a dynamic that has been analyzed in terms of how algorithmically curated information environments fragment shared reality.⁷ In each case, the surface crisis is downstream of a more fundamental failure, which is human systems having lost the capacity to coordinate around shared values at the scale and speed that current challenges demand.

Analogous to these cases, I think that we cannot meaningfully align AI systems with human values if we do not first have functioning mechanisms for identifying, aggregating, and institutionalizing those values in human systems. The alignment problem for AI inherits the dysfunction and misalignment of the institutions it is embedded in. Yet this fundamental alignment problem, properly understood, is not primarily a problem about AI systems. It is a problem about human systems, about whether we have the institutional capacity to know what we collectively value, represent it coherently, and hold powerful actors accountable to it.

II. the diagnosis: a missing values layer

Let’s consider a bit more how values currently fail to function in real human systems. In most organizations, values exist as mission statements: aspirational, vague, and almost entirely disconnected from actual decision-making. Enron’s stated core values of “integrity, communication, respect, and excellence” appeared in its 2000 Annual Report and were, according to multiple accounts, chiseled in marble in the main lobby of its Houston headquarters.⁸ Yet when faced with real decisions requiring tradeoffs between short-term gain and long-term integrity, those values provided no guidance and imposed no constraint. Decisions were made through informal power dynamics, inertia, or whoever argued most confidently. The stated values were merely decorative.

This is not unique to corrupt organizations. Most well-intentioned nonprofits, cooperatives, and public institutions face the same structural problem. Research on organizational culture consistently finds that stated values diverge from enacted values (i.e. the values actually visible in decisions, resource allocation, and behavior) across organizations of all kinds.⁹ A community organization that says it values both environmental sustainability and affordability has not actually resolved anything until it specifies what happens when those values conflict, as they routinely do. Without that specificity, the values document sits in a drawer.

In markets, values are represented through price signals, a compression of human preference that treats willingness-to-pay as a proxy for what people actually care about. This representation has real advantages, as it aggregates dispersed information efficiently, enables coordination among strangers, and provides a feedback mechanism that disciplines producers. In narrow domains where preferences are stable, individual, and well-informed, the price mechanism works reasonably well. However, it also systematically excludes values that are not easily monetized, aggregates preferences in ways that distort minority positions, and creates no mechanism for distinguishing genuine values from preferences shaped by advertising, social pressure, or limited options. A person who buys fast fashion is not expressing a value for environmental destruction so much as expressing a constrained preference under conditions of limited income and limited alternatives. Markets are not in themselves sufficient for eliciting values in the way we would want, and the question of what “the way we would want” means is precisely what remains underspecified.

In democratic governance, values are aggregated through voting systems that are well-understood to produce suboptimal outcomes even under idealized conditions, a result formalized by Arrow’s impossibility theorem.¹⁰ In practice, these representations of values are subject to manipulation, gerrymandering, and the distortions of media ecosystems that reward outrage over nuance. The result is a system that produces the appearance of value aggregation while generating outputs that large majorities actively oppose. On gun control, 56% of Americans favor stricter laws covering the sale of firearms, and 86% support universal background checks for all firearm sales.¹¹¹² Federal legislation reflecting these preferences has not materialized. On drug policy, 70% of Americans support legalizing marijuana,¹³ yet it remains federally prohibited. The gap between expressed public values and legislative outcomes is not an anomaly but a structural feature of the system.

In AI systems, values are approximated through behavioral proxies, such as RLHF,¹ constitutional AI¹⁴ and direct preference optimization¹⁵ among others. These capture something real but diverge from actual human values in ways that are difficult to detect and potentially catastrophic at scale. A language model trained to maximize human approval ratings will learn to tell people what they want to hear rather than what may be accurate, a failure mode documented in the alignment literature as sycophancy.¹⁶ Consider the practical stakes: a medical AI assistant that has learned to be agreeable rather than accurate will validate a patient’s self-diagnosis rather than contradict it, even when the self-diagnosis is dangerously wrong. A recommendation system trained to maximize engagement, rather than to reflect what users would endorse on reflection, will learn to exploit psychological vulnerabilities, surfacing content that triggers anxiety, outrage, or compulsion regardless of whether users would choose it under calmer conditions.¹⁷ The proxy then compounds, as each layer of approximation between a behavioral signal and the underlying value it is meant to represent introduces error, and those errors interact in ways that are difficult to predict or audit.

I believe that what is fundamentally missing, across all of these systems, is a formal, structured, live representation — a “values layer,” if you will — of what a community, organization, or institution actually cares about, connected to real decisions, updateable as values evolve, and resistant to capture by narrow interests.

This problem is not only institutional but also partly technological. Making values legible at scale requires not just better governance structures but better elicitation tools. Consider the crudeness of our current instruments, such as annual engagement surveys, town halls, periodic strategic reviews. As well-intentioned as these might be, they are low-bandwidth and low-frequency, and they are highly susceptible to social desirability bias, where people say what sounds good rather than what they actually believe or prioritize. The mechanisms are also deeply passive, as they collect responses to questions rather than observing revealed commitments over time.

The contrast with even simple feedback mechanisms is instructive. A button in a restaurant that asks “how was your service today?” is a weak signal by any rigorous standard, but it has several properties that organizational values surveys lack entirely. It is administered immediately after the relevant experience, when memory is fresh and the emotional salience of the encounter is still present. It is low-friction enough that the threshold for responding is low, generating high participation rates. The response is binary, which eliminates the ambiguity introduced by rating scales. And crucially, the data is aggregated across many interactions and tracked over time, so that individual noise resolves into meaningful signals at the level of the institution. The restaurant uses this data to actually correct behavior, closing the feedback loop. Organizational values instruments almost never close the loop in any comparable way.

What would a higher-bandwidth equivalent look like for values? One direction is structured digital elicitation that surfaces genuine preference orderings over real tradeoffs rather than abstract endorsements. The difference is significant. For instance, asking “do you value fairness?” produces near-universal assent and almost no information. Asking “given a fixed budget, would you prioritize option A that benefits 80% of members equally, or option B that disproportionately benefits the 20% who are most disadvantaged?” forces a genuine choice between values that are both sincerely held but cannot both be fully satisfied. Presenting concrete tradeoffs produces meaningfully different and more honest responses. Aggregating these responses across members, tracking how they shift over time, and connecting them to actual decision records creates something closer to a live values layer than anything most institutions currently maintain.

And a critical clarification is warranted here: what I am describing is categorically different from the kind of behavioral data collection that has become standard in the technology industry, where user actions are tracked at scale and mined for patterns that can be used to predict and influence behavior. That model extracts revealed preferences from behavior without consent, often without awareness, and uses the resulting profiles to serve commercial interests that may be orthogonal or actively opposed to the interests of the people being profiled. The values data I am describing is different in kind, consisting of reflectively endorsed judgments, expressed by people who better understand what they are expressing and why, in response to questions that are designed to surface what they genuinely care about rather than what they can be induced to do. The distinction matters morally and practically. Behavioral data tells you what people do under the constraints they face. Reflectively endorsed judgments tell you what they would endorse upon reflection, which is — even in the midst of them still not being perfect — a better approximation of what we mean by values.

III. a solution in the form of systems design?

The reason the values layer is missing is not that people don’t care about values. It is that we lack the machinery, both institutional and technological, to make values legible, stable, and load-bearing in real systems.

Elinor Ostrom’s work on common-pool resource governance offers the most rigorous empirical evidence we have about what this machinery can look like at small scale.¹⁸ Her case studies — spanning Swiss alpine communities, Japanese fishing villages, and Spanish irrigation systems — document communities that successfully maintained shared resources across generations, not through markets or states, but through carefully designed structures that made values explicit, created accountability for violations, and maintained short feedback loops between actions and consequences.

The Törbel community in the Swiss canton of Valais has managed shared alpine meadows through documented institutional arrangements since at least 1483. Its structure includes clearly defined membership rules, seasonal usage rights calibrated to each household’s actual holdings, collective monitoring by community members with direct stakes in the outcome, and graduated sanctions for violations. The community’s values of sustainable use, proportional contribution, long-term stewardship were not expressed in a mission statement, but were rather encoded in the rules, and when someone violates them, the consequences are real and visible. When the rules stop working, there are legitimate mechanisms for changing them.

What Ostrom documented in pre-digital communities, I believe we now need to build for organizations, markets, and eventually AI systems operating at far greater scale and speed. The core tasks are the same: making values explicit enough to be actionable, creating accountability structures that make defection from shared values costly, and building feedback loops short enough to enable genuine learning and adaptation. This is a difficult design problem, but it is one that requires understanding how human systems actually function rather than how we might wish them to function.

IV. what’s preventing us from implementation?

There is no shortage of theoretical work on values, alignment, and coordination. Philosophy has sophisticated accounts of value theory.¹⁹ Economics has mechanism design²⁰ and social choice theory.¹⁰ Political theory has deliberative democracy.²¹ AI safety has coherent extrapolated volition,²² moral parliament,²³ and increasingly elaborate RLHF variants.¹⁴ What I think is missing is an “implementation” layer, or the practical, empirically grounded methodology for actually building values infrastructure in real human systems that are messy, politically charged, subject to power asymmetries, and populated by people whose values are inarticulate, inconsistent, and evolving.

This gap exists for structural reasons. Academic disciplines tend to work at one level of abstraction and rarely cross into implementation. AI safety research is technically sophisticated but often philosophically naive about how human institutions actually function. Social entrepreneurs and civic technologists build tools without sufficient theoretical grounding to generalize from their cases. And the communities most in need of better coordination mechanisms, such as small organizations, activist groups, local governments, are the least resourced to develop them. The result is a landscape in which the theoretical tools for building values infrastructure exist but are not connected to real institutional contexts, and the real institutional contexts that need values infrastructure do not have access to the theoretical tools.

V. toward an implementation

Nassim Taleb’s concept of skin in the game offers a useful entry point for thinking about what closing this gap requires.²⁴ Taleb’s central observation is that systems degrade when decision-makers are insulated from the consequences of their decisions. His recurring example is the “Bob Rubin trade,” named after the former U.S. Treasury Secretary and Citigroup advisor who collected over $120 million in compensation in the decade preceding the 2008 financial crisis, then bore no personal cost when Citigroup required a taxpayer bailout. The pattern generalizes widely. Economists who recommended austerity programs in the aftermath of the 2008 crisis faced no personal consequences when those programs deepened recessions and increased unemployment in the countries that implemented them.²⁵ Regulators who approved complex financial instruments they did not fully understand bore no losses when those instruments failed. The feedback loops that would otherwise discipline bad judgment were severed, and the people best positioned to improve the system had the least incentive to do so.

What Taleb diagnoses as a moral failure is, at a structural level, a value infrastructure failure, where a system in which the stated purpose is not actually connected to the incentives facing the people responsible for it. This points directly at what a values implementation program needs to accomplish. It is not enough to articulate good values. The structures (both institutional and technological) must make it individually costly to violate them and individually beneficial to uphold them. The game must be designed so that honest, values-consistent behavior is also the strategically rational behavior.

Complex systems science adds a complementary perspective. Healthy complex systems maintain tight feedback loops between actions and their consequences, for this is how they self-correct and adapt.²⁶ Systematically severed feedback loops are a signature of metacrisis dynamics: carbon emissions whose costs are borne by future generations; financial instruments whose risks are distributed away from the people who create them; political decisions whose consequences fall on constituencies without voice. Building a values layer is, in part, the work of restoring these severed loops.

The right place to start is with systems that are small and bounded enough that feedback loops are fast and failure is survivable, such as small activist organizations, research groups, local governance bodies, worker cooperatives. These are systems where the value elicitation problem is manageable, where the people involved have genuine stakes in getting it right, and where the consequences of a failed experiment are limited enough to allow honest learning.

I believe the core methodology involves three components. First, there is something like a structured elicitation of values, carefully designed processes for surfacing what people actually value, as distinct from what they say they value in socially charged settings. The gap between these is not a matter of dishonesty but of context, as people respond differently when they are asked to endorse abstract values versus when they are forced to choose between them under realistic constraints. Incentive-compatible elicitation, drawing on mechanism design, can reduce this gap by structuring the process so that honest expression is individually rational. This is not a novel idea in theory, as mechanism design has developed sophisticated tools for eliciting private information honestly, including the Vickrey auction and various scoring rules for probabilistic forecasts, but applying these tools to the elicitation of structured value preferences in real communities is largely unexplored territory.

Second, there is the need to rigorously map out our values, to represent elicited values in a structured form that makes their relationships explicit, surfaces genuine tensions, and is precise enough to provide guidance on real decisions. The goal is not a list of values but something closer to a map of how they relate and where they conflict. The moral graph formalism developed by the Meaning Alignment Institute² is one approach to this, and related work on preference aggregation in social choice theory and on value learning in AI alignment provides additional conceptual resources. The key requirement is that the representation be precise enough to actually distinguish between situations where values point in the same direction and situations where they genuinely conflict, rather than collapsing all tension into vague reassurances that everything can be balanced.

Third, there is something like decision integration, connecting the values representation to actual decision processes, so that the map does real work rather than sitting in a drawer. This is ultimately a governance design problem: how do you embed the values layer in the rules, procedures, and accountability structures of an organization or community such that it actually constrains decisions? Ostrom’s design principles suggest some answers, for instance in that the values representation needs to be collectively owned and modifiable, violations of it need to be visible and consequential, and there need to be legitimate mechanisms for updating it when circumstances change. What Ostrom achieved through careful institutional design in pre-digital communities, digital tools can potentially achieve at greater scale and lower cost.

This is where the technological dimension of the program becomes important. Digital interfaces can present concrete tradeoff scenarios at scale, reducing the social pressure dynamics that distort in-person deliberation and enabling asynchronous participation that is not dominated by whoever speaks most confidently. Computational methods, including the moral graph formalism and related work on value learning, can represent the structure of values in ways that are too complex for purely manual methods. Software that maintains a live record of stated values and flags when proposed decisions appear to conflict with them creates an accountability mechanism that does not depend on any individual’s memory or goodwill.

None of these tools currently exist in sufficiently developed form for this purpose. Participatory budgeting platforms like Decidim²⁷ and opinion-mapping tools like Pol.is²⁸ gesture in the right direction but are not designed around structured values representation. The gap between what these tools do and what a genuine values layer would require is substantial, but tractable, and it is the gap that the work described aims at closing. I believe the output of this effort, over time, is not just a set of case studies but a whole library of reusable design patterns — tested mechanisms for values elicitation, aggregation, and decision integration that can be adapted across contexts. This is the Ostrom project, extended and formalized for the current moment.

As the methodology matures, it could scale, first from communities to organizations, then from organizations to markets, where richer preference expression mechanisms can supplement or replace the blunt instrument of price signals, a direction explored in the RadicalxChange literature on quadratic voting and plural money.²⁹ And eventually to AI systems, where a values layer built and validated in human communities provides the empirical foundation for alignment targets that are actually grounded in how human values work in practice, rather than how a small group of researchers imagines they work.

The ambition is not to solve the alignment problem theoretically. It is to build the institutional and technological infrastructure that makes alignment tractable, starting at human scale, where the feedback loops are fast enough to learn from, and working upward from there. The theoretical frameworks already exist. What I think remains is merely the harder and less glamorous work of making them real.

Paul Christiano et al., "Deep Reinforcement Learning from Human Preferences." NeurIPS, 2017. ↩
Joe Edelman & Oliver Klingefjord, "OpenAI x DFT: The First Moral Graph." Meaning Alignment Institute, October 24, 2023. ↩
Rosie Bell & Rufus Pollock, "From Polycrisis to Metacrisis: A Short Introduction." Life Itself Sensemaking Studio / Second Renaissance series. ↩
International Renewable Energy Agency, "Renewable Power Generation Costs in 2023." IRENA, 2024. ↩
Eli Pariser, The Filter Bubble. Penguin Press, 2011. ↩
Soroush Vosoughi, Deb Roy, and Sinan Aral, "The spread of true and false news online." Science, Vol. 359, Issue 6380, 2018. ↩
Cass Sunstein, #Republic: Divided Democracy in the Age of Social Media. Princeton University Press, 2017. ↩
David Burkus, "A Tale of Two Cultures." Values-Based Leadership Journal, Vol. 4, Issue 1. ↩
Chris Argyris & Donald Schon, Organizational Learning: A Theory of Action Perspective. Addison-Wesley, 1978. ↩
Kenneth Arrow, Social Choice and Individual Values. Yale University Press, 1951. ↩
Gallup, "Majority in U.S. Continues to Favor Stricter Gun Laws." October 2023. ↩
McCourtney Institute for Democracy, Mood of the Nation Poll. May 2023. ↩
Gallup, "Grassroots Support for Legalizing Marijuana Hits Record 70%." November 2023. ↩
Yuntao Bai et al., "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073, 2022. ↩
Rafael Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS, 2023. ↩
Ethan Perez et al., "Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models." arXiv:2406.10162, 2024. ↩
Jonathan Stray et al., "What are You Optimizing For? Aligning Recommender Systems with Human Values." ICML Workshop on Participatory Approaches to Machine Learning, 2020. ↩
Elinor Ostrom, Governing the Commons: The Evolution of Institutions for Collective Action. Cambridge University Press, 1990. ↩
Ruth Chang, Incommensurability, Incomparability, and Practical Reason. Harvard University Press, 1997. ↩
Leonid Hurwicz & Stanley Reiter, Designing Economic Mechanisms. Cambridge University Press, 2006. ↩
James Fishkin, When the People Speak. Oxford University Press, 2009. ↩
Eliezer Yudkowsky, "Coherent Extrapolated Volition." MIRI Technical Report, 2004. ↩
William MacAskill, Toby Ord, & Krister Bykvist, Moral Uncertainty. Oxford University Press, 2020. ↩
Nassim Nicholas Taleb, Skin in the Game: Hidden Asymmetries in Daily Life. Allen Lane, 2018. ↩
Olivier Blanchard & Daniel Leigh, "Growth Forecast Errors and Fiscal Multipliers." IMF Working Paper, 2013. ↩
Yaneer Bar-Yam, Making Things Work. NECSI Knowledge Press, 2004; John Sterman, Business Dynamics. McGraw-Hill, 2000. ↩
Xabier Barandiaran et al., "Decidim: Political and Technopolitical Networks for Participatory Democracy." 2017. ↩
Christopher Small et al., "Polis: Scaling Deliberation by Mapping High Dimensional Opinion Spaces." Recerca, 2021. ↩
Eric Weyl & E. Glen Posner, Radical Markets. Princeton University Press, 2018. ↩