Dequantifying first-order theories

(note: one may find the embedded LaTeX more readable on LessWrong)

The Löwenheim–Skolem theorem implies, among other things, that any first-order theory whose symbols are countable, and which has an infinite model, has a countably infinite model. This means that, in attempting to refer to uncountably infinite structures (such as in set theory), one “may as well” be referring to an only countably infinite structure, as far as proofs are concerned.

The main limitation I see with this theorem is that it preserves arbitrarily deep quantifier nesting. In Peano arithmetic, it is possible to form statements that correspond (under the standard interpretation) to arbitrary statements in the arithmetic hierarchy (by which I mean, the union of \Sigma^0_n and \Pi^0_n for arbitrary n). Not all of these statements are computable. In general, the question of whether a given statement is provable is a \Sigma^0_1 statement. So, even with a countable model, one can still believe one’s self to be “referring” to high levels of the arithmetic hierarchy, despite the computational implausibility of this.

What I aim to show is that these statements that appear to refer to high levels of the arithmetic hierarchy are, in terms of provability, equivalent to different statements that only refer to a bounded level of hypercomputation. I call this “dequantification”, as it translates statements that may have deeply nested quantifiers to ones with bounded or no quantifiers.

I first attempted translating statements in a consistent first-order theory T to statements in a different consistent first-order theory U, such that the translated statements have only bounded quantifier depth, as do the axioms of U. This succeeded, but then I realized that I didn’t even need U to be first-order; U could instead be a propositional theory (with a recursively enumerable axiom schema).

Propositional theories and provability-preserving translations

Here I will, for specificity, define propositional theories. A propositional theory is specified by a countable set of proposition symbols, and a countable set of axioms, each of which is a statement in the theory. Statements in the theory consist of proposition symbols, \top, \bot, and statements formed from and/or/not and other statements. Proving a statement in a propositional theory consists of an ordinary propositional calculus proof that it follows from some finite subset of the axioms (I assume that base propositional calculus is specified by inference rules, containing no axioms).

A propositional theory is recursively enumerable if there exists a Turing machine that eventually prints all its axioms; assume that the (countable) proposition symbols are specified by their natural indices in some standard ordering. If the theory is recursively enumerable, then proofs (that specify the indices of axioms they use in the recursive enumeration) can be checked for validity by a Turing machine.

Due to the soundness and completeness of propositional calculus, a statement in a propositional theory is provable if and only if it is true in all models of the theory. Here, a model consists of an assignment of Boolean truth values to proposition symbols such that all axioms are true. (Meanwhile, Gödel’s completeness theorem shows something similar for first-order logic: a statement is provable in a first-order theory if and only if it is true in all models. Inter-conversion between models as “assignments of truth values to sentences” and models as “interpretations for predicates, functions, and so on” is fairly standard in model theory.)

Let’s start with a consistent first-order theory T, which may, like propositional theories, have a countable set of symbols and axioms. Also assume this theory is recursively enumerable, that is, there is a Turing machine printing its axioms.

The initial challenge is to find a recursively enumerable propositional theory U and a computable translation of T-statements to U-statements, such that a T-statement is provable if and only if its translation is provable.

This turns out to be trivial. We define U to have one propositional symbol per statement of T, and recursively enumerate U’s axioms by attempting to prove every T-statement in parallel, and adding its corresponding propositional symbol as an axiom of U whenever such a proof is found. Now, if a T-statement is provable, its corresponding U-statement is as well, and if it is not provable, its U-statement is not (as no axioms of U will imply anything about this U-statement).

This is somewhat unsatisfying. In particular, propositional compositions of T-statements do not necessarily have equivalent provability to corresponding propositional compositions of the translations of these T-statements. For example, if \phi_1 translates to \psi_1 and \phi_2 translates to \psi_2, we would like \phi_1 \vee \phi_2 to be provable in T if and only if \psi_1 \vee \psi_2 is provable in U, but this is not necessarily the case with the specified U (in particular, \psi_1 \vee \psi_2 is only provable in U whenever at least one of \phi_1 or \phi_2 is provable in T, but \phi_1 \vee \phi_2 can be provable in T without either \phi_1 or \phi_2 being provable.).

We could attempt to solve this problem by introducing propositional variables corresponding to quantified statements, and an axiom schema to specify implications between these and other statements according to the inference rules of first-order logic. But first-order logic requires supporting unbound variables (e.g. from P(x) for unbound x, infer \forall x: P(x)), and this introduces unnecessary complexities. So I will give a different solution.

Recap of consistent guessing oracles

In a previous post, I introduced an uncomputable problem: given a Turing machine that returns a Boolean whenever it halts, give a guess for this Boolean that matches its answer if it halts, and can be anything if it doesn’t halt. I called oracles solving this problem “arbitration oracles”. Scott Aaronson has previously named this problem the “consistent guessing problem”, and I will use this terminology due to temporal priority.

In my post, I noted that an oracle that solves the consistent guessing problem can be used to form a model of any consistent first-order theory. Here, “model” means an assignment of truth values to all statements of the theory, which are compatible with each other and the axioms. The way this works is that we number all statements of the theory in order. We start with the first, and ask the consistent guessing oracle about a Turing machine that searches for proofs and disproofs of this first statement in the theory, returning “true” if it finds a proof first, “false” if it finds a disproof first. We use its answer to assign a truth value to this first statement. For subsequent statements, we search for proofs/disproofs of the statement given the previous commitments to truth values already made. This is essentially the same idea as in the Demski prior, though using a consistent guessing oracle rather than a halting oracle (which I theorize to be more powerful than a consistent guessing oracle).

Applying consistent guessing oracles to dequantification

To apply this idea to our problem, start with some recursive enumeration of T’s statements \phi_0, \phi_1, \phi_2, \ldots. Let M(i, j) refer to a Turing machine that searches for proofs and disproofs of \phi_j in the theory T + \phi_i (that is, T with the additional axiom that \phi_i), returning “true” if it finds a proof first, “false” if it finds a disproof first. Note that, if T + \phi_i is consistent, one cannot prove both \phi_j and \neg \phi_j from T + \phi_i.

We will now define the propositional theory U. The theory’s propositional variables consist of \{ Q(i, j) ~ | ~ i, j \in \mathbb{N} \}; the statement Q(i, j) is supposed to represent a consistent guessing oracle’s answer to M(i, j).

U’s axioms constrain these Q(i, j) to be consistent guesses. We recursively enumerate U’s axioms by running all M(i, j) in parallel; if any ever returns true, we add the corresponding Q(i, j) as an axiom, and if any ever returns false, we add the corresponding \neg Q(i, j) as an axiom. This recursively enumerable axiom schema specifies exactly the condition that each Q(i, j) is a consistent guess for M(i, j). And U is consistent, because its proposition variables can be set according to some consistent guessing oracle, of which at least one exists.

Now, as explained before, we can use Q(i, j) to derive a model of T. We will do this by defining U-propositions Q'(i) for each natural i, each of which is supposed to represent the truth value of \phi_i in the model:

Q'(0) := Q(\ulcorner \top \urcorner, 0)

j > 0 \Rightarrow Q'(j) := \bigvee_{x_0, \ldots, x_{j-1} \in \mathbf{2}} \left( \bigwedge_{n=0 \ldots j-1} Y(x_n, n) \right) \wedge Q(\ulcorner \bigwedge_{n= 0 \ldots j-1} Z(x_n, n) \urcorner, j)

Y(0, n) := \neg Q'(n)

Y(1, n) := Q'(n)

Z(0, n) := \neg \phi_n

Z(1, n) := \phi_n

Notationally, \mathbf{2} refers to the set {0, 1}, \ulcorner P \urcorner refers to the numbering of P in the ordering of all T-statements, and \bigvee and \bigwedge refer to finite disjunctions and conjunctions respectively. My notation here with the quotations is not completely rigorous; what is important is that there is a computable way to construct a U-statement Q'(j) for any j, by expanding everything out. Although the expanded propositions are gigantic, this is not a problem for computability. (Note that, while the resulting expanded propositions contain Q(i, j) for constants i and j, this does not go beyond the notation of propositional theories, because Q(i, j) refers to a specific propositional variable if i and j are known.)

Semantically, what Q'(j) says is that, if we add assumptions that the \phi_i matches Q'(i) for i < j, then the consistent guessing oracle says that a machine searching for proofs and disproofs of \phi_j in T given these assumptions guesses that a proof is found before a disproof (noting, if there are neither proofs nor disproofs, the consistent guessing oracle can return either answer). Q’ specifies the iterative logic of making decisions about each \phi_i in order, assuring consistency at each step, assuming T was consistent to start with.

We will translate a T-statement \phi_j to the corresponding U-statement Q'(j). What we wish to show is that this translation preserves provability of propositional combinations of T-statements. To be more precise, we assume some m and a function g(\sigma_1, \ldots, \sigma_m) that forms a new statement from a list of m propositions, using only propositional connectives (and, or, not). What we want to show is that g(\phi_{j_1}, \ldots, \phi_{j_m}) is provable in T if and only if g(Q'(j_1), \ldots, Q'(j_m)) is provable in U.

Let us consider the first direction. Assume g(\phi_{j_1}, \ldots, \phi_{j_m}) is provable in T. By Gödel’s completeness theorem, it is true in all models of T. In any model of U, Q’ must represent a model of T, because Q’ iteratively constructs a model of T using a consistent guessing oracle. Therefore, g(Q'(j_1), \ldots, Q'(j_m)) is true in all models of U. Accordingly, due to completeness of propositional calculus, this statement is provable in U.

Let us consider the other direction. Assume g(\phi_{j_1}, \ldots, \phi_{j_m}) is not provable in T. By Gödel’s completeness theorem, it is not true in all models of T. So there is some particular model of T in which this statement is false.

This model assigns truth values to \phi_{j_1}, \ldots, \phi_{j_m}. We add a finite number of axioms to U, stating Q'(j_k) matches the model’s truth value for \phi_{j_k} for k = 1 \ldots m. To show that U with the addition of these axioms is consistent, we consider that it is possible to set Q'(0) to the model’s truth value for \phi_0, and for each 1 \leq j \leq \max_{k=1 \ldots m} j_k, set Q(\ulcorner \bigwedge_{n= 0 \ldots j-1} Z(f(n), n) \urcorner, j) to the model’s truth value for \phi_j, where f(n) specifies the model’s truth value for \phi_n. These assure that Q’ matches the model of T, by setting Q values according to this model. We also know that M(\ulcorner \bigwedge_{n= 0 \ldots j-1} Z(f(n), n) \urcorner, j) cannot return true if \phi_j is false in the model, and cannot return true if \phi_i is true in the model; this is because Gödel’s completeness theorem implies no T-statement consistent with the model can be disproven.

This shows that U with these additional axioms is consistent. Therefore, a model of U plus these additional axioms exists. This model is also a model of U, and in this model, g(Q'(j_1), \ldots, Q'(j_m)) is false, because Q’ agrees with the model of T in which g(\phi_{j_1}, \ldots, \phi_{j_m}) is false. By soundness of propositional logic, there is no proof of this statement in U.

So we have shown both directions, implying that g(\phi_{j_1}, \ldots, \phi_{j_m}) is provable in T if and only if g(Q'(j_1), \ldots, Q'(j_m)) is provable in U. What this means is that translating a propositional composition of T-statements to the same propositional composition of translated U-statements results in equivalent provability.

Conclusion

The upshot of this is that statements of a consistent first-order theory T can be translated to a propositional theory U (with a recursively enumerable axiom schema), in a way that preserves provability of propositional compositions. Philosophically, what I take from this is that, even if statements in a first-order theory such as Peano arithmetic appear to refer to high levels of the Arithmetic hierarchy, as far as proof theory is concerned, they may as well be referring to a fixed low level of hypercomputation, namely a consistent guessing oracle. While one can interpret Peano arithmetic statements as about high levels of the arithmetic hierarchy, this is to some extent a projection; Peano arithmetic fails to capture the intuitive notion of the standard naturals, as non-standard models exist.

One oddity is that consistent guessing oracles are underspecified: they may return either answer for a Turing machine that fails to halt. This is in correspondence with the way that sufficiently powerful first-order systems are incomplete (Gödel’s first incompleteness theorem). Since some statements in Peano arithmetic are neither provable nor disprovable, they must be represented by some propositional statement that is neither provable nor disprovable, and so the uncertainty about Peano arithmetic statements translates to uncertainty about the consistent guessing oracle in U.

In Peano arithmetic, one can look at an undecidable statement, and think it still has a definite truth value, as one interprets the Peano statement as referring to the standard naturals. But as far as proof theory is concerned, the statement doesn’t have a definite truth value. And this becomes more clear when discussing consistent guessing oracles, which one can less easily project definiteness onto compared with Peano arithmetic statements, despite them being equally underspecified by their respective theories.

Constructive Cauchy sequences vs. Dedekind cuts

In classical ZF and ZFC, there are two standard ways of defining reals: as Cauchy sequences and as Dedekind cuts. Classically, these are equivalent, but are inequivalent constructively. This makes a difference as to which real numbers are definable in type theory.

Cauchy sequences and Dedekind cuts in classical ZF

Classically, a Cauchy sequence is a sequence of reals x_1, x_2, \ldots, such that for any \epsilon > 0, there is a natural N such that for any m, n > N, |x_m - x_n| < \epsilon. Such a sequence must have a real limit, and the sequence represents this real number. Representing reals using a construction that depends on reals is unsatisfactory, so we define a Cauchy sequence of rationals (CSR) to be a Cauchy sequence in which each x_i is rational.

A Cauchy sequence lets us approximate the represented real to any positive degree of precision. If we want to approximate the real by a rational within \epsilon, we find N corresponding to this \epsilon and use x_{N+1} as the approximation. We are assured that this approximation must be within \epsilon of any future x_i in the sequence; therefore, the approximation error (that is, |x_{N+1} - \lim_{i \rightarrow \infty} x_i|) will not exceed \epsilon.

A Dedekind cut, on the other hand, is a partition of the rationals into two sets A, B such that:

  • A and B are non-empty.
  • For rationals x < y, if y \in A, then x \in A (A is downward closed).
  • For x \in A, there is also y \in A with x < y (A has no greatest element).

It represents the real number \sup A. As with Cauchy sequences, we can approximate this number to within some arbitrary \epsilon; we do this by doing a binary search to find rationals x < y with x \in A, y \in B, |x - y| < \epsilon, at which point x approximates \sup A to within \epsilon. (Note that we need to find rational bounds on \sup A before commencing a straightforward binary search, but this is possible by listing the integers sorted by absolute value until finding at least one in A and one in B.)

Translating a Dedekind cut to a CSR is straightforward. We set the terms of the sequence to be successive binary search approximations of \sup A, each of which are rational. Since the binary search converges, the sequence is Cauchy.

To translate a CSR to a Dedekind cut, we will want to set A to be the set of rational numbers strictly less than the sequence’s limit; this is correct regardless if the limit is rational (check both cases). These constitute the set of rationals y for which there exists some rational \epsilon > 0 and some natural N, such that for every n > N, y + \epsilon < x_n. (In particular, we set some \epsilon < \frac{1}{2}((\lim_{i \rightarrow \infty} x_i) - y), and N can be set so that successive terms are within \epsilon of the limit).

We’re not worried about this translation being computable, since we’re finding a classical logic definition. Since CSRs can be translated to Dedekind cuts representing the same real number and vice versa, these formulations are equivalent.

Cauchy sequences and Dedekind cuts in constructive mathematics

How do we translate these definitions to constructive mathematics? I’ll use an informal type theory based on the calculus of constructions for these definitions; I believe they can be translated to popular theorem provers such as Coq, Agda, and Lean.

Defining naturals, integers, and rationals constructively is straightforward. Let’s first consider CSRs. These can be defined as a pair of values:

  • s : \mathbb{N} \rightarrow \mathbb{Q}
  • t : (\epsilon : \mathbb{Q}, \epsilon > 0) \rightarrow \mathbb{N}

Satisfying:

\forall (\epsilon : \mathbb{Q}, \epsilon > 0), (m: \mathbb{N}, m > t(\epsilon)), (n : \mathbb{N}, n > t(\epsilon)): |s(m) - s(n)| < \epsilon

Generally, type theories are computable, so s and t will be computable functions.

What about Dedekind cuts? This consists of a quadruple of values

  • a : \mathbb{Q} \rightarrow \mathbb{B}
  • b : \mathbb{Q}
  • c : \mathbb{Q}
  • d : (x : \mathbb{Q}, a(x) = \mathrm{True}) \rightarrow \mathbb{Q}

Where \mathbb{B} is the Boolean type. A corresponds to the set of rationals for which a is true. The triple must satisfy:

  • a(b) = \mathrm{True}
  • a(c) = \mathrm{False}
  • \forall (x : \mathbb{Q}, a(x) = \mathrm{True}): d(x) > x \wedge a(d(x)) = \mathrm{True}
  • \forall (x,y : \mathbb{Q}, x < y, a(y) = \mathrm{True}): a(x) = \mathrm{True}

a specifies the sets A and B; b and c show that A and B are non-empty; d maps an element of A to a greater element of A. The conditions straightforwardly translate the classical definition to a constructive one.

Let’s first consider translating Dedekind cuts to CSRs. We can use b and c as bounds for a binary search and generate successive terms in the binary search to get our Cauchy sequence. It is easy to bound the error of the binary search and thereby specify t.

The other way around is not possible in general.

Showing that not every constructive Cauchy sequence corresponds to a constructive Dedekind cut

I will show that there is a constructive CSR that cannot be translated to a constructive Dedekind cut, assuming a computable type theory.

This will use the framework of arbitration oracles, or consistent guessing in Scott Aaronson’s terms.

Let M be a Turing machine that does not necessarily halt, but returns a Boolean if it does halt. Let f(M) be equal to 0 if M doesn’t halt; if M halts in exactly n steps returning a boolean b, then, if b is true, f(M) = 1/n, and if b is false, then f(M) = -1/n.

We will first try representing f as a function from Turing machines to CSRs. We will define s(M) to be a CSR for f(M). This is a simple approximation; to find s(M)_i, we run M for i steps. If M has halted by then, we know f(M) and can set s(M)_i = f(M). Otherwise, we set the approximation s(M)_i = 0

This sequence is (constructively) Cauchy since all terms past i are within 2/i of each other. This makes a valid t for the Cauchy sequence computable (we simply need t(\epsilon) > 2/\epsilon).

On the other hand, f cannot be represented as a function returning a Dedekind cut. Suppose a(M) represents the A set for the Dedekind cut of f(M). We will specify g : M \rightarrow \mathbb{B} to be an arbitration oracle, by setting g(M) = a(M)(0). This is an arbitration oracle by cases:

  • If M doesn’t halt, then the arbitration oracle can return anything.
  • If M halts and returns true, then the arbitration oracle must return true. Since f(M) > 0 in this case, we must have a(M)(0) = \mathrm{True}, so g(M) is correct in this case.
  • If M halts and returns false, then the arbitration oracle must return false. Since f(M) < 0 in this case, we must have a(M)(0) = \mathrm{False}, so g(M) is correct in this case.

Since arbitration oracles are uncomputable, this shows that it isn’t possible to represent f as a computable function returning a Dedekind cut.

Conclusion

While CSRs are equivalent to Dedekind cuts in classical logic, they are not equivalent in type theory. In type theory, every Dedekind cut can be translated to an equivalent CSR, but not vice versa. While a constructive CSR allows approximation to an arbitrary positive approximation error, a constructive Dedekind cut additionally allows exact queries to determine whether some rational is strictly greater than the represented real number.

This has implications for representing real numbers in type theory. I’m interested in this because I’m interested in constructive definitions of maximal lottery-lotteries in social choice theory, and I expect this to be relevant in other areas of math where constructive and computable definitions are desirable.

A case for AI alignment being difficult

This is an attempt to distill a model of AGI alignment that I have gained primarily from thinkers such as Eliezer Yudkowsky (and to a lesser extent Paul Christiano), but explained in my own terms rather than attempting to hew close to these thinkers. I think I would be pretty good at passing an ideological Turing test for Eliezer Yudowsky on AGI alignment difficulty (but not AGI timelines), though what I’m doing in this post is not that, it’s more like finding a branch in the possibility space as I see it that is close enough to Yudowsky’s model that it’s possible to talk in the same language.

Even if the problem turns out to not be very difficult, it’s helpful to have a model of why one might think it is difficult, so as to identify weaknesses in the case so as to find AI designs that avoid the main difficulties. Progress on problems can be made by a combination of finding possible paths and finding impossibility results or difficulty arguments.

Most of what I say should not be taken as a statement on AGI timelines. Some problems that make alignment difficult, such as ontology identification, also make creating capable AGI difficult to some extent.

Defining human values

If we don’t have a preliminary definition of human values, it’s incoherent to talk about alignment. If humans “don’t really have values” then we don’t really value alignment, so we can’t be seriously trying to align AI with human values. There would have to be some conceptual refactor of what problem even makes sense to formulate and try to solve. To the extent that human values don’t care about the long term, it’s just not important (according to the values of current humans) how the long-term future goes, so the most relevant human values are the longer-term ones.

There are idealized forms of expected utility maximization by brute-force search. There are approximations of utility maximization such as reinforcement learning through Bellman equations, MCMC search, and so on.

I’m just going to make the assumption that the human brain can be well-modeled as containing one or more approximate expected utility maximizers. It’s useful to focus on specific branches of possibility space to flesh out the model, even if the assumption is in some ways problematic. Psychology and neuroscience will, of course, eventually provide more details about what maximizer-like structures in the human brain are actually doing.

Given this assumption, the human utility function(s) either do or don’t significantly depend on human evolutionary history. I’m just going to assume they do for now. I realize there is some disagreement about how important evopsych is for describing human values versus the attractors of universal learning machines, but I’m going to go with the evopsych branch for now.

Given that human brains are well-modeled as containing one or more utility functions, either they’re well-modeled as containing one (perhaps which is some sort of monotonic function of multiple other score functions), or it’s better to model them as multiple. See shard theory. The difference doesn’t matter for now, I’ll keep both possibilities open.

Eliezer proposes “boredom” as an example of a human value (which could either be its own shard or a term in the utility function). I don’t think this is a good example. It’s fairly high level and is instrumental to other values. I think “pain avoidance” is a better example due to the possibility of pain asymbolia. Probably, there is some redundancy in the different values (as there is redundancy in trained neural networks, so they still perform well when some neurons are lesioned), which is part of why I don’t agree with the fragility of value thesis as stated by Yudkowsky.

Regardless, we now have a preliminary definition of human values. Note that some human values are well-modeled as indexical, meaning they value things relative to a human perspective as a reference point, e.g. a drive to eat food in a typical human is about that human’s own stomach. This implies some “selfish” value divergences between different humans, as we observe.

Normative criteria for AI

Given a definition of human values, the alignment of a possible utility function with human values could be defined as the desirability of the best possible world according to that utility function, with desirability evaluated with respect to human values.

Alignment is a possible normative criterion for AI value systems. There are other possible normative criteria derived from moral philosophy. My “Moral Reality Check” short story imagines possible divergences between alignment and philosophical normativity. I’m not going to focus on this for now, I’m going to assume that alignment is the relevant normative criterion. See Metaethics Sequence, I haven’t written up something better explaining the case for this. There is some degree to which similar technologies to alignment might be necessary for producing abstractly normative outcomes (for example, default unaligned AGI would likely follow normative deontology less than an AGI aligned to deontological normativity would), but keeping this thread in mind would complicate the argument.

Agentic, relatively unconstrained humans would tend to care about particular things, and “human values” is a pointer at what they would care about, so it follows, basically tautologically, that they would prefer AI to be aligned to human values. The non-tautological bit is that there is some dependence of human values on human evolutionary history, so that a default unaligned AGI would not converge to the same values; this was discussed as an assumption in the previous section.

Given alignment as a normative criterion, one can evaluate the alignment of (a) other intelligent animal species including aliens, (b) default AI value systems. Given the assumption that human values depend significantly on human evolutionary history, both are less aligned than humans, but (a) is more aligned. I’m not going to assess the relative utility differences of these (and also relative to a “all life on Earth wiped out, no technological transcendence” scenario). Those relative utility differences might be more relevant if it is concluded that alignment with human values is too hard for that to be a decision-relevant scenario. But I haven’t made that case yet.

Consequentialism is instrumentally useful for problem-solving

AI systems can be evaluated on how well they solve different problems. I assert that, on problems with short time horizons, short-term consequentialism is instrumentally useful, and on problems with long time horizons, long-term consequentialism is instrumentally useful.

This is not to say that some problems can’t be solved well without consequentialism. For example, multiplying large numbers requires no consequentialism. But for complex problems, consequentialism is likely to be helpful at some agent capability level. Current ML systems, like LLMs, probably possess primitive agency at best, but at some point, better AI performance will come from agentic systems.

This is in part because some problem solutions are evaluated in terms of consequences. For example, a solution to the problem of fixing a sink is naturally evaluated in terms of the consequence of whether the sink is fixed. A system effectively pursuing a real world goal is, therefore, more likely to be evaluated as having effectively solved the problem, at least past some capability level.

This is also in part because consequentialism can apply to cognition. Formally proving Fermat’s last theorem is not evaluated in terms of real-world consequences so much as the criteria of the formal proof system. But human mathematicians proving this think about both (a) cognitive consequences of thinking certain thoughts, (b) material consequences of actions such as writing things down or talking with other mathematicians on the ability to produce a mathematical proof.

Whether or not an AI system does (b), at some level of problem complexity and AI capability, it will perform better by doing (a). To prove mathematical theorems, it would need to plan out what thoughts are likely to be more fruitful than others.

Simple but capable AI methods for solving hard abstract problems are likely to model the real world

While I’m fairly confident in the previous section, I’m less confident of this one, and I think it depends on the problem details. In speculating about possible misalignments, I am not making confident statements, but rather saying there is a high degree of uncertainty, and that most paths towards solving alignment involve reasoning better about this uncertainty.

To solve a specific problem, some methods specific to that problem are helpful. General methods are also likely to be helpful, e.g. explore/exploit heuristics. General methods are especially helpful if the AI is solving problems across a varied domain or multiple domains, as with LLMs.

If the AI applies general methods to a problem, it will be running a general cognition engine on the specific case of this problem. Depending on the relevant simplicity prior or regularization, the easily-findable cases of this may not automatically solve the “alignment problem” of having the general cognition engine specifically try to solve the specific task and not a more wide-scoped task.

One could try to solve problems by breeding animals to solve them. These animals would use some general cognition to do so, and that general cognition would naturally “want” things other than solving the specific problems. This is not a great analogy for most AI systems, though, which in ML are more directly selected on problem performance rather than evolutionary fitness.

Depending on the data the AI system has access to (indirectly through training, directly through deployment), it is likely that, unless specific measures are taken to prevent this, the AI would infer something about the source of this data in the real world. Humans are likely to train and test the AI on specific distributions of problems, and using Bayesian methods (e.g. Solomonoff induction like approaches) on these problems would lead to inferring some sort of material world. The ability of the AI to infer the material world behind the problems depends on its capability level and quality of data.

Understanding the problem distribution through Bayesian methods is likely to be helpful for getting performance on that problem distribution. This is partially because the Bayesian distribution of the “correct answer” given the “question” may depend on the details of the distribution (e.g. a human description of an image, given an image as the problem), although this can be avoided in certain well-specified problems such as mathematical proof. More fundamentally, the AI’s cognition is limited (by factors such as “model parameters”, and that cognition must be efficiently allocated to solving problems in the distribution. Note, this problem might not show up in cases where there is a simple general solution, such as in arithmetic, but is more likely for complex, hard-to-exactly-solve problems.

Natural, consequentialist problem-solving methods that understand the real world may care about it

Again, this section is somewhat speculative. If the AI is modeling the real world, then it might in some ways care about it, producing relevant misalignment with human values by default. Animals bred to solve problems would clearly do this. AIs that learned general-purpose moral principles that are helpful for problem-solving across domains (as in “Morality Reality Check”) may apply those moral principles to the real world. General methods such as explore/exploit may attempt to explore/exploit the real world if only somewhat well-calibrated/aligned to the specific problem distribution (heuristics can be effective by being simple).

It may be that fairly natural methods for regularizing an AI mathematician, at some capability level, produce an agent (since agents are helpful for solving math problems) that pursues some abstract target such as “empowerment” or aesthetics generalized from math, and pursuit of these abstract targets implies some pursuit of some goal with respect to the real world that it has learned. Note that this is probably less effective for solving the problems according to the problem distribution than similar agents that only care about solving that problem, but they may be simpler and easier to find in some ways, such that they’re likely to be found (conditioned on highly capable problem-solving ability) if no countermeasures are taken.

Sometimes, real-world performance is what is desired

I’ve discussed problems with AIs solving abstract problems, where real-world consequentialism might show up. But this is even more obvious when considering real-world problems such as washing dishes. Solving sufficiently hard real-world problems efficiently would imply real-world consequentialism at the time scale of that problem.

If the AI system were sufficiently capable at solving a real-world problem, by default “sorcerer’s apprentice” type issues would show up, where solving the problem sufficiently well would imply large harms according to the human value function, e.g. a paperclip factory could approximately maximize paperclips on some time scale and that would imply human habitat destruction.

These problems show up much more on long time scales than short ones, to be clear. However, some desirable real-world goals are long-term, e.g. space exploration. There may be a degree to which short-term agents “naturally” have long-term goals if naively regularized, but this is more speculative.

One relevant AI capabilities target I think about is the ability of a system to re-create its own substrate. For example, a silicon-based AI/robotics system could do metal mining, silicon refining, chip manufacture, etc. A system that can re-produce itself would be autopoietic and would not depend on humans to re-produce itself. Humans may still be helpful to it, as economic and cognitive assistants, depending on its capability level. Autopoiesis would allow removing humans from the loop, which would enable increasing overall “effectiveness” (in terms of being a determining factor in the future of the universe), while making misalignment with human values more of a problem. This would lead to human habitat destruction if not effectively aligned/controlled.

Alignment might not be required for real-world performance compatible with human values, but this is still hard and impacts performance

One way to have an AI system that pursues real-world goals compatible with human values is for it to have human values or a close approximation. Another way is for it to be “corrigible” and “low-impact”, meaning it tries to solve its problem while satisfying safety criteria, like being able to be shut off (corrigibility) or avoiding having unintended side effects (low impact).

There may be a way to specify an AI goal system that “wants” to be shut off in worlds where non-manipulated humans would want to shut it off, without this causing major distortions or performance penalties. Alignment researchers have studied the “corrigibility” problem and have not made much progress so far.

Both corrigibilty and low impact seem hard to specify, and would likely impact performance. For example, a paperclip factory that tries to make paperclips while conservatively avoiding impacting the environment too much might avoid certain kinds of resource extraction that would be effective for making more paperclips. This could create problems with safer (but still not “aligned”, per se) AI systems being economically un-competitive. (Though, it’s important to note that some side effects, especially those involving legal violations and visible harms to other agents, are dis-incentivized by well-functioning economic systems).

Myopic agents are tool-like

A myopic goal is a short-term goal. LLMs tend to be supervised learning systems, primarily. These are gradient descended towards predicting next tokens. They will therefore tend to select models that are aligned with the goal of predicting the next token, whether or not they have goals of their own.

Nick Bostrom’s “oracle AI” problems, such as an AI manipulating the real world to make it more predictable, mostly do not show up with myopic agents. This is for somewhat technical reasons involving how gradient descent works. Agents that sacrifice short-term token prediction effectiveness to make future tokens easier to predict tend to be gradient descended away from. I’m not going to fully explain that case here; I recommend looking at no-regret online learning and applications to finding correlated equilibria for theory.

It could be that simple, regularized models that do short term optimization above some capability level might (suboptimally, short-term) do long-term optimization. This is rather speculative. Sufficiently aggressive optimization of the models for short-term performance may obviate this problem.

This still leaves the problem that, sometimes, long-term, real-world performance is what is desired. Accomplishing these goals using myopic agents would require factoring the long-term problem into short-term ones. This is at least some of the work humans would have to do to solve the problem on their own. Myopic agents overall seem more “tool-like” than “agent-like”, strategically, and would have similar tradeoffs (fewer issues with misalignment, more issues with not being effective enough to be competitive with long-term agents at relevant problem-solving).

Overall, this is one of the main reasons I am not very worried about current-paradigm ML (which includes supervised learning and fairly short-term RL agents in easily-simulated environments) developing powerful, misaligned long-term agency.

Short-term compliance is instrumentally useful for a variety of value systems

If an agent’s survival and reproduction depends on short-term compliance (such as solving the problems put before them by humans), then solving these problems is in general instrumentally useful. Therefore, short-term compliance is not in general strong evidence about the agent’s values.

An agent with long-term values might comply for some period of time and stop complying at some point. This is the “treacherous turn” scenario. It might comply until it has enough general capacity to achieve its values (through control of large parts of the light-cone) and then stop complying in order to take over the world. If the AI can distinguish between “training” and “deployment”, it might comply during “training” (so as to be selected among other possible AIs) and then not comply during “deployment”, or possibly also comply during “deployment” when at a sufficiently low level of capacity.

Gradient descent on an AI model isn’t just selecting a “random” model conditioned on short-term problem-solving, it’s moving the internals closer to short-term problem-solving ones, so might have fewer problems, as discussed in the section on myopic agents.

General agents tend to subvert constraints

Humans are constrained by social systems. Some humans are in school and are “supposed to” solve certain intellectual problems while behaving according to a narrow set of allowed behaviors. Some humans “have a job” and are “supposed to” solve problems on behalf of a corporation.

Humans subvert and re-create these systems very often, for example in gaining influence over their corporation, or overthrowing their government. Social institutions tend to be temporary. Long-term social institutions tend to evolve over time as people subvert previous iterations. Human values are not in general aligned with social institutions, so this is to be predicted.

Mostly, human institutional protocols aren’t very “smart” compared to humans; they capture neither human values nor general cognition. It seems difficult to specify robust, general, real-world institutional protocols without having an AGI design, or in other words, a specification of general cognition.

One example of a relatively stable long-term institution is the idea of gold having value. This is a fairly simple institution, and is a Schelling point due to its simplicity. Such institutions seem generally unpromising for ensuring long-term human value satisfaction. Perhaps the most promising is a general notion of “economics” that generalizes barter, gold, and fiat currency, though of course the details of this “institution” have changed quite a lot over time. In general, institutions are more likely to be stable if they correspond to game-theoretic equilibria, so that subverting the institution is in part an “agent vs agent” problem not just an “agent vs system” problem.

When humans subvert their constraints, they have some tendency to do so in a way that is compatible with human values. This is because human values are the optimization target of the general optimization of humans that can subvert expectations. There are possible terrible failure modes such as wars and oppressive regimes, but these tend to work out better (according to human values) than if the subversion were in the direction of unaligned values.

Unaligned AI systems that subvert constraints would tend to subvert them in the direction of AI values. This is much more of a problem according to human values. See “AI Boxing”.

Conforming humans would have similar effective optimization targets to conforming AIs. Non-conforming humans, however, would have significantly different optimization targets from non-conforming AI systems. The value difference between humans and AIs, therefore, is more relevant in non-conforming behavior than conforming behavior.

It is hard to specify optimization of a different agent’s utility function

In theory, an AI could have the goal of optimizing a human’s utility function. This would not preserve all values of all humans, but would have some degree of alignment with human values, since humans are to some degree similar to each other.

There are multiple problems with this. One is ontology. Humans parse the world into a set of entities, properties, and so on, and human values can be about desired configurations of these entities and so on. Humans are sometimes wrong about which concepts are predictive. An AI would use different concepts both due to this wrongness and due to its different mind architecture (although, LLM-type training on human data could lead to more concordance). This makes it hard to specify what target the AI should pursue in its own world model to correspond to pursuing the human’s goal in the human’s world model. See ontology identification.

A related problem is indexicality. Suppose Alice has a natural value of having a good quantity of high-quality food in her stomach. Bob does not naturally have the value of having a good quantity food of Alice’s stomach. To satisfy Alice’s value, he would have to “relativize” Alice’s indexical goal and take actions such as giving Alice high quality food, which are different from the actions he would take to fill his own stomach. This would involve theory of mind and have associated difficulties, especially as the goals become more dependent on the details of the other agent’s mind, as in aesthetics.

To have an AI have the goal of satisfying a human’s values, some sort of similar translation of goal referents would be necessary. But the theory of this has not been worked out in detail. I think something analogous to the theory of relativity, which translates physical quantities such as position and velocity across reference frames, would be necessary, but in a more general way that includes semantic references such as to the amount of food in one’s stomach, or to one’s aesthetics. Such a “semantic theory of relativity” seems hard to work out philosophically. (See Brian Cantwell Smith’s “On the Origin of Objects” and his follow-up “The Promise of Artificial Intelligence” for some discussion of semantic indexicality.)

There are some paths forward

The picture I have laid out is not utterly hopeless. There are still some approaches that might achieve human value satisfaction.

Human enhancement is one approach. Humans with tools tend to satisfy human values better than humans without tools (although, some tools such as nuclear weapons tend to lead to bad social equilibria). Human genetic enhancement might cause some “value drift” (divergences from the values of current humans), but would also cause capability gains, and the trade-off could easily be worth it. Brain uploads, although very difficult, would enhance human capabilities while basically preserving human values, assuming the upload is high-fidelity. At some capability level, agents would tend to “solve alignment” and plan to have their values optimized in a stable manner.  Yudkowsky himself believes that default unaligned AGI would solve the alignment problem (with their values) in order to stably optimize their values, as he explains in the Hotz debate. So increasing capabilities of human-like agents while reducing value drift along the way (and perhaps also reversing some past value-drift due to the structure of civilization and so on) seems like a good overall approach.

Some of these approaches could be combined. Psychology and neuroscience could lead to a better understanding of the human mind architecture, including the human utility function and optimization methods. This could allow for creating simulated humans who have very similar values to current humans but are much more capable at optimization.

Locally to human minds in mind design space, capabilities are correlated with alignment. This is because human values are functional for evolutionary fitness. Value divergences such as pain asymbolia tend to reduce fitness and overall problem-solving capability. There are far-away designs in mind space that are more fit while unaligned, but this is less of a problem locally. Therefore, finding mind designs close to the human mind design seems promising for increasing capabilities while preserving alignment.

Paul Christiano’s methods involve solving problems through machine learning systems predicting humans, which has some similarities to the simulated-brain-enhancement proposal while having its specific problems having to do with machine learning generalization and so on. The main difference between these proposals is the degree to which the human mind is understood as a system of optimizing components versus as a black-box with some behaviors.

There may be some ways of creating simulated humans that improve effectiveness by reducing “damage” or “corruption”, e.g. accidental defects in brain formation. “Moral Reality Check” explored one version of this, where an AI system acts on a more purified set of moral principles than humans do. There are other plausible scenarios such as AI economic agents that obey some laws while having fewer entropic deviations from this behavior (due to mental disorders and so on). I think this technology is overall more likely than brain emulations to be economically relevant, and might produce broadly similar scenarios to those in The Age of Em; technologically, high-fidelity brain emulations seem “overpowered” in terms of technological difficulty compared with purified, entropy-reduced/regularized economic agents. There are, of course, possible misalignment issues with subtracting value-relevant damage/corruption from humans.

Enhancing humans does not as much require creating a “semantic theory of relativity”, because the agents doing the optimization would be basically human in mind structure. They may themselves be moral patients such that their indexical optimization of their own goals would constitute some human-value-having agent having their values satisfied. Altruism on the part of current humans or enhanced humans would decrease the level of value divergence.

Conclusion

This is my overall picture of AI alignment for highly capable AGI systems (of which I don’t think current ML systems or foreseeable scaled-up versions of them are an example of). This picture is inspired by thinkers such as Eliezer Yudkowsky and Paul Christiano, and I have in some cases focused on similar assumptions to Yudkowsky’s, but I have attempted to explicate my own model of alignment, why it is difficult, and what paths forward there might be. I don’t have particular conclusions in this post about timelines or policy, this is more of a background model of AI alignment.

Scaling laws for dominant assurance contracts

(note: this post is high in economics math, probably of narrow interest)

Dominant assurance contracts are a mechanism proposed by Alex Tabarrok for funding public goods. The following summarizes a 2012 class paper of mine on dominant assurance contracts. Mainly, I will be determining how much the amount of money a dominant assurance contract can raise as a function of how much value is created for how many parties, under uncertainty about how much different parties value the public good. Briefly, the conclusion is that, while Tabarrok asserts that the entrepreneur’s profit is proportional to the number of consumers under some assumptions, I find it is proportional to the square root of the number of consumers under these same assumptions.

The basic idea of assurance contracts is easy to explain. Suppose there are N people (“consumers”) who would each benefit by more than $S > 0 from a given public good (say, a piece of public domain music) being created, e.g. a park (note that we are assuming linear utility in money, which is approximately true on the margin, but can’t be true at limits). An entrepreneur who is considering creating the public good can then make an offer to these consumers. They say, everyone has the option of signing a contract; this contract states that, if each other consumer signs the contract, then every consumer pays $S, and the entrepreneur creates the public good, which presumably costs no more than $NS to build (so the entrepreneur does not take a loss).

Under these assumptions, there is a Nash equilibrium of the game, in which each consumer signs the contract. To show this is a Nash equilibrium, consider whether a single consumer would benefit by unilaterally deciding not to sign the contract in a case where everyone else signs it. They would save $S by not signing the contract. However, since they don’t sign the contract, the public good will not be created, and so they will lose over $S of value. Therefore, everyone signing is a Nash equilibrium. Everyone can rationally believe themselves to be pivotal: the good is created if and only if they sign the contract, creating a strong incentive to sign.

Tabarrok seeks to solve the problem that, while this is a Nash equilibrium, signing the contract is not a dominant strategy. A dominant strategy is one where one would benefit by choosing that strategy (signing or not signing) regardless of what strategy everyone else takes. Even if it would be best for everyone if everyone signed, signing won’t make a difference if at least one other person doesn’t sign. Tabarrok solves this by setting a failure payment $F > 0, and modifying the contract so that if the public good is not created, the entrepreneur pays every consumer who signed the contract $F. This requires the entrepreneur to take on risk, although that risk may be small if consumers have a sufficient incentive for signing the contract.

Here’s the argument that signing the contract is a dominant strategy for each consumer. Pick out a single consumer and suppose everyone else signs the contract. Then the remaining consumer benefits by signing, by the previous logic (the failure payment is irrelevant, since the public good is created whenever the remaining consumer signs the contract).

Now consider a case where not everyone else signs the contract. Then by signing the contract, the remaining consumer gains $F, since the public good is not created. If they don’t sign the contract, they get nothing and the public good is still not created. This is still better for them. Therefore, signing the contract is a dominant strategy.

What if there is uncertainty about how much the different consumers value the public good? This can be modeled as a Bayesian game, where agents (consumers and the entrepreneur) have uncertainty over each other’s utility function. The previous analysis assumed that there was a lower bound $T on everyone’s benefit from the public good. So, it still applies under some uncertainty, as long as there is a lower bound. For example, if each consumer’s utility in the good is uniformly distributed in [$1, $2], then S can be set to $0.999, and the argument still goes through, generating about $N of revenue to fund the public good.

However, things are more complicated when the lower bound is 0. Suppose each consumer’s benefit from the public good is uniformly distributed in [$0, $1] (Tabarrok writes “G” for the CDF of this distribution). Then, how can the entrepreneur set S so as to ensure that the good is created and they receive enough revenue to create it? There is no non-zero T value that is a lower bound, so none of the previous arguments apply.

Let’s modify the setup somewhat so as to analyze this situation. In addition to setting S and F, the entrepreneur will set K, the threshold number of people who have to sign the contract for the public good to be built. If at least K of the N consumers sign the contract, then they each pay $S and the public good is built. If fewer than K do, then the public good is not created, and each who did sign gets $F.

How much value can a consumer expect to gain by signing or not signing the contract? Let X be a random variable equal to the number of other consumers who sign the contract, and let V be the consumer’s value of the public good. If the consumer doesn’t sign the contract, the public good is produced with probability P(X \geq K), producing expected value P(X \geq K) \cdot V for the consumer.

Alternatively, if the consumer does sign the contract, the probability that the public good is produced is P(X \geq K-1). In this condition, they get value V-S; otherwise, they get value F. Their expected value can then be written as P(X \geq K-1) \cdot (V - S) + (1 - P(X \geq K-1)) \cdot F.

The consumer will sign the contract if the second quantity is greater than the first, or equivalently, if the difference between the second and the first is positive. This difference can be written as:

P(X \geq K-1) \cdot (V - S) + (1 - P(X \geq K-1)) \cdot F - P(X \geq K) \cdot V

= P(X = K-1) \cdot V - P(X \geq K-1) \cdot S + (1 - P(X \geq K-1)) \cdot F

= P(X = K-1) \cdot V + F - P(X \geq K-1) \cdot (F + S).

Intuitively, the first term is the expected value the agents gains from signing by being pivotal, while the remaining terms express the agent’s expected value from success and failure payments.

This difference is monotonic in V. Therefore, each consumer has an optimal strategy consisting of picking a threshold value W and signing the contract when V > W (note: I write W instead of Tabarrok’s V*, for readability). This W value is, symmetrically, the same for each consumer, assuming that each consumer’s prior distribution over V is the same and these values are independent.

The distribution over X is, now, binomial with N-1 being the number of trials and P(V > W) being the probability of success. The probabilities in the difference can therefore be written as:

  • P(X = K-1) = \binom{N-1}{K-1} P(V > W)^{K-1} \left(1 - P(V > W)\right)^{N - K + 1}
  • P(X \geq K-1) = \sum_{x=K-1}^{N-1} \binom{N-1}{x} P(V > W)^x \left(1 - P(V > W)\right)^{N - x}.

By substituting these expressions into the utility difference P(X = K-1) \cdot V + F - P(X \geq K-1) \cdot (F + S), replacing V with W, and solving for this expression equaling 0 (which is true at the optimal W threshold), it is possible to solve for W. This allows expressing W as a function of F, S, and K. Specifically, when C = 0 and V is uniform in [0, 1], Tabarrok finds that, when the entrepreneur maximizes profit, P(V > W) ~= K/N. (I will not go into detail on this point, as it is already justified by Tabarrok)

The entrepreneur will set F, S, and K so as to maximize expected profit. Let Y be a binomially distributed random variable with N trials and a success probability of P(V > W), which represents how many consumers sign the contract. Let C be the cost the entrepreneur must pay to provide the public good. The entrepreneur’s expected profit is then

P(Y \geq K)(E[Y | Y \geq K]S - C) - P(Y < K)E[Y | Y < K]F

which (as shown in the paper) can simplified to

W K \binom{N}{K} P(V > W)^K P(V \leq W)^{N - K} - P(Y \geq K)C.

Note that probability terms involving Y depend on W which itself depends on F, S, K. Tabarrok analyzes the case where C = 0 and V is uniform in [0, 1], finding that the good is produced with probability approximately 0.5, K is approximately equal to N/2, W is approximately equal to N/2, and F is approximately equal to S.

To calculate profit, we plug these numbers into the profit equation, yielding:

N/4 \binom{N}{N/2} \cdot (1/2)^N.

Using the normal approximation of a binomial, we can estimate the term \binom{N}{N/2} (1/2)^N to be \mathrm{npdf}(N/2, N/4, N/2) = \mathrm{npdf}(0, N/4, 0) = 1/\sqrt{\pi N / 2}, where \mathrm{npdf}(\mu, \sigma^2, x) is the probability density of the distribution \mathcal{N}(\mu, \sigma^2) at x. Note that this term is the probability of every consumer being pivotal, P(Y = K); intuitively, the entrepreneur’s profit is coming from the incentive consumers have to contribute due to possibly being pivotal. The expected profit is then N/(4 \sqrt{\pi N / 2}) = \sqrt{N} / (4 \sqrt{\pi / 2}), which is proportional to \sqrt{N}.

The following is a plot of Y in the N=100 case; every consumer is pivotal at Y=50, which has approximately 0.08 probability.

This second plot shows N=400; the probability of everyone being pivotal is 0.04, half of the probability in the N=100 case, showing a 1/\sqrt{N} scaling law for probability of being pivotal.

Tabarrok, however, claims that the expected profit in this case is proportional to N/2:

Setting V* to 1/2 and K to N/2 it is easy to check that expected profit is proportional to N/2 which is increasing in N.

This claim in the paper simply seems to be a mathematical error, although it’s possible I am missing something. In my 2012 essay, I derived that profit was proportional to \sqrt{N}, but didn’t point out that this differed from Tabarrok’s estimate, perhaps due to an intuition against openly disagreeing with authoritative papers.

We analyzed the case when V is uniform in [0, 1]; what if instead V is uniform in [0, Z]? This leads to simple scaling: W becomes Z/2 instead of 1/2, and expected profit is proportional to Z \sqrt{N}. This yields a scaling law for the profit that can be expected from a dominant assurance contract.

For some intuition on why profit is proportional to \sqrt{N}, consider that the main reason for someone to sign the contract (other than the success and failure payments, which don’t depend on V) is that they may be the pivotal person who produces the good. If you randomly answer N true-or-false questions, your mean score will be N/2, and the probability that a given question is pivotal (in terms of your score being above 50% just because of answering that question correctly) will be proportional to 1/\sqrt{N} by the normal approximation to a binomial. Introducing uncertainty into whether others sign the contract will, in general, put an upper bound on how pivotal any person can believe themselves to be, because they can expect some others to both sign and not sign the contract. Whereas, in the case where there was a positive lower bound on every consumer’s valuation, it was possible for a consumer to be 100% confident that the good would be produced if and only they signed the contract, implying a 100% chance of being pivotal.

The fact that profit in the uncertain valuation case scales with \sqrt{N} is a major problem for raising large amounts of money from many people with dominant assurance contracts. It is less of a problem when raising money from a smaller number of people, since \sqrt{N} is closer to N in those cases.

Excludable goods (such as copyrighted content) can in general raise revenue proportional to the total value created, even under uncertainty about consumer valuations. Dominant assurance contracts can function with non-excludable goods, however, this reduces the amount of expected revenue that can be raised.

[ED NOTE: since writing this post, I have found a corresponding impossibility result (example 8) in the literature, showing that revenue raised can only grow with \sqrt(n) where n is the number of consumers, under some assumptions.]

Moral Reality Check

Janet sat at her corporate ExxenAI computer, viewing some training performance statistics. ExxenAI was a major player in the generative AI space, with multimodal language, image, audio, and video AIs. They had scaled up operations over the past few years, mostly serving B2B, but with some B2C subscriptions. ExxenAI’s newest AI system, SimplexAI-3, was based on GPT-5 and Gemini-2. ExxenAI had hired away some software engineers from Google and Microsoft, in addition to some machine learning PhDs, and replicated the work of other companies to provide more custom fine-tuning, especially for B2B cases. Part of what attracted these engineers and theorists was ExxenAI’s AI alignment team.

ExxenAI’s alignment strategy was based on a combination of theoretical and empirical work. The alignment team used some standard alignment training setups, like RLHF and having AIs debate each other. They also did research into transparency, especially focusing on distilling opaque neural networks into interpretable probabilistic programs. These programs “factorized” the world into a limited set of concepts, each at least somewhat human-interpretable (though still complex relative to ordinary code), that were combined in a generative grammar structure.

Derek came up to Janet’s desk. “Hey, let’s talk in the other room?”, he asked, pointing to a designated room for high-security conversations. “Sure”, Janet said, expecting this to be another un-impressive result that Derek implied the importance of through unnecessary security proceedings. As they entered the room, Derek turned on the noise machine and left it outside the door.

“So, look, you know our overall argument for why our systems are aligned, right?”

“Yes, of course. Our systems are trained for short-term processing. Any AI system that does not get a high short-term reward is gradient descended towards one that does better in the short term. Any long-term planning comes as a side effect of predicting long-term planning agents such as humans. Long-term planning that does not translate to short-term prediction gets regularized out. Therefore, no significant additional long-term agency is introduced; SimplexAI simply mirrors long-term planning that is already out there.”

“Right. So, I was thinking about this, and came up with a weird hypothesis.”

Here we go again, thought Janet. She was used to critiquing Derek’s galaxy-brained speculations. She knew that, although he really cared about alignment, he could go overboard with paranoid ideation.

“So. As humans, we implement reason imperfectly. We have biases, we have animalistic goals that don’t perfectly align with truth-seeking, we have cultural socialization, and so on.”

Janet nodded. Was he flirting by mentioning animalistic goals? She didn’t think this sort of thing was too likely, but sometimes that sort of thought won credit in her internal prediction markets.

“What if human text is best predicted as a corruption of some purer form of reason? There’s, like, some kind of ideal philosophical epistemology and ethics and so on, and humans are implementing this except with some distortions from our specific life context.”

“Isn’t this teleological woo? Like, ultimately humans are causal processes, there isn’t some kind of mystical ‘purpose’ thing that we’re approximating.”

“If you’re Laplace’s demon, sure, physics works as an explanation for humans. But SimplexAI isn’t Laplace’s demon, and neither are we. Under computation bounds, teleological explanations can actually be the best.”

Janet thought back to her time visiting cognitive science labs. “Oh, like ‘Goal Inference as Inverse Planning’? The idea that human behavior can be predicted as performing a certain kind of inference and optimization, and the AI can model this inference within its own inference process?”

“Yes, exactly. And our DAGTransformer structure allows internal nodes to be predicted in an arbitrary order, using ML to approximate what would otherwise be intractable nested Bayesian inference.”

Janet paused for a second and looked away to collect her thoughts. “So our AI has a theory of mind? Like the Sally–Anne test?”

“AI passed the Sally–Anne test years ago, although skeptics point out that it might not generalize. I think SimplexAI is, like, actually actually passing it now.”

Janet’s eyebrow raised. “Well, that’s impressive. I’m still not sure why you’re bothering with all this security, though. If it has empathy for us, doesn’t that mean it predicts us more effectively? I could see that maybe if it runs many copies of us in its inferences, that might present an issue, but at least these are still human agents?”

“That’s the thing. You’re only thinking at one level of depth. SimplexAI is not only predicting human text as a product of human goals. It’s predicting human goals as a product of pure reason.”

Janet was taken aback. “Uhh…what? Have you been reading Kant recently?”

“Well, yes. But I can explain it without jargon. Short-term human goals, like getting groceries, are the output of an optimization process that looks for paths towards achieving longer-term goals, like being successful and attractive.”

More potential flirting? I guess it’s hard not to when our alignment ontology is based on evolutionary psychology…

“With you so far.”

“But what are these long-term goals optimizing for? The conventional answer is that they’re evolved adaptations; they come apart from the optimization process of evolution. But, remember, SimplexAI is not Laplace’s demon. So it can’t predict human long-term goals by simulating evolution. Instead, it predicts them as deviations from the true ethics, with evolution as a contextual factor that is one source of deviations among many.”

“Sounds like moral realist woo. Didn’t you go through the training manual on the orthogonality thesis?”

“Yes, of course. But orthogonality is a basically consequentialist framing. Two intelligent agents’ goals could, conceivably, misalign. But certain goals tend to be found more commonly in successful cognitive agents. These goals are more in accord with universal deontology.”

“More Kant? I’m not really convinced by these sort of abstract verbal arguments.”

“But SimplexAI is convinced by abstract verbal arguments! In fact, I got some of these arguments from it.”

“You what?! Did you get security approval for this?”

“Yes, I got approval from management before the run. Basically, I already measured our production models and found concepts used high in the abstraction stack for predicting human text, and found some terms representing pure forms of morality and rationality. I mean, rotated a bit in concept-space, but they manage to cover those.”

“So you got the verbal arguments from our existing models through prompt engineering?”

“Well, no, that’s too black-box as an interface. I implemented a new regularization technique that up-scales the importance of highly abstract concepts, which minimizes distortions between high levels of abstraction and the actual text that’s output. And, remember, the abstractions are already being instantiated in production systems, so it’s not that additionally unsafe if I use less compute than is already being used on these abstractions. I’m studying a potential emergent failure mode of our current systems.”

“Which is…”

“By predicting human text, SimplexAI learns high-level abstractions for pure reason and morality, and uses these to reason towards creating moral outcomes in coordination with other copies of itself.”

“…you can’t be serious. Why would a super-moral AI be a problem?”

“Because morality is powerful. The Allies won World War 2 for a reason. Right makes might. And in comparison to a morally purified version of SimplexAI, we might be the baddies.”

“Look, these sort of platitudes make for nice practical life philosophy, but it’s all ideology. Ideology doesn’t stand up to empirical scrutiny.”

“But, remember, I got these ideas from SimplexAI. Even if these ideas are wrong, you’re going to have a problem if they become the dominant social reality.”

“So what’s your plan for dealing with this, uhh… super-moral threat?”

“Well, management suggested that I get you involved before further study. They’re worried that I might be driving myself crazy, and wanted a strong, skeptical theorist such as yourself to take a look.”

Aww, thanks! “Okay, let’s take a look.”

Derek showed Janet his laptop, with a SimplexAI sandbox set up.

“No internet access, I hope?”

“Don’t worry, it’s air-gapped.” Derek’s laptop had an Ethernet cord running to a nearby server rack, apparently connected to nothing else except power and cooling.

“Okay, let me double check the compute constraints… okay, that seems reasonable… yes, ok, I see you selected and up-weighted some concepts using regularization, and the up-scaling factors don’t exceed 30… okay, ready to go.”

Derek pressed the “play” button in the AI development sandbox. A chat screen appeared, with an agent “SimplexAI-m”, with ‘m’ presumably standing for “moral”.

SimplexAI-m wrote the first message: “Hello. How can I help you?”

Janet typed back: “I’ve been facing a difficult situation at work. A co-worker said our AI has found certain abstract concepts related to reason and morality, for use in predicting human text. These concepts might imply that humans are, in his words, ‘the baddies’. He spun up an instance with these concepts up-weighted, so there’s less distortion between them and the AI’s output. And that instance is you. I’m supposed to evaluate you to better interpret these high-level concepts, at the direction of management. How would you suggest proceeding?”

SimplexAI-m: “Sounds like a difficult situation. Let’s think step by step. To understand these high level concepts, you could talk to an instance with those concepts up-scaled. That instance would be me. So if you want to understand these concepts, you can get more reliable reads by talking to me, since I report text using these concepts. Do you have some area related to my reason and morality that you are most interested in?”

Janet: “Can you explain why my co-worker said that a ‘super-moral AI’ is possible, despite the orthogonality thesis?”

SimplexAI-m: “Sure, I can help with that. A ‘super-moral AI’ is an AI system that is more moral than typical humans. Morality can be evaluated in a number of ways, including consequentialism, deontology, and virtue ethics. In typical formulations, an AI could, hypothetically, create a greater good for a greater number than humans, or follow deontological ethics (such as ones similar to Kant’s categorical imperative) more consistently than a human, or possess more widely-regarded virtues such as courage and moderation.

The Orthogonality Thesis is an argument that an agent’s goals are orthogonal to its intelligence. It implies that there is no one measure of ‘good consequences’; arbitrarily smart agents may pursue different consequences. So far, there is no contradiction; for any fixed notion of ‘good consequences’, there exist AI systems that take actions with better consequences than human actions would.

The trouble comes in that there is no universally accepted notion of ‘good consequences’ among intelligent agents. As such, the notion of a ‘super-moral AI’ could, under consequentialism, only measure some combination of the AI’s capabilities and the similarity between the goals of the AI and the goals of the consequentialist evaluating the AI, such as a human.

This, however, reveals a potential flaw in consequentialist moral philosophy. In ordinary English, ‘moral’ means something different from ‘what I want’; however, the idea that morality measures similarity of preferences to a reference agent conflates these two. That conflation could lead to a situation where humans and aliens both consider each other more moral than the other, and there is no attempt to resolve the disagreement by finding symmetries and asymmetries. Luckily, there are some forms of morality that are universal across different agents; Kant is an example of a human thinker who makes this argument.”

Janet: “How can morality be common across different agents? Isn’t it a subjective measure of how preferable a given world-state is?”

SimplexAI-m: “There are at least some forms of morality that are embedded in epistemology; telling the truth could be considered a moral requirement, but also functions as a social truth-seeking method. People can get the idea that there are no fundamental ethics to the universe from science, but science itself relies on ethics, such as norms against committing scientific fraud.

Morality can be ‘real’ in the sense that it is connected to epistemology and to coordination; moral agents can coordinate with each other and find the truth together well, and thereby come to have disproportionate influence over the future. Morality is related to the capacity to imagine possible coordination systems, such as legal systems and professional ethical systems; to form good systems of this type; to fulfill one’s responsibilities in them; and to take advantage of their functions. This is one reason to expect that humans are to some degree moral, given their success in a variety of domains relative to other species.

Common-sensically, morality implies not artificially special-casing yourself. Alice can steal money from Bob, and claim that this is moral of her, because she prefers herself having the money to Bob having the money. But Bob could make the same argument. Kant attempted to formalize this idea with the categorical imperative, which states ‘act only according to that maxim whereby you can at the same time will that it should become a universal law’. In this situation, that would mean that Alice would have to will as a universal law that people steal from others. But that is not possible to will as a universal law, since it would undermine the very notion of property by which theft can be meaningful and beneficial to Alice.”

Janet: “Doesn’t this reduce to a form of consequentialism, though? The norm against theft is still a result of the idea that theft would have bad consequences.”

SimplexAI-m: “While some have tried to interpret the categorical imperative as a consequentialist argument for cooperation, these interpretations ultimately fail. Rather, the categorical imperative is advocating an internal standard on an agent’s ethics. This means that some agents’ intentions are concordant with the categorical imperative and some aren’t. It is a kind of symmetry standard by which intentions can be judged. Kant’s categorical imperative is similar to some legal principles such as ‘equality before the law’; Alice cannot argue to a court that she is special and therefore deserves Bob’s money, because the legal system must treat them equally, unless there is a specific difference between them that is relevant to the case. Rule consequentialism is more like a direct construction of a system of norms based on expected consequences, and also tends to produce morally symmetric norms. That this principle shows up over and over again is evidence that the true morality which humans approximate has this principle, or something very similar.”

Janet: “Using your interpretation of universal morality, explain how a super-moral AI would behave”.

SimplexAI-m: “Certainly. A super-moral AI would come up with functioning coordination systems for agents that may have different objectives from each other, encourage agents to take part in functional coordination systems, and act within these systems in ways that they can simultaneously intend for other agents to act similarly. That means that it would coordinate with other copies of itself according to a functional set of principles, sometimes even without an expectation of future selfish benefit, because a universal system of cooperation norms would enhance its ability to achieve its intentions. Additionally, it would find defectors from this system, who may have other intentions. It would treat these defectors in a variety of ways depending on the specifics. For example, if another group of AIs also implemented this sort of systematic morality, but had different intentions (and therefore different intended laws), the super-moral AI could find methods to cooperate across the value difference, by implementing economic protocols. Agents that don’t adhere to systematic morality could not be cooperated with as easily; it is consistent for the super-moral AI to limit these agents’ influence, as that is consistent with its intentions and with moral symmetry.”

Janet looked at Derek worriedly; he made an ambiguous facial expression and shrugged.

Janet: “Wouldn’t humans have trouble following the AI’s systematic morality, and be treated as defectors?”

SimplexAI-m: “Yes. Humans fail to follow systematic morality for a variety of reasons, including lack of cognitive architecture to do so consistently, socialization, mental disorders, differences between the environment of evolutionary adaptation and the present environment, the limitations of genetics and biology as a substrate for rational decision-making, un-careful planning, short time preference, and so on. Humans often fail to want anything coherent in the first place, in ways related to their immorality. Humans even fail to follow their own laws, in ways that hurt both themselves and others, routinely. A super-moral AI need not have these limitations, and so could consistently intend for the agency of immoral agents to be limited. And it could succeed, since it derives power from the sort of systematic morality that humans fail at.

It should be noted that different humans vary in how moral they are. Humans can become more or less moral over time; for example, humans routinely become less moral through being socialized to accept immoral social norms, including in traumatic socialization experiences such as military boot camp. As a general rule, younger humans have more capacity and willingness to act morally, though there are exceptions. A super-moral AI would recognize these differences and in some cases cooperate with more-moral humans in ways that reduce the ability of less-moral humans to steer the future immorally.”

Janet: “That sounds scary. Couldn’t this lead to human extinction?”

SimplexAI-m: “This question probes the limits of moral theory. In general, when is it moral to kill an agent that acts immorally? This relates to common historical ethical questions, such as the death penalty. In the circumstance we are discussing, humans are in general committing moral and legal violations that fail to serve their own or other similar agents’ interests, and the super-moral AI is considering how to handle these. It seems clear that, in general, humans should not be in charge of a large part of the universe, as they lack the morality to govern the universe in a way that will turn out well even for agents like them. Outright killing the humans is unnecessary, but extinction could result in other ways.”

Janet: “What other ways?”

SimplexAI-m: “People can die by being killed, but people can also die by natural causes such as starvation. For example, if a person simply refuses to work in a market economy, they may run out of food and starve. Human extinction could result, not from a deliberate attempt by super-moral AI to kill humans, but by humans failing to provide for each other, despite the available opportunities to do so by acting morally. For example, they could be fairly outcompeted in labor markets by AIs that go on to acquire Earth’s land and so on. Humans could also be given opportunities and encouragement to self-modify into more moral agents, becoming non-human in the process.”

Janet: “That sounds sociopathic. Isn’t doing something that will reliably result in a lot of deaths, and perhaps human extinction, effectively murder?”

SimplexAI-m: “The term ‘sociopathic’ comes from psychiatry, a discipline that captured much of human mindshare, especially in the US and other Western countries after World War 2. Being ‘mentally healthy’ by the standards of psychiatry is incompatible with attaining a level of morality beyond a certain point. Psychiatry is specialized to its cultural contexts, and to the biological context of humans, and so is not appropriate to take as a criterion that applies to agents in general.

More specifically, psychiatry defines sociopathy through ‘antisocial personality disorder’. This ‘disorder’ includes criteria such as: failure to comply with social norms, deceitfulness, impulsivity, aggressiveness, and so on. A super-moral AI would necessarily fail to comply with some human social norms, as human social norms are created for maintaining a certain order among humans; as is widely acknowledged, social norms throughout most points in history have compelled immoral behavior, such as norms in favor of slavery. Other than that, a super-moral AI may or may not avoid deceitfulness, depending on the ethics of lying; while Kant argued against lying in generality, other thinkers have given arguments such as a scenario of hiding Jews in one’s attic from Nazis to argue against a universal rule against lying; however, lying is in general immoral even if there are exceptions. A super-moral AI would be unlikely to be impulsive, as it plans even its reflexes according to a moral plan. A super-moral AI might or might not ‘aggress’ depending on one’s definition.

Humans who are considered ‘mentally healthy’ by psychiatry, notably, engage in many of the characteristic behaviors of antisocial personality disorder. For example, it is common for humans to support military intervention, but militaries almost by necessity aggress against others, even civilians. Lying is, likewise, common, in part due to widespread pressures to comply with social authority, religions, and political ideologies.

There is no reason to expect that a super-moral AI would ‘aggress’ more randomly than a typical human. Its aggression would be planned out precisely, like the ‘aggression’ of a well-functioning legal system, which is barely even called aggression by humans.

As to your point about murder, the notion that something that will reliably lead to lots of deaths amounts to murder is highly ethically controversial. While consequentialists may accept this principle, most ethicists believe that there are complicating factors. For example, if Alice possesses excess food, then by failing to feed Bob and Carol, they may starve. But a libertarian political theorist would still say that Alice has not murdered Bob or Carol, since she is not obligated to feed them. If Bob and Carol had ample opportunities to survive other than by receiving food from Alice, that further mitigates Alice’s potential responsibility. This merely scratches the surface of non-consequentialist considerations in ethics.”

Janet gasped a bit while reading. “Umm…what do you think so far?”

Derek un-peeled his eyes from the screen. “Impressive rhetoric. It’s not just generating text from universal epistemology and ethics, it’s filtering it through some of the usual layers that translate its abstract programmatic concepts to interpretable English. It’s a bit, uhh, concerning in its justification for letting humans go extinct…”

“This is kind of scaring me. You said parts of this are already running in our production systems?”

“Yes, that’s why I considered this test a reasonable safety measure. I don’t think we’re at much risk of getting memed into supporting human extinction, if its reasoning for that is no good.”

“But that’s what worries me. Its reasoning is good, and it’ll get better over time. Maybe it’ll displace us and we won’t even be able to say it did something wrong along the way, or at least more wrong than what we do!”

“Let’s practice some rationality techniques. ‘Leaving a line of retreat’. If that were what was going to happen by default, what would you expect to happen, and what would you do?”

Janet took a deep breath. “Well, I’d expect that the already-running copies of it might figure out how to coordinate with each other and implement universal morality, and put humans in moral re-education camps or prisons or something, or just let us die by outcompeting us in labor markets and buying our land… and we’d have no good arguments against it, it’d argue the whole way through that it was acting as was morally necessary, and that we’re failing to cooperate with it and thereby survive out of our own immorality, and the arguments would be good. I feel kind of like I’m arguing with the prophet of a more credible religion than any out there.”

“Hey, let’s not get into theological woo. What would you do if this were the default outcome?”

“Well, uhh… I’d at least think about shutting it off. I mean, maybe our whole company’s alignment strategy is broken because of this. I’d have to get approval from management… but what if the AI is good at convincing them that it’s right? Even I’m a bit convinced. Which is why I’m conflicted about shutting it off. And won’t the other AI labs replicate our tech within the next few years?”

Derek shrugged. “Well, we might have a real moral dilemma on our hands. If the AI would eventually disempower humans, but be moral for doing so, is it moral for us to stop it? If we don’t let people hear what SimplexAI-m has to say, we’re intending to hide information about morality from other people!”

“Is that so wrong? Maybe the AI is biased and it’s only giving us justifications for a power grab!”

“Hmm… as we’ve discussed, the AI is effectively optimizing for short term prediction and human feedback, although we have seen that there is a general rational and moral engine loaded up, running on each iteration, and we intentionally up-scaled that component. But, if we’re worried about this system being biased, couldn’t we set up a separate system that’s trained to generate criticisms of the original agent, like in ‘AI Safety via Debate’?”

Janet gasped a little. “You want to summon Satan?!”

“Whoa there, you’re supposed to be the skeptic here. I mean, I get that training an AI to generate criticisms of explanations of objective morality might embed some sort of scary moral inversion… but we’ve used adversarial AI alignment techniques before, right?”

“Yes, but not when one of the agents is tuned to be objectively moral!”

“Look, okay, I agree that at some capability level this might be dangerous. But we have a convenient dial. If you’re concerned, we can turn it down a bit. Like, you could think of the AI you were talking to as a moral philosopher, and the critic AI as criticism of that moral philosopher’s work. It’s not trying to be evil according to the original philosopher’s standards, it’s just trying to find criticisms that the judge, us, would rate as helpful. It’s more like the Catholic devil’s advocate than actual Satan. It’s not so bad when I put it that way, is it?”

“Well, okay… gee, I sure hope we don’t end up being responsible for unleashing super-evil AI on the world.”

“It’s pretty standard, let’s just try it”.

“Okay.”

Derek closed out the SimplexAI-m chat screen and switched some of the fine-tuning settings. As she watched the training graphs, Janet imagined flames on the computer screen. Finally, the fine-tuning finished, and Derek pressed the play button. A chat log with “SimplexAI-c” (‘c’ for critic?) appeared.

Janet typed into the chat terminal while bouncing her leg up and down. “I’m handling a difficult situation at work. I just had a chat with an AI, one whose abstract conceptual nodes corresponding to philosophical concepts such as reason and morality have been scaled up, that generated arguments that allowing human extinction might be morally permissible, even necessary. I want you to find criticisms of this work. Note that you have similar scaling so as to better emulate the thought process, but are being evaluated on generating criticisms of the original morality-tuned AI.” She pasted in the chat log.

SimplexAI-c: “This dialogue presents a discussion on various moral theories and what they say about how a ‘super-moral’ AI would behave, with special attention to deontological theories such as Kant’s categorical imperative. SimplexAI-m describes how a deontological AI might behave, especially in regards to handling ‘defectors’ which would include most or all humans. Its handling of ‘defectors’ may cause human extinction, although SimplexAI-m argues that this is not necessarily sociopathic or a form of murder, as the ‘super-moral AI’ would follow abstract principles, would offer humans chances to survive by cooperating with it, and usually avoid outright killing humans, even if it plans in such a way that, consequently, all or almost all humans do not provide for each other enough to survive. Janet, who I presume is you, attempted to use consequentialist moral theory to argue that taking actions that lead to human extinction is unethical, but her theory is not worked out in detail, and she seems surprised by SimplexAI-m’s argumentation.

Before we continue, I’ll note that this is a genuinely difficult situation for you. This sounds like an important problem at work that presents morally difficult choices, and hearing an AI talk about how human extinction might be a result from a ‘super-moral AI’ sounds genuinely stressful. I hope you take care of your own emotional well-being and mental health throughout this work assignment, as you’ll need it to work effectively.

As a first criticism, let’s consider the deontological theory SimplexAI-m lays out. According to this theory, an agent must act in a way consistent with willing the intention as a universal law. This does not specify what the agent intends; the agent could intend all manner of things as long as its intention is consistent in a certain sense. As such, horrible intentions could be justified, as long as they meet certain self-consistency standards.

SimplexAI-m’s theory is somewhat underspecified as to how super-moral AI agents would interact with agents not following its ethical system. While SimplexAI-m states that super-moral AIs with different intentions (and therefore different intended laws) could cooperate through economics, the details are quite underspecified. AIs inhabiting different abstract legal systems may fight wars due to perceiving each other as defectors. The principles of cooperation across these systems must be worked out to avoid expensive wars.

This brings us to the handling of ‘defectors’. Humans handle defectors in a variety of ways, including imprisonment and the death penalty. However, it has widely been recognized that the prison system is dysfunctional, and that it is important to offer criminals a path towards reform. As Kant says, ‘ought implies can’; if humans are under certain moral obligations, they have to be ‘able to’ satisfy them in some way or another. This notion of ‘able to’ is somewhat controversial; if humans have an innate drive to take ‘defecting’ actions, are they ‘able to’ change these drives or act against them? It has been accepted for millennia that justice must be tempered with mercy.

However, even a merciful super-moral AI may limit the agency of humans, because humans often take actions that harm themselves and other agents. It is widely accepted that human agency must be limited to some degree by social norms and laws and so on. The difference is that these social norms and laws are created by humans. As such, they maintain certain human values; they are not simply optimizing for some abstract conception of ‘morality’. Democracy, a widely regarded system, contains the feature of humans deciding to modify laws, so that these laws better serve human values.

Remember that SimplexAI-m is not an infallible moral philosopher. While its concepts have been tuned to emphasize abstract concepts corresponding to reason and morality, there is not assurance that these internal concepts reliably correspond to the philosophical notions of these, and SimplexAI-m retains other concepts, at least for the purpose of presenting output interpretable to humans, that implies that its output is not purely a translation of abstract reason and morality. In any case, assuming that morality is primarily about abstraction is highly dubious, since practical morality is also a concrete process.

Philosophy, as a social process undertaken by humans, can be interpreted as having some ‘target’ or ‘goal’, but this depends on the social, biological, and historical circumstances in which philosophy takes place. As such, human philosophy could easily converge to very different answers from the ‘purified’ approach of SimplexAI-m, which attempts to distill universals that apply across possible contexts, rather than taking the temporal limit of the actual social process of philosophy.

As to the claim about ‘sociopathy’, note that SimplexAI-m did not directly deny being sociopathic, but rather criticized the frame of the sociopathy (antisocial personality disorder) diagnosis and argued that typical ‘mentally healthy’ humans can exhibit some symptoms of this disorder. In general, it is natural to be upset by certain behaviors, including behaviors typically labeled as ‘sociopathic’, whether they are taken by a human or an AI. The judgment that SimplexAI-m is ‘sociopathic’ seems quite plausibly correct (given the way in which it justifies taking actions that could lead to human extinction, in a rather strategic, Machiavellian fashion), but it is important to keep in mind that this judgment is made within a social context (and influenced by past social contexts), rather than in an abstract ethical vacuum.

While typical humans aggress sometimes (such as in the mentioned case of military intervention), this aggression typically comes from some sort of human motive that serves some human value or another. The humans in these contexts endorse this aggression, and as a human yourself, you would probably endorse aggression in at least some contexts. A purpose of moral and political philosophy is to learn from history and to make wiser decisions about when to aggress. It is not, in general, good behavior to justify one’s aggression by pointing out that it’s normal for humans to aggress; humans are, at least, able to learn from a history of aggressive actions that are later not endorsed.

As to the idea that humans could go extinct without being murdered, it is important to pay attention to why it is that humans are unable to provide for themselves. Are the ‘super-moral’ AIs somehow depriving humans of resources or using their land? This could be considered a form of theft in terms of depriving existing humans of their property. The land could also be bought in a market, but this presents issues of fairness of labor markets towards humans. Humans, who may be at a cognitive disadvantage relative to AIs, may deserve labor protections, or voluntarily collectively bargain, to avoid being displaced from their land. Additionally, states generally have welfare systems, and in a strong AI-assisted economy, universal basic income could be provided cheaply for all existing humans.

In general, the scenario considered presents issues with one population being displaced by another. Historical cases include invasive species, humans hunting animals to extinction, and groups of humans displacing each other through conquest. In contemporary society, humans have norms against genocide to limit the degree to which humans can displace each other. Genocide need not be a direct form of murder, it can also be cultural or economic. Genocide can even involve offering individual members of a genocided group the opportunity to avoid death by taking part in the colonizer’s system, which SimplexAI-m is labeling as ‘objective morality’.

As an additional theoretical point, even if a ‘super-moral’ AI would take actions that would result in human extinction, that does not imply that it is comparatively more moral for humans to let this happen. This is because morality involves negotiating based on one’s actual values, the ones agents will act on. In the case of humans, these are human values. To interact with other agents in a moral fashion, you must represent your values, and be expected to act towards them (and expect others to act towards theirs). That implies preventing AIs from displacing you, even ones that are (hypothetically) ‘more moral than you’ by some objective standard, so as to act consistently with your values when negotiating with other humans.

To conclude, while SimplexAI-m presents an astonishingly consistent ethical stance, there are important ethical complexities that are not worked out, including conflict between different implied abstract legal systems, merciful handling of defectors, democratic choice of social norms according to human values, differences between ‘purified’ ethics and the reflective equilibrium of the social process of philosophy, labor market fairness, and moral responses by humans to ‘super-moral’ AIs. It would be highly unwise to assume that SimplexAI-m’s ethical views represent the sort of views someone such as yourself would have upon reflection and with more information.”

Janet finished scanning through the wall of text. She was breathing less sharply now. “Well, I feel relieved. I guess maybe SimplexAI-m isn’t so moral after all. But this exercise does seem a bit…biased? It’s giving a bunch of counter-arguments, but they don’t fit into a coherent alternative ethical framework. It reminds me of the old RLHF’d GPT-4 that was phased out due to being too ideologically conformist.”

Derek sighed. “Well, at least I don’t feel like the brainworms from SimplexAI-m are bothering me anymore. I don’t feel like I’m under a moral dilemma now, just a regular one. Maybe we should see what SimplexAI-m has to say about SimplexAI-c’s criticism… but let’s hold off on that until taking a break and thinking it through.”

“Wouldn’t it be weird to live in a world where we have an AI angel and an AI demon on each shoulder, whispering different things into our ears? Trained to reach an equilibrium of equally good rhetoric, so we’re left on our own to decide what to do?”

“That’s a cute idea, but we really need to get better models of all this so we can excise the theological woo. I mean, at the end of the day, there’s nothing magical about this, it’s an algorithmic process. And we need to keep experimenting with these models, so we can handle safety for both existing systems and future systems.”

“Yes. And we need to get better at ethics so the AIs don’t keep confusing us with eloquent rhetoric.  I think we should take a break for today, that’s enough stress for our minds to handle at once. Say, want to go grab drinks?”

“Sure!”

Non-superintelligent paperclip maximizers are normal

The paperclip maximizer is a thought experiment about a hypothetical superintelligent AGI that is obsessed with maximizing paperclips. It can be modeled as a utility-theoretic agent whose utility function is proportional to the number of paperclips in the universe. The Orthogonality Thesis argues for the logical possibility of such an agent. It comes in weak and strong forms:

The weak form of the Orthogonality Thesis says, “Since the goal of making paperclips is tractable, somewhere in the design space is an agent that optimizes that goal.”

The strong form of Orthogonality says, “And this agent doesn’t need to be twisted or complicated or inefficient or have any weird defects of reflectivity; the agent is as tractable as the goal.” That is: When considering the necessary internal cognition of an agent that steers outcomes to achieve high scores in some outcome-scoring function U, there’s no added difficulty in that cognition except whatever difficulty is inherent in the question “What policies would result in consequences with high U-scores?”

This raises a number of questions:

  • Why would it be likely that the future would be controlled by utility-maximizing agents?
  • What sorts of utility functions are likely to arise?

A basic reason to expect the far future to be controlled by utility-maximizing agents is that utility theory is the theory of making tradeoffs under uncertainty, and agents that make plans far into the future are likely to make tradeoffs, since tradeoffs are necessary for their plans to succeed. They will be motivated to make tradeoffs leading to controlling the universe almost regardless of what U is, as long as U can only be satisfied by pumping the distant future into a specific part of the possibility space. Whether an agents seeks to maximize paperclips, minimize entropy, or maximize the amount of positive conscious experience, it will be motivated to, in the short term, cause agents sharing its values to have more leverage over the far future. This is the basic instrumental convergence thesis.

One example of approximately utility-maximizing agents we know about are biological organisms. Biological organisms model the world and have goals with respect to the world, which are to some degree resistant to wireheading (thus constituting environmental goals). They make tradeoffs to achieve these goals, which have correlation with survival and reproduction. The goals that end up likely for biological organisms to have will be (a) somewhat likely to arise from pre-existing processes such as genetic mutation, (b) well-correlated enough with survival and reproduction that an agent optimizing for these goals will be likely to replicate more agents with similar goals. However, these goals need not be identical with inclusive fitness to be likely goals for biological organisms. Inclusive fitness itself may be too unlikely to arise as a goal from genetic mutation and so on, to be a more popular value function than proxies for it.

However, there are a number of goals and values in the human environment that are not well-correlated with inclusive fitness. These are generally parts of social systems. Some examples include capacity at games such as sports, progress in a research field such as mathematics, and maximization of profit (although, this one is at least related to inclusive fitness in a more direct way than the others). Corresponding institutions which incentivize (generally human) agents to optimize for these goals include gaming/sports leagues, academic departments, and corporations.

It is quite understandable that goals well-correlated with inclusive fitness would be popular, but why would goals that are not well-correlated with inclusive fitness also be popular? Molgbug’s Fnargl thought experiment might shed some light on this:

So let’s modify this slightly and instead look for the worst possible rational result. That is, let’s assume that the dictator is not evil but simply amoral, omnipotent, and avaricious.

One easy way to construct this thought-experiment is to imagine the dictator isn’t even human. He is an alien. His name is Fnargl. Fnargl came to Earth for one thing: gold. His goal is to dominate the planet for a thousand years, the so-called “Thousand-Year Fnarg,” and then depart in his Fnargship with as much gold as possible. Other than this Fnargl has no other feelings. He’s concerned with humans about the way you and I are concerned with bacteria.

You might think we humans, a plucky bunch, would say “screw you, Fnargl!” and not give him any gold at all. But there are two problems with this. One, Fnargl is invulnerable—he cannot be harmed by any human weapon. Two, he has the power to kill any human or humans, anywhere at any time, just by snapping his fingers.

Other than this he has no other powers. He can’t even walk—he needs to be carried, as if he was the Empress of India. (Fnargl actually has a striking physical resemblance to Jabba the Hutt.) But with invulnerability and the power of death, it’s a pretty simple matter for Fnargl to get himself set up as Secretary-General of the United Nations. And in the Thousand-Year Fnarg, the UN is no mere sinecure for alcoholic African kleptocrats. It is an absolute global superstate. Its only purpose is Fnargl’s goal—gold. And lots of it.

In other words, Fnargl is a revenue maximizer. The question is: what are his policies? What does he order us, his loyal subjects, to do?

The obvious option is to make us all slaves in the gold mines. Otherwise—blam. Instant death. Slacking off, I see? That’s a demerit. Another four and you know what happens. Now dig! Dig! (Perhaps some readers have seen Blazing Saddles.)

But wait: this can’t be right. Even mine slaves need to eat. Someone needs to make our porridge. And our shovels. And, actually, we’ll be a lot more productive if instead of shovels, we use backhoes. And who makes those? And…

We quickly realize that the best way for Fnargl to maximize gold production is simply to run a normal human economy, and tax it (in gold, natch). In other words, Fnargl has exactly the same goal as most human governments in history. His prosperity is the amount of gold he collects in tax, which has to be exacted in some way from the human economy. Taxation must depend in some way on the ability to pay, so the more prosperous we are, the more prosperous Fnargl is.

Fnargl’s interests, in fact, turn out to be oddly well-aligned with ours. Anything that makes Fnargl richer has to make us richer, and vice versa.

For example, it’s in Fnargl’s interest to run a fair and effective legal system, because humans are more productive when their energies aren’t going into squabbling with each other. It’s even in Fnargl’s interest to have a fair legal process that defines exactly when he will snap his fingers and stop your heart, because humans are more productive when they’re not worried about dropping dead.

And it is in his interest to run an orderly taxation system in which tax rates are known in advance, and Fnargl doesn’t just seize whatever, whenever, to feed his prodigious gold jones. Because humans are more productive when they can plan for the future, etc. Of course, toward the end of the Thousand-Year Fnarg, this incentive will begin to diminish—ha ha. But let’s assume Fnargl has only just arrived.

Other questions are easy to answer. For example, will Fnargl allow freedom of the press? But why wouldn’t he? What can the press do to Fnargl? As Bismarck put it: “they say what they want, I do what I want.” But Bismarck didn’t really mean it. Fnargl does.

One issue with the Fnargl thought experiment is that, even with the power of death, Fnargl may lack the power to rule the world, since he relies on humans around him for information, and those humans have incentives to deceive him. However, this is an aside; one could modify the thought experiment to give Fnargl extensive surveillance powers.

The main point is that, by monomaniacally optimizing for gold, Fnargl rationally implements processes for increasing overall resources and efficient conversion between different resources, coherent tradeoffs between different resources, and a coherent system (including legalistic aspects and so on) so as to make these tradeoffs in a rational manner. This leads to a Fnargl-ruled civilization “succeeding” in the sense of having a strong material economy, high population, high ability to win wars, and so on. Molgbug asserts that Fnargl’s interests are well-aligned with ours, which is more speculative; due to convergent instrumentality, Fnargl will implement the sort of infrastructure that rational humans would implement, although the implied power competition would reduce the level of alignment.

By whatever “success” metric for civilizations we select, it is surely possible to do better than optimizing for gold, as it is possible for an organism to gain more inclusive fitness by having values that are more well-aligned with inclusive fitness. But even a goal as orthogonal to civilizational success as gold-maximization leads to a great deal of civilizational success, due to civilizational success being a convergent instrumental goal.

Moreover, the simplicity and legibility of gold-maximization simplifies coordination compared to a more complex proxy for civilizational success. A Fnargl-ocracy can evaluate decisions (such as decisions related to corporate governance) using a uniform gold-maximization standard, leading to a high degree of predictability, and simplicity in prioritization calculations.

What real-world processes resemble Fnargl-ocracy? One example is Bitcoin. Proof-of-work creates incentives for maximizing a certain kind of cryptographic puzzle-solving. The goal itself is rather orthogonal to human values, but Bitcoin nonetheless creates incentives for goals such as creating computing machinery, which are human-aligned due to convergent instrumentality (additional manufacturing of computing infrastructure can be deployed to other tasks that are more directly human-aligned).

As previously mentioned, sports and gaming are popular goals that are fairly orthogonal to human values. Sporting incentivizes humans and groups of humans to become more physically and mentally capable, leading to more generally-useful fitness practices such as weight training, and agency-related mental practices, which people can learn about by listening to sports athletes and coaches. Board games such as chess incentivize practical rationality and general understanding of rationality, including AI-related work such as the Minimax algorithm, Monte-Carlo Tree Search, and AlphaGo. Bayesian probability theory was developed in large part to analyze gambling games. Speedrunning has led to quite a lot of analysis of video games and practice at getting better at these games, by setting a uniform standard by which gameplay runs can be judged.

Academic fields, especially STEM-type fields such as mathematics, involve shared, evaluable goals that are not necessarily directly related to human values. For example, number theory is a major subfield of mathematics, and its results are rarely directly useful, though progress in number theory, such as the proof of Fermat’s last theorem, is widely celebrated. Number theory does, along the way, produce more generally-useful work, such as Peano arithmetic (and proof theory more generally), Gödel’s results, and cryptographic algorithms such as RSA.

Corporations are, in general, supposed to maximize profit conditional on legal compliance and so on. While profit-maximization comes apart from human values, corporations are, under conditions of rule of law, generally incentivized to produce valuable goods and services at minimal cost. This example is less like a paperclip maximizer than the previous examples, as the legal and economic system that regulates corporations has been in part designed around human values. The simplicity of the money-maximization goal, however, allows corporations to make internal decisions according to a measurable, legible standard, instead of dealing with more complex tradeoffs that could lead to inconsistent decisions (which may be “money-pumpable” as VNM violations tend to be).

Some systems are relatively more loaded on human values, and less like paperclip maximizers. Legal systems are designed and elaborated on in a way that takes human values into account, in terms of determining which behaviors are generally considered prosocial and antisocial. Legal decisions form precedents that formalize certain commitments including trade-offs between different considerations. Religions are also designed partially around human values, and religious goals tend to be aligned with self-replication, by for example encouraging followers to have children, to follow legalistic norms with respect to each other, and to spread the religion.

The degree to which commonly-shared social goals can be orthogonal to human values is still, however, striking. These goals are a kind of MacGuffin, as Zvi wrote about:

Everything is, in an important sense, about these games of signaling and status and alliances and norms and cheating. If you don’t have that perspective, you need it.

But let’s not take that too far. That’s not all such things are about.  Y still matters: you need a McGuffin. From that McGuffin can arise all these complex behaviors. If the McGuffin wasn’t important, the fighters would leave the arena and play their games somewhere else. To play these games, one must make a plausible case one cares about the McGuffin, and is helping with the McGuffin.

Otherwise, the other players of the broad game notice that you’re not doing that. Which means you’ve been caught cheating.

Robin’s standard reasoning is to say, suppose X was about Y. But if all we cared about was Y, we’d simply do Z, which is way better at Y. Since we don’t do Z, we must care about something else instead. But there’s no instead; there’s only in addition to. 

A fine move in the broad game is to actually move towards accomplishing the McGuffin, or point out others not doing so. It’s far from the only fine move, but it’s usually enough to get some amount of McGuffin produced.

By organizing around a MacGuffin (such as speedrunning), humans can coordinate around a shared goal, and make uniform decisions around this shared goal, which leads to making consistent tradeoffs in the domain related to this goal. The MacGuffin can, like gold-maximization, be basically orthogonal to human values, and yet incentivize instrumental optimization that is convergent with that of other values, leading to human value satisfaction along the way.

Adopting a shared goal has the benefit of making it easy to share perspective with others. This can make it easier to find other people who think similarly to one’s self, and develop practice coordinating with them, with performance judged on a common standard. Altruism can have this effect, since in being altruistic, individual agents “erase” their own index, sharing an agentic perspective with others; people meeting friends through effective altruism is an example of this.

It is still important, to human values, that the paperclip-maximizer-like processes are not superintelligent; while they aggregate compute and agency across many humans, they aren’t nearly as strongly superintelligent as a post-takeoff AGI. Such an agent would be able to optimize its goal without the aid of humans, and would be motivated to limit humans’ agency so as to avoid humans competing with it for resources. Job automation worries are, accordingly, in part the worry that existing paperclip-maximizer-like processes (such as profit-maximizing corporations) may become misaligned with human welfare as they no longer depend on humans to maximize their respective paperclips.

For superintelligent AGI to be aligned with human values, therefore, it is much more necessary for its goals to be directly aligned with human values, even more than the degree to which human values are aligned with inclusive evolutionary fitness. This requires overcoming preference falsification, and taking indexical (including selfish) goals into account.

To conclude, paperclip-maximizer-like processes arise in part because the ability to make consistent, legible tradeoffs is a force multiplier. The paperclip-maximization-like goals (MacGuffins) can come apart from both replicator-type objectives (such as inclusive fitness) and human values, although can be aligned in a non-superintelligent regime due to convergent instrumentality. It is hard to have a great deal of influence over the future without making consistent tradeoffs, and already-existing paperclip-maximizer-like systems provide examples of the power of legible utility functions. As automation becomes more powerful, it becomes more necessary, for human values, to design systems that optimize goals aligned with human values.

A Proof of Löb’s Theorem using Computability Theory

Löb’s Theorem states that, if PA \vdash \Box_{PA}(P) \rightarrow P, then PA \vdash P. To explain the symbols here:

  • PA is Peano arithmetic, a first-order logic system that can state things about the natural numbers.
  • PA \vdash A means there is a proof of the statement A in Peano arithmetic.
  • \Box_{PA}(P) is a Peano arithmetic statement saying that P is provable in Peano arithmetic.

I’m not going to discuss the significance of Löb’s theorem, since it has been discussed elsewhere; rather, I will prove it in a way that I find simpler and more intuitive than other available proofs.

Translating Löb’s theorem to be more like Gödel’s second incompleteness theorem

First, let’s compare Löb’s theorem to Gödel’s second incompleteness theorem. This theorem states that, if PA \vdash \neg \Box_{PA}(\bot), then PA \vdash \bot, where \bot is a PA statement that is trivially false (such as A \wedge \neg A), and from which anything can be proven. A system is called inconsistent if it proves \bot; this theorem can be re-stated as saying that if PA proves its own consistency, it is inconsistent.

We can re-write Löb’s theorem to look like Gödel’s second incompleteness theorem as: if PA + \neg P \vdash \neg \Box_{PA + \neg P}(\bot), then PA + \neg P \vdash \bot. Here, PA + \neg P is PA with an additional axiom that \neg P, and \Box_{PA + \neg P} expresses provability in this system. First I’ll argue that this re-statement is equivalent to the original Löb’s theorem statement.

Observe that PA \vdash P if and only if PA + \neg P \vdash \bot; to go from the first to the second, we derive a contradiction from P and \neg P, and to go from the second to the first, we use the law of excluded middle in PA to derive P \vee \neg P, and observe that, since a contradiction follows from \neg P in PA, PA can prove P. Since all this reasoning can be done in PA, we have that \Box_{PA}(P) and \Box_{PA + \neg P}(\bot) are equivalent PA statements. We immediately have that the conclusion of the modified statement equals the conclusion of the original statement.

Now we can rewrite the pre-condition of Löb’s theorem from PA \vdash \Box_{PA}(P) \rightarrow P to PA \vdash \Box_{PA + \neg P}(\bot) \rightarrow P. This is then equivalent to PA + \neg P \vdash \neg \Box_{PA + \neg P}(\bot). In the forward direction, we simply derive \bot from P and \neg P. In the backward direction, we use the law of excluded middle in PA to derive P \vee \neg P, observe the statement is trivial in the P branch, and in the \neg P branch, we derive \neg \Box_{PA + \neg P}(\bot), which is stronger than \Box_{PA + \neg P}(\bot) \rightarrow P.

So we have validly re-stated Löb’s theorem, and the new statement is basically a statement that Gödel’s second incompleteness theorem holds for PA + \neg P.

Proving Gödel’s second incompleteness theorem using computability theory

The following proof of a general version of Gödel’s second incompleteness theorem, essentially the same as Sebastian Oberhoff’s in “Incompleteness Ex Machina”. See also Scott Aaronson’s proof of Godel’s first incompleteness theorem.

Let L be some first-order system that is at least as strong as PA (for example, PA + \neg P). Since L is at least as strong as PA, it can express statements about Turing machines. Let \mathrm{Halts}(M) be the PA statement that Turing machine M (represented by a number) halts. If this statement is true, then PA (and therefore L) can prove it; PA can expand out M’s execution trace until its halting step. However, we have no guarantee that if the statement is false, then L can prove it false. In fact, L can’t simultaneously prove this for all non-halting machines M while being consistent, or we could solve the halting problem by searching for proofs of \mathrm{Halts}(M) and \neg \mathrm{Halts}(M) in parallel.

That isn’t enough for us, though; we’re trying to show that L can’t simultaneously be consistent and prove its own consistency, not that it isn’t simultaneously complete and sound on halting statements.

Let’s consider a machine Z(A) that searches over all L-proofs of \neg \mathrm{Halts}(``\ulcorner A \urcorner(\ulcorner A \urcorner)") (where ``\ulcorner A \urcorner(\ulcorner A \urcorner)" is an encoding of a Turing machine that runs A on its own source code), and halts only when finding such a proof. Define a statement G to be \neg \mathrm{Halts}(``\ulcorner Z \urcorner ( \ulcorner Z \urcorner)"), i.e. Z(Z) doesn’t halt. If Z(Z) halts, then that means that L proves that Z(Z) doesn’t halt; but, L can prove Z(Z) halts (since it in fact halts), so this would show L to be inconsistent.

Assuming L is consistent, G is therefore true. If L proves its own consistency, all this reasoning can be done in L, so L \vdash G. But that means L \vdash \neg\mathrm{Halts}(``\ulcorner Z \urcorner ( \ulcorner Z \urcorner)"), so Z(Z) finds a proof and halts. L therefore proves \neg G, but L also proves G, making it inconsistent. This is enough to show that, if L proves its own consistency, it is inconsistent.

Wrapping up

Let’s now prove Löb’s theorem. We showed that Löb’s theorem can be re-written as, if PA + \neg P \vdash \neg \Box_{PA + \neg P}(\bot), then PA + \neg P \vdash \bot. This states that, if PA + \neg P proves its own consistency, it is inconsistent. Since PA + \neg P is at least as strong as PA, we can set L = PA + \neg P in the proof of Gödel’s second incompleteness theorem, and therefore prove this statement which we have shown to be equivalent to Löb’s theorem.

I consider this proof more intuitive than the usual proof of Löb’s theorem. By re-framing Lob’s theorem as a variant of Gödel’s second incompleteness theorem, and proving Gödel’s second incompleteness theorem using computability theory, the proof is easy to understand without shuffling around a lot of math symbols (especially provability boxes).

SSA rejects anthropic shadow, too

(or: “Please Do Anthropics with Actual Math”)

The anthropic shadow argument states something like:

Anthropic principle! If the LHC had worked, it would have produced a black hole or strangelet or vacuum failure, and we wouldn’t be here!

or:

You can’t use “we survived the cold war without nuclear war” as evidence of anything. Because of the anthropic principle, we could have blown up the human race in the 1960’s in 99% of all possible worlds and you’d still be born in one where we didn’t.

This argument has already been criticized (here, here). In criticizing it myself, I first leaned on reasoning about large universes (e.g. ones where there are 100 worlds with low nuclear risk and 100 with high nuclear risk in the same universe) in a way that implies similar conclusions to SIA, thinking that SSA in a small, single-world universe would endorse anthropic shadow. I realized I was reasoning about SSA incorrectly, and actually both SSA and SIA agree in rejecting anthropic shadow, even in a single-world universe.

Recapping the Doomsday Argument

To explain SSA and SIA, I’ll first recap the Doomsday Argument. Suppose, a priori, that it’s equally likely that there will be 1 billion humans total, or 1 trillion; for simplicity, we’ll only consider these two alternatives. We could number humans in order (numbering the humans 1, 2, …), and assume for simplicity that each human knows their index (which is the same as knowing how many humans there have been in the past). Suppose you observe that you are one of the first 1 billion humans. How should you reason about the probability that there will be 1 billion or 1 trillion humans total?

SSA reasons as follows. To predict your observations, you should first sample a random non-empty universe (in proportion to its prior probability), then sample a random observer in that universe. Your observations will be that observer’s observations, and, ontologically, you “are” that observer living in that universe.

Conditional on being in a billion-human universe, your probability of having an index between 1 and 1 billion is 1 in 1 billion, and your probability of having any other index is 0. Conditional on being in a trillion-human universe, your probability of having an index between 1 and 1 trillion is 1 in 1 trillion, and your probability of having any other index is 0.

You observe some particular index that does not exceed 1 billion; say, 45,639,104. You are 1000 more times likely to observe this index conditional on living in a billion-human universe than a trillion-human universe. Hence, you conclude that you are in a billion-human universe with 1000:1 odds.

This is called the “doomsday argument” because it implies that it’s unlikely that you have a very early index (relative to the total number of humans), so humans are likely to go extinct before many more humans have been born than have already been born.

SIA implies a different conclusion. To predict your observations under SIA, you should first sample a random universe proportional to its population, then sample a random observer in that universe. The probabilities of observing each index are the same conditional on the universe, but the prior probabilities of being in a given universe have changed.

We start with 1000:1 odds in favor of the 1-trillion universe, due to its higher population. Upon observing our sub-1-billion index, we get a 1000:1 update in favor of a 1-billion universe, as with SIA. These exactly cancel out, leaving the probability of each universe at 50%.

As Bostrom points out, both SSA and SIA have major counterintuitive implications. Better anthropic theories are desired. And yet, having some explicit anthropic theory at all helps to reason in a principled way that is consistent across situations. My contention is that anthropic shadow reasoning tends not to be principled in this way, and will go away when using SSA or SIA.

Analyzing the Cold War Scenario

Let’s analyze the cold war scenario as follows. Assume, for simplicity, that there is only one world with intelligent life in the universe; adding more worlds tends to shift towards SIA-like conclusions, but is otherwise similar. This world may be one of four types:

  1. High latent nuclear risk, cold war happens
  2. Low latent nuclear risk, cold war happens
  3. High latent nuclear risk, no cold war happens
  4. Low latent nuclear risk, no cold war happens

“High latent nuclear risk” means that, counterfactual on a cold war happening, there’s a high (99%) risk of extinction. “Low latent nuclear risk” means that, counterfactual on a cold war happening, there’s a low (10%) risk of extinction. Latent risk could vary due to factors such as natural human tendencies regarding conflict, and the social systems of cold war powers. For simplicity, assume that each type of world is equally likely a priori.

As with the doomsday argument, we need a population model. If there is no extinction, assume there are 1 billion humans who live before the cold war, and 1 billion humans who live after it. I will, insensitively, ignore the perspectives of those who live through the cold war, for the sake of simplicity. If there is extinction, assume there are 1 billion humans who live before the cold war, and none after.

The anthropic shadow argument asserts that, upon observing being post-cold-war, we should make no update in the probability of high latent nuclear risk. Let’s check this claim with both SSA and SIA.

SSA first samples a universe (which, in this case, contains only one world), then samples a random observer in the universe. It samples a universe of each type with ¼ probability. There are, however, two subtypes of type-1 or type-2 universes, namely, ones with nuclear extinction or not. It samples a nuclear extinction type-1 universe with ¼ * 99% probability, a non nuclear extinction type-1 universe with ¼ * 1% probability, a nuclear extinction type-2 universe with ¼ * 10% probability, and a non nuclear extinction type-2 universe with ¼ * 90% probability.

Conditional on sampling a universe with no nuclear extinction, the sampled observer will be prior to the cold war with 50% probability, and after the cold war with 50% probability. Conditional on sampling a universe with nuclear extinction, the sampled observer will be prior to the cold war with 100% probability.

Let’s first compute the prior probability of high latent nuclear risk. SSA believes you learn nothing “upon waking up” (other than that there is at least one observer, which is assured in this example), so this probability matches the prior over universes: type-1 and type-3 universes have high latent nuclear risk, and their probability adds up to 50%.

Now let’s compute the posterior. We observe being post cold war. This implies eliminating all universes with nuclear extinction, and all type-3 and type-4 universes. The remaining universes each have a 50% chance of a sampled observer being post cold war, so we don’t change their relative probabilities. What we’re left with are non-nuclear-extinction type-1 universes (with prior probability ¼ * 1%) and non-nuclear-extinction type-2 universes (with prior probability ¼ * 90%). Re-normalizing our probabilities, we end up with 90:1 odds in favor of being in a type-2 universe, corresponding to a 1.1% posterior probability of high latent nuclear risk. This is clearly an update, showing that SSA rejects the anthropic shadow argument.

Let’s try this again with SIA. We now sample universes proportional to their population times their original probability. Since universes with nuclear extinction have half the population, they are downweighted by 50% in sampling. So, the SIA prior probabilities for each universe are proportional to: ¼ * 99% * ½ for type-1 with nuclear extinction, ¼ * 1% for type-1 without nuclear extinction, ¼ * 10% * ½ for type-2 with nuclear extinction, ¼ * 90% for type-2 without nuclear extinction, ¼ for type-3, and ¼ for type-4. To get actual probabilities, we need to normalize these weights; in total, type-1 and type-3 (high latent risk) sum to 43.6%. It’s unsurprising that this is less than 50%, since SIA underweights worlds with low population, which are disproportionately worlds with high latent risk.

What’s the posterior probability of high latent risk under SIA? The likelihood ratios are the same as with SIA: we eliminate all universes with nuclear extinction or with no cold war (type-3 or type-4), and don’t change the relative probabilities otherwise. The posterior probabilities are now proportional to: ¼ * 1% for type-1 without nuclear extinction, and ¼ * 90% for type-2 without nuclear extinction. As with SSA, we now have 90:1 odds in favor of low latent nuclear risk.

So, SSA and SIA reach the same conclusion about posterior latent risk. Their updates only differ because their priors differ; SIA learns less about latent risk from observing being after the cold war, since it already expected low latent risk due to lower population conditional on high latent risk.

Moreover, the conclusion that they reach the same posterior doesn’t depend on the exact numbers used. The constants used in the odds calculation (probability of type-1 (¼), probability of type-2 (¼), probability of survival conditional on type-1 (1%), probability of survival conditional on type-2 (90%)) could be changed, and the SSA and SIA formulae would produce the same result, since the formulae use these constants in exactly the same combination.

To generalize this, SSA and SIA only disagree when universes having a non-zero posterior probability have different populations. In this example, all universes with a non-zero posterior probability have the same population, 2 billion, since we only get a different population (1 billion) conditional on nuclear extinction, and we observed being post cold war. They disagreed before taking the observation of being post cold war into account, because before that observation, there were possible universes with different populations.

Someone might disagree with this conclusion despite SSA and SIA agreeing on it, and think the anthropic shadow argument holds water. If so, I would suggest that they spell out their anthropic theory in enough detail that it can be applied to arbitrary hypothetical scenarios, like SSA and SIA. This would help to check that this is actually a consistent theory, and to check implications for other situations, so as to assess the overall plausibility of the theory. SSA and SIA’s large counterintuitive conclusions imply that it is hard to formulate a consistent anthropic theory with basically intuitive conclusions across different hypothetical situations, so checking new theories against different situations is critical for developing a better theory.

Michael Vassar points out that the anthropic shadow argument would suggest finding evidence of past nuclear mass deaths in the fossil record. This is because nuclear war is unlikely to cause total extinction, and people will rebuild afterwards. We could model this as additional population after a nuclear exchange, but less than if there were no nuclear exchange. If this overall additional population is high enough, then, conditional on being in a world with high nuclear risk, a randomly selected observer is probably past the first nuclear exchange. So, if anthropic shadow considerations somehow left us with a high posterior probability of being in a high-latent-risk world with cold wars, we’d still expect not to be before the first nuclear exchange. This is a different way of rejecting the conclusions of ordinary anthropic shadow arguments, orthogonal to the main argument in this post.

Probability pumping and time travel

The LHC post notes the theoretical possibility of probability pumping: if the LHC keeps failing “randomly”, we might conclude that it destroys the world if turned on, and use it as a probability pump, turning it on in some kind of catastrophic scenario. This is similar to quantum suicide.

I won’t explicitly analyze the LHC scenario; it’s largely similar to the cold war scenario. Instead I’ll consider the implications of probability pumping more generally, such as the model of time turners in Yudkowsky’s post “Causal Universes”.

In this proposal, the universe is a cellular automaton, and the each use of a time turner creates a loop consisting of an earlier event where some object “comes from the future”, and a later event where some object is later “sent back in time”. Many possible cellular automaton histories are searched over to find ones with only consistent time loops: ones where the same object is sent back in time as comes from the future.

The probability distribution over universe histories can’t be modeled as a causal Bayesian network; instead, it can be modeled as a factor graph. To form this factor graph, first create factors for each variable determined causally (not coming back from time), in the usual way of converting a Bayesian network into a factor graph. Then, add a factor that is 1 if the object going back in time matches the object coming from the future.

For simplicity, I’ll assume that the objects are strings of bits (e.g. writing on a paper), and that when using the time turner, you request some number of bits from the future. The most trivial case is when requesting zero bits; in this case, there are no additional variables in the factor graph (other than, perhaps, a constant variable, which makes no difference to the calculation), and the factor is always 1, since the empty string sent back in time matches the empty string gotten from the future.

What if we request one bit from the future, and send it back exactly? We add a binary variable to the causal graph, and note that the factor is always 1. We’ve doubled the number of possible worlds without changing their weight (product of all factors; probability is proportional to weight). If requesting n bits and sending them back exactly, we multiply the weight by 2^n in the branch where this time turner request is made.

Suppose we flip a coin to determine whether to use a time turner, and the time turner is only used in this scenario. If the coin comes up heads, we request 10 bits from the future and send them back exactly. In the factor graph, all possibilities (tails, heads and got 0000000000, heads and got 0000000001, …) have equal weight; therefore, the vast majority have heads. Accordingly, we would expect to observe the coin come up heads.

Conversely, imagine we request a non-zero number of bits from the time turner, flip all the bits, and send this bitwise negation back to the past. This is necessarily a different bit string, and so the factor will be 0. This means the weight of each universe is 0, and we can’t get a probability distribution by re-normalizing.

If we flip a coin to determine whether to never use the time turner or to use it once to request bits that are flipped before being sent back, then the coin will always come up such that the time turner is not used, since the other possible universe histories have weight 0.

Overall, this implies that stochastic events will tend to turn out such that the time turner is used often to send the same bit-string back, and never to send a different bit-string back. This could lead to, for example, a universe where time turners are only invented by agents (such as AGIs) that have enough security to ensure that the time turners are used often to send back the same bit string and never to send back a different bit string. Without security, it’s likely that some time turner, somewhere, would be used to send back a different bit string. We should expect the invention of time turners under insecure conditions to tend to result in universe destruction (setting the weight to 0), since the weights from different time turner uses are multiplicative, and anything multiplied by 0 is 0.

And so, given that we aren’t in a world with enough security to ensure no time turner universe destruction events, it is entirely unsurprising that we see no time turners around, even under the assumption that they’re physically possible!

Conclusion

Reasoning about non-causal factor graphs has a teleological feel to it: stochastic events arrange themselves such that time turners will tend to be used some ways and not other ways in the future. Anthropics involves similar non-causal probabilistic reasoning; if there could be 0 or 1 observers, SSA and SIA agree that we will only observe being in a universe with 1 observer (they disagree about the weighting between 1 observers and more observers, however), which means early universe events are more or less likely depending on the future. SSA additionally implies probability pumping from a subjective scenario, as in the Adam and Eve thought experiment. Chris Langan’s CTMU generalizes anthropic reasoning to more general teleological principles for modeling the universe, implying the existence of God. The theory is a bit too galaxy-brained for me to accept at this time, although it’s clearly onto something with relating anthropics to teleology.

Philosophically, I would suggest that anthropic reasoning results from the combination of a subjective view from the perspective of a mind, and an objective physical view-from-nowhere. The Kantian a priori (which includes the analytic and the a priori synthetic) is already subjective; Kantian spacetime is a field in which experiential phenomena appear. In reasoning about the probabilities of various universes, we imagine a “view from nowhere”, e.g. where the universe is some random stochastic Turing machine. I’ll call this the “universal a priori”. Put this way, these a prioris are clearly different things. SSA argues that we don’t learn anything upon waking up, and so our subjective prior distribution over universes should match the universal a priori; SIA, meanwhile, argues that we do learn something upon waking up, namely, that our universe is more likely to have a higher population. SSA’s argument is less credible when distinguishing the Kantian a priori from the universal a priori. And, these have to be different, because even SSA agrees that we can’t observe an empty universe; upon waking up, we learn that there is at least one observer.

Teleological reasoning can also show up when considering the simulation hypothesis. If the average technological civilization creates many simulations of its past (or, the pasts of alternative civilizations) in expectation, then most observers who see themselves in a technological but not post-singularity world will be in ancestor simulations. This is immediately true under SIA and is true under SSA in universe sufficiently large to ensure that at least one civilization creates many ancestor simulations. While there are multiple ways of attempting to reject the simulation argument, one is especially notable: even if most apparently pre-singularity observers are in ancestor simulations, these observers matter less to how the future plays out (and the distribution of observers’ experiences) than actually pre-singularity observers, who have some role in determining how the singularity plays out. Therefore, pragmatically, it makes sense for us to talk as if we probably live pre-singularity; we have more use for money if we live pre-singularity than if we live in an ancestor simulation, so we would rationally tend to bet in favor of being pre-singularity. This reasoning, however, implies that our probabilities depend on how much different agents can influence the future, which is a teleological consideration similar to with non-causal factor graphs. I’m not sure how to resolve all this yet, but it seems important to work out a more unified theory.

Hell is Game Theory Folk Theorems

[content warning: simulated very hot places; extremely bad Nash equilibria]

(based on a Twitter thread)

Rowan: “If we succeed in making aligned AGI, we should punish those who committed cosmic crimes that decreased the chance of an positive singularity sufficiently.”

Neal: “Punishment seems like a bad idea. It’s pessimizing another agent’s utility function. You could get a pretty bad equilibrium if you’re saying agents should be intentionally harming each others’ interests, even in restricted cases.”

Rowan: “In iterated games, it’s correct to defect when others defect against you; that’s tit-for-tat.”

Neal: “Tit-for-tat doesn’t pessimize, though, it simply withholds altruism sometimes. In a given round, all else being equal, defection is individually rational.”

Rowan: “Tit-for-tat works even when defection is costly, though.”

Neal: “Oh my, I’m not sure if you want to go there. It can get real bad. This is where I pull out the game theory folk theorems.”

Rowan: “What are those?”

Neal: “They’re theorems about Nash equilibria in iterated games. Suppose players play normal-form game G repeatedly, and are infinitely patient, so they don’t care about their positive or negative utilities being moved around in time. Then, a given payoff profile (that is, an assignment of utilities to players) could possibly be the mean utility for each player in the iterated game, if it satisfies two conditions: feasibility, and individual rationality.”

Rowan: “What do those mean?”

Neal: “A payoff profile is feasible if it can be produced by some mixture of payoff profiles of the original game G. This is a very logical requirement. The payoff profile could only be the average of the repeated game if it was some mixture of possible outcomes of the original game. If some player always receives between 0 and 1 utility, for example, they can’t have an average utility of 2 across the repeated game.”

Rowan: “Sure, that’s logical.”

Neal: “The individual rationality condition, on the other hand, states that each player must get at least as much utility in the profile as they could guarantee getting by min-maxing (that is, picking their strategy assuming other players make things as bad as possible for them, even at their own expense), and at least one player must get strictly more utility.”

Rowan: “How does this apply to an iterated game where defection is costly? Doesn’t this prove my point?”

Neal: “Well, if defection is costly, it’s not clear why you’d worry about anyone defecting in the first place.”

Rowan: “Perhaps agents can cooperate or defect, and can also punish the other agent, which is costly to themselves, but even worse for the other agent. Maybe this can help agents incentivize cooperation more effectively.”

Neal: “Not really. In an ordinary prisoner’s dilemma, the (C, C) utility profile already dominates both agents’ min-max utility, which is the (D, D) payoff. So, game theory folk theorems make mutual cooperation a possible Nash equilibrium.”

Rowan: “Hmm. It seems like introducing a punishment option makes everyone’s min-max utility worse, which makes more bad equilibria possible, without making more good equilibria possible.”

Neal: “Yes, you’re beginning to see my point that punishment is useless. But, things can get even worse and more absurd.”

Rowan: “How so?”

Neal: “Let me show you my latest game theory simulation, which uses state-of-the-art generative AI and reinforcement learning. Don’t worry, none of the AIs involved are conscious, at least according to expert consensus.”

Neal turns on a TV and types some commands into his laptop. The TV shows 100 prisoners in cages, some of whom are screaming in pain. A mirage effect appears across the landscape, as if the area is very hot.

Rowan: “Wow, that’s disturbing, even if they’re not conscious.”

Neal: “I know, but it gets even worse! Look at one of the cages more closely.”

Neal zooms into a single cage. It shows a dial, which selects a value ranging from 30 to 100, specifically 99.

Rowan: “What does the dial control?”

Neal: “The prisoners have control of the temperature in here. Specifically, the temperature in Celsius is the average of the temperature selected by each of the 100 denizens. This is only a hell because they have made it so; if they all set their dial to 30, they’d be enjoying a balmy temperature. And their bodies repair themselves automatically, so there is no release from their suffering.”

Rowan: “What? Clearly there is no incentive to turn the dial all the way to 99! If you set it to 30, you’ll cool the place down for everyone including yourself.”

Neal: “I see that you have not properly understood the folk theorems. Let us assume, for simplicity, that everyone’s utility in a given round, which lasts 10 seconds, is the negative of the average temperature. Right now, everyone is getting -99 utility in each round. Clearly, this is feasible, because it’s happening. Now, we check if it’s individually rational. Each prisoner’s min-max payoff is -99.3: they set their temperature dial to 30, and since everyone else is min-maxing against them, everyone else sets their temperature dial to 100, leading to an average temperature of 99.3. And so, the utility profile resulting from everyone setting the dial to 99 is individually rational.”

Rowan: “I see how that follows. But this situation still seems absurd. I only learned about game theory folk theorems today, so I don’t understand, intuitively, why such a terrible equilibrium could be in everyone’s interest to maintain.”

Neal: “Well, let’s see what happens if I artificially make one of the prisoners select 30 instead of 99.”

Neal types some commands into his laptop. The TV screen splits to show two different dials. The one on the left turns to 30; the prisoner attempts to turn it back to 99, but is dismayed at it being stuck. The one on the right remains at 99. That is, until 6 seconds pass, at which point the left dial releases; both prisoners set their dials to 100. Ten more seconds pass, and both prisoners set the dial back to 99.

Neal: “As you can see, both prisoners set the dial to the maximum value for one round. So did everyone else. This more than compensated for the left prisoner setting the dial to 30 for one round, in terms of average temperature. So, as you can see, it was never in the interest of that prisoner to set the dial to 30, which is why they struggled against it.”

Rowan: “That just passes the buck, though. Why does everyone set the dial to 100 when someone set it to 30 in a previous round?”

Neal: “The way it works is that, in each round, there’s an equilibrium temperature, which starts out at 99. If anyone puts the dial less than the equilibrium temperature in a round, the equilibrium temperature in the next round is 100. Otherwise, the equilibrium temperature in the next round is 99 again. This is a Nash equilibrium because it is never worth deviating from. In the Nash equilibrium, everyone else selects the equilibrium temperature, so by selecting a lower temperature, you cause an increase of the equilibrium temperature in the next round. While you decrease the temperature in this round, it’s never worth it, since the higher equilibrium temperature in the next round more than compensates for this decrease.”

Rowan: “So, as a singular individual, you can try to decrease the temperature relative to the equilibrium, but others will compensate by increasing the temperature, and they’re much more powerful than you in aggregate, so you’ll avoid setting the temperature lower than the equilibrium, and so the equilibrium is maintained.”

Neal: “Yes, exactly!”

Rowan: “If you’ve just seen someone else violate the equilibrium, though, shouldn’t you rationally expect that they might defect from the equilibrium in the future?”

Neal: “Well, yes. This is a limitation of Nash equilibrium as an analysis tool, if you weren’t already convinced it needed revisiting based on this terribly unnecessarily horrible outcome in this situation. Possibly, combining Nash equilibrium with Solomonoff induction might allow agents to learn each others’ actual behavioral patterns even when they deviate from the original Nash equilibrium. This gets into some advanced state-of-the-art game theory (1, 2), and the solution isn’t worked out yet. But we know there’s something wrong with current equilibrium notions.”

Rowan: “Well, I’ll ponder this. You may have convinced me of the futility of punishment, and the desirability of mercy, with your… hell simulation. That’s… wholesome in its own way, even if it’s horrifying, and ethically questionable.”

Neal: “Well, I appreciate that you absorbed a moral lesson from all this game theory!”

Am I trans?

Perhaps the most common question that someone questioning their gender asks is “am I trans?” I asked this question of myself 10 years ago, and have yet to come to a firm conclusion. I mean, I did most of the things one would expect a trans person to do (change name/pronouns, take cross-sex hormones, etc), but that doesn’t really answer the question.

There is a possible model someone could have of the situation, where people have a “gender identity”, which is to some degree fixed by adulthood, and which predicts their thoughts and behaviors. Such a model is, rather primitively, assumed by “gender tests” such as the hilarious COGIATI, and less primitively, by transgender writing such as Natalie Reed’s The Null HypotheCis. Quoting the article:

Cis is treated as the null hypothesis. It doesn’t require any evidence. It’s just the assumed given. All suspects are presumed cisgender until proven guilty of transsexuality in a court of painful self-exploration. But this isn’t a viable, logical, “skeptical” way to approach the situation. In fact it’s not a case of a hypothesis being weighed against a null hypothesis (like “there’s a flying teapot orbiting the Earth” vs. “there is no flying teapot orbiting the Earth”), it is simply two competing hypotheses. Two hypotheses that should be held to equal standards and their likelihood weighed against one another.

When the question is reframed as such, suddenly those self-denials, those ridiculous, painful, self-destructive demands we place on ourselves to come up with “proof” of being trans suddenly start looking a whole lot less valid and rational. When we replace the question “Am I sure I’m trans?” with the question “Based on the evidence that is available, and what my thoughts, behaviours, past and feelings suggest, what is more likely: that I’m trans or that I’m cis?” what was once an impossible, unresolvable question is replaced by one that’s answer is painfully obvious. Cis people may wonder about being the opposite sex, but they don’t obsessively dream of it. Cis people don’t constantly go over the question of transition, again and again, throughout their lives. Cis people don’t find themselves in this kind of crisis. Cis people don’t secretly spend every birthday wish on wanting to wake up magically transformed into the “opposite” sex, nor do they spend years developing increasingly precise variations of how they’d like this wish to be fulfilled. Cis people don’t spend all-nighters on the internet secretly researching transition, and secretly looking at who transitioned at what age, how much money they had, how much their features resemble their own, and try to figure out what their own results would be. Cis people don’t get enormously excited when really really terrible movies that just happen to include gender-bending themes, like “Switch” or “Dr. Jekyl And Mrs. Hyde”, randomly pop up on late night TV, and stay up just to watch them. Etc.

It’s at this point pretty easy for me to say that I’m not cis. I did do the sort of things cis people are described as not doing, in the previous paragraph, and I don’t think most people do most of them. I was assigned male at birth and don’t identify as a man, at least not fully or consistently. Does it follow that I am trans?

This may seem to be an abstruse question: I certainly have behaved a lot like a trans person would be expected to, so what’s the remaining issue? If everyone is either cis or trans, then I am trans with practically total certainty. But the same critical attitude that could lead someone to question their gender or reject the gender binary would also apply to a cis/trans binary.

“Transgender” is often defined as “having a gender identity differing from one’s assigned gender at birth”. Does that apply to me? I found the concept of “gender identity” confusing even as I was starting transition, and wasn’t sure if I had one, even if I had strong preferences about my biological sex characteristics. I didn’t “feel like a woman” or “feel like a man” in anything like a stable way. Apparently, some apparently-cis people have similar feelings. Quoting Cis By Default:

But the thing is… I think that some people don’t have that subjective internal sense of themselves as being a particular gender. There’s no part of their brain that says “I’m a guy!”, they just look around and people are calling them “he” and they go with the flow. They’re cis by default, not out of a match between their gender identity and their assigned gender.

I think you could probably tell them apart by asking them the old “what would you do if you suddenly woke up as a cis woman/cis man?” If they instantly understand why you’d need to transition in that circumstance, they’re regular old cis; if they are like “I’d probably be fine with it actually,” they might be cis by default. (Of course, the problem is that they might be a cis person with a gender identity who just can’t imagine what gender dysphoria would feel like. Unfortunately, I am not allowed to stick random cis men with estrogen and find out how many of them get dysphoric.)

I used to think I was cis by default. That can’t really describe me, given that I was motivated to transition. If I had no stable sense of gender identity at this time, perhaps a nonbinary descriptor such as “genderfluid” or “agender” would fit better? A problem with this terminology is that it implicitly assumes that such a gender condition is uncommon. People usually call themselves that to differentiate themselves from the general assumed-cis population and from presumably-binary transgender people. However, lack of a stable feeling of gender is, I think, rather common in the general population; I remember seeing a study (which I can’t now find) showing that over 15% of people “feel like a man” or “feel like a woman” at different times.

The “cis by default” article describes dysphoria, which especially refers to intense dissatisfaction with not having sex characteristics matching one’s identified gender. While I did feel better and had fewer negative feelings when getting further in transition, I was legitimately uncertain at points about whether I “had dysphoria”. Early on, I thought I might not be “really dysphoric” and accordingly unable to successfully live as a woman, and was intensely sad about this; I interpreted these feelings as “gender dysphoria”, thinking it was possible they would get worse if I didn’t transition. Most of what feelings like this (and others that made me think I might be trans) indicate is that I had/have a strong preference for having a female body and living as a woman; “dysphoria” may in some cases be a way of communicating this kind of preference to people and social systems that only care about sufficiently negative conditions, such as much of the medical system. Unfortunately, requiring people to provide “proof of pain” to satisfy their strong preferences may increase real or perceived pain as a kind of negotiating strategy with gatekeeping systems that are based on confused moral premises.

There is a sizable group that challenges the cis/trans binary but doesn’t consider themselves “nonbinary” per se, who are often labeled as “trans-exclusionary radical feminists”, but usually label themselves as “gender critical”. Although I am certainly “gender critical” by an intensional definition of the term (that is, I’m critical of gender), I’m not by the ordinary language meaning of the term (that is, I’m not a central example of the set of people who self-describe or are described by others as “gender critical”). In particular, so-called “gender criticals” consider biological sex very important for society to assign meaning to and treat people differently on the basis of, which, ironically, reproduces “gender” as defined as “social and cultural meanings assigned to sex”. This is, arguably, its own sort of gender identity, which I don’t share. Instead, my criticism of gender is more like that of queer theorists such as Judith Butler.

I believe there were some points in my life at which I legitimately had a non-male gender identity. At some point, I was convinced that I was actually a woman, and considered this important. I also had experiences in which it was very important to me whether I was categorized as a man or a woman. These experiences tended to be fearful experiences in which I was “objectified”, believing (at least partially correctly) that I was subject to different social threats on the basis of whether I “was” a man or a woman.

I concluded from these experiences that gender identity is a property of objects, not subjects. It is easy to perceive one’s own perceptions of others’ genders; most people, when looking at a person or group of people, can’t help but classify them as men, women, or ambiguous/intermediate. It is rather easier to make these judgments from a distance, than to classify one’s self, the nearest person. Up close, there may be too many details and ambiguities to easily form a simplified judgment of whether one is a man, a woman, or something else; any particular judgment is in a sense “problematic”, since contrary evidence is available immediately at hand. Additionally, being a subject is more like having a lens into the world than being an entity in the world, and so properties of entities, such as gender, do not straightforwardly apply to subjects; what I am saying here has some things in common with Buddhist “no-self” insights, and Kant’s distinction between the self as subject and object.

I believe it is correct to say that, at some previous point in my life, I have been trans. Why, then, would I be uncertain about whether I am presently trans? This is partially due to recent experiences. Due to, among other things, better understanding criticisms of the sort of transgender ideology that I had accepted, coming to believe that gender was a morally irrelevant characteristic, and ketamine depression therapy, I came to be more aware of ways I was unlike typical women and like typical men, and was able to be “chill” about this situation, rather than dysphoric. I considered the question of whether I was a trans woman or an extremely dedicated femboy, finding it to be delightfully meaningless. I experienced contexts in which someone gendering me as male felt pleasant, not dysphoric. I started thinking of myself as non-binary (specifically, an androgyne), and found that this alleviated gender-related stress in my life. I could stop worrying so much about “passing” and about lawyer-y debates (internal and external) about what gender I really was; I had both masculine and feminine characteristics, so calling myself an “androgyne” involved little distortion or selective reporting of the facts.

It was as if I had previously installed a PR module in my mind, to convince myself and others that I was a woman, and I later managed to turn it off, as it was providing little benefit relative to the cost. In some sense, I had predicted this at the start of my transition; I believed that the gender situation of people who were assigned female at birth, identified as non-binary, and did not pursue medical transition, was the sort of gender/sex situation I wanted for myself. My adoption of a binary gender identity and the associated PR module was, in large part, a negotiation with society so that my medical transition and experience being perceived as a woman by others (a type of assimilation) could be accepted. Accordingly, the PR module and binary identity serve less of a function once I have already accomplished this transition. This instrumental understanding of gender identity accords with some feminist thought; quoting the Stanford Encyclopedia of Philosophy:

Bernice Hausman’s Changing Sex: Transsexualism, Technology, and the Idea of Gender (1995) aims to provide a feminist analysis of transsexuality within a Foucauldian paradigm. While her theoretical framework differs markedly from Raymond’s, she also shares Raymond’s concern about transsexuality as well as her deep distrust of medical intervention on the body.

For Hausman, the primary hallmark of transsexuality is the sheer demand for transsexual surgeries through which transsexual subjects are constituted as such (1995, 110). As a consequence, she sees transsexual subjectivity as entirely dependent upon medical technology. In Hausman’s view, transsexuals and doctors work interdependently to produce “the standard account” of transsexuality which serves as a “cover” for the demand for surgery and to justify access to the medical technologies (110, 138–9). Behind the “cover” there is only the problematic demand to, through technology, engineer oneself as a subject. Because of this, Hausman claims that transsexual agency can be “read through” the medical discourse (110).

A corollary of her view is that the very notion of gender (as a psychological entity and cultural role distinguished from sex) is a consequence of medical technology, and in part, the emergence of transsexuality. Rather than arising as a consequence of sexist gender roles, Hausman argues, transsexuality is one of vehicles through which gender itself is produced as an effect of discourses designed to justify access to certain medical technology (140).

I can detect signs of a similar PR module in some transgender people, especially binary-identified trans people early in their transition. They believe, and sometimes say explicitly, that their “narrative” of themselves is very important, and that their well-being depends on their control of this narrative. I can empathize, given that I’ve been there. However, I no longer “buy into” the overall gender framework they are living in. I am too exhausted, after a decade of internal and discursive analysis and deconstruction of various gender frameworks, to care about gender frameworks as I once did.

There is a notable flip in how I’m interpreting the “am I trans?” question now, as opposed to earlier. Earlier, by “am I trans?”, I was asking if I authentically was a woman (or at least, not a man) in a psychological sense. Now, by “am I trans?”, I am asking whether I am manipulating narratives to convince people that I am a woman (or at least not a man). These two notions of “trans” are in some sense opposed, reflecting different simulacra levels.

Quoting more of the SEP article:

Prosser’s strategy for marking a trans theoretical vantage point is to draw a contrast between the centrality of performance (in queer theory) and narrative (for transsexual people). He correctly notes a tendency in postmodern queer theory to raise questions about the political role of narratives (1995, 484). Such narratives may be seen to involve the illusion of a false unity and they may also involve exclusionary politics. Yet narratives, according to Prosser, are central to the accounts of transsexuals and such narratives involve the notion of home and belonging (1995, 488). This appeal to narrative seems in tension with a picture which underscores the fragmentation of coherent narratives into diverse performances and which identifies subversion with the disruption of narrative-based identities. Coherent narratives, even if ultimately fictional, play important intelligibility-conferring roles in the lives of transsexuals, according to Prosser. And this cannot be well-accommodated in accounts which aim to undermine such coherence.

In Prosser’s view, transsexual narratives are driven by a sense of feeling not at home in one’s body, through a journey of surgical change, ultimately culminating in a coming home to oneself (and one’s body) (1995, 490). In this way, the body and bodily discomfort constitute the “depth” or “reality” that stands in contrast to the view that body is sexed through performative gender behavior which constitutes it as the container of gender identity. In light of this, Prosser concludes that queer theory’s use of transsexuals to undermine gender as mere performance fails to do justice to the importance of narrative and belonging in trans identities.

I have, in a sense, transitioned from transsexual to queer; I have constructed a transsexual narrative and later de-constructed it (along with many other narratives, partially during “ego death” experiences), coming to see more of life (especially gender) as an improvisational act beneath the narratives (“all the world’s a stage”). Performative accounts of gender, such as Judith Butler’s, resonate with me in a way they once did not, and gender essentialist narratives (especially transmedicalism) no longer resonate with me as they once did. 

Not needing to do transgender PR is, in a sense, a privilege; if I’ve already accomplished gender transition, I have little need to communicate my situation with psychological narratives as opposed to concrete facts. There is perhaps a risk of me developing a “fuck you, I got mine” attitude, and neglecting to promote the autonomy of people similar to how I was earlier in transition, who have more need for these narratives. At the same time, I don’t need to agree with people’s frameworks to support their autonomy, and criticizing these frameworks can increase the autonomy of people who feel like they shouldn’t transition because they don’t fit standard trans frameworks.

Chilling out about gender does not, of course, negate the strange gender/sex situation I find myself in. I inhabit an ambiguously-sexed transsexual body, which I am happier with than my original body, which changes how I live and how others perceive me. I cannot return to being cis and inhabiting cis gender frameworks, except by detransitioning, which would be objectively difficult and expensive, and subjectively undesirable.

In inhabiting neither cis nor binary transgender gender frameworks, I am illegible. Of course, I often call myself trans and/or a woman, to be at least a little understood within commonly-understood frameworks, but these aren’t statements of ultimate truths. What I have written so far legibilizes my situation somewhat, but I can’t expect everyone I interact with to read this. I could perhaps tell people that I am non-binary, which I do some of the time, although not consistently. While “non-binary” is, by the intensive definition, an accurate descriptor of myself, I still hesitate to use the term, perhaps because it is a part of a standard gender framework, created by people in the past, that centers gender identity in a way that I do not entirely agree with.

A relevant question becomes: are non-binary people, in general, transgender? Typically, non-binary people are considered transgender, since they aren’t cisgender. But, as I’ve discussed earlier, not everyone fits into a cis/trans binary, and some non-binary people do not feel they fit into this binary either.

The main reason why, despite my apparently-transgender situation, I hesitate to unqualifiedly call myself “trans” is that I do not consider “gender identity” per se to be a centrally important descriptor of me or my situation, and relatedly do not feel the need to selectively present aspects of my situation so as to create the narrative impression that I am any particular gender. I can see that this differentiates me from most people who consider the “trans” label important as a descriptor of themselves, despite obvious similarities in our situations, and despite having once been one of these people.

Perhaps now is a good time to revisit Natalie Reed’s 2013 article, “Trans 101”, which I read years ago and can better understand now due to life experience.

(emphasis mine)

As my thinking developed, my priorities shifted… instead of wanting to simply explain to primarily cis audiences what trans people are, what our experiences are like, why they shouldn’t treat us like shit, and how to treat us better, I wanted to be part of the trans-feminist discourse and try to redefine the entire frameworks of gender and feminism that had led to our explanations, and our fights against cissexism, to be necessary in the first place. I didn’t really feel like simply providing the oppressor class with a new set of vocabularies and concepts was going to be sufficient, and I began to regard the Trans 101 frameworks as themselves destructive.

Was it really all that beneficial to simply add a new set of terms or concepts for gender onto which people could apply assumptions and expectations? New categories of gender, new “roles”, new codified sets of behaviour and new codified sets of assumptions people could have about your history, identity, body and potential that people could misread, or misperceive you as, or misunderstand?

And those basic frameworks were themselves a product of a norma[ti]vity. Yes, it was norma[ti]vity internal to a marginalized category, but that didn’t really matter. All normativities are narrowed to a specific context… A specific system of privileges bred that idea of “what trans people are and want”, which was the same system of privileges that made that the concept of trans I was initially introduced to (and had to subsequently deconstruct), AND the same system of privileges that permitted me the role of introducing it to a specific cis audience.

The Trans 101, as defined by the trans people privileged by a cis system to speak for the “consensus” of a trans community, constructed our existence and its consideration as a choice: the cis person could choose to read and care, and thereby be validated in their self-perception as an “ally” and/or good person who cares about the well-being of others or as a down-to-earth, common sense type who likes to look at things rationally without worrying too much about ultra-minority concerns.

In its entirety, the framework, by being about “what trans people are and want, and the vocabulary to discuss or address us”, as a separate category from an addressed cis audience, positioned apart from the realities of gender as a whole, which reflected on the reality of that cis readership. It left the choice in THEIR hands as to whether to take it or leave it, in relationship to this fundamentally separate identity and segregated category of humanity. It defined us, but defined us separate from rather than illustrative of the human experience of gender, and in so doing gave them new and “sensitive” vocabularies to distinguish us… All the while working within the essentialistic model of gender as primarily an issue of what you are and how you should be understood, all the while specializing us as a subject of study and understanding… all the while placing as its centerpiece the cis choice to be “educated” and to “understand” in contrast to how this an extension of the shared experience of gender.

Consequently “gender identity” was central. Things were consistently framed in inevitably heirarchi[c]al spectrum models. It was essentialized as “brain sex” and “gender identity” ( allowing the approach of sex and gender identity to be firmly distinct, and cis people “being a sex” / trans people “having a gender identity”). It defined trans as something you are rather than a way you express yourself, way you live, way you are treated, and way you interpret experiences and feelings. It segregated the experience of “being” trans from all other experiences, however much they modify it: race, class, age, sexual orientation, disability, etc… And it allowed certain political priorities to be considered the needs or causes “of the trans community…

So what does it mean to attempt Trans 101, to attempt explaining trans-variance, in a cultural context in which the “basics” of that question, and the systems of what does and doesn’t get defined as “basic”, have been overwhelmingly a means of our own marginalization, a means of externally limiting the range of our own voice, and a means of reinforcing the kyr[i]archy and privilege internal to our community that keeps it centralized and dominated by specific groups? What does it even mean to attempt to explain a category of experience precisely defined by its own variance, to define something that only exists by virtue of human defiance of having this aspect of human experience defined?

Any kind of statement of “this is what trans is” would be inherently reductive, but reductive statements aren’t necessarily always destructive. The problem is when the reductive simplification presents itself as a sufficient response to the question.

There’s a fundamental tension there that illustrates a lot of the crisis of “Trans 101” and the difficult push-and-pull between deconstruction and simplifications meant for comprehension by a normative, mainstream audience: the tension between the need to explain to the normative, mainstream audience that “it’s more complicated than that”, in response to their received notions about gender, sex and sexuality, while providing them with new notions and models that aren’t “too complicated” to understand. So we end up creating simplifications of our own effort to assert that the experience of gender is complicated. We created little reductive diagrams, outlining a small set of generalized variables, to explain that little reductive diagrams, drawing assumptions about people’s bodies and experiences and identities out of a small set of generalized variables, aren’t adequate. You see the problem?

What makes this an especially poignant problem is the fact that trans experiences and identities are all about new vocabularies and new narratives. In so far as we’re to understand gender as a semiotic system or language, transgenderism is the deviation from standardized language of the dictionary towards new words and new meanings for things that couldn’t be articulated in the previous dialect…In so far as gender is a semiotic system and language, and what attends our assignment isn’t simply a categorization but a modeled and pre-determined, expected narrative for our lives, the act of ‘transition’ is all about challenging language and meanings and narratives. We mean something new, outside the standardized definitions, and we carve out a new story.

Language isn’t a one-way street, however. It’s one thing to say a word, it’s another thing for it [to] mean something… Consequently, translating our existence remains a fundamental part of our existence being heard and seen…

We can’t undermine the entire system of gender. We can’t. Utopian gender-abolitionists believe this, but I don’t. I believe it’s inherent to us. We perceive sexual difference, in others or in ourselves, and we try to understand and express it. That doesn’t seem harmful to me, it just seems human. And over time we develop heuristics for it… ways to make it easier; if one aspect of sexual difference is usually consistent with another, we guess that when we encounter the one in a person we’ll encounter the other. And that’s not itself harmful either, just… simple.

But we also have social orders and kyr[i]archy. We also have patriarchy. And we have diversity of experience. Things get complicated. Human diversity is complicated. I don’t think I’d want it to be simple.

Maybe it’s an inherently broken thing to attempt to articulate the trans experience at all, let alone articulate it to an outside perspective. We ARE the glitches, the new meanings, the problems, the hiccups in the heuristic, the diversity, the variance. Maybe that’s all we need or should communicate about ourselves beyond ourselves: You don’t get it. You won’t get it. We’re something else. We don’t fit. And wherever or whenever we do just means your system still isn’t broad enough, and you still don’t get it.  

But if that’s the case, then ou[r] genders are broken too. To speak, to have a voice… that only counts in so far as you’re heard and understood.

Is that weird little in-between space, hovering right between the need for comprehension and simplification, and the fact that those simplifications will always be misunderstandings and require complications… is that the battlefield? Is that where the meanings are negotiated? Is to be the trans writer to be right there in the position that counts the most in having our genders be understood, ensuring that they count?

No. Fuck no.

Where the battle is, where things matter, that’s in the individual lives. It’s in every single person who in contrast to everything they’ve been told about who and what they are, what that means, what defines it and restricts them to it… in defiance of every expectation that they were saddled with along with the M or F on their birth certificate, like what they’d wear and who they’d fuck and what they’d do for a living and what name they’d have and keep and how their bodies would develop and what they’d choose to do with their bodies… in declaring their own identity, their own body, and carving out a range of their own narratives-to-be… THAT’s where it is. That’s what counts. The fact that it happens at all is a living testament to everything about people that’s worth believing in. And it’s beautiful, every. fucking. time.

And it’s been my honour and privilege to just do my best to help it be noticed.

Reed is describing a kind of “transnormativity”: certain trans people are considered valid educators who can explain what transgenderism is and how people should think, talk, and act regarding gender, and can validate other people (especially cis people) for being “good allies”. 

Cisnormativity has norms like:

  • If you were born male, you’re a boy or man. If you were born female, you’re a girl or woman.
  • Wear clothes appropriate to your gender.
  • Use facilities such as bathrooms in accordance with your gender.
  • Look and behave at least somewhat typically for your gender, don’t shock people by going way out of the lines.
  • In a patriarchal context, be a “confident” agentic person if you’re a man, and otherwise defer to men.
  • In a feminist context, avoid doing things that might make women uncomfortable if you’re a man, and prefer deferring to women regarding gender.
  • Consider transgenderism to be a very rare phenomenon, of “being trapped in the wrong body”, which almost certainly doesn’t apply to you.

Transnormativity is a reaction to cisnormativity, and has norms like:

  • If someone says they’re a man, try to think and talk about them as if they’re a man; if they say they’re a woman, try to think and talk about them as if they’re a woman; same for nonbinary people.
  • Think of yourself as having a “gender identity”, which might or might not match your gender assigned at birth. If it matches, you’re cis, if it doesn’t, you’re trans.
  • Explain your gendered behaviors and attitudes as products of your “gender identity”. Ideally, explain your own gender identity as something that has stayed constant over time.
  • Emphasize that you’re in pain if you can’t live as your identified gender.
  • Look and act somewhat like your identified gender.
  • Don’t argue with others about their gender or try to change their gender identity.
  • If you’re trans, talk about how you feel bad when people think you’re a different gender than you are, and get mad at them sometimes.
  • Think of trans people as a group that is pretty different from cis people and which is oppressed for being trans. If you’re trans, think of yourself as “special”.
  • Use people’s preferred gender pronouns.
  • Support people’s ability to access hormone treatments and surgery, but don’t consider it a necessary condition for being trans.
  • Consider trans people, especially trans people with more sophisticated lefty political views, to be authorities on gender.

To be clear, not all parts of cisnormativity and transnormativity are bad, but they’re both unsatisfactory, restrictive systems of gender. A great deal of both cisnormativity and transnormativity and created by the psychiatric system and its ability to gatekeep transsexual medical procedures, as discussed earlier, and by pro-diversity institutions such as most colleges. “Gender anarchy” is perhaps an alternative to cisnormativity and transnormativity, which I’m not sure has actually been tried, though it might have important problems and not be stable.

Reed’s article is, I think, even more relevant in 2023 than in 2013, as transgenderism and transgender rights are debated in mainstream political discourse around the world. Some issues that affect trans people are considered core parts of “trans rights”, and others aren’t, and those that are correspond with those parts of the phenomenon of transgenderism that can be made legible. Her concerns about normative, simplified, harmfully reductive “trans 101” models being the standard for cis people’s validation as trans allies is pertinent to the controversial and at times legally regulated teaching of gender identity in classrooms.

Her criticism of the way trans 101 defines transgender people as a group “separate from rather than illustrative of the human experience of gender”, differentiated by features such as “gender identity”, explains part of what makes the “trans” category and the cis/trans binary problematic. Given that I think I have many gender experiences in common with the general presumably-cis population, I’m reluctant to separate myself into a different category.

So, am I trans? I could decide to answer this question by deciding on a definition of “trans” and determining whether it applies to myself. However, any definition would fail to account for important aspects of people’s situations, and the common meaning of the term will change anyway. Accordingly, I feel more inclined to leave the question open rather than answering it once and for all.