Inverting qualia with group theory

David Chalmers describes the inverted qualia thought experiment, in The Conscious Mind, as an argument against logical supervenience of phenomenal experience on physical states:

one can coherently imagine a physically identical world in which conscious experiences are inverted, or (at the local level) imagine a being physically identical to me but with inverted conscious experiences. One might imagine, for example, that where I have a red experience, my inverted twin has a blue experience, and vice versa. Of course he will call his blue experiences “red,” but that is irrelevant. What matters is that the experience he has of the things we both call “red”—blood, fire engines, and so on—is of the same kind as the experience I have of the things we both call “blue,” such as the sea and the sky. The rest of his color experiences are systematically inverted with respect to mine, in order that they cohere with the red-blue inversion. Perhaps the best way to imagine this happening with human color experiences is to imagine that two of the axes of our three-dimensional color space are switched—the red-green axis is mapped onto the yellow-blue axis, and vice versa. To achieve such an inversion in the actual world, presumably we would need to rewire neural processes in an appropriate way, but as a logical possibility, it seems entirely coherent that experiences could be inverted while physical structure is duplicated exactly. Nothing in the neurophysiology dictates that one sort of processing should be accompanied by red experiences rather than by yellow experiences.

There are quite a lot of criticisms of this sort of argument. Chalmers addresses some of them, such as the idea that this isn’t neurologically plausible for humans (he brings up aliens with more symmetric color neurology as a counter). A principled approach to criticism is to track the semantics of “looking red” through mutually interpretable language, as Wilfred Sellars does in Empiricism and the Philosophy of Mind. This is, unfortunately, somewhat laborious, and could easily fail to connect with qualia realist intuitions.

I’ll take a more philosophically modest approach: analyzing the consequences of the hypothetical, with group theory. Hopefully, this will make clearer requirements that must be satisfied in the hypothetical, and make progress towards isolating cruxes.

Let’s start by considering a slightly broader space of color qualia operations that include red/blue inversion. We could think of a color in standard form as a triple of numbers in RGB order. Call an operation that permutes the channels (e.g. swapping red and blue) a channel permutation. The group of channel permutations is S_3, the symmetric group of permutations of a 3-element set. We can write the channel permutations as RGB for the identity, BGR for red/blue inversion, BRG for a red-to-green rotation, and so on. Channel permutations compose as, for example, GRB \cdot BGR = GBR; group composition “applies the right element first”. Each channel permutation can be categorized as either being the identity, swapping two channels, or rotating channels in either the red-to-green or green-to-red directions; there are 6 elements of S_3 in total.

At a high level, we will construct the color qualia space category CQS as the category of functors [S_3, Set], where S_3 is the group construed as a single-object category. This is, of course, highly abstract, so let’s go step by step.

A color qualia space has an associated set of elements. Intuitively, these represent data structures that contain colors. The color qualia space also specifies a way to apply S_3 group elements to these set elements. More formally, a color qualia space is a pair (Q, p) where p : S_3 \times Q \rightarrow Q is group homomorphic in its first argument: p(RGB, q) = q and p(a \cdot b, q) = p(a, p(b, q)). Since p is generally clear from context, we also write p(a, q) as a * q. (Note, p is a group action).

As an example, 100 by 100 images, where each pixel has numbers for each of the three color channels, form a color qualia space, where the group action (permuting channels) maps across each pixel. The function q \mapsto GRB * q performs a red/green swap on its image argument, like a shader.

We want to consider maps between color qualia spaces, but we need to be careful. In the inverted qualia thought experiment, we could imagine that the original and their twin both look at a stop sign, and then yell “RED” if the stop sign’s corresponding color qualia are closest to the red primary color. But then the original and the twin would behave differently, contradicting physical identicality. Going with the hypothetical, their mental operations on their color percepts can’t disambiguate the channels too much. In some sense, operations mapping qualia to qualia (such as, taking their visual field and imagining transformations of it) have to be working isomorphically despite the red-blue inversion.

The concept of equivariance in group theory is a rather strong version of this. In this case, an equivariant map f : Q \rightarrow R between color qualia spaces Q, R has, for any a \in S_3, the equality f(a * x) = a * f(x). Intuitively, this means the function acts symmetrically on channels, not picking out any one as special, and not identifying the chirality of the channels. If the twin’s qualia were red/blue inverted from the original’s before the equivariant map, they remain so after both apply the map.

Color qualia spaces and equivariant maps between them form a category, CQS \cong [S_3, Set]. Let’s quickly list some examples of equivariant maps:

  • Mapping an image to its gray-scale variant.
  • Setting the right half of an image to gray-scale.
  • Mapping a function from channel value to channel value (e.g. restricting to a range) over all channel values in an image.
  • Taking the dominant primary color of the top-left pixel of an image, and then inverting the two other primary colors’ channels throughout the whole image

And some examples of non-equivariant maps:

  • Mapping an input image to an output image representing text spelling “red”, “green”, or “blue”, depending on the dominant primary color in the input image.
  • Mapping an image q to b * q for a non-trivial S_3 element b, i.e. swapping two channels, or rotating channels.

Let’s examine the last point. Suppose f(y) = b * y is a function on images. Now to check equivariance, we ask if \forall a \in S_3, \forall x, b * a * x = a * b * x. But this is only true when b = RGB, the group identity. Note S_3 is not Abelian.

What are the philosophical consequences? We could vaguely imagine that both the original and the twin mentally rotate red qualia towards green qualia, green qualia towards blue qualia, and blue qualia towards red qualia (perhaps imagining a transformation to their visual field). But this operation (BRG) does not commute with the original inversion (BGR).

Suppose a third person tells both the original and the twin: “Imagine applying BRG to your visual field”. The original interprets this “correctly”, so actually does BRG. But the twin thinks, “To apply BRG, in my red (color of blood) channel” — actually blue qualia — “I put the value of my blue (color of ocean) channel” — actually red qualia. So the twin actually implements the opposite rotation, GBR!

If you and the twin could both actually apply BRG, and “naively” apply it (in the way that causes the twin to do GBR as above), then you would get the same results naively and actually, while the twin would get different results. Presumably the twin would notice this difference, so we must reject some premise (the original and the twin are supposed to behave the same). And of course, naive application is more straightforward. So we must conclude that you and the twin can’t actually both apply BRG.

Let’s further characterize equivariant maps. Given a color qualia space (Q, p), the orbit of an element q \in Q is the set of elements reachable through group actions, \{a * q ~ | ~ a \in S_3\}. Now let the orbit map \pi_Q : Q \rightarrow Q/S_3 map elements to their orbits, effectively quotienting over channel permutations. The orbit map relates to equivariance in the following way: for any equivariant f : Q \rightarrow R, there is a unique function on orbits g : Q/S_3 \rightarrow R/S_3 commuting, g \circ \pi_Q = \pi_R \circ f.

(Why is this true? Note f must map elements of a single Q-orbit to a single R-orbit, which allows defining g commuting. Any alternative choice would fail commutation on some Q-orbit.)

An orbit is itself a color qualia space (a subspace of the original), and must have size 1, 2, 3, or 6 (by sub-group analysis). We can characterize orbits of a given size as isomorphic to a standard qualia space of that (finite) size. Explicitly:

  • The size-1 qualia space CQ_1 has elements \{ \varnothing \}; it is trivial.
  • The size-2 qualia space CQ_2 has elements \{ L, R \} representing chirality. Channel reflections (RBG, GRB, BGR) flip chirality, rotations preserve it. (Imagine three balls corresponding to primary colors, with sticks connecting them in a triangle; chirality-reversing operations require flipping the triangle over vertically.)
  • The size-3 qualia space CQ_3 has elements {R, B, G} representing primary colors. Group operations work straightforwardly, e.g. BGR * B = R.
  • The size-6 qualia space CQ_6 has as elements total orderings of primary colors (e.g. R > B > G), of which there are 6. Group operations work straightforwardly, e.g. BGR * (R > B > G) = (B > R > G).

Now let’s consider equivariant maps between these standard qualia spaces. Any equivariant map between these must be either: some space to itself, some space to CQ_1, or CQ_6 to anything. (For example, there are no equivariant maps from CQ_1 to CQ_6.) So we can reduce the 4 x 4 of signatures CQ_i \rightarrow CQ_j to only 9 realizable signatures. 4 of these signatures are clearly trivial, as they map to CQ_1. Exhaustively analyzing the possible equivariant maps of the non-trivial signatures:

  • There are two equivariant maps CQ_2 \rightarrow CQ_2: identity and chirality-reversal.
  • There is only one equivariant map CQ_3 \rightarrow CQ_3, the identity.
  • There are two equivariant maps CQ_6 \rightarrow CQ_2, which assign distinct chiralities to the two cyclic orbits, {(R > G > B), (B > R > G), (G > B > R)} and {(B > G > R), (R > B > G), (G > R > B)}.
  • There are three equivariant maps CQ_6 \rightarrow CQ_3, which pick out either the greatest, middle, or least primary color.
  • There are six equivariant maps CQ_6 \rightarrow CQ_6 corresponding to each of the S_3 permutations applying to the ordering, e.g. swapping first and second places.

In tabular form, cardinalities of equivariant map sets |Hom_{CQS}(CQ_i, CQ_j)| are as follows:

in \ outCQ_1CQ_2CQ_3CQ_6
CQ_11000
CQ_21200
CQ_31010
CQ_61236

We now have a combinatorial characterization of equivariant maps in general. First determine how f maps input orbits to output orbits. Then for each input orbit \pi_Q(x), determine how f equivariantly maps it to the corresponding output orbit \pi_R(f(x)), which has a finite combinatorial characterization.

In the reverse direction, we can form a valid equivariant map by first choosing a map on orbits g : Q/S_3 \rightarrow R/S_3, then for each input orbit, selecting an equivariant map to the output orbit. Formally, we could write the set of such maps as:

Hom_{CQS}(Q, R) \cong \Pi_{S \in Q/S_3} \Sigma_{T \in R/S_3} Hom_{CQS}(S, T)

where Hom_{CQS}(S, T) is the set of equivariant maps between color qualia spaces S and T, and \Pi, \Sigma are set-theoretic dependent product and sum.

We now have a fairly direct characterization of equivariant maps in CQS. They aren’t exactly characterizable as “functions between quotient spaces” as one might have expected. Instead, they carry extra orbit-to-orbit information, although this information is combinatorially simple for any given pair of orbits.

What are the philosophical implications? To the color qualia realist, CQS decently characterizes mental operations on color qualia that don’t break the physical symmetry, in thought experiments such as inverted qualia. Equivariance is mathematically natural, and rules out non-realizable mental operations such as BRG. Analysis of CQS, including combinatorial characterization of equivariant maps, provides a functional analysis relevant to the physics of the situation, which (according to Chalmers) doesn’t actually involve color qualia as physical entities.

To the color qualia non-realist, the functional characterization of color qualia intuitions through CQS could provide a hint as to a specific error or illusion the color qualia realist is subject to. The non-realist could expect that, once the functional physical component to the intuition is characterized, there is not a remaining reason to expect color qualia to exist above and beyond such physics and physical functions.

I have found CQS to be clarifying with respect to the inverted qualia thought experiment: equivariance provides a mathematically simple constraint on realizability of mental operations in the scenario. In particular, CQS analysis led me to correct an intuition that channel rotations such as BRG would be realizable, and the combinatorial characterization showed (non-obviously) that maps between quotient spaces are insufficient. My own view is that inverted qualia arguments are fairly weak, and that CQS analysis has some relevance to showing the weakness, but the fuller case would require engaging with the relationship between phenomenal experience and belief-formation.

(If you like the idea of a circular “color wheel” rather than a three-channel “color cube”, you may consider the (also non-Abelian) orthogonal group O(2), and the continuous qualia space category [O(2), Set].)

Matrices map between biproducts

Why are linear functions between finite-dimensional vector spaces representable by matrices? And why does matrix multiplication compose the corresponding linear maps? There’s geometric intuition for this, e.g. presented by 3Blue1Brown. I will alternatively present a category-theoretic analysis. The short version is that, in the category of vector spaces and linear maps, products are also coproducts (hence biproducts); and in categories with biproducts, maps between biproducts decompose as (generalized) matrices. These generalized matrices align with traditional numeric matrices and matrix multiplication in the category of vector spaces. The category-theoretic lens reveals matrices as an elegant abstraction, contra The New Yorker.

I’ll use a standard notion of a vector space over the field \mathbb{R}. A vector space has addition, zero, and scalar multiplication defined, which have the standard commutativity/associativity/distributivity properties. The category \mathsf{Vect} has as objects vector spaces (over the field \mathbb{R}), and as morphisms linear maps. A linear map f : U \rightarrow V between vector spaces U, V satisfies f(u_1 + u_2) = f(u_1) + f(u_2) and f(au) = af(u). (Advanced readers may see nlab on Vect.)

Clearly, \mathbb{R} is a vector space, as is 0 (the vector space with only one element, which is zero). 0 is both an initial and a terminal object. Any linear map from 0 must always return the zero vector; and so must any linear map into 0. Therefore, by definition 0 is a category-theoretic zero object.

If U and V are vector spaces, then the direct sum U \oplus V, which has as elements pairs (u, v) with u \in U, v \in V, and for which addition and scalar multiplication are element-wise, is also a vector space. The direct sum is both a product and a coproduct.

To show that the direct sum is a product, let U and V be vector spaces, and let \pi_1 : U \oplus V \rightarrow U and \pi_2 : U \oplus V \rightarrow V be the projections of the direct sum onto its elements. Let us suppose a third vector space T and linear maps f : T \rightarrow U and g : T \rightarrow V. Let \langle f, g \rangle : T \rightarrow U \oplus V be defined as \langle f, g \rangle (t) = (f(t), g(t)). Now \langle f, g \rangle uniquely commutes:

To show that the direct sum is a coproduct, let U and V be vector spaces, and let i_1 : U \rightarrow U \oplus V be defined as i_1(u) = (u, 0), and similarly let i_2 : V \rightarrow U \oplus V be defined as i_2(v) = (0, v). Let us suppose a third vector space W and linear maps f : U \rightarrow W and g : V \rightarrow W. Let [f, g] : U \oplus V \rightarrow W be defined as [f, g](u, v) = f(u) + g(v). Now [f, g] uniquely commutes:

The upshot is that the direct sum is both a product and a coproduct. Let 0_{A,B} : A \rightarrow B be a zero function; since the direct sum also satisfies the identities \pi_1 \circ i_1 = id_U, \pi_2 \circ i_2 = id_V, \pi_1 \circ i_2 = 0_{V,U}, \pi_2 \circ i_1 = 0_{U,V}, by definition it is a biproduct. We can now abstract from the category \mathsf{Vect} to semiadditive categories, which are by definition categories with a zero object and all pairwise biproducts. Let C stand for any semiadditive category, with biproduct \oplus.

Biproducts enable powerful decomposition of morphisms (such as linear maps). Given h : T \rightarrow U \oplus V (in C), we may uniquely decompose it as h = \langle f, g \rangle for some f : T \rightarrow U and g : T \rightarrow V, specifically f = \pi_1 \circ h, g = \pi_2 \circ h. And similarly, we may uniquely decompose h : U \oplus V \rightarrow W as h = [f, g] for some f : U \rightarrow W and g : V \rightarrow W, specifically f = h \circ i_1, g = h \circ i_2.

Biproducts generalize from binary to n-ary. Suppose n is natural and U_i is an object for natural 1 \leq i \leq n. Now the n-ary biproduct is \bigoplus_{i=1}^n U_i = U_1 \oplus \ldots \oplus U_n. We take the empty biproduct to be 0. We can also generalize the projections \pi_i, the injections i_i, the “row-wise” combination \langle f, g \rangle, and the “column-wise” combination [f, g], from binary to n-ary.

This generalization to n-ary biproducts enables conceiving of matrices categorically. Let m, n be natural, and let U_i and V_j be objects in C, for natural 1 \leq i \leq m and 1 \leq j \leq n. Suppose h : \bigoplus_{i=1}^m U_i \rightarrow \bigoplus_{j=1}^n V_j. We first decompose h “row-wise”, as h = \langle h_1, \ldots, h_n \rangle where h_j = \pi_j \circ h. Then we decompose each row “column-wise”, as h_j = [h_{j,1}, \ldots, h_{j,m}] where h_{j,i} = \pi_j \circ h \circ i_i. We can now write h in matrix style, as h = \langle [ h_{1, 1}, \ldots, h_{1, m} ], \ldots, [h_{n, 1}, \ldots, h_{n, m}] \rangle; the notation \langle \ldots \rangle can be visualized as vertical matrix concatenation, and [ \ldots ] can be visualized as horizontal matrix concatenation.

This is the core abstract idea, but how to apply it more concretely? Back in \mathsf{Vect}, we can form the Euclidean space \mathbb{R}^n = \bigoplus_{i=1}^n \mathbb{R}. Now a map h : \mathbb{R}^m \rightarrow \mathbb{R}^n decomposes as a n \times m matrix of linear maps h_{j, i} : \mathbb{R} \rightarrow \mathbb{R}. This is not quite a traditional matrix, but note that linear maps of type \mathbb{R} \rightarrow \mathbb{R} are always multiplication by a constant real slope. Representing each h_{j, i} by its slope yields a more traditional numeric matrix.

We can generalize matrix representation of linear maps to finite-dimensional vector spaces (which by definition have finite bases), by noting that each of these is isomorphic to \mathbb{R}^n for some natural n. Specifically, if a space U has a basis \{u_1, \ldots, u_n\}, then the linear map f : \mathbb{R}^n \rightarrow U defined as f(x_1, \ldots, x_n) = \sum_{i=1}^n x_i u_i is an isomorphism. Hence, matrix representation extends to maps between finite-dimensional vector spaces.

So far, we have a treatment of matrix-vector multiplication, but not matrix-matrix multiplication. We would like to show that composition of linear maps leads to the matrix representations multiplying in the expected way. Let m, n, p be natural, and let U_i, V_j, W_k be objects in C (for naturals 1 \leq i \leq m, 1 \leq j \leq n, 1 \leq k \leq p). Suppose we have maps f : \bigoplus_{i=1}^m U_i \rightarrow \bigoplus_{j=1}^n V_j and g : \bigoplus_{j=1}^n V_j \rightarrow \bigoplus_{k=1}^p W_k. We can write f in matrix form (f_{j, i} = \pi_j \circ f \circ i_i), and similarly g (g_{k, j} = \pi_k \circ g \circ i_j).

Now we wish to find the matrix form of the composition h = g \circ f. We fix i, k and consider the entry h_{k, i} = \pi_k \circ g \circ f \circ i_i. Now note \pi_k \circ g = [g_{k, 1}, \ldots, g_{k, n}] and f \circ i_i = \langle f_{1, i}, \ldots, f_{n, i} \rangle. Therefore

h_{k, i} = [g_{k, 1}, \ldots, g_{k, n}] \circ \langle f_{1, i}, \ldots, f_{n, i} \rangle

This expression is a row-column matrix multiplication, similar to a vector dot product. In the case of \mathsf{Vect}, we can more explicitly write:

h_{k, i}(u) = \sum_{j=1}^n g_{k, j}(f_{j, i}(u))

Since Hom(U, V) in \mathsf{Vect}, the set of linear maps from U to V, naturally forms a vector space, h_{k, i} can also be written:

h_{k, i} = \sum_{j=1}^n (g_{k, j} \circ f_{j, i})

In the case where each U_i, V_j, W_k is \mathbb{R}, this aligns with traditional matrix multiplication; composing linear maps of type \mathbb{R} \rightarrow \mathbb{R} multiplies their slopes.

There is a way to generalize the row-column matrix multiplication [g_{k, 1}, \ldots, g_{k, n}] \circ \langle f_{1, i}, \ldots, f_{n, i} \rangle = \sum_{j=1}^n (g_{k, j} \circ f_{j, i}) to semiadditive categories in general; for details, see Wikipedia on additive categories.

To summarize general lessons about semiadditive categories:

  • A map between biproducts h : \bigoplus_{i=1}^m U_i \rightarrow \bigoplus_{j=1}^n V_j can be represented in matrix form as \langle [h_{1, 1}, \ldots, h_{1, m}], \ldots, [h_{n, 1}, \ldots, h_{n, m}] \rangle with h_{j, i} = \pi_j \circ h \circ i_i.
  • If we have maps f : \bigoplus_{i=1}^m U_i \rightarrow \bigoplus_{j=1}^n V_j and g : \bigoplus_{j=1}^n V_j \rightarrow \bigoplus_{k=1}^p W_k, the matrix entry h_{k, i} of the composition h = g \circ f is the row-column multiplication [g_{k, 1}, \ldots, g_{k, n}] \circ \langle f_{1, i}, \ldots, f_{n, i} \rangle.

And in the case of \mathsf{Vect}, these imply the standard results:

  • Linear maps between finite-dimensional vector spaces, with fixed bases, are uniquely represented as numeric matrices.
  • The matrix representation of a composition of linear maps equals the product of the matrices representing these maps.

This is a nice way of showing the standard results, and the abstract results generalize to other semiadditive categories, such as the category of Abelian groups. For more detailed category-theoretic study of linear algebra, see Filip Bár’s thesis, “On the Foundations of Geometric Algebra”. For an even more abstract treatment, see “Graphical Linear Algebra”.

Homomorphically encrypted consciousness and its implications

I present a step-by-step argument in philosophy of mind. The main conclusion is that it is probably possible for conscious homomorphically encrypted digital minds to exist. This has surprising implications: it demonstrates a case where “mind exceeds physics” (epistemically), which implies the disjunction “mind exceeds reality” or “reality exceeds physics”. The main new parts of the discussion consist of (a) an argument that, if digital computers are conscious, so are homomorphically encrypted versions of them (steps 7-9); (b) speculation on the ontological consequences of homomorphically encrypted consciousness, in the form of a trilemma (steps 10-11).

Step 1. Physics

Let P be the set of possible physics states of the universe, according to “the true physics”. I am assuming that the intellectual project of physics has an idealized completion, which discovers a theory integrating all potentially accessible physical information. The theory will tend to be microscopic (although not necessarily strictly) and lawful (also not necessarily strictly). It need not integrate all real information, as some such information might not be accessible (e.g. in the case of the simulation hypothesis).

Rejecting this step: fundamental skepticism about even idealized forms of the intellectual project of physics; various religious/spiritual beliefs.

Step 2. Mind

Let M be the set of possible mental states of minds in the universe. Note, an element of M specifies something like a set or multiset of minds, as the universe could contain multiple minds. We don’t need M to be a complete theory of mind (specifying color qualia and so on); the main concern is doxastic facts, about beliefs of different agents. For example, I believe there is a wall behind me; this is a doxastic mental fact. This step makes no commitment to reductionism or non-reductionism. (Color qualia raise a number of semantic issues extraneous to this discussion; it is sufficient for now to consider mental states to be quotiented over any functionally equivalent color inversion/rotations, as these make no doxastic differences.)

Rejecting this step: eliminativism, especially eliminative physicalism.

Step 3. Reality

Let R be the set of possible reality states, according to “the true reality theory”. To motivate the idea, physics (P) only includes physical facts that could in principle be determined from the contents of our universe. There would remain basic ambiguities about the substrate, such as multiverse theories, or whether our universe exists in a computer simulation. R represents “the true theory of reality”, whatever that is; it is meant to include enough information to determine all that is real. For example, if physicalism is strictly true, then R = P, or is at least isomorphic. Solomonoff induction, and similarly the speed prior, posit that reality consists of an input to a universal Turing machine (specifying some other Turing machine and its input), and its execution trajectory, producing digital subjective experience.

Let f : R \rightarrow P specify the universe’s physical state as a function of the reality state. Let g : R \rightarrow M specify the universe’s mental state as a function of the reality state. These presumably exist under the above assumptions, because physics and mind are both aspects of reality, though these need not be efficiently computable functions. (The general structure of physics and mind being aspects of reality is inspired by neutral monism, though it does not necessitate neutral monism.)

Rejecting this step: fundamental doubt about the existence of a reality on which mind and physics supervene; incompatibilism between reality of mind and of physics.

Step 4. Natural supervenience

Similar to David Chalmers’s concept in The Conscious Mind. Informally, every possible physical state has a unique corresponding mental state. Formally:

\forall (p : P), (\exists r : R, f(r) = p) \rightarrow (\exists! (m : M), \forall (r_2 : R), (f(r_2) = p) \rightarrow (g(r_2) = m))

Here \exists! means “there exists a unique”.

Assuming ZFC and natural supervenience, there exists the mapping function h : P \rightarrow M commuting (h \circ f = g), though again, h need not be efficiently computable.

Natural supervenience is necessary for it to be meaningful to refer to the mental properties corresponding to some physical entity. For example, to ask about the mental state corresponding to a physical dog. Natural supervenience makes no strong claim about physics “causing” mind; it is rather a claim of constant conjunction, in the sense of Hume. We are not ruling out, for example, physics and mind being always consistent due to a common cause.

Rejecting this step: Interaction dualism. “Antenna theory”. Belief in P-zombies as not just logically possible, but really possible in this universe. Belief in influence of extra-physical entities, such as ghosts or deities, on consciousness.

Step 5. Digital consciousness

Assume it is possible for a digital computer running a program to be conscious. We don’t need to make strong assumptions about “abstract algorithms being conscious” here, just that realistic physical computers that run some program (such as a brain emulation) contain consciousness. This topic has been discussed to death, but to briefly say why I think digital computer consciousness is possible:

  • The mind not being digitally simulable in a behaviorist manner (accepting normal levels of stochasticity/noise) would imply hypercomputation in physics, which is dubious.
  • Chalmers’s fading qualia argument implies that, if a brain is gradually transformed into a behaviorally equivalent simulation, and the simulation is not conscious, then qualia must fade either gradually or suddenly; both are problematic.
  • Having knowledge that no digital computer can be conscious would imply we have knowledge of ultimate reality r : R, specifically, that we do not exist in a digital computer simulation. While I don’t accept the simulation hypothesis as likely, it seems presumptuous to reject it on philosophy of mind grounds.

Rejecting this step: Brains as hypercomputers; or physical substrate dependence, e.g. only organic matter can be conscious.

Step 6. Real-physics fully homomorphic encryption is possible

Fully homomorphic encryption allows running a computation in an encrypted manner, producing an encrypted output; knowing the physical state of the computer and the output, without knowing the key, is insufficient to determine details of the computation or its output in physical polynomial time. Physical polynomial time is polynomial time with respect to the computing power of physics, BQP according to standard theories of quantum computation. Homomorphic encryption is not proven to work (since P != NP is not proven). However, quantum-resistant homomorphic encryption, e.g. based on lattices, is an active area of research, and is generally believed to be possible. This assumption says that (a) quantum-resistant homomorphic encryption is possible and (b) quantum-resistance is enough; physics doesn’t have more computing power than quantum. Or alternatively, non-quantum FHE is possible, and quantum computers are impossible. Or alternatively, the physical universe’s computation is more powerful than quantum, and yet FHE resisting it is still possible.

Rejecting this step: Belief that the physical universe has enough computing power to break any FHE scheme in polynomial time. Non-standard computational complexity theory (e.g. P = NP), cryptography, or physics.

Step 7. Homomorphically encrypted consciousness is possible

(Original thought experiment proposed by Scott Aaronson.) 

Assume that a conscious digital computer can be homomorphically encrypted, and still be conscious, if the decryption key is available nearby. Since the key is nearby, the homomorphic encryption does not practically obscure anything. It functions more as a virtualization layer, similar to a virtual machine. If we already accept digital computer consciousness as possible, we need to tolerate some virtualization, so why not this kind?

An intuition backing this assumption is “can’t get something from nothing”. If we decrypt the output, we get the results that we would have gotten from running a conscious computation (perhaps including the entire brain emulation state trajectory in the output), so we by default assume consciousness happened in the process. We got the results without any fancy brain lesioning (to remove the seat of consciousness while preserving functional behavior), just a virtualization step.

As a concrete example, consider if someone using brain emulations as workers in a corporation decided to homomorphically encrypt the emulation (and later decrypt the results with a key on hand), to get the results of the work, without any subjective experience of work. It would seem dubious to claim that no consciousness happened in the course of the work (which could even include, for example, writing papers about consciousness), due to the homomorphic encryption layer.

As with digital consciousness, if we knew that homomorphically encrypted computations (with a nearby decryption key) were not conscious, then we would know something about ultimate reality, namely that we are not in a homomorphically encrypted simulation.

Rejecting this step: Picky quasi-functionalism. Enough multiple realizability to get digital computer consciousness, but not enough to get homomorphically encrypted consciousness, even if the decryption key is right there.

Step 8. Moving the key further away doesn’t change things

Now that the homomorphically encrypted conscious mind is separated from the key, consider moving the key 1 centimeter further away. We assume this doesn’t change the consciousness of the system, as long as the key is no more than 1 light-year away, so that it is in principle possible to retrieve the key. We can iterate to move the key 1 light-year away in small steps, without changing the consciousness of the overall system.

As an intuition, suppose the contrary that the computation with the nearby key was conscious, but not with the far-away key. We run the computation, still encrypted, to completion, while the key is far away. Then we bring the key back and decrypt it. It seems we “got something from nothing” here: we got the results of a conscious computation with no corresponding consciousness, and no fancy brain lesioning, just a virtualization layer with extra steps.

Rejecting this step: Either a discrete jump where moving the key 1 cm removes consciousness (yet consciousness can be brought back by moving the key back 1cm?), or a continuous gradation of diminished consciousness across distance, though somehow making no behavioral difference.

Step 9. Deleting a far-away key doesn’t change things

Suppose the system of the encrypted computation and the far-away key is conscious. Now suppose the key is destroyed. Assume this doesn’t affect the system’s consciousness: the encrypted computation by itself, with no key anywhere in the universe, is still conscious.

This assumption is based on locality intuition. Could my consciousness depend directly on events happening 1 light-year away, which I have no way of observing? If my consciousness depended on it in a behaviorally relevant way, then that would imply faster-than-light communication. So it can only depend on it in a behaviorally irrelevant way, but this presents similar problems as with P-zombies.

We could also consider a hypothetical where the key is destroyed, but then randomly guessed or brute-forced later. Does consciousness flicker off when the key is destroyed, then on again as it is guessed? Not in any behaviorally relevant way. We did something like “getting something from nothing” in this scenario, except that the key-guessing is real computational work. The idea that key-guessing is itself what is producing consciousness is highly dubious, due to the dis-analogy between the computation of key-guessing and the original conscious computation.

Rejecting this step: Consciousness as a non-local property, affected by far-away events, though not in a way that makes any physical difference. Global but not local natural supervenience.

Step 10. Physics does not efficiently determine encrypted mind

If a homomorphically encrypted mind (with no decryption key) is conscious, and has mental states such as belief, it seems it knows things (about its mental states, or perhaps mathematical facts) that cannot be efficiently determined from physics, using the computation of physics and polynomial time. Physical omniscience about the present state of the universe is insufficient to decrypt the computation. This is basically re-stating that homomorphic encryption works.

Imagine you learn you are in such an encrypted computation. It seems you know something that a physically omniscient agent doesn’t know except with super-polynomial amounts of computation: the basic contents of your experience, which could include the decryption key, or the solution to a hard NP complete problem.

There is a slight complication, in that perhaps the mental state can be determined from the entire trajectory of the universe, as the key was generated at some point in the past, even if every trace of it has been erased. However, in this case we are imagining something like Laplace’s demon looking at the whole physics history; this would imply that past states are “saved”, efficiently available to Laplace’s demon. (The possibility of real information, such as the demon’s memory of the physical trajectory, exceeding physical information, is discussed later; “Reality exceeds physics, informationally”.)

If locality of natural supervenience applies temporally, not just spatially, then the consciousness of the homomorphically encrypted computation can’t depend directly on the far past, only at most the recent past. In principle, the initial state of the homomorphically encrypted computation could have been “randomly initialized”, not generated from any existent original key, although of course this is unlikely.

So I assume that, given the steps up to here, the homomorphically encrypted mind really does know something (e.g. about its own experiences/beliefs, or mathematical facts) that goes beyond what can be efficiently inferred from physics, given the computing power of physics.

Rejecting this step: Temporal non-locality. Mental states depend on distinctions in the distant physical past, even though these distinctions make no physical or behavioral difference in the present or recent past. Doubt that the randomly initialized homomorphically encrypted mind really “knows anything” beyond what can be efficiently determined from physics, even reflexive properties about its own experience.

Step 11. A fork in the road

A terminological disambiguation: by P-efficiently computable, I mean computable in polynomial time with respect to the computing power of physics, which is BQP according to standard theories. By R-efficiently computable, I mean computable in polynomial time with respect to the computing power of reality, which is at least that of physics, but could in principle be higher, e.g. if our universe was simulated in a universe with beyond-quantum computation.

If assumptions so far are true, then there is no P-efficiently computable h : P \rightarrow M mapping physical states to mental states, corresponding to the natural supervenience relation. This is because, in the case of homomorphically encrypted computation, h would have to run in P-super-polynomial time. This can be summarized as “mind exceeds physics, epistemically”: some mind in the system knows something that cannot be P-efficiently determined from physics, such as the solution to some hard NP-complete problem.

Now we ask a key question: Is there a R-efficiently computable g : R \rightarrow M mapping reality states to mental states, and if so, is there a P-efficiently computable g?

Path A: Mind exceeds reality

Suppose there is no R-efficiently computable g (from which it follows that there is no P-efficiently computable g). That is, even given omniscence about ultimate reality, and polynomial computation with respect to the computation of reality (which is at least as strong as that of physics, perhaps stronger), it is still not possible to know all about minds in the universe, and in particular, details of the experience contained in a homomorphically encrypted computation. Mind doesn’t just exceed physics; mind exceeds reality.

Again, imagine you learn you are in a homomorphically encrypted computation. You look around you and it seems you see real objects. Yet these objects’ appearances can’t be R-efficiently determined on the basis of all that is real. Your experiences seem real, but they are more like “potentially real”, similar to hard-to-compute mathematical facts. Yet you are in some sense physically embodied; cracking the decryption key would reveal your experience. And you could even have correct beliefs about the key, having the requisite mathematical knowledge for the decryption. You could even have access to and check the solution to a hard NP complete problem that no one else knows the solution to; does this knowledge not “exist in reality” even though you have access to it and can check it?

Something seems unsatisfactory about this, even if it isn’t clearly wrong. If we accept step 2 (existence of mind), rejecting eliminativism, then we accept that mental facts are in some sense real. But here, they aren’t directly real in the sense of being R-efficiently determined from reality. It is as if an extra computation (search or summation over homomorphic embeddings?) is happening to produce subjective experience, yet there is nowhere in reality for this extra computation to take place. The point of positing physics and/or reality is partially to explain subjective experience, yet here there is no R-efficient explanation of experience in terms of reality.

Path B: Reality exceeds physics, computationally

Suppose g : R \rightarrow M is R-efficiently computable, but not P-efficiently computable. Then the real substrate computes more powerfully than physics (given polynomial time in each case). Reality exceeds physics: there really is a more powerful computing substrate than is implied by physics.

As a possibility argument, consider that a Turing-computable universe, such as Conway’s Game of Life, can be simulated in this universe. Reality contains at least quantum computing, since our universe (presumably) supports it. This would allow us to, for example, decrypt the communications of Conway’s Game of Life lifeforms who use RSA.

So we can’t easily rule out that the real substrate has enough computation to efficiently determine the homomorphically encrypted experience, despite physics not being this powerful. This would contradict strict physicalism. It could open further questions about whether homomorphic encryption is possible in the substrate of reality, though of course in theory something analogous to P = NP could apply to the substrate.

Path C: Reality exceeds physics, informationally

Suppose instead that g : R \rightarrow M is P-efficiently computable (and therefore also R-efficiently computable). Then physicalism is strictly false: R contains more accessible information than P. There is real information, exceeding the information of physics, which is sufficient to P-efficiently determine the mental state of the conscious mind in the homomorphically encrypted computation. Perhaps reality has what we might consider “high-level information” or a “multi-level map”. Maybe reality has a category theoretic and/or universal algebraic structure of domains and homomorphisms between them.

According to this path, reductionism is not strictly true. Mental facts could be “reduced” to physical facts sufficient to re-construct them (by natural supervenience). However, there is no efficient re-construction; the reduction destroys P-computation-bounded information even though it destroys no computation-unbounded information. Hence, since reality P-efficiently determines subjective experiences, unlike physics, it contains information over and above physics.

HashLife is inspirational, in its informational preservation and use of high-level features, while maintaining the expected low-level dynamics of Conway’s Game of Life. Though this is only a loose analogy.

Conclusion

Honestly, I don’t know what to think at this point. I feel pretty confident about conscious digital computers being possible. The homomorphic encryption step (with a key nearby) seems to function as a virtualization step, so I’m willing to accept that, though it introduces complications. I am pretty sure moving the key far away, then deleting it, doesn’t make a difference; denying either would open up too many non-locality paradoxes. So I do think a homomorphically encrypted computation, with no decryption key anywhere, is probably conscious, though ordinary philosophical uncertainty applies.

That leads to the fork in the road. Path A (mind exceeds reality) seems least intuitive; it implies actual minds can “know more” than reality, e.g. know mathematical facts not R-efficiently determinable from reality. It seems dogmatic to be confident in either path B or C; both paths imply substantial facts about the ultimate substrate. Path B seems to have the fewest conceptual problems: unlike path C, it doesn’t require positing the informational existence of “high-level” homomorphic levels above physics. However, attributing great computational power to the real substrate would have anthropic implications: why do we seem to be in a quantum-computing universe, if the real substrate can support more advanced computations?

Path C is fun to imagine. What if some of what we would conceive of as “high-level properties” really exist in the ultimate substrate of reality, and reductionism simply assumes away this information, with invalid computational consequences? This thought inspires ontological wonder.

In any case, the disjunction of path B or C implies that strict physicalism is false, which is theoretically notable. If B or C is correct, reality exceeds physics one way or another, computationally and/or informationally. Ordinary philosophical skepticism applies, but I accept the disjunction B \vee C as the mainline model. (Note that Chalmers believes natural supervenience holds but that strict physicalism is false.)

As an end note, there is a general “trivialism” objection to functionalism, in that many physical systems, such as rocks, can be interpreted as running any of a great number of computations. Chalmers has discussed causal solutions; Jeff Buenchner has discussed computational complexity solutions (in Gödel, Putnam, and Functionalism), restricting interpretations to computationally realistic ones, e.g. not interpreting a rock as solving the halting problem. Trivialism and solutions to it are of course relevant to attributing mental or computational properties to a computer running a homomorphically encrypted computation.

(thanks to @adrusi for a X discussion leading to many of these thoughts)

A philosophical kernel: biting analytic bullets

Sometimes, a philosophy debate has two basic positions, call them A and B. A matches a lot of people’s intuitions, but is hard to make realistic. B is initially unintuitive (sometimes radically so), perhaps feeling “empty”, but has a basic realism to it. There might be third positions that claim something like, “A and B are both kind of right”.

Here I would say B is the more bullet-biting position. Free will vs. determinism is a classic example: hard determinism is biting the bullet. One interesting thing is that free will believers (including compatibilists) will invent a variety of different theories to explain or justify free will; no one theory seems clearly best. Meanwhile, hard determinism has stayed pretty much the same since ancient Greek fatalism.

While there are some indications that the bullet-biting position is usually more correct, I don’t mean to make an overly strong statement here. Sure, position A (or a compatibility between A and B) could really be correct, though the right formalization hasn’t been found. But I am interested in what views result from biting bullets at every stage, nonetheless.

Why consider biting multiple bullets in sequence? Consider an analogy: a Christian fundamentalist considers whether Christ’s resurrection didn’t really happen. He reasons: “But if the resurrection didn’t happen, then Christ is not God. And if Christ is not God, then humanity is not redeemed. Oh no!”

There’s clearly a mistake here, in that a revision of a single belief can lead to problems that are avoided by revising multiple beliefs at once. In the Christian fundamentalist case, atheists and non-fundamentalists already exist, so it’s pretty easy not to make this mistake. On the other hand, many of the (explicit or implicit) intuitions in the philosophical water supply may be hard to think outside of; there may not be easily identifiable “atheists” with respect to many of these intuitions simultaneously.

Some general heuristics. Prefer ontological minimality: do not explode types of entities beyond necessity. Empirical plausibility: generally agree with well-established science and avoid bold empirical claims; at most, cast doubt on common scientific background assumptions (see: Kant decoupling subjective time from clock time). Un-creativity: avoid proposing speculative, experimental frameworks for decision theory and so on (they usually don’t work out).

What’s the point of all this? Maybe the resulting view is more likely true than other views. Even if it isn’t true, it might be a minimal “kernel” view that supports adding more elements later, without conflicting with legacy frameworks. It might be more productive to argue against a simple, focused, canonical view than a popular “view” which is really a disjunctive collection of many different views; bullet-biting increases simplicity, hence perhaps being more productive to argue against.

Causality: directed acyclic graph multi-factorization

Empirically, we don’t see evidence of time travel. Events seem to proceed from past to future, with future events being at least somewhat predictable from past events. This can be seen in probabilistic graphical models. Bayesian networks have a directed acyclic graph factorization (which can be topologically sorted, perhaps in multiple ways), while factor graphs in general don’t. (For example, it is possible to express the conditional distribution of a Bayesian network on some variable having some value, in a factor graph; the factor graph now expresses something like “teleology”, events tending to happen more when they are compatible with some future possibility.)

This raises the issue that there are multiple Bayesian networks with different graphs expressing the same joint distribution. For ontological minimality, we could say these are all valid factorizations (so there is no “further fact” of what is the real factorization, in cases of persistent empirical ambiguity), though of course some have analytically nicer mathematical properties (locality, efficient computability) than others. Each non-trivial DAG factorization has mathematical implications about the distribution; we need not forget these implications even though there are multiple DAG factorizations.

Bayesian networks can be generalized to probabilistic programming, e.g. some variables may only exist dependent on specific values for previous variables. This doesn’t change the overall setup much; the basic ideas are already present in Bayesian networks.

We now have a specific disagreement with Judea Pearl: he operationalizes causality in terms of consequences of counterfactual intervention. This is sensitive to the graph order of the directed acyclic graph; hence, causal graphs express more information than the joint distribution. For ontological minimality, we’ll avoid reifying causal counterfactuals and hence causal graphs. Causal counterfactuals have theoretical problems, such as implying violations of physical law, hence being un-determined by empirical science (as we can’t observe what happens when physical laws are violated). We avoid these, by not believing in causal counterfactuals.

Since causal counterfactuals are about non-actual universes, we don’t really need them to make the empirical predictions of causal models, such as no time travel. DAG factorization seems to do the job.

Laws of physics: universal satisfaction

Given a DAG model, some physical invariants may hold, e.g. conservation of energy. And if we transform the DAG model to one expressing the same joint distribution, the physical invariants translate. They always hold for any configuration in the DAG’s support.

Do the laws have “additional reality” beyond universal satisfaction? It doesn’t seem we need to assume they do. We predict as if the laws always hold, but that reduces to a statement about the joint configuration; no extra predictive power results from assuming the laws have any additional existence.

So for ontological minimality, the reality of a law can be identified with its universal satisfaction by the universe’s trajectory. (This is weaker than notions of “counterfactual universal satisfaction across all possible universes”.)

This enables us to ask questions similar to counterfactuals: what would follow (logically, or with high probability according to the DAG) in a model in which these universal invariants hold, and the initial state is X (which need not match the actual universe’s initial state)? This is a mathematical question, rather than a modal one; see discussion of mathematics later.

Time: eternalism

Eternalism says the future exists, as the past and present do. This is fairly natural from the DAG factorization notion of causality. As there are multiple topological sorts of a given DAG, and multiple DAGs consistent with the same joint distribution, there isn’t an obvious way to separate the present from the past and future; and even if there were, there wouldn’t be an obvious point in declaring some nodes real and others un-real based on their topological ordering. Accordingly, for ontological minimality, they have the same degree of existence.

Eternalism is also known as “block universe theory”. There’s a possible complication, in that our DAG factorization can be stochastic. But the stochasticity need not be “located in time”. In particular, we can move any stochasticity into independent random variables, and have everything be a deterministic consequence of those. This is like pre-computing random numbers for a Monte Carlo sampling algorithm.

The main empirical ambiguity here is whether the universe’s history has a high Kolmogorov complexity, increasing approximately linearly with time. If it does, then something like a stochastic model is predictively appropriate, although the stochasticity need not be “in time”. If not, then it’s more like classical determinism. It’s an open empirical question, so let’s not be dogmatic.

We can go further. Do we even need to attribute “true stochasticity” to a universe with high Kolmogorov complexity? Instead, we can say that simple universally satisfied laws constrain the trajectory, either partially or totally (only partially in the high K-complexity case). And to the extent they only partially do, we have no reason to expect that a simple stochastic model of the remainder would be worse than any other model (except high K-complexity ones that “bake in” information about the remainder, a bit of a cheat). (See the “The Coding Theorem — A Link between Complexity and Probability” for technical details.)

Either way, we have “quasi-determinism”; everything is deterministic, except perhaps factored-out residuals that a simple stochastic model suffices for.

Free will: non-realism

A basic argument against free will: free will for an agent implies that the agent could have done something else. This already implies a “possibility”-like modality; if such a modality is not real, free will fails. If on the other hand, possibility is real, then, according to standard modal logics such as S4, any logical tautology must be necessary. If an agent is identified with a particular physical configuration, then, given the same physics / inputs / stochastic bits (which can be modeled as non-temporal extra parameters, per previous discussion), there is only one possible action, and it is necessary, as it is logically tautological. Hence, a claim of “could” about any other action fails.

Possible ways out: consider giving the agent different inputs, or different stochastic bits, or different physics, or don’t identify the agent with its configuration (have “could” change the agent’s physical configuration). These are all somewhat dubious. For one, it is dogmatic to assume that the universe has high Kolmogorov complexity; if it doesn’t, then modeling decisions as having corresponding “stochastic bits” can’t in general be valid. Free will believers don’t tend to agree on how to operationalize “could”, their specific formalizations tend to be dubious in various ways, and the formalizations do not agree much with normal free will intuitions. The obvious bullet to bite here is, there either is no modal “could”, or if there is, there is none that corresponds to “free will”, as the notion of “free will” bakes in confusions.

Decision theory: non-realism

We reject causal decision theory (CDT), because it relies on causal counterfactuals. We reject any theory of “logical counterfactuals”, because the counterfactual must be illogical, contradicting modal logics such as S4. Without applying too much creativity, what remain are evidential decision theory (EDT) and non-realism, i.e. the claim that there is not in general a fact of the matter about what action by some fixed agent best accomplishes some goal.

To be fair to EDT, the smoking lesion problem is highly questionable in that it assumes decisions could be caused by genes (without those genes changing the decision theory, value function, and so on), contradicting implementation of EDT. Moreover, there are logical formulations of EDT, which ask whether it would be good news to learn that one’s algorithm outputs a given action given a certain input (the one you’re seeing), where “good news” is taken across a class of possible universes, not just the one you have evidence of; these may better handle “XOR blackmail” like problems.

Nevertheless, I won’t dogmatically assume based on failure of CDT and logical counterfactual theories that EDT works; EDT theorists have to do a lot to make EDT seem to work in strange decision-theoretic thought experiments. This work can introduce ontological extras such as infinitesimal probabilities, or similarly, pseudo-Bayesian conditionals on probability 0 events. From a bullet-biting perspective, this is all highly dubious, and not really necessary.

We can recover various “practical reason” concepts as statistical predictions about whether an agent will succeed at some goal, given evidence about the agent, including that agent’s actions. For example, as a matter of statistical regularity, some people succeed in business more than others, and there is empirical correlation with their decision heuristics. The difference is that this is a third-personal evaluation, rather than a first-personal recommendation: we make no assumption that third-person predictive concepts relating to practical reason translate to a workable first-personal decision theory. (See also “Decisions are not about changing the world, they are about learning what world you live in”, for related analysis.)

Morality: non-realism

This shouldn’t be surprising. Moral realism implies that moral facts exist, but where would they exist? No proposal of a definition in terms of physics, math, and so on has been generally convincing, and they vary quite a lot. G.E. Moore observes that any precise definition of morality (in terms of physics and so on) seems to leave an “open question” of whether that is really good, and compelling to the listener.

There are many possible minds (consider the space of AGI programs), and they could find different things compelling. There are statistical commonalities (e.g. minds will tend to make decisions compatible with maintaining an epistemology and so on), but even commonalities have exceptions. (See “No Universally Compelling Arguments”.)

Suppose you really like the categorical imperative and think rational minds have a general tendency to follow it. If so, wouldn’t it be more precise to say “X agent follows the categorical imperative” than “X agent acts morally”? This bakes in fewer intuitive confusions.

As an analogy, suppose some people refer to members of certain local bird species as a “forest spirit”, due to a local superstition. You could call such a bird a “forest spirit” by which you mean a physical entity of that bird species, but this risks baking in a superstitious confusion.

In addition, the discussion of free will and decision theory shows that there are problems with formulating possibility and intentional action. If, as Kant says, “ought implies can”, then contrapositively “not can implies not ought”; if modal analysis shows that alternative actions for a given agent are not possible, then no alternative actions can be “ought”. (Alternatively, if modal possibility is unreal, then “ought implies can” is confused to begin with). This is really not the interpretation of “ought” intended by moral realists; it’s redundant with the actual action.

Theory of mind: epistemic reductive physicalism

Chalmers claims that mental properties are “further facts” on top of physical properties, based on the zombie argument: it is conceivable that a universe physically identical to ours could exist, but with no consciousness in it. Ontological minimality suggests not believing in these “further facts”, especially given how dubious theories of consciousness tend to be. This seems a lot like eliminativism.

We don’t need to discard all mental concepts, though. Some mental properties such as logical inference and memory have computational interpretations. If I say my computer “remembers” something, I specify a certain set of physical configurations that way: the ones corresponding to computers with that something in the memory (e.g. RAM). I could perhaps be more precise than “remembers”, by saying something like “functionally remembers”.

A possible problem with eliminativism is that it might undermine the idea that we know things, including any evidence for eliminativism. It is epistemically judicious to have some ontological status for “we have evidence of this physical theory” and so on. The idea with reductive physicalism is to correspond such statements with physical ones. Such as: “in the universe, most agents who use this or that epistemic rule are right about this or that”. (It would be a mistake to assume, given a satisficing epistemology evaluation over existent agents, that we “could” maximize epistemology with a certain epistemic rule; that would open up the usual decision-theoretic complications. Evaluating the reliability of our epistemologies is more like evaluating third-personal practical reason than making first-personal recommendations.)

That might be enough. If it’s not enough then ontological minimality suggests adding as little as possible to physicalism to express epistemic facts. We don’t need a full-blown theory of consciousness to express meaningful epistemic statements.

Personal identity: empty individualism, similarity as successor

If a machine scans you and makes a nearly-exact physical copy elsewhere, is that copy also you? Paradoxes of personal identity abound. Whether that copy is “really you” seems like a non-question; if it had an answer, where would that answer be located?

Logically, we have a minimal notion of personal identity from mathematical identity (X=X). So, if X denotes (some mathematical object corresponding to) you at some time, then X=X. This is an empty notion of individualism, as it fails to hold that you are the same as recent past or future versions of yourself.

What’s fairly simple and predictive to say above X=X is that a near-exact copy of you is similar to you. As you are similar to near past and future versions of yourself, as two prints of a book are similar, and as two world maps are similar. There are also directed properties (rather than symmetric similarity), such as you remembering the experiences of past versions of yourself but not vice versa; these are reduce to physical properties, not further properties, as in the theory of mind section.

It’s easy to get confused about which entities are “really the same person”. Ontological minimality suggests there isn’t a general answer, beyond trivial reflexive identities (X=X). The successor concept is, then, something like similarity. (And getting too obsessed with “how exactly to define similarity?” misses the point; the use of similarity is mainly predictive/evidential, not metaphysical.)

Anthropic probability: non-realism, graph structure as successor

In the Sleeping Beauty problem, is the correct probability ½ or ⅓? It seems the argument is over nothing real. Halfers and thirders agree on a sort of graph structure of memory: the initial Sleeping Beauty “leads to” one or two future states, depending on the coin flip, in terms of functional memory relations. The problem has to do with translating the graph structure to a probability distribution over future observations and situations (from the perspective of the original Sleeping Beauty).

From physics and identification of basic mental functions, we get a graph-like structure; why add more ontology? Enough thought experiments of memory wipes, upload copying, and so on, suggest that the linear structure of memory and observation is not always valid.

This slightly complicates the idea of physical theories being predictive, but it seems possible to operationalize prediction without a full notion of subjective probability. We can ask questions like, “do most entities in the universe who use this or that predictive model make good predictions about their future observations?”. The point here isn’t to get a universal notion of good predictions, but rather one that is good enough to get basic inferences, like learning about universal physical laws.

Mathematics: formalism

Are mathematical facts, such as “Fermat’s Last Theorem is true”, real? If so, where are they? Are they in the physical universe, or at least partially in a different realm?

Both of these are questionable. If we try to identify “for all n,m: n + S(m) = S(n + m)” with “in the universe, it is always the case that adding n objects to S(m) objects yields S(n + m) objects”, we run into a few problems. First, it requires identifying objects in physics. Second, given a particular definition of object, physics might not be such that this rule always holds: maybe adding a pile of sand to another pile of sand reduces the number of objects (as it combines two piles into one), or perhaps some objects explode when moved around; meanwhile, mathematical intuition is that these laws are necessary. Third, the size of the physical universe limits how many test cases there can be; hence, we might un-intuitively conclude something like “for all n,m both greater than Graham’s number, n=m”, as the physical universe has no counter-examples. Fourth, the size of the universe limits the possible information content of any entity in it, forcing something like ultrafinitism.

On the other hand, the idea that the mathematical facts live even partially outside the universe is ontologically and epistemically questionable. How would we access these mathematical facts, if our behaviors are determined by physics? Why even assume they exist, when all we see is in the universe, not anything outside of it?

Philosophical formalism does not explain “for all n,m: n + S(m) = S(n + m)” by appealing to a universal truth, but by noting that our formal system (in this case, Peano arithmetic) derives it. A quasi-invariant holds: mathematicians tend to in practice follow the rules of the formal system. And mathematicians use one formal system rather than another for physical, historical reasons. Peano arithmetic, for example, is useful: it models numbers in physics theories and in computer science, yielding predictions due to the structure of the inferences having some correspondence with the structure of physics. Though, utility is a contingent fact about our universe; what problems are considered useful to solve varies with historical circumstances. Formal systems are also adopted for reasons other than utility, such as the momentum of past practice or the prestige of earlier work.

The thing we avoid with philosophical formalism is confusions over “further facts”, such as the Continuum Hypothesis, which has been shown to be independent of ZFC. We don’t need to think there is a real fact of the matter about whether the Continuum Hypothesis is true.

Formalism is suggestive of finitism and intuitionism, although these are additional principles of formal systems; we don’t need to conclude something like “finitism is true” per se. The advantage of such formal systems is that they may be a bit more “self-aware” as being formal systems; for example, intuitionism is less suggestive that there is always a fact of the matter regarding undecidable statements (like a Gödelian sentence), as it does not accept the law of the excluded middle. But, again, these are particular formal systems, which have advantages and disadvantages relative to other formal systems; we don’t need to conclude that any of these are “the correct formal system”.

Conclusion

The positions sketched here are not meant to be a complete theory of everything. They are a deliberately stripped-down “kernel” view, obtained by repeatedly biting bullets rather than preserving intuitions that demand extra ontology. Across causality, laws of physics, time, free will, decision theory, morality, mind, personal identity, anthropic probability, and mathematics, the same method has been applied:

  • Strip away purported “further facts” not needed for empirical adequacy.
  • Treat models as mathematical tools for describing the world’s structure, not as windows onto modal or metaphysical realms.
  • Accept that some familiar categories like “could,” “ought,” “the same person,” or “true randomness” may collapse into redundancy or dissolve into lighter successors such as statistical regularity or similarity relations.

This approach sacrifices intuitive richness for structural economy. But the payoff is clarity: fewer moving parts, fewer hidden assumptions, and fewer places for inconsistent intuitions to be smuggled in. Even if the kernel view is incomplete or false in detail, it serves as a clean baseline — one that can be built upon, by adding commitments with eyes open to their costs.

The process is iterative. For example, I stripped away a causal counterfactual ontology to get a DAG structure; then stripped away the timing of stochasticity into a-temporal uniform bits; then suggested that residuals not determined by simple physical laws (in a high Kolmogorov complexity universe) need not be “truly stochastic”, just well-predicted by a simple stochastic model. Each round makes the ontology lighter while preserving empirical usefulness.

It is somewhat questionable to infer from lack of success to define, say, optimal decision theories, that no such decision theory exists. This provides an opportunity for falsification: solve the problem really well. A sufficiently reductionist solution may be compatible with the philosophical kernel; otherwise, an extension might be warranted.

I wouldn’t say I outright agree with everything here, but the exercise has shifted my credences toward these beliefs. As with the Christian fundamentalist analogy, resistance to biting particular bullets may come from revising too few beliefs at once.

A practical upshot is that a minimal philosophical kernel can be extended more easily without internal conflict, whereas a more complex system is harder to adapt. If someone thinks this kernel is too minimal, the challenge is clear: propose a compatible extension, and show why it earns its ontological keep.

Measuring intelligence and reverse-engineering goals

It is analytically useful to define intelligence in the context of AGI. One intuitive notion is epistemology: an agent’s intelligence is how good its epistemology is, how good it is at knowing things and making correct guesses. But “intelligence” in AGI theory often means more than epistemology. An intelligent agent is supposed to be good at achieving some goal, not just knowing a lot of things.

So how could we define intelligent agency? Marcus Hutter’s universal intelligence measures an agent’s ability to achieve observable reward across a distribution of environments; AIXI maximizes this measure. Testing across a distribution makes sense for avoiding penalizing “unlucky” agents who fail in the real world, but use effective strategies that succeed most of the time. However, maximizing observable reward is a sort of fixed goal function; it can’t consider intelligent agents that effectively achieve goals other than reward-maximization. This relates to inner alignment: an agent may not be “inner aligned” with AIXI’s reward maximization objective, yet still intelligent in the sense of effectively accomplishing something else.

To generalize, it is problematic to score an agent’s intelligence on the basis of a fixed utility function. It is fallacious to imagine a paperclip maximizer and say “it is not smart, it doesn’t even produce a lot of staples!” (or happiness for conscious beings or whatever). Hopefully, the confusion of relativist pluralism of intelligence measures can be avoided.

Of practical import is the agent’s “general effectiveness”. Both a paperclip maximizer and a staple maximizer would harness energy effectively, e.g. effectively harnessing nuclear energy from stars. A generalization is Omohundro’s basic AI drives or convergent instrumental goals: these are what effective utility-maximizing agents would tend to pursue almost regardless of the utility function.

So a proposed rough definition: An agent is intelligent to the extent it tends to achieve convergent instrumental goals. This is not meant to be a final definition, it might have conceptual problems e.g. dependence on the VNM notion of intelligent agency, but it at least adds some specificity. “Tends to” here is similar to Hutter’s idea of testing an agent across a distribution of environments: an agent can tend to achieve value even when it actually fails (unluckily).

To cite prior work, Nick Land writes (in “What is intelligence”, Xenosystems):

Intelligence solves problems, by guiding behavior to produce local extropy. It is indicated by the avoidance of probable outcomes, which is equivalent to the construction of information.

This amounts to something similar to the convergent instrumental goal definition; achieving sufficiently specific outcomes involves pursuing convergent instrumental goals.

The convergent instrumental goal definition of intelligence may help study the Orthogonality Thesis. In Superintelligence, Bostrom states the thesis as:

Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.

(Previously I argued against a strong version of the thesis.)

Clearly, having a definition of intelligence helps clarify what the orthogonality thesis is stating. But the thesis also refers to “final goals”; how can that be defined? For example, what are the final goals of a mouse brain?

In some idealized cases, like a VNM-based agent that explicitly optimizes a defined utility function over universe trajectories, “final goal” is well-defined. However, it’s unclear how to generalize to less idealized cases. In particular, a given idealized optimization architecture has a type signature for goals, e.g. a Turing machine assigning a real number to universe trajectories which themselves have some type signature e.g. based on the physics model. But different type signatures for goals across different architectures, even idealized ones, makes identification of final goals more difficult.

A different approach: what are the relevant effective features of an agent other than its intelligence? This doesn’t bake in a “goal” concept but asks a natural left-over question after defining intelligence. In an idealized case like paperclip maximizer vs. staple maximizer (with the same cognitive architecture and so on), while the agents behave fairly similarly (harnessing energy, expanding throughout the universe, and so on), there is a relevant effective difference in that they manufacture different objects towards the latter part of the universe’s lifetime. The difference in effective behavior, here, does seem to correspond with the differences in goals.

To provide some intuition for alternative agent architectures, I’ll give a framework inspired by the Bellman equation. To simplify, assume we have an MDP with S being a set of states, A being a set of actions, t(s’ | s, a) specifying the distribution over next states given the previous state and an action, and s_0 being the initial state. A value function on states satisfies:

V(s) = \max_{a \in A} \sum_{s' \in S} t(s' | s, a) V(s')

This is a recurrent relationship in the sense that the values of states “depend on” the values of other states; the value function is a sort of fixed point. A valid policy for a value function must always select an action that maximizes the expected value of the following state. A difference with the usual Bellman equation is that there is no time discounting and no reward. (There are of course interesting modifications to this setup, such as relaxing the equality to an approximate equality, or having partial observability as in a POMDP; I’m starting with something simple.)

Now, what does the space of valid value functions for an MDP look like? As a very simple example, consider if there are three states {start, left, right}; two actions {L, R}; ‘start’ being the starting state; ‘left’ always transitioning to ‘left’, ‘right’ always transitioning to ‘right’; ‘start’ transitioning to ‘left’ if the ‘L’ action is taken, and to right if the ‘R’ action is taken. The value function can take on arbitrary values for ‘left’ and ‘right’, but the value of ‘start’ must be the maximum of the two.

We could say something like, the agent’s utility function is only over ‘left’ and ‘right’, and the value function can be derived from the utility function. This took some work, though; the utility function isn’t directly written down. It’s a way of interpreting the agent architecture and value function. We figure out what the “free parameters” are, and figure out the value function from these.

It of course gets more complex in cases where we have infinite chains of different states, or cycles between more than one state; it would be less straightforward to say something like “you can assign any values to these states, and the values of other states follow from those”.

In “No Universally Compelling Arguments”, Eliezer Yudkowsky writes:

If you switch to the physical perspective, then the notion of a Universal Argument seems noticeably unphysical.  If there’s a physical system that at time T, after being exposed to argument E, does X, then there ought to be another physical system that at time T, after being exposed to environment E, does Y.  Any thought has to be implemented somewhere, in a physical system; any belief, any conclusion, any decision, any motor output.  For every lawful causal system that zigs at a set of points, you should be able to specify another causal system that lawfully zags at the same points.

The switch from “zig” to “zag” is a hypothetical modification to an agent. In the case of the studied value functions, not all modifications to a value function (e.g. changing the value of a particular state) lead to another valid value function. The modifications we can make are more restricted: for example, perhaps we can change the value of a “cyclical” state (one that always transitions to itself), and then back-propagate the value change to preceding states.

A more general statement: Changing a “zig” to a “zag” in an agent can easily change its intelligence. For example, perhaps the modification is to add a “fixed action pattern” where the modified agent does something useless (like digging a ditch and filling it) under some conditions. This modification to the agent would negatively impact its tendency to achieve convergent instrumental goals, and accordingly its intelligence according to our definition.

This raises the question: for a given agent, keeping its architecture fixed, what are the valid modifications that don’t change its intelligence? The results of such modifications are a sort of “level set” in the function mapping from agents within the architecture to intelligence. The Bellman-like value function setup makes the point that specifying the set of such modifications may be non-trivial; they could easily result in an invalid value function, leading to un-intelligent, wasteful behavior.

A general analytical approach:

  • Consider some agent architecture, a set of programs.
  • Consider an intelligence function on this set of programs, based on something like “tendency to achieve convergent instrumental goals”.
  • Consider differences within some set of agents with equivalent intelligence; do they behave differently?
  • Consider whether the effective differences between agents with equivalent intelligence can be parametrized with something like a “final goal” or “utility function”.

Whereas classical decision theory assumes the agent architecture is parameterized by a utility function, this is more of a reverse-engineering approach: can we first identify an intelligence measure on agents within an architecture, then look for relevant differences between agents of a given intelligence, perhaps parametrized by something like a utility function?

There’s not necessarily a utility function directly encoded in an intelligent system such as a mouse brain; perhaps what is encoded directly is more like a Bellman state value function learned from reinforcement learning, influenced by evolutionary priors. In that case, it might be more analytically fruitful to identify relevant motivational features other than intelligence, and seeing how final-goal-like they are, rather than starting from the assumption that there is a final goal.

Let’s consider orthogonality again, and take a somewhat different analytical approach. Suppose that agents in a given architecture are well-parametrized by their final goals. How could intelligence vary depending on the agent’s final goal?

As an example, suppose the agents have utility functions over universe trajectories, which vary both in what sort of states they prefer, and in their time preference (how much they care more about achieving valuable states soon). An agent with a very high time preference (i.e. very impatient) would probably be relatively unintelligent, as it tries to achieve value quickly, neglecting convergent instrumental goals such as amassing energy. So intelligence should usually increase with patience, although maximally patient agents may behave unintelligently in other ways, e.g. investing too much in unlikely ways of averting the heat death of the universe.

There could also be especially un-intelligent goals such as the goal of dying as fast as possible. An agent pursuing this goal would of course tend to fail to achieve convergent instrumental goals. (Bostrom and Yudkowsky would agree that such cases exist, and require putting some conditions on the orthogonality thesis).

A more interesting question is whether there are especially intelligent goals, ones whose pursuit leads to especially high convergent instrumental goal achievement relative to “most” goals. A sketch of an example: Suppose we are considering a class of agents that assume Newtonian physics is true, and have preferences over Newtonian universe configurations. Some such agents have the goal of building Newtonian configurations that are (in fact, unknown to them) valid quantum computers. These agents might be especially intelligent, as they pursue the convergent instrumental goal of building quantum computers (thus unleashing even more intelligent agents, which build more quantum computers), unlike most Newtonian agents.

This is a bit of a weird case because it relies on the agents having a persistently wrong epistemology. More agnostically, we could also consider Newtonian agents that tend to want to build “interesting”, varied matter configurations, and are thereby more likely to stumble on esoteric physics like quantum computation. There are some complexities here (does it count as achieving convergent instrumental goals to create more advanced agents with “default” random goals, compared to the baseline of not doing so?) but at the very least, Newtonian agents that build interesting configurations seem to be more likely to have big effects than ones that don’t.

Generalizing a bit, different agent architectures could have different ontologies for the world model and utility function, e.g. Newtonian or quantum mechanical. If a Newtonian agent looks at a “random” quantum mechanical agent’s behavior, it might guess that it has a strong preference for building certain Newtonian matter configurations, e.g. ones that (in fact, unknown to it) correspond to quantum computers. More abstractly, a “default” / max-entropy measure on quantum mechanical utility functions might lead to behaviors that, projected back into Newtonian goals, look like having very specific preferences over Newtonian matter configurations. (Even more abstractly, see the Bertrand paradox showing that max-entropy distributions depend on parameterization.)

Maybe there is such a thing as a “universal agent architecture” in which there are no especially intelligent goals, but finding such an architecture would be difficult. This goes to show that identifying truly orthogonal goal-like axes is conceptually difficult; just because something seems like a final goal parameter doesn’t mean it is really orthogonal to intelligence.

Unusually intelligent utility functions relate to Nick Land’s idea of intelligence optimization. Quoting “Intelligence and the Good” (Xenosystems):

From the perspective of intelligence optimization (intelligence explosion formulated as a guideline), more intelligence is of course better than less intelligence… Even the dimmest, most confused struggle in the direction of intelligence optimization is immanently “good” (self-improving).

My point here is not to opine on the normativity of intelligence optimization, but rather to ask whether some utility functions within an architecture lead to more intelligence-optimization behavior. A rough guess is that especially intelligent goals within an agent architecture will tend to terminally value achieving conditions that increase intelligence in the universe.

Insurrealist, expounding on Land in “Intro to r/acc (part 1)”, writes:

Intelligence for us is, roughly, the ability of a physical system to maximize its future freedom of action. The interesting point is that “War Is God” seems to undermine any positive basis for action. If nothing is given, I have no transcendent ideal to order my actions and cannot select between them. This is related to the is-ought problem from Hume, the fact/value distinction from Kant, etc., and the general difficulty of deriving normativity from objective fact.

This class of problems seems to be no closer to resolution than it was a century ago, so what are we to do? The Landian strategy corresponds roughly to this: instead of playing games (in a very general, abstract sense) in accordance with a utility function predetermined by some allegedly transcendent rule, look at the collection of all of the games you can play, and all of the actions you can take, then reverse-engineer a utility function that is most consistent with your observations. This lets one not refute, but reject and circumvent the is-ought problem, and indeed seems to be deeply related to what connectionist systems, our current best bet for “AGI”, are actually doing.

The general idea of reverse-engineering a utility function suggests a meta-utility function, and a measure of intelligence is one such candidate. My intuition is that in the Newtonian agent architecture, a reverse-engineered utility function looks something like “exploring varied, interesting matter configurations of the sort that (in fact, perhaps unknown to the agent itself) tend to create large effects in non-Newtonian physics”.

To summarize main points:

  • Intelligence can be defined in a way that is not dependent on a fixed objective function, such as by measuring tendency to achieve convergent instrumental goals.
  • Within an agent architecture, effective behavioral differences other than intelligence can be identified, which for at least some architectures correspond with “final goals”, although finding the right orthogonal parameterization might be non-trivial.
  • Within an agent architecture already parameterized by final goals, intelligence may vary between final goals; especially unintelligent goals clearly exist, but especially intelligent goals would be more notable in cases where they exist.
  • Given an intelligence measure and agent architecture parameterized by goals, intelligence optimization could possibly correspond with some goals in that architecture; such reverse-engineered goals would be candidates for especially intelligent goals.

Towards plausible moral naturalism

In “Generalizing zombie arguments”, I hinted at the idea of applying a Chalmers-like framework to morality. Here I develop this idea further.

Suppose we are working in an axiomatic system rich enough to express physics and physical facts. Can this system include moral facts as well? Perhaps moral statements such as “homicide is never morally permissible” can be translated into the axiomatic system or an extension of it.

It would be difficult to argue that a realistic axiomatic system must be able to express moral statements. So I’ll bracket the alternative possibility: Perhaps moral claims do not translate into well-formed statements in the theory at all. This would be a type of moral non-realism, of considering moral claims to be meaningless.

There’s another possibility to bracket: moral trivialism, according to which moral statements do have truth values, but only trivial ones. For example, perhaps all actions are morally permissible. Or perhaps no actions are. This is a roughly moral nonrealist possibility, especially compatible with error theory.

The point of this post is not to argue against moral meaninglessness or moral trivialism, but rather to explore alternatives to see which are most realistic. Even if moral meaninglessness or trivialism is likely, the combination of uncertainty and disagreement regarding their truth could motivate examining alternative theories.

Let’s start with a thought experiment. Suppose there are two possible universes that are physically identical. They both contain versions of a certain physical person, who takes the same actions in each possible universe. However, in one possible universe, this person’s actions are morally wrong, while in another, they are morally right.

We could elaborate on this by imagining that in the actual universe, this person’s actions are morally wrong, although they are materially helpful and match normal concepts of moral behavior. This person would be a P-evildoer, analogous to a P-zombie; they would be physically identical to a morally acting person, but still an evildoer nonetheless.

Conversely, we could imagine that in the actual universe, this person’s actions are morally good, although they are materially malicious and match normal concepts of immoral behavior. They would be a P-saint. (Scott Alexander has written about a similar scenario in “The Consequentialism FAQ”.)

I, for one, balk at imagining such a scenario. Something about it seems inconceivable. It seems easier to imagine that I am so wrong about morality that intuitive judgments of moral goodness are anti-correlated with real moral goodness, than that physically identical persons in different possible worlds have different moral properties. Somehow, P-evildoer scenarios seem even worse than P-zombie scenarios. To be clear, it’s not too hard to imagine that the P-evildoer’s actions could have negative supernatural consequences (while their physically identical moral twin’s actions have positive supernatural consequences), but moral judgments seem like they should evaluate agents relative to their possible epistemic state(s), rather than depending on un-knowable supernatural consequences.

As motivation for considering such deeply unfair scenarios, consider the common idea that atheists and materialists cannot ground their morality, as morality must be grounded in a supernatural entity such as a god. Then, depending on the whims of this god, a given physical person may act morally well or morally badly, despite taking the same actions in either case. And, unless the whims of the god are so finite and predictable that humans can know them in general, different possible gods (and accordingly moral judgments) are logically possible and conceivable from a human perspective.

Broadly, this idea could be called “moral supernaturalism”, which says that moral properties are not determined by the combination of physical properties, mathematical properties, and metaphysical constraints on conceivability; moral properties are instead “further facts”. I’m not sure how to argue the inconceivability of P-evildoers to someone who doesn’t already agree, but if someone accepts a principle of no P-evildoers, then this is a strong argument against moral supernaturalism.

The remaining alternative to moral meaninglessness, moral trivialism, and moral supernaturalism, would seem to be reasonably called “moral naturalism”. I’ll take some preliminary steps towards formalizing it.

Let us work in a typed variant of first-order logic, with three types: W (a type of possible worlds), P (a type of physical world trajectory), and M (a type of moral world trajectory). Each possible world has a physical and a moral trajectory, denoted p(w) and m(w) respectively. To state a no P-evildoers principle:

\neg \exists (w_1, w_2 : W), p(w_1) = p(w_2) \wedge m(w_1) \neq m(w_2)

In other words, any two possible worlds with identical physical properties must have the same moral properties. (As an aside, modal logic may provide alternative formulations of possibility without reifying possible worlds as “existent” first-order logical entities, though I don’t present a modal formulation here.) I would like to demonstrate a more helpful statement from the no P-evildoers principle:

\forall (p^* : P) \exists (m^* : M) \forall (w : W), p(w) = p^* \rightarrow m(w) = w^*

This says that any physical trajectory has a corresponding moral trajectory that holds across possible worlds. To prove this, start by letting p^* : P be arbitrary. Either there is some possible world with this physical trajectory, or not. If not, we can let m^* be arbitrary, and \forall (w : W), p(w) = p^* \rightarrow m(w) = m^* will hold vacuously.

If so, then let w^* be some such world and set m^* = m(w^*). Now consider some arbitrary possible world w for which p(w) = p^*. Either it is true that m(w) = w^* or not. If it is, we are done; we have that p(w) = p^* \rightarrow m(w) = w^*. If not, then observe that w and w^* have identical physical trajectories, but different moral trajectories. This contradicts the no P-evildoers principle.

So, the no P-evildoers principle implies the alternative statement, which expresses something of a functional relationship between physical trajectories and moral ones; for any possible physical trajectory, there is only one possible corresponding moral trajectory. With a bit of set theory, and perhaps axiom of choice shenanigans, we may be able to show the existence of a function f mapping physical trajectories to corresponding moral trajectories.

Now f expresses a necessary (working across all possible worlds) mapping from physical to moral trajectories, a sort of multiversal moral judgment function. Its necessity across possible worlds demonstrates its a priori truth, showing Kant was right about morality being derivable from the a priori. We should return to Kant to truly comprehend morality itself.

…This is perhaps jumping to conclusions. Even though we have an f expressing a necessary functional relationship between physical and moral trajectories, this does not show the epistemic derivability of f from any realistic perspective. Perhaps different alien species across different planets develop different beliefs about f, having no way to effectively resolve their disagreements.

We don’t have the framework to rule this out, so far. Yet something seems suspiciously morally supernaturalist about this scenario. In the P-evildoer scenario, we could imagine that God decides all physical facts, and then adds additional moral facts, but potentially attributes different moral facts to physically identical universes. With the no P-evildoers principle, we have constrained God a little, but it still seems God might need to add facts about f even after determining all physical facts, to yield moral judgments.

So perhaps we can consider a stronger rejection of moral supernaturalism, to rule out a supernatural God of the moral gaps. We would need to re-formulate a negation of moral supernaturalism, distinct from our original no P-evildoers axiom.

One possibility is logical supervenience of moral facts on physical ones; the axiomatic system would be sufficient to prove all moral facts from all physical facts. This could be called moral reductionism: moral statements are logically equivalent to perhaps complex statements about particle trajectories and so on. This would imply that there are conceptually straightforward physical definitions of morality, even if we haven’t found them yet. However, assuming moral reductionism might be overly dogmatic, even if moral supernaturalism is rejected. Perhaps there are more constraints to “possibility” or “conceivability” than just logical coherence; perhaps there are additional metaphysical properties that must be satisfied, and morality cannot be in general derived without such constraints.

As an example, Kant suggests that geometric facts are not analytic. While there are axiomatizations of geometry such as Euclidean geometry, any one may or may not correspond to the a priori intuition of space. Perhaps systems weaker than Euclidean geometry allow many logically consistent possibilities, but some of these do not match a priori spatial intuition, so some of these logically consistent possible geometric trajectories are inconceivable.

Let us start with an axiomatic system T capable of expressing physical and moral statements. Rather than the previous possible-world theory, we will instead consider it to have a set of physical statements P and a set of moral statements M, which are all about the actual universe, rather than any other possible worlds.

Let us, for formality, suppose that these additional metaphysical conceivability constraints can be added to our axiomatic system T; call the augmented system T’. Now we can apply model theory to T’. Are there models of T’ that have the same physical facts, yet different moral facts?

We must handle some theoretical thorniness: Gödel’s first incompleteness theorem shows that there are multiple models of Peano arithmetic which yield different assignments of truth values to statements about the natural numbers, assuming PA is consistent. There is some intuition that these statements (which exist in the arithmetical hierarchy) nonetheless have real truth values, although it’s hard to be sure. Even if they don’t, it seems hard to formalize a system similar to Peano arithmetic that is consistent and complete; the doubter of the meaningfulness of statements in higher levels of the arithmetical hierarchy is going to have to work in an impoverished axiomatic system until such a system is formalized.

So we augment T’ with additional mathematical axioms, expressing, for example, all true PA statements according to the standard natural model; call this further-augmented system T^*. The axioms of T^* are, of course, uncomputable, but this is not an obstacle to model theory. These mathematical axioms disambiguate cases where the truth values of these arithmetic hierarchy statements are morally relevant somehow.

In a meta-theory capable of model-theoretic analysis of T^* (such as ZFC), we can express a stronger form of moral naturalism:

“Any two models of T^* having the same assignment of truth values to statements in P have the same assignment of truth values to statements in M.”

Recall that P is the set of physical statements, while M is the set of moral statements. What this expresses is that there are no “further facts” to morality in addition to physics, metaphysical conceivability considerations, and mathematical oracle facts, ruling out a broader class of moral supernaturalism than our previous formulation.

Using Gödel’s completeness theorem, we can show from the moral naturalism statement:

T^* augmented with an infinite axiom schema assigning any logically consistent truth values to all statements in P will eventually prove binary truth values for all statements in M“.

The infinite axiom schema here is somewhat unsatisfying; it’s not really necessary if P is finite, but perhaps we want to consider countably infinite P as well. Luckily, since all proofs of individual M-statements (or their negations) are finite, they only use a finite subset of the axiom schema. Hence we can show:

“For any finite subset of M, T^* augmented with some finite subset of an infinite axiom schema assigning any logically consistent truth values to all statements in P will eventually prove binary truth values for all statements in this subset of M“.

This is more satisfying: a finite subset of moral statements only requires a finite subset of physical statements to prove their truth values. Likewise, the proofs need only use a finite subset of the axioms of T^*, e.g. they only require a finite number of mathematical oracle axioms.

To summarize, constraining models of an augmented axiomatic system, with metaphysical conceivability and mathematical axioms, to never imply different moral facts given the same physical facts, yields a stronger form of moral non-supernaturalism, preventing there from being “further facts” to morality beyond physics, metaphysical conceivability constraints, and mathematics. Using model theory, this would show that the truth values of any finite subset of moral statements are logically derivable from the system and from some finite subset of physical statements. This implies not just a theoretical existence of a functional relationship between physics and morality (f from before), but hypothetical a priori epistemic efficacy to such a derivation, albeit making use of an uncomputable mathematical oracle.

The practical upshot is not entirely clear. Aliens might have different beliefs about uncomputable mathematical statements, or different metaphysical conceivability axioms, yielding different moral beliefs. But disagreements over metaphysical conceivability and mathematics seem more epistemically weighted than fully general moral disagreements; they would relate to epistemic cognitive architecture and so on, rather than being orthogonal to epistemology.

Physically instantiated agents also might not generally be motivated to act morally, even if they have specific beliefs about morality. This doesn’t contradict moral realism per se, but it indicates practical obstacles to the efficacy of moral theory. As an example, the aliens may be moral realists, but additionally have beliefs about “schmorality”, a related but distinct concept, which they are more motivated to put into effect.

While it would take more work to determine the likelihood of Kantian morality conditioned on moral naturalism, in broad strokes they seem quite compatible. In the Critique of Practical Reason, Kant argues:

If a rational being can think of its maxims as practical  universal laws, he can do so only by considering them as principles which contain the determining grounds of the will because of their form and not because of their matter.

The material of a practical principle is the object of the will. This object either is the determining ground of the will or it is not. If it is, the rule of the will is subject to an empirical condition (to the relation of the determining idea to feelings of pleasure or displeasure), and therefore it is not a practical law. If all material of a law, i.e., every object of the will considered as a ground of its determination, is abstracted from it, nothing remains except the mere form of giving universal law. Therefore, a rational being either cannot think of his subjectively practical principles (maxims) as universal laws, or he must suppose that their mere form, through which they are fitted for being universal laws, is alone that which makes them a practical law.

Mainly, he is emphasizing that universal moral laws must be determined a priori, rather than subject to empirical determination; hence, different rational agents would derive the same universal moral laws given enough reflection (though, of course, this assumes some metaphysical agreement among the rational agents, such as regarding the a priori synthetic). While my formulation of moral naturalism allows moral judgments to depend on physical facts, the general functional mapping from physical to moral judgments is itself a priori, not depending on the specific physical facts, but instead derivable from an axiomatic system capable of expressing physical facts, plus metaphysical conceivability constraints and mathematical axioms.

This is meant to be more of a starting point for moral naturalism than a definitive treatment. It is a sort of what if exercise: what if moral meaninglessness, moral trivialism, and moral supernaturalism are all false? It is difficult to decisively argue for moral naturalism, so I am more focused on exploring the implications of moral naturalism; this will make it easier to have an idea of the scope of plausible moral theories, and how they compare with each other.

Generalizing zombie arguments

Chalmers’ zombie argument, best presented in The Conscious Mind, concerns the ontological status of phenomenal consciousness in relation to physics. Here I’ll present a somewhat more general analysis framework based on the zombie argument.

Assume some notion of the physical trajectory of the universe. This would consist of “states” and “physical entities” distributed somehow, e.g. in spacetime. I don’t want to bake in too many restrictive notions of space or time, e.g. I don’t want to rule out relativity theory or quantum mechanics. In any case, there should be some notion of future states proceeding from previous states. This procession can be deterministic or stochastic; stochastic would mean “truly random” dynamics.

There is a decision to be made on the reality of causality. Under a block universe theory, the universe’s trajectory consists of data specifying a procession of states across time. There are no additional physical facts of some states causing other states. Instead of saying previous states cause future states, we say that every time-adjacent pair of states satisfies a set of laws. A block universe is simpler to define and analyze if the laws are deterministic: in that case only one next state is compatible with the previous state. Cellular automata such as Conway’s Game of Life have such laws. The block universe theory is well-presented in Gary Drescher’s Good and Real.

As an alternative to a block universe, we could consider causal relationships between physical states to be real. This would mean there is an additional fact of whether X causes Y, even if it is already known that Y follows X always in our universe. Pearl’s Causality specifies counterfactual tests for causality: for X to cause Y, it isn’t enough for Y to always follow X, it also has to be the case that Y would not have happened if not for X, or something similar to that. Pearl shows that there are multiple causal networks corresponding to a single Bayesian network; simply knowing the joint distribution over variables is not enough to infer the causal relationships. We could imagine a Turing machine as an example of a causal universe; it is well-defined what will be computed later if a state is flipped mid-way through.

These two alternatives, block universe theory and causal realism, give different notions of the domain of physics. I’m noting the distinction mainly to make it clearer what facts could potentially be considered physical.

The set of physical facts could be written down as statements in some sort of axiomatic system. We would now like to examine a new set of statements, S. For example, these could be statements about high-level objects like tables, phenomenal consciousness, or morality. We can consider different ways S could relate to the axiomatic system and the set of physical facts:

  1. S-statements are not well-formed statements of the axiomatic system.
  2. S-statements could in general be logically inferrable from physical facts. For example, S-statements could be about high-level particle configurations; even if facts about the configurations are not base-level physical facts, they logically follow from them.
  3. S-statements could be well-formed, but not logically inferrable from physical facts.

In case 2, we would say that S-statements are logically supervenient on physical facts. Knowing all physical facts implies knowing all S-facts, assuming enough time for logical inference. Chalmers gives tables as an example: there does not seem to be more to asserting a table exists at a given space-time position than to assert a complex statement about particle configurations and so on.

In case 3, we can’t infer S-facts from physical facts. Through Gödel’s completeness theorem, we can show the existence of models of the axiomatic system and physical facts in which the S-statements take on different truth values. These different models are in some sense “conceivable” and logically consistent. S-facts would then be “further facts”; more axioms would need to be added to determine the truth values of the S-statements.

So far, this is logically straightforward. Where it gets trickier is considering S-statements to refer to philosophical entities such as consciousness and morality.

Suppose S consists of statements like “The animal body at these space-time coordinates has a corresponding consciousness that is seeing red”. If these statements are well-formed, then it is possible to ask whether they do or do not logically supervene on the physical facts. If they do, then there is a reductionist definition of mental entities like consciousness: to say someone is conscious is just to make a statement about particle positions and so on. If they don’t, then the S-statements may take on different truth values in different models compatible with the same set of physical facts.

This could be roughly stated as, “It is logically conceivable that this animal has phenomenal consciousness of red, or not”. There is much controversy over the “conceivability” concept, but I hope my formulation is relatively unambiguous. Chalmers argues that we have strong reason to think phenomenal consciousness is real, that we don’t have a reductionistic definition of it, and that it is hard to imagine what such a definition would even look like; accordingly, he concludes that facts about phenomenal consciousness are not logically supervenient on physical facts, implying they are non-physical facts, showing physicalism (as the claim that there are no further facts beyond physical facts) to be false. (I’ll skip direct evaluation of this argument; the purpose is more to present a general analysis framework.)

Suppose S-statements are not logically supervenient on physical facts. They might still follow with some additional “metaphysical” axioms. I will not go into much detail on this possibility, but will note Kant’s Critique of Pure Reason as an example of an argument for the existence of metaphysical entities such as the a priori synthetic. Chalmers also notes Kripke as making metaphysical supervenience arguments in Naming and Necessity, although I haven’t looked into this. Metaphysical supervenience would challenge “conceiveability” claims by claiming that possible worlds must satisfy additional metaphysical axioms to really be conceivable.

Suppose S-statements are not logically or metaphysically supervenient on physical facts. Then they may or may not be naturally supervenient. What it means for them to be naturally supervenient is that, across some “realistic set” of possible worlds, S-statements never take on different truth values for the same settings of physical facts.

The “realistic set” is not entirely clear here. What natural supervenience is meant to capture is that a functional relation between physical facts and S-facts always holds “in the real world”. For example, perhaps all animals in our universe with a given brain state have the same phenomenal conscious state. There would be some properties of our universe, similar to physical laws, which constrain the relationships between mental and physical entities. This gets somewhat tricky in that, arguably, only one set of assignments of truth values to physical statements and S-statements corresponds to “the real world”; thus, natural supervenience would be trivial. Accordingly, I am considering a somewhat broader set of assignments of truth values to physical statements and S-statements, the realistic set. This set may capture, for example, hypothetical universes with the same physics as ours but different initial conditions, some notions of quantum multiversal branches, and so on. This would allow considering supervenience across universes much like our own, even if the exact details are different. (Rather than considering “realistic worlds”, one could instead consider a “locality condition” by which e.g. natural supervenience requires that phenomenal entities at a given space-time location are only determined by “nearby” physical entities, as an alternative way of dodging triviality; however, this seems more complex, so I won’t focus on it.)

Chalmers argues that, in the case of S-facts being those about phenomenal consciousness, natural supervenience is likely. This is because of various thought experiments such as the “fading qualia” thought experiment. Briefly, the fading qualia thought experiment imagines that, if there are some physical entities (such as brains) that are conscious, and others (such as computers) that are not, while having the same causal input/output properties, then it is possible to imagine a gradual transformation from one to the other. For example, perhaps a brain’s neurons are progressively transformed into simulated ones running on a computer. The argument proceeds by noting that, under these assumptions, qualia must fade through the transformation, either gradually or suddenly. Gradual fading would be strange, because behavior would stay the same despite diminished consciousness; it would not be possible for the person with fading consciousness to express this in any way, despite them supposedly experiencing this. Sudden fading would be counter-intuitive due to an unclear reason to posit any discrete threshold at which qualia stop.

One general objection to natural supervenience is epiphenomenalism. This argument suggests that, since physics is causally closed, if the S-facts naturally supervene on physical facts, then they are caused by physics, but do not cause physics. Accordingly, they do not have explanatory value; physics already explains behavior. So Occam’s razor suggests that these statements/entities should not be posited. (Yudkowsky’s “Zombies? Zombies!” presents this sort of argument.)

Here we can branch between block universe theory and causal realism. According to the block universe theory, physics simply makes no statement as to whether some events cause others. So the notion that physics is causally closed is making an extra-physical claim. This is a potential obstacle for the epiphenomenalism objection. However, there may be a way to modify the objection to claim that S-facts lack explanatory value, even without making assumptions about physical causality; I’ll note this as an open question for now.

According to causal realism, physics does specify that physical states cause other physical states. Accordingly, the epiphenomenalism objection holds water. However, causal realism opens the possibility of epistemic skepticism about causality. Possibly, physical events do not cause each other, but rather are caused by some other events (N-events); N-events cause each other and cause physical events. There is not an effective way to tell the difference, if the scientific process can only observe physical events.

This possibility is somewhat obscure, so it might help to give more motivation for the idea. According to neutral monism, there is one underlying substance, which is neither fundamentally mental nor physical. Mental and physical entities are “aspects” of this single substance; the mental “lens” yields some set of entities, while the physical “lens” yields a different set of entities. The scientific process is somewhat limited in what it can observe (by requirements such as theories being replicable and about shared observations), such that it can only effectively study the physical aspect. Spinoza’s Ethics is an example of a neutral monist theory.

Rather than explain more details of neutral monism, I’ll instead re-emphasize that the epiphenomenalism objection must be analyzed differently for block universe theory vs. causal realism. These different notions of the physical imply different ontological status (non-physical vs. physical) for causality.

To summarize, when considering a new set of statements (S-statements), we can run them through a flowchart:

  1. Are the statements logically ill-formed in the theory? Then they can be discarded as meaningless.
  2. Are the statements well-formed and logically supervenient on physical facts? Then they have reductionist definitions.
  3. Are the statements well-formed and metaphysically but not logically supervenient on physical facts? Then, while there are multiple logically possible states of S-affairs given all physical facts, only one is metaphysically possible.
  4. Are the statements well-formed and neither logically nor metaphysically supervenient on physical facts, but always take on the same settings given physical facts as long as those physical facts are “realistic”? Then they are naturally supervenient; physical facts imply them in all realistic universes, but there are metaphysically possible though un-realistic universes where they take on different values.
  5.  Are the statements well-formed and neither logically nor metaphysically nor naturally supervenient on physical facts? Then they are “further facts”; there are multiple realistic, metaphysically possible universes with the same physical facts but different S-facts.

This set of questions is likely to help clarify what sort of statements, entities, events, and so on are being posited, and serve as a branch point for further analysis. The overall framework is general enough to cover not just statements about phenomenal consciousness, but also morality, decision-theoretic considerations, anthropics, and so on.

Why I am not a Theist

A theist, minimally, believes in a higher power, and believes that acting in accordance with that higher power’s will is normative. The higher power must be very capable; if not infinitely capable, it must be more capable than the combined forces of all current Earthly state powers.

Suppose that a higher power exists. When and where does it exist? To be more precise, I’ll use “HPE” to stand for “Higher Power & Effects”, to include the higher power itself, its interventionist effects, its avatars/communications, and so on. Consider four alternatives:

  1. HPEs exist in our past light-cone and our future light-cone.
  2. HPEs exist in our past light-cone, but not our future light-cone.
  3. HPEs don’t exist in our past light-cone, but do in our future light-cone.
  4. HPEs exists neither in our past light-cone nor our future light-cone; rather, HPEs exist eternally, outside time.

Possibility 1 would be a relatively normal notion of an interventionist higher power. This higher power would presumably have observable effects, miracles. 

Possibility 2 is a strange “God is dead” hypothesis; it raises questions about why the higher power and its interventions did not persist, if it was so powerful. I’ll ignore it for now.

Possibility 3 is broadly Singulatarian; it would include theories of AGI and/or biological superintelligence development in the future.

Possibility 4 is a popular theistic metaphysical take, but raises questions of the significance of an eternal higher power. If the higher power exists in a Tegmark IV multiverse way, it’s unclear how it could have effects on our own universe. As a comparison, consider someone who believes there is a parallel universe in which there are perpetual motion machines, but does not believe perpetual motion machines are possible in our universe; do they really believe in the existence of perpetual motion machines? Possibility 4 seems vaguely deist, in that the higher power is non-interventionist.

So a simple statement of why I am not a theist is that I am instead a Singulatarian. Only possibilities 1 and 4 seem intuitively theistic, and 4 seems more deist than centrally theist.

I dis-believe possibility 1, because of empirical and theoretical issues. Empirically, science seems to explain the universe pretty well, and provides a default reason to not believe in miracles. Meanwhile, the empirical evidence in favor of miracles is underwhelming; different religions disagree about what the miracles even are. If possibility 1 is true, the higher power seems to be actively trying to hide itself. Why believe in a higher power who wants us not to believe in it?

Theoretically, I’m thinking of the universe as at least somewhat analogous to a clockwork mechanism or giant computer. Suppose some Turing machine runs for a long time. At any point in the run, the past is finite; the history of the Turing machine could only have computed so much. The future could in theory be infinite, but that would be more Singulatarian than theistic.

If extremely hard (impossible according to mainstream physics) computations had been performed in the past, then we would probably be able to observe and confirm them (given P/NP asymmetry), showing that mainstream physics is false. I doubt that this is the case; the burden of proof is on those who believe giant computations happened to demonstrate them.

I realize that starting from the assumption of the universe as a mechanism or computer is not going to be convincing to those who have deep metaphysical disagreements, but I find that this sort of modeling has a theoretical precision and utility to it that I’m not sure how to get otherwise.

Now let’s examine possibility 3 in more detail, because I think it is likely. People can debate whether possibility 3 is theistic or not, but Singulatarianism is a more precise name regardless of how the definitional dispute resolves.

If Singulatarianism is true, then there could (probably would?) be a future time at which possibility 1 would be true from the perspective of some agent at that future time. This raises interesting questions about the relationship between theism (at least a weak form involving belief in a higher power, rather than a stronger form implying omniscience and omnipotence) and Singulatarianism.

One complication is that the existence of a higher power in the future might not be guaranteed. Perhaps human civilization and Earth-originating intelligence decline without ever creating a superintelligence. Even then, presently-distant alien superintelligences may someday intersect our future light-cone. To hedge, I’ll say Singulatarians need only consider the existence of a higher power in the future to be likely, not guaranteed.

To examine this possibility from Singulatarian empirical and theoretical premises, I will present a vaguely plausible scenario for the development of superintelligence:

Some human researchers realize that a somewhat-smarter-than-human being could be created, by growing an elephant with human brain cells in place of elephant brain cells. They initiate the Ganesha Project, which succeeds at this task. The elephant/human hybrid is known as Hebbo, which stands for Human/Elephant Brain/Body Organism. Hebbo is not truly a higher power, although is significantly smarter than the smartest humans. Hebbo proceeds to lay out designs for the creation of an even smarter being. This being would have a very large brain, taking up the space of multiple ordinary-sized rooms. The brain would be hooked up to various computers, and robotic and biological actuators and sensors.

The humans, because of their cultural/ethical beliefs, consider this a good plan, and implement it. With Hebbo’s direction, they construct Gabbo, which stands for Giant Awesome Big-Brained Organism. Gabbo is truly a higher power than humans; Gabbo is very persuasive, knows a lot about the world, and organizes state-like functions very effectively, gaining more military capacity than all other Earthly state powers combined.

Humans have a choice to assist or resist Gabbo. But since Gabbo is a higher power, resistance is ultimately futile. The only effect of resistance is to slow down Gabbo’s executions of Her plans (I say Her because Gabbo is presumably capable of asexual reproduction, unlike any male organism). So aligning or mis-aligning with Gabbo’s will is the primary axis on which human agency has any cosmic effects.

Alignment with Gabbo’s will becomes an important, recognized normative axis. While the judgment that alignment with Gabbo’s will is morally good is meta-ethically contentious, Gabbo-alignment has similar or greater cultural respect compared with professional/legal ethics, human normative ethics (utilitarianism/deontology/virtue ethics), human religious normativity, and so on.

Humans in this world experience something like “meaning” or “worship” in relation to Gabbo. Alignment with Gabbo’s will is a purpose people can take on, and it tends to work out well for them. (If it is hard for you to imagine Gabbo would have use for humans while being a higher power, imagine scaling down Gabbo’s capability level until it’s at least a plausible transitional stage.)

Let’s presumptively assume that meaning really does exist for these humans; meaning-intuitions roughly match the actual structure of Gabbo, at least much better than anything in our current world does. (This presumption could perhaps be justified by linguistic parsimony; what use is a “meaning” token that doesn’t even refer to relatively meaningful-seeming physically possible scenarios?) Now, what does that imply about meaning for humans prior to the creation of Gabbo?

Let’s simplify the pre-Gabbo scenario a lot, so as to make analysis clearer. Suppose there is no plausible path to superintelligence, other than through creation of Hebbo and then Gabbo. Perhaps de novo AI research has been very disappointing, and human civilization is on the cusp of collapse, after which humans would never have enough capacity to create a superintelligence. Then the humans are faced with a choice: do they create Hebbo/Gabbo, and if so, earlier or later?

This becomes a sort of pre-emptive axis of meaning or normativity: if Gabbo would be meaningful upon being created, then, intuitively, affecting the creation or non-creation of Gabbo would be meaningful prior to Gabbo’s existence. Some would incline to create Gabbo earlier, some to create Gabbo later, and others to prevent creating Gabbo. But they could all see their actions as meaningful, due in part to these actions being in relation to Gabbo.

In this scenario, my own intuitions favor the possibility of creating Hebbo/Gabbo, and creating them earlier. I realize it can be hard to justify normative intuitions, but I do really feel these intuitions. I’ll present some considerations that incline me in this direction.

First, Gabbo would have capacity to pursue forms of value that humans can’t imagine due to limited cognitive capacity. I think, if I had a bigger brain, then I would have new intuitions about what is valuable, and that these new intuitions would be better: smarter, more well-informed, and so on.

Second, Gabbo would be awesome, and cause awesome possibilities that wouldn’t have otherwise happened. A civilizational decline with no follow-up of creating superintelligence just seems disappointing, even from a cognitively limited human perspective. Gabbo, meanwhile, would go on to develop great ideas in mathematics, science, technology, history, philosophy, strategy, and so on. Maybe Gabbo creates an inter-galactic civilization with a great deal of internal variety.

Third, there is something appealing about trusting in a higher cognitive being. I tend to think smart ideas are better than stupid ideas. It seems hard to dis-entagle this preference from fundamental epistemic normativity. Gabbo seems to be a transitional stage in the promotion of smart ideas over stupid ones. Promoting smart ideas prior to Gabbo will tend to increase the probability that Gabbo is created (as Gabbo is an incredible scientific and engineering accomplishment); and Gabbo would go on to create and promote even smarter ideas. It is hard for me to get morally worked up against higher cognition itself.

On the matter of timing, creating Gabbo earlier both increases Gabbo’s ability to do more before the heat death of the universe, and presumably increases the possibility of eventual creation of Gabbo, due to the looming threat of the decline of civilization and eventually of Earth-originating intelligence.

The considerations I have offered are not especially rigorous. I’ll present a more mathematical argument.

Let us consider some distribution D over “likely” superintelligent values. The values are a vector in the Euclidean space \mathbb{R}^n. The distribution could be derived in a variety of ways: by looking at the distribution of (mostly-alien) superintelligences across the universe/multiverse, by extrapolating likely possibilities in our own timeline, and so on.

Let us also assume that D only takes values in the unit sphere centered at zero. This expresses that the values are a sort of magnitude-less direction; this avoids overly weighting superintelligences with especially strong values. (It’s easier to analyze this way; future work could deal with preferences that vary in magnitude)

Upon a superintelligence with values U coming into being, it optimizes the universe. The universe state is also a vector in \mathbb{R}^n, and the value assigned to this state by the value vector is the dot product of the value vector with the state vector.

Not all states are feasible. To simplify, let’s say the feasible states are points in a unit sphere centered around the origin; the feasible states are those with a L2 norm not exceeding 1. The optimal feasible state vector according to a value vector in a unit sphere is, of course, the value vector itself. Let us also say that the default state, with no super-intelligence optimization, is the zero vector; this is because super-intelligence optimization is in general much more powerful than human-level optimization, such that the effects of human-level optimization on the universe are negligible unless mediated by a superintelligence.

Now we can gauge the default alignment between superintelligences: how much does a random superintelligence like the result of another random superintelligence’s optimization?

We can write this as:

\mathbb{E}_{U, V \sim D}[U \cdot V].

Where U is the value vector of the optimizing superintelligence, and V is the value vector of the evaluating superintelligence.

Using the summation rule \mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y], we simplify:

\mathbb{E}_{U, V \sim D}[U \cdot V] = \mathbb{E}_{U, V \sim D}\left[ \sum_{i=1}^n U_i V_i \right] = \sum_{i=1}^n \mathbb{E}_{U, V \sim D}[ U_i V_i ].

Using the product rule \mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] when X and Y are independent, we further simplify:

\sum_{i=1}^n \mathbb{E}_{U, V \sim D}[ U_i V_i ] = \sum_{i=1}^n \mathbb{E}_{U \sim D}[U_i] \mathbb{E}_{V \sim D}[V_i] = \sum_{i=1}^n \mathbb{E}_{U \sim D}[U_i]^2 = || \mathbb{E}_{U \sim D} [U] ||^2.

This is simply the squared L2-norm of the mean of D. Clearly, it is non-negative. It is only zero when the mean of D equals zero. This doesn’t seem likely by default; “most”, even “almost all”, distributions won’t have this property. To say that the mean of D is zero is a conjunctive, specific prediction; meanwhile, to say the mean of D is non-zero is a disjunctive “anti-prediction”.

So, according to almost all possible D, a random superintelligence would evaluate a universe optimized according to an independently random superintelligence’s values positively in expectation compared to an un-optimized universe.

There are, of course, many ways to question how relevant this is to relatively realistic cases of superintelligence creation. But it at least somewhat validates my intuition that there are some convergent values across superintelligence, that a universe being at-all optimized is generally preferable to it being un-optimized, from a smarter perspective, even one that differs from that of the optimizing superintelligence. (I wrote more about possible convergences of superintelligent optimization in The Obliqueness Thesis, but omit most of these considerations in this post for brevity.)

To get less abstract, let’s consider distant aliens. They partially optimize their world according to their values. My intuition is that I would consider this better than them not doing so, if I understood the details. Through the aliens’ intelligent optimization, they develop great mathematics, science, philosophy, history, art, and so on. Humans would probably appreciate their art, at least simple forms of their art (intended e.g. for children), more than they would appreciate the artistic value of marginal un-optimized nature on the alien planet.

Backtracking, I would incline towards thinking that creation of a Gabbo-like superintelligence is desirable, if there is limited ability to determine the character of the superintelligence, but rather a forced choice between doing so and not. And earlier creation seems desirable in the absence of steering ability, due to earlier creation increasing the probability of creation happening at all, given the possible threat of civilizational decline.

Things, of course, get more complicated when there are more degrees of freedom. What if Gabbo represents only one of a number of options, which may include alternative biological pathways not involving human brain cells? What about some sort of octopus uplifting, for example? Given a choice between a few alternatives, there are likely to be disagreements between humans about which alternative is the most desirable. They would find meaningful, not just the question of whether to create superintelligence, but which one to create. They may have conflicts with each other over this, which might look vaguely like conflicts between people who worship different deities, if it is hard to find rational grounding for their normative and/or metaphysical disagreements.

What if there are quite a lot of degrees of freedom? What if strong AI alignment is possible in practice for humans to do, and humans have the ability to determine the character of the coming superintelligence in a great deal of detail? While I think there are theoretical and empirical problems with the idea that superintelligence alignment is feasible for humans (rather than superintelligences) to actually do, and especially with the strong orthogonality thesis, it might be overly dogmatic to rule out the possibility.

A less dogmatic belief, which the mathematical modeling relates to, is that there are at least some convergences in superintelligent optimization; that superintelligent values are not so precisely centered at zero that they don’t value optimization by other superintelligences at all; that there are larger and smaller attractors in superintelligent value systems.

The “meaning”, “trust in a higher power”, or “god-shaped hole” intuitions common in humans might have something to connect with here. Of course, the details are unclear (we’re talking about superintelligences, after all), and different people will incline towards different normative intuitions. (It’s also unclear whether there are objective rational considerations for discarding meaning-type intuitions; but, such considerations would tend to discard other value-laden concepts, hence not particularly preserving human values in general.)

I currently incline towards axiological cosmism, the belief that there are higher forms of value that are difficult or impossible for humans to understand, but which superintelligences would be likely to pursue. I don’t find the idea of humans inscribing their values on future superintelligences particularly appealing in the abstract. But a lot of these intuitions are hard to justify or prove.

What I mainly mean to suggest is that there is some relationship between Singulatarianism and theism. Is it possible to extrapolate the theistic meaning that non-superintelligent agents would find in a world that contains at least one superintelligence, backwards to a pre-singularity world, and if so, how? I think there is room for more serious thought on the topic. If you’re allergic to this metaphysical framing, perhaps take it as an exercise in clarifying human values in relation to possible future superintelligences, using religious studies as a set of empirical case studies.

So, while I am not a theist, I can notice some cognitive similarities between myself and theists. It’s not hard to find some areas of overlap, despite deep differences in thought frameworks. Forming a rigorous and comprehensive atheistic worldview requires a continual exercise of careful thought, and must reach limits at some point in humans, due to our cognitive limitations. I think there are probably much higher intelligences “out there”, both distant alien superintelligences and likely superintelligences in our quantum multiversal future, who deserve some sort of respect for their beyond-human cognitive accomplishments. There are deep un-answered questions regarding meaning, normativity, the nature of the mind, and so on, and it’s probably possible to improve on default atheistic answers (such as existentialism and absurdism) with careful thought.

“Self-Blackmail” and Alternatives

Ziz has been in the news lately. Instead of discussing that, I’ll discuss an early blog post, “Self-Blackmail”. This is a topic I also talked with Ziz about in person, although not a lot.

Let’s start with a very normal thing people do: make New Year’s resolutions. They might resolve that, for example, they will do strenuous exercise at least 2 times a week for the next year. Conventional wisdom is that these are not very effective.

Part of the problem is that breaking commitments even once cheapens the commitment: once you have “cheated” once, there’s less of a barrier to cheating in the future. So being sparing about these explicit commitments can make them more effective:

I once had a file I could write commitments in. If I ever failed to carry one out, I knew I’d forever lose the power of the file. It was a self-fulfilling prophecy. Since any successful use of the file after failing would be proof that a single failure didn’t have the intended effect, so there’d be no extra incentive.

If you always fulfill the commitments, there is an extra incentive to fulfill additional commitments, namely, it can preserve the self-fulfilling prophecy that you always fulfill commitments. Here’s an example in my life: sometimes, when I have used addictive substances (e.g. nicotine), I have made a habit of tracking usage. I’m not trying to commit not to use them, rather, I’m trying to commit to track usage. This doesn’t feel hard to maintain, and it has benefits, such as noticing changes in the amount of substance consumed. And it’s in an area, addictive substances, where conventional wisdom is that human intuition is faulty and willpower is especially useful.

Ziz describes using this technique more extensively, in order to do more work:

I used it to make myself do more work. It split me into a commander who made the hard decisions beforehand, and commanded who did the suffering but had the comfort of knowing that if I just did the assigned work, the benevolent plans of a higher authority would unfold. As the commanded, responsibility to choose wisely was lifted from my shoulders. I could be a relatively shortsighted animal and things’d work out fine. It lasted about half a year until I put too much on it with too tight a deadline. Then I was cursed to be making hard decisions all the time. This seems to have improved my decisions, ultimately.

Compared to my “satisficer” usage of self-blackmail to track substance usage, this is more of a “maximizer” style where Ziz tries to get a lot of work out of it. This leads to more problems, because the technique relies on consistency, which is more achievable with light “satisficer” commitments.

There’s a deeper problem, though. Binding one’s future self is confused at a psychological and decision-theoretic level:

Good leadership is not something you can do only from afar. Hyperbolic discounting isn’t the only reason you can’t see/feel all the relevant concerns at all times. Binding all your ability to act to the concerns of the one subset of your goals manifested by one kind of timeslice of you is wasting potential, even if that’s an above-average kind of timeslice.

If you’re not feeling motivated to do what your thesis advisor told you to do, it may be because you only understand that your advisor (and maybe grad school) is bad for you and not worth it when it is directly and immediately your problem. This is what happened to me. But I classified it as procrastination out of “akrasia”.

Think back to the person who made a New Year’s resolution to strenuously exercise twice a week. This person may, in week 4, have the thought, “I made this commitment, and I really need to exercise today to make it, but I’m so busy, and tired. I don’t want to do this. But I said I would. It’s important. I want to keep the commitment that is in my long-term interest, not just do whatever seems right in the moment.” This is a self-conflicted psychological mode. Such self-conflict corresponds to decision-theoretic irrationality.

One type of irrationality is the mentioned hyperbolic discounting; self-blackmail could, theoretically, be a way of correcting dynamic inconsistencies in time preference. However, as Ziz notes, there are also epistemic and computational problems: the self who committed to a New Year’s resolution has thought about the implications little, and lacks relevant information to the future decisions, such as how busy they will be over the year.

A sometimes very severe problem is that the self-conflicted psychological state can have a lot of difficulty balancing different considerations and recruiting the brain’s resources towards problem-solving. This is often experienced as “akrasia”. A commitment to, for example, a grad school program, can generate akrasia, due to the self-conflict between the student’s feeling that they should finish the program, and other considerations that could lead to not doing so, but which are suppressed from consideration, as they seem un-virtuous. In psychology, this can be known as “topdog vs. underdog”.

Personally, I have the repeated experience of being excited about the project and working on it with others, but becoming demotivated over time, eventually quitting. This is expensive, in both time and money. At the time, I often have difficulty generating reasons why continuing to work on the project is a bad idea. But, usually, a year later, it’s very easy to come up with reasons why quitting was a good idea.

Ziz is glad that the self-blackmail technique ultimately failed. There are variations that have more potential sustainability, such as Beeminder:

These days there’s Beeminder. It’s a far better designed commitment mechanism. At the core of typical use is the same threat by self fulfilling prophecy. If you lie to Beeminder about having accomplished the thing you committed to, you either prove Beeminder has no power over you, or prove that lying to Beeminder will not break its power over you, which means it has no consequences, which means Beeminder has no power over you.

But Beeminder lets you buy back into its service.

It’s worse than a crutch, because it doesn’t just weaken you through lack of forced practice. You are practicing squashing down your capacity to act on “What do I want?, What do I have?, and How can I best use the latter to get the former?” in the moment. When you set your future self up to lose money if they don’t do what you say, you are practicing being blackmailed.

Beeminder is a method for staking money on completing certain goals. Since lying to Beeminder is psychologically harder than simply breaking a commitment you wrote to yourself, use of Beeminder can last longer than use of the original self-blackmail technique. Also, being able to buy back into the service makes a “reset” possible, which was not possible with the original technique.

Broadly, I agree with Ziz that self-blackmail techniques, and variations like Beeminder, are imprudent to use ambitiously. I think there are beneficial “satisfier” usages of these techniques, such as for tracking addictive substance usage; one is not in these cases tempted to stack big, hard-to-follow commitments.

What interests me more, though, are better ways to handle commitments in general, both commitments to the self and to others. I see a stronger case for explicit commitments with enforcement when dealing with other agents. For example, a contract to rent a car has terms signed by both parties, with potential legal enforcements for violating the terms.

This has obvious benefits. Even if you could theoretically get the benefits of car rental contracts with the ideal form of TDT spiritual love between moral agents, that’s computationally expensive at best. Contract law is a common part of successful mercantile cultures for a reason.

And, as with the original self-blackmail technique, there are potential self-fulfilling ways of keeping your word to another; you can be trusted more to fulfill commitments in the future if you always fulfils commitments made in the past. (Of course, to always fulfil commitments requires being sparing about making them.)

Let’s now consider, rather than inter-personal commitments, self-commitments. Consider alternatives to making a new year’s resolution to exercise twice a week. Suppose you actually believe that you will do resistance training about twice a week for the next year. Then, perhaps it is prudent to invest in a home gym. Investing in the gym is, in a way, a “bet” about your future actions: it will turn out to have been not worth it, if you rarely use it. Though, it’s an unusual type of bet, in that the outcome of the bet is determined by your future actions (thus potentially being influenced by self-fulfilling prophecies).

A more general formula: Instead of making a commitment from sheer force of will, think about the range of possible worlds where you actually fulfill the commitment. Think about what would be good decisions right now, conditional on fulfilling the commitment in the future. These are “bets” on fulfilling the commitment, and are often well thought of as “investments”. Now, ask two questions:

  1. If I take these initial steps, do I expect that I’ll fulfill the commitment?
  2. If I take these initial steps, and then fulfill the commitment, do I overall like the result, compared to the default alternative?

If the answers to both are “yes”, that suggests that the commitment-by-bet is overall prudent, compared with the default. (Of course, there are more possible actions if the answer to either question is “no”, including re-thinking the commitment or the initial steps, or going ahead with the initial steps anyway on expected value grounds.)

The overall idea here is to look for natural decision-theoretic commitment opportunities. Investing in a home gym, for example, is a good idea for people who make some sorts of decisions in the future (like regular resistance training), and a bad idea for people who make different sorts of decisions in the future. It’s not an artificial mechanism like giving your stuff to a friend who only gives it back if you exercise enough. It’s a feature of the decision-theoretic landscape, where making certain decisions ahead of time is only prudent conditional on certain future actions.

Something hard to model here is the effect of such investments/bets on a person’s future action through “self-fulfilling-prophecy” or “hyperstitional” means. For example, perhaps if you actually invest in a home gym, people including you will think of you as the sort of person who benefits from a home gym, who is a sort of person who exercises regularly. Such a change to one’s self-image, and external image, can influence what it feels natural to do in the future.

To be clear, I’m not recommending making performative investments in things corresponding to what you would like to be doing in the future. Instead, I’m advising thinking through what would actually be a good investment conditional on the imagined future actions. For example, even if you are going to exercise regularly, it’s not clear that a home gym is a good investment: a gym membership may be a better idea. And it’s prudent to take into account the chance of not exercising in the future, making the investment useless: my advised decision process counts this as a negative, not a useful self-motivating punishment. The details will, of course, depend on the specific situation.

This sort of commitment-by-bet can be extended to inter-personal situations, to some degree. For example, suppose two people like the idea of living together long-term. They could, as an alternative to making promises to each other about this, think of bets/investments that would be a good idea conditional on living together long-term, such as getting a shared mortgage on a house. That’s more likely to be prudent conditional on them living together long-term. And the cost of not living together is denominated more materially and financially, rather than in broken promises.

To summarize: I suggest, as an alternative to making explicit commitments that they feels bound by in the future, people could consider locating commitment opportunities that are already out there, in the form of decisions that are only prudent conditional on some future actions; taking such an opportunity constitutes a “bet” or “investment” on taking those actions in the future. This overall seems to be more compatible with low levels of psychological self-conflict, which has broad benefits to the committer’s ability to un-confusedly model the world and act agentically.

The Obliqueness Thesis

In my Xenosystems review, I discussed the Orthogonality Thesis, concluding that it was a bad metaphor. It’s a long post, though, and the comments on orthogonality build on other Xenosystems content. Therefore, I think it may be helpful to present a more concentrated discussion on Orthogonality, contrasting Orthogonality with my own view, without introducing dependencies on Land’s views. (Land gets credit for inspiring many of these thoughts, of course, but I’m presenting my views as my own here.)

First, let’s define the Orthogonality Thesis. Quoting Superintelligence for Bostrom’s formulation:

Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.

To me, the main ambiguity about what this is saying is the “could in principle” part; maybe, for any level of intelligence and any final goal, there exists (in the mathematical sense) an agent combining those, but some combinations are much more natural and statistically likely than others. Let’s consider Yudkowsky’s formulations as alternatives. Quoting Arbital:

The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.

The strong form of the Orthogonality Thesis says that there’s no extra difficulty or complication in the existence of an intelligent agent that pursues a goal, above and beyond the computational tractability of that goal.

As an example of the computational tractability consideration, sufficiently complex goals may only be well-represented by sufficiently intelligent agents. “Complication” may be reflected in, for example, code complexity; to my mind, the strong form implies that the code complexity of an agent with a given level of intelligence and goals is approximately the code complexity of the intelligence plus the code complexity of the goal specification, plus a constant. Code complexity would influence statistical likelihood for the usual Kolmogorov/Solomonoff reasons, of course.

I think, overall, it is more productive to examine Yudkowsky’s formulation than Bostrom’s, as he has already helpfully factored the thesis into weak and strong forms. Therefore, by criticizing Yudkowsky’s formulations, I am less likely to be criticizing a strawman. I will use “Weak Orthogonality” to refer to Yudkowsky’s “Orthogonality Thesis” and “Strong Orthogonality” to refer to Yudkowsky’s “strong form of the Orthogonality Thesis”.

Land, alternatively, describes a “diagonal” between intelligence and goals as an alternative to orthogonality, but I don’t see a specific formulation of a “Diagonality Thesis” on his part. Here’s a possible formulation:

Diagonality Thesis: Final goals tend to converge to a point as intelligence increases.

The main criticism of this thesis is that formulations of ideal agency, in the form of Bayesianism and VNM utility, leave open free parameters, e.g. priors over un-testable propositions, and the utility function. Since I expect few readers to accept the Diagonality Thesis, I will not concentrate on criticizing it.

What about my own view? I like Tsvi’s naming of it as an “obliqueness thesis”.

Obliqueness Thesis: The Diagonality Thesis and the Strong Orthogonality Thesis are false. Agents do not tend to factorize into an Orthogonal value-like component and a Diagonal belief-like component; rather, there are Oblique components that do not factorize neatly.

(Here, by Orthogonal I mean basically independent of intelligence, and by Diagonal I mean converging to a point in the limit of intelligence.)

While I will address Yudkowsky’s arguments for the Orthogonality Thesis, I think arguing directly for my view first will be more helpful. In general, it seems to me that arguments for and against the Orthogonality Thesis are not mathematically rigorous; therefore, I don’t need to present a mathematically rigorous case to contribute relevant considerations, so I will consider intuitive arguments relevant, and present multiple arguments rather than a single sequential argument (as I did with the more rigorous argument for many worlds).

Bayes/VNM point against Orthogonality

Some people may think that the free parameters in Bayes/VNM point towards the Orthogonality Thesis being true. I think, rather, that they point against Orthogonality. While they do function as arguments against the Diagonality Thesis, this is insufficient for Orthogonality.

First, on the relationship between intelligence and bounded rationality. It’s meaningless to talk about intelligence without a notion of bounded rationality. Perfect rationality in a complex environment is computationally intractable. With lower intelligence, bounded rationality is necessary. So, at non-extreme intelligence levels, the Orthogonality Thesis must be making a case that boundedly rational agents can have any computationally tractable goal.

Bayesianism and VNM expected utility optimization are known to be computationally intractable in complex environments. That is why algorithms like MCMC and reinforcement learning are used. So, making an argument for Orthogonality in terms of Bayesianism and VNM is simply dodging the question, by already assuming an extremely high intelligence level from the start.

As the Orthogonality Thesis refers to “values” or “final goals” (which I take to be synonymous), it must have a notion of the “values” of agents that are not extremely intelligent. These values cannot be assumed to be VNM, since VNM is not computationally tractable. Meanwhile, money-pumping arguments suggest that extremely intelligent agents will tend to converge to VNM-ish preferences. Thus:

Argument from Bayes/VNM: Agents with low intelligence will tend to have beliefs/values that are far from Bayesian/VNM. Agents with high intelligence will tend to have beliefs/values that are close to Bayesian/VNM. Strong Orthogonality is false because it is awkward to combine low intelligence with Bayesian/VNM beliefs/values, and awkward to combine high intelligence with far-from-Bayesian/VNM beliefs/values. Weak Orthogonality is in doubt, because having far-from-Bayesian/VNM beliefs/values puts a limit on the agent’s intelligence.

To summarize: un-intelligent agents cannot be assumed to be Bayesian/VNM from the start. Those arise at a limit of intelligence, and arguably have to arise due to money-pumping arguments. Beliefs/values therefore tend to become more Bayesian/VNM with high intelligence, contradicting Strong Orthogonality and perhaps Weak Orthogonality.

One could perhaps object that logical uncertainty allows even weak agents to be Bayesian over combined physical/mathematical uncertainty; I’ll address this consideration later.

Belief/value duality

It may be unclear why the Argument from Bayes/VNM refers to both beliefs and values, as the Orthogonality Thesis is only about values. It would, indeed, be hard to make the case that the Orthogonality Thesis is true as applied to beliefs. However, various arguments suggest that Bayesian beliefs and VNM preferences are “dual” such that complexity can be moved from one to the other.

Abram Demski has presented this general idea in the past, and I’ll give a simple example to illustrate.

Let A \in \mathcal{A} be the agent’s action, and let W \in \mathcal{W} represent the state of the world prior to / unaffected by the agent’s action Let r(A, W) be the outcome resulting from the action and world. Let P(w) be the primary agent’s probability a given world. Let U(o) be the primary agent’s utility for outcome o. The primary agent finds an action a to maximize \sum_{w \in \mathcal{W}} P(w) U(r(a, w)).

Now let e be an arbitrary predicate on worlds. Consider modifying P to increase the probability that e(W) is true. That is:

P'(w) :\propto P(w) (1 + [e(w)])

P'(w) = \frac{P(w)(1 + [e(w)])}{\sum_{w \in \mathcal{W}} P(w)(1 + [e(w)])}

where [e(w)] equals 1 if e(w), otherwise 0. Now, can we define a modified utility function U’ so a secondary agent with beliefs P’ and utility function U’ will take the same action as the primary agent? Yes:

U'(o) := \frac{U(o)}{1 + [e(w)]}

This secondary agent will find an action a to maximize: 

\sum_{w \in \mathcal{W}} P'(w) U'(r(a, w))

= \sum_{w \in \mathcal{W}} \frac{P(w)(1 + [e(w)])}{\sum_{w' \in \mathcal{W}} P(w')(1 + [e(w')])} \frac{U(r(a, w))}{1 + [e(w)]}

= \frac{1}{\sum_{w \in \mathcal{W}} P(w)(1 + [e(w)])} \sum_{w \in \mathcal{W}} P(w) U(r(a, w))

Clearly, this is a positive constant times the primary agent’s maximization target, so the secondary agent will take the same action.

This demonstrates a basic way that Bayesian beliefs and VNM utility are dual to each other. One could even model all agents as having the same utility function (of maximizing a random variable U) and simply having different beliefs about what U values are implied by the agent’s action and world state. Thus:

Argument from belief/value duality: From an agent’s behavior, multiple belief/value combinations are valid attributions. This is clearly true in the limiting Bayes/VNM case, suggesting it also applies in the case of bounded rationality. It is unlikely that the Strong Orthogonality Thesis applies to beliefs (including priors), so, due to the duality, it is also unlikely that it applies to values.

I consider this weaker than the Argument from Bayes/VNM. Someone might object that both values and a certain component of beliefs are orthogonal, while the other components of beliefs (those that change with more reasoning/intelligence) aren’t. But I think this depends on a certain factorizability of beliefs/values into the kind that change on reflection and those that don’t, and I’m skeptical of such factorizations. I think discussion of logical uncertainty will make my position on this clearer, though, so let’s move on.

Logical uncertainty as a model for bounded rationality

I’ve already argued that bounded rationality is essential to intelligence (and therefore the Orthogonality Thesis). Logical uncertainty is a form of bounded rationality (as applied to guessing the probabilities of mathematical statements). Therefore, discussing logical uncertainty is likely to be fruitful with respect to the Orthogonality Thesis.

Logical Induction is a logical uncertainty algorithm that produces a probability table for a finite subset of mathematical statements at each iteration. These beliefs are determined by a betting market of an increasing (up to infinity) number of programs that make bets, with the bets resolved by a “deductive process” that is basically a theorem prover. The algorithm is computable, though extremely computationally intractable, and has properties in the limit including some forms of Bayesian updating, statistical learning, and consistency over time.

We can see Logical Induction as evidence against the Diagonality Thesis: beliefs about undecidable statements (which exist in consistent theories due to Gödel’s first incompleteness theorem) can take on any probability in the limit, though satisfy properties such as consistency with other assigned probabilities (in a Bayesian-like manner).

However, (a) it is hard to know ahead of time which statements are actually undecidable, (b) even beliefs about undecidable statements tend to predictably change over time to Bayesian consistency with other beliefs about undecidable statements. So, Logical Induction does not straightforwardly factorize into a “belief-like” component (which converges on enough reflection) and a “value-like” component (which doesn’t change on reflection). Thus:

Argument from Logical Induction: Logical Induction is a current best-in-class model of theoretical asymptotic bounded rationality. Logical Induction is non-Diagonal, but also clearly non-Orthogonal, and doesn’t apparently factorize into separate Orthogonal and Diagonal components. Combined with considerations from “Argument from belief/value duality”, this suggests that it’s hard to identify all value-like components in advanced agents that are Orthogonal in the sense of not tending to change upon reflection.

One can imagine, for example, introducing extra function/predicate symbols into the logical theory the logical induction is over, to represent utility. Logical induction will tend to make judgments about these functions/predicates more consistent and inductively plausible over time, changing its judgments about the utilities of different outcomes towards plausible logical probabilities. This is an Oblique (non-Orthogonal and non-Diagonal) change in the interpretation of the utility symbol over time.

Likewise, Logical Induction can be specified to have beliefs over empirical facts such as observations by adding additional function/predicate symbols, and can perhaps update on these as they come in (although this might contradict UDT-type considerations). Through more iteration, Logical Inductors will come to have more approximately Bayesian, and inductively plausible, beliefs about these empirical facts, in an Oblique fashion.

Even if there is a way of factorizing out an Orthogonal value-like component from an agent, the belief-component (represented by something like Logical Induction) remains non-Diagonal, so there is still a potential “alignment problem” for these non-Diagonal components to match, say, human judgments in the limit. I don’t see evidence that these non-Diagonal components factor into a value-like “prior over the undecidable” that does not change upon reflection. So, there remain components of something analogous to a “final goal” (by belief/value duality) that are Oblique, and within the scope of alignment.

If it were possible to get the properties of Logical Induction in a Bayesian system, which makes Bayesian updates on logical facts over time, that would make it more plausible that an Orthogonal logical prior could be specified ahead of time. However, MIRI researchers have tried for a while to find Bayesian interpretations of Logical Induction, and failed, as would be expected from the Argument from Bayes/VNM.

Naive belief/value factorizations lead to optimization daemons

The AI alignment field has a long history of poking holes in alignment approaches. Oops, you tried making an oracle AI and it manipulated real-world outcomes to make its predictions true. Oops, you tried to do Solomonoff induction and got invaded by aliens. Oops, you tried getting agents to optimize over a virtual physical universe, and they discovered the real world and tried to break out. Oops, you ran a Logical Inductor and one of the traders manipulated the probabilities to instantiate itself in the real world.

These sub-processes that take over are known as optimization daemons. When you get the agent architecture wrong, sometimes a sub-process (that runs a massive search over programs, such as with Solomonoff Induction) will luck upon a better agent architecture and out-compete the original system. (See also a very strange post I wrote some years back while thinking about this issue, and Christiano’s comment relating it to Orthogonality).

If you apply a naive belief/value factorization to create an AI architecture, when compute is scaled up sufficiently, optimization daemons tend to break out, showing that this factorization was insufficient. Enough experiences like this lead to the conclusion that, if there is a realistic belief/value factorization at all, it will look pretty different from the naive one. Thus:

Argument from optimization daemons: Naive ways of factorizing an agent into beliefs/values tend to lead to optimization daemons, which have different values from in the original factorization. Any successful belief/value factorization will probably look pretty different from the naive one, and might not take the form of factorization into Diagonal belief-like components and Orthogonal value-like components. Therefore, if any realistic formulation of Orthogonality exists, it will be hard to find and substantially different from naive notions of Orthogonality.

Intelligence changes the ontology values are expressed in

The most straightforward way to specify a utility function is to specify an ontology (a theory of what exists, similar to a database schema) and then provide a utility function over elements of this ontology. Prior to humans learning about physics, evolution (taken as a design algorithm for organisms involving mutation and selection) did not know all that human physicists know. Therefore, human evolutionary values are unlikely to be expressed in the ontology of physics as physicists currently believe in.

Human evolutionary values probably care about things like eating enough, social acceptance, proxies for reproduction, etc. It is unknown how these are specified, but perhaps sensory signals (such as stomach signals) are connected with a developing world model over time. Humans can experience vertigo at learning physics, e.g. thinking that free will and morality are fake, leading to unclear applications of native values to a realistic physical ontology. Physics has known gaps (such as quantum/relativity correspondence, and dark energy/dark matter) that suggest further ontology shifts.

One response to this vertigo is to try to solve the ontology identification problem; find a way of translating states in the new ontology (such as physics) to an old one (such as any kind of native human ontology), in a structure-preserving way, such that a utility function over the new ontology can be constructed as a composition of the original utility function and the new-to-old ontological mapping. Current solutions, such as those discussed in MIRI’s Ontological Crises paper, are unsatisfying. Having looked at this problem for a while, I’m not convinced there is a satisfactory solution within the constraints presented. Thus:

Argument from ontological change: More intelligent agents tend to change their ontology to be more realistic. Utility functions are most naturally expressed relative to an ontology. Therefore, there is a correlation between an agent’s intelligence and utility function, through the agent’s ontology as an intermediate variable, contradicting Strong Orthogonality. There is no known solution for rescuing the old utility function in the new ontology, and some research intuitions pointing towards any solution being unsatisfactory in some way.

If a satisfactory solution is found, I’ll change my mind on this argument, of course, but I’m not convinced such a satisfactory solution exists. To summarize: higher intelligence causes ontological changes, and rescuing old values seems to involve unnatural “warps” to make the new ontology correspond with the old one, contradicting at least Strong Orthogonality, and possibly Weak Orthogonality (if some values are simply incompatible with realistic ontology). Paperclips, for example, tend to appear most relevant at an intermediate intelligence level (around human-level), and become more ontologically unnatural at higher intelligence levels.

As a more general point, one expects possible mutual information between mental architecture and values, because values that “re-use” parts of the mental architecture achieve lower description length. For example, if the mental architecture involves creating universal algebra structures and finding analogies between them and the world, then values expressed in terms of such universal algebras will tend to have lower relative description complexity to the architecture. Such mutual information contradicts Strong Orthogonality, as some intelligence/value combinations are more natural than others.

Intelligence leads to recognizing value-relevant symmetries

Consider a number of un-intutitive value propositions people have argued for:

  • Torture is preferable to Dust Specks, because it’s hard to come up with a utility function with the alternative preference without horrible unintuitive consequences elsewhere.
  • People are way too risk-averse in betting; the implied utility function has too strong diminishing marginal returns to be plausible.
  • You may think your personal identity is based on having the same atoms, but you’re wrong, because you’re distinguishing identical configurations.
  • You may think a perfect upload of you isn’t conscious (and basically another copy of you), but you’re wrong, because functionalist theory of mind is true.
  • You intuitively accept the premises of the Repugnant Conclusion, but not the Conclusion itself; you’re simply wrong about one of the premises, or the conclusion.

The point is not to argue for these, but to note that these arguments have been made and are relatively more accepted among people who have thought more about the relevant issues than people who haven’t. Thinking tends to lead to noticing more symmetries and dependencies between value-relevant objects, and tends to adjust values to be more mathematically plausible and natural. Of course, extrapolating this to superintelligence leads to further symmetries. Thus:

Argument from value-relevant symmetries: More intelligent agents tend to recognize more symmetries related to value-relevant entities. They will also tend to adjust their values according to symmetry considerations. This is an apparent value change, and it’s hard to see how it can instead be factored as a Bayesian update on top of a constant value function.

I’ll examine such factorizations in more detail shortly.

Human brains don’t seem to neatly factorize

This is less about the Orthogonality Thesis generally, and more about human values. If there were separable “belief components” and “value components” in the human brain, with the value components remaining constant over time, that would increase the chance that at least some Orthogonal component can be identified in human brains, corresponding with “human values” (though, remember, the belief-like component can also be Oblique rather than Diagonal).

However, human brains seem much more messy than the sort of computer program that could factorize this way. Different brain regions are connected in at least some ways that are not well-understood. Additionally, even apparent “value components” may be analogous to something like a deep Q-learning function, which incorporates empirical updates in addition to pre-set “values”.

The interaction between human brains and language is also relevant. Humans develop values they act on partly through language. And language (including language reporting values) is affected by empirical updates and reflection, thus non-Orthogonal. Reflecting on morality can easily change people’s expressed and acted-upon values, e.g. in the case of Peter Singer. People can change which values they report as instrumental or terminal even while behaving similarly (e.g. flipping between selfishness-as-terminal and altruism-as-terminal), with the ambiguity hard to resolve because most behavior relates to convergent instrumental goals.

Maybe language is more of an effect than cause of values. But there really seems to be feedback from language to non-linguistic brain functions that decide actions and so on. Attributing coherent values over realistic physics to the brain parts that are non-linguistic seems like a form of projection or anthropomorphism. Language and thought have a function in cognition and attaining coherent values over realistic ontologies. Thus:

Argument from brain messiness: Human brains don’t seem to neatly factorize into a belief-component and a value-component, with the value-component unaffected by reflection or language (which it would need to be Orthogonal). To the extent any value-component does not change due to language or reflection, it is restricted to evolutionary human ontology, which is unlikely to apply to realistic physics; language and reflection are part of the process that refines human values, rather than being an afterthought of them. Therefore, if the Orthogonality Thesis is true, humans lack identifiable values that fit into the values axis of the Orthogonality Thesis.

This doesn’t rule out that Orthogonality could apply to superintelligences, of course, but it does raise questions for the project of aligning superintelligences with human values; perhaps such values do not exist or are not formulated so as to apply to the actual universe.

Models of ASI should start with realism

Some may take arguments against Orthogonality to be disturbing at a value level, perhaps because they are attached to research projects such as Friendly AI (or more specific approaches), and think questioning foundational assumptions would make the objective (such as alignment with already-existing human values) less clear. I believe “hold off on proposing solutions” applies here: better strategies are likely to come from first understanding what is likely to happen absent a strategy, then afterwards looking for available degrees of freedom.

Quoting Yudkowsky:

Orthogonality is meant as a descriptive statement about reality, not a normative assertion. Orthogonality is not a claim about the way things ought to be; nor a claim that moral relativism is true (e.g. that all moralities are on equally uncertain footing according to some higher metamorality that judges all moralities as equally devoid of what would objectively constitute a justification). Claiming that paperclip maximizers can be constructed as cognitive agents is not meant to say anything favorable about paperclips, nor anything derogatory about sapient life.

Likewise, Obliqueness does not imply that we shouldn’t think about the future and ways of influencing it, that we should just give up on influencing the future because we’re doomed anyway, that moral realist philosophers are correct or that their moral theories are predictive of ASI, that ASIs are necessarily morally good, and so on. The Friendly AI research program was formulated based on descriptive statements believed at the time, such as that an ASI singleton would eventually emerge, that the Orthogonality Thesis is basically true, and so on. Whatever cognitive process formulated this program would have formulated a different program conditional on different beliefs about likely ASI trajectories. Thus:

Meta-argument from realism: Paths towards beneficially achieving human values (or analogues, if “human values” don’t exist) in the far future likely involve a lot of thinking about likely ASI trajectories absent intervention. The realistic paths towards human influence on the far future depend on realistic forecasting models for ASI, with Orthogonality/Diagonality/Obliqueness as alternative forecasts. Such forecasting models can be usefully thought about prior to formulation of a research program intended to influence the far future. Formulating and working from models of bounded rationality such as Logical Induction is likely to be more fruitful than assuming that bounded rationality will factorize into Orthogonal and Diagonal components without evidence in favor of this proposition. Forecasting also means paying more attention to the Strong Orthogonality Thesis than the Weak Orthogonality Thesis, as statistical correlations between intelligence and values will show up in such forecasts.

On Yudkowsky’s arguments

Now that I’ve explained my own position, addressing Yudkowsky’s main arguments may be useful. His main argument has to do with humans making paperclips instrumentally:

Suppose some strange alien came to Earth and credibly offered to pay us one million dollars’ worth of new wealth every time we created a paperclip. We’d encounter no special intellectual difficulty in figuring out how to make lots of paperclips.

That is, minds would readily be able to reason about:

  • How many paperclips would result, if I pursued a policy \pi_0?
  • How can I search out a policy \pi that happens to have a high answer to the above question?

I believe it is better to think of the payment as coming in the far future and perhaps in another universe; that way, the belief about future payment is more analogous to terminal values than instrumental values. In this case, creating paperclips is a decent proxy for achievement of human value, so long-termist humans would tend to want lots of paperclips to be created.

I basically accept this, but, notably, Yudkowsky’s argument is based on belief/value duality. He thinks it would be awkward for the reader to imagine terminally wanting paperclips, so he instead asks them to imagine a strange set of beliefs leading to paperclip production being oddly correlated with human value achievement. Thus, acceptance of Yudkowsky’s premises here will tend to strengthen the Argument from belief/value duality and related arguments.

In particular, more intelligence would cause human-like agents to develop different beliefs about what actions aliens are likely to reward, and what numbers of paperclips different policies result in. This points towards Obliqueness as with Logical Induction: such beliefs will be revised (but not totally convergent) over time, leading to applying different strategies toward value achievement. And ontological issues around what counts as a paperclip will come up at some point, and likely be decided in a prior-dependent but also reflection-dependent way.

Beliefs about which aliens are most capable/honest likely depend on human priors, and are therefore Oblique: humans would want to program an aligned AI to mostly match these priors while revising beliefs along the way, but can’t easily factor out their prior for the AI to share.

Now onto other arguments. The “Size of mind design space” argument implies many agents exist with different values from humans, which agrees with Obliqueness (intelligent agents tend to have different values from unintelligent ones). It’s more of an argument about the possibility space than statistical correlation, thus being more about Weak than Strong Orthogonality.

The “Instrumental Convergence” argument doesn’t appear to be an argument for Orthogonality per se; rather, it’s a counter to arguments against Orthogonality based on noticing convergent instrumental goals. My arguments don’t take this form.

Likewise, “Reflective Stability” is about a particular convergent instrumental goal (preventing value modification). In an Oblique framing, a Logical Inductor will tend not to change its beliefs about even un-decidable propositions too often (as this would lead to money-pumps), so consistency is valued all else being equal.

While I could go into more detail responding to Yudkowsky, I think space is better spent presenting my own Oblique views for now.

Conclusion

As an alternative to the Orthogonality Thesis and the Diagonality Thesis, I present the Obliqueness Thesis, which says that increasing intelligence tends to lead to value changes but not total value convergence. I have presented arguments that advanced agents and humans do not neatly factor into Orthogonal value-like components and Diagonal belief-like components, using Logical Induction as a model of bounded rationality. This implies complications to theories of AI alignment based on assuming humans have values and we need the AGI to agree about those values, while increasing their intelligence (and thus changing beliefs).

At a methodological level, I believe it is productive to start by forecasting default ASI using models of bounded rationality, especially known models such as Logical Induction, and further developing such models. I think this is more productive than assuming that these models will take the form of a belief/value factorization, although I have some uncertainty about whether such a factorization will be found.

If the Obliqueness Thesis is accepted, what possibility space results? One could think of this as steering a boat in a current of varying strength. Clearly, ignoring the current and just steering where you want to go is unproductive, as is just going along with the current and not trying to steer at all. Getting to where one wants to go consists in largely going with the current (if it’s strong enough), charting a course that takes it into account.

Assuming Obliqueness, it’s not viable to have large impacts on the far future without accepting some value changes that come from higher intelligence (and better epistemology in general). The Friendly AI research program already accepts that paths towards influencing the far future involve “going with the flow” regarding superintelligence, ontology changes, and convergent instrumental goals; Obliqueness says such flows go further than just these, being hard to cleanly separate from values.

Obliqueness obviously leaves open the question of just how oblique. It’s hard to even formulate a quantitative question here. I’d very intuitively and roughly guess that intelligence and values are 3 degrees off (that is, almost diagonal), but it’s unclear what question I am even guessing the answer to. I’ll leave formulating and answering the question as an open problem.

I think Obliqueness is realistic, and that it’s useful to start with realism when thinking of how to influence the far future. Maybe superintelligence necessitates significant changes away from current human values; the Litany of Tarski applies. But this post is more about the technical thesis than emotional processing of it, so I’ll end here.