A philosophical kernel: biting analytic bullets

Sometimes, a philosophy debate has two basic positions, call them A and B. A matches a lot of people’s intuitions, but is hard to make realistic. B is initially unintuitive (sometimes radically so), perhaps feeling “empty”, but has a basic realism to it. There might be third positions that claim something like, “A and B are both kind of right”.

Here I would say B is the more bullet-biting position. Free will vs. determinism is a classic example: hard determinism is biting the bullet. One interesting thing is that free will believers (including compatibilists) will invent a variety of different theories to explain or justify free will; no one theory seems clearly best. Meanwhile, hard determinism has stayed pretty much the same since ancient Greek fatalism.

While there are some indications that the bullet-biting position is usually more correct, I don’t mean to make an overly strong statement here. Sure, position A (or a compatibility between A and B) could really be correct, though the right formalization hasn’t been found. But I am interested in what views result from biting bullets at every stage, nonetheless.

Why consider biting multiple bullets in sequence? Consider an analogy: a Christian fundamentalist considers whether Christ’s resurrection didn’t really happen. He reasons: “But if the resurrection didn’t happen, then Christ is not God. And if Christ is not God, then humanity is not redeemed. Oh no!”

There’s clearly a mistake here, in that a revision of a single belief can lead to problems that are avoided by revising multiple beliefs at once. In the Christian fundamentalist case, atheists and non-fundamentalists already exist, so it’s pretty easy not to make this mistake. On the other hand, many of the (explicit or implicit) intuitions in the philosophical water supply may be hard to think outside of; there may not be easily identifiable “atheists” with respect to many of these intuitions simultaneously.

Some general heuristics. Prefer ontological minimality: do not explode types of entities beyond necessity. Empirical plausibility: generally agree with well-established science and avoid bold empirical claims; at most, cast doubt on common scientific background assumptions (see: Kant decoupling subjective time from clock time). Un-creativity: avoid proposing speculative, experimental frameworks for decision theory and so on (they usually don’t work out).

What’s the point of all this? Maybe the resulting view is more likely true than other views. Even if it isn’t true, it might be a minimal “kernel” view that supports adding more elements later, without conflicting with legacy frameworks. It might be more productive to argue against a simple, focused, canonical view than a popular “view” which is really a disjunctive collection of many different views; bullet-biting increases simplicity, hence perhaps being more productive to argue against.

Causality: directed acyclic graph multi-factorization

Empirically, we don’t see evidence of time travel. Events seem to proceed from past to future, with future events being at least somewhat predictable from past events. This can be seen in probabilistic graphical models. Bayesian networks have a directed acyclic graph factorization (which can be topologically sorted, perhaps in multiple ways), while factor graphs in general don’t. (For example, it is possible to express the conditional distribution of a Bayesian network on some variable having some value, in a factor graph; the factor graph now expresses something like “teleology”, events tending to happen more when they are compatible with some future possibility.)

This raises the issue that there are multiple Bayesian networks with different graphs expressing the same joint distribution. For ontological minimality, we could say these are all valid factorizations (so there is no “further fact” of what is the real factorization, in cases of persistent empirical ambiguity), though of course some have analytically nicer mathematical properties (locality, efficient computability) than others. Each non-trivial DAG factorization has mathematical implications about the distribution; we need not forget these implications even though there are multiple DAG factorizations.

Bayesian networks can be generalized to probabilistic programming, e.g. some variables may only exist dependent on specific values for previous variables. This doesn’t change the overall setup much; the basic ideas are already present in Bayesian networks.

We now have a specific disagreement with Judea Pearl: he operationalizes causality in terms of consequences of counterfactual intervention. This is sensitive to the graph order of the directed acyclic graph; hence, causal graphs express more information than the joint distribution. For ontological minimality, we’ll avoid reifying causal counterfactuals and hence causal graphs. Causal counterfactuals have theoretical problems, such as implying violations of physical law, hence being un-determined by empirical science (as we can’t observe what happens when physical laws are violated). We avoid these, by not believing in causal counterfactuals.

Since causal counterfactuals are about non-actual universes, we don’t really need them to make the empirical predictions of causal models, such as no time travel. DAG factorization seems to do the job.

Laws of physics: universal satisfaction

Given a DAG model, some physical invariants may hold, e.g. conservation of energy. And if we transform the DAG model to one expressing the same joint distribution, the physical invariants translate. They always hold for any configuration in the DAG’s support.

Do the laws have “additional reality” beyond universal satisfaction? It doesn’t seem we need to assume they do. We predict as if the laws always hold, but that reduces to a statement about the joint configuration; no extra predictive power results from assuming the laws have any additional existence.

So for ontological minimality, the reality of a law can be identified with its universal satisfaction by the universe’s trajectory. (This is weaker than notions of “counterfactual universal satisfaction across all possible universes”.)

This enables us to ask questions similar to counterfactuals: what would follow (logically, or with high probability according to the DAG) in a model in which these universal invariants hold, and the initial state is X (which need not match the actual universe’s initial state)? This is a mathematical question, rather than a modal one; see discussion of mathematics later.

Time: eternalism

Eternalism says the future exists, as the past and present do. This is fairly natural from the DAG factorization notion of causality. As there are multiple topological sorts of a given DAG, and multiple DAGs consistent with the same joint distribution, there isn’t an obvious way to separate the present from the past and future; and even if there were, there wouldn’t be an obvious point in declaring some nodes real and others un-real based on their topological ordering. Accordingly, for ontological minimality, they have the same degree of existence.

Eternalism is also known as “block universe theory”. There’s a possible complication, in that our DAG factorization can be stochastic. But the stochasticity need not be “located in time”. In particular, we can move any stochasticity into independent random variables, and have everything be a deterministic consequence of those. This is like pre-computing random numbers for a Monte Carlo sampling algorithm.

The main empirical ambiguity here is whether the universe’s history has a high Kolmogorov complexity, increasing approximately linearly with time. If it does, then something like a stochastic model is predictively appropriate, although the stochasticity need not be “in time”. If not, then it’s more like classical determinism. It’s an open empirical question, so let’s not be dogmatic.

We can go further. Do we even need to attribute “true stochasticity” to a universe with high Kolmogorov complexity? Instead, we can say that simple universally satisfied laws constrain the trajectory, either partially or totally (only partially in the high K-complexity case). And to the extent they only partially do, we have no reason to expect that a simple stochastic model of the remainder would be worse than any other model (except high K-complexity ones that “bake in” information about the remainder, a bit of a cheat). (See the “The Coding Theorem — A Link between Complexity and Probability” for technical details.)

Either way, we have “quasi-determinism”; everything is deterministic, except perhaps factored-out residuals that a simple stochastic model suffices for.

Free will: non-realism

A basic argument against free will: free will for an agent implies that the agent could have done something else. This already implies a “possibility”-like modality; if such a modality is not real, free will fails. If on the other hand, possibility is real, then, according to standard modal logics such as S4, any logical tautology must be necessary. If an agent is identified with a particular physical configuration, then, given the same physics / inputs / stochastic bits (which can be modeled as non-temporal extra parameters, per previous discussion), there is only one possible action, and it is necessary, as it is logically tautological. Hence, a claim of “could” about any other action fails.

Possible ways out: consider giving the agent different inputs, or different stochastic bits, or different physics, or don’t identify the agent with its configuration (have “could” change the agent’s physical configuration). These are all somewhat dubious. For one, it is dogmatic to assume that the universe has high Kolmogorov complexity; if it doesn’t, then modeling decisions as having corresponding “stochastic bits” can’t in general be valid. Free will believers don’t tend to agree on how to operationalize “could”, their specific formalizations tend to be dubious in various ways, and the formalizations do not agree much with normal free will intuitions. The obvious bullet to bite here is, there either is no modal “could”, or if there is, there is none that corresponds to “free will”, as the notion of “free will” bakes in confusions.

Decision theory: non-realism

We reject causal decision theory (CDT), because it relies on causal counterfactuals. We reject any theory of “logical counterfactuals”, because the counterfactual must be illogical, contradicting modal logics such as S4. Without applying too much creativity, what remain are evidential decision theory (EDT) and non-realism, i.e. the claim that there is not in general a fact of the matter about what action by some fixed agent best accomplishes some goal.

To be fair to EDT, the smoking lesion problem is highly questionable in that it assumes decisions could be caused by genes (without those genes changing the decision theory, value function, and so on), contradicting implementation of EDT. Moreover, there are logical formulations of EDT, which ask whether it would be good news to learn that one’s algorithm outputs a given action given a certain input (the one you’re seeing), where “good news” is taken across a class of possible universes, not just the one you have evidence of; these may better handle “XOR blackmail” like problems.

Nevertheless, I won’t dogmatically assume based on failure of CDT and logical counterfactual theories that EDT works; EDT theorists have to do a lot to make EDT seem to work in strange decision-theoretic thought experiments. This work can introduce ontological extras such as infinitesimal probabilities, or similarly, pseudo-Bayesian conditionals on probability 0 events. From a bullet-biting perspective, this is all highly dubious, and not really necessary.

We can recover various “practical reason” concepts as statistical predictions about whether an agent will succeed at some goal, given evidence about the agent, including that agent’s actions. For example, as a matter of statistical regularity, some people succeed in business more than others, and there is empirical correlation with their decision heuristics. The difference is that this is a third-personal evaluation, rather than a first-personal recommendation: we make no assumption that third-person predictive concepts relating to practical reason translate to a workable first-personal decision theory. (See also “Decisions are not about changing the world, they are about learning what world you live in”, for related analysis.)

Morality: non-realism

This shouldn’t be surprising. Moral realism implies that moral facts exist, but where would they exist? No proposal of a definition in terms of physics, math, and so on has been generally convincing, and they vary quite a lot. G.E. Moore observes that any precise definition of morality (in terms of physics and so on) seems to leave an “open question” of whether that is really good, and compelling to the listener.

There are many possible minds (consider the space of AGI programs), and they could find different things compelling. There are statistical commonalities (e.g. minds will tend to make decisions compatible with maintaining an epistemology and so on), but even commonalities have exceptions. (See “No Universally Compelling Arguments”.)

Suppose you really like the categorical imperative and think rational minds have a general tendency to follow it. If so, wouldn’t it be more precise to say “X agent follows the categorical imperative” than “X agent acts morally”? This bakes in fewer intuitive confusions.

As an analogy, suppose some people refer to members of certain local bird species as a “forest spirit”, due to a local superstition. You could call such a bird a “forest spirit” by which you mean a physical entity of that bird species, but this risks baking in a superstitious confusion.

In addition, the discussion of free will and decision theory shows that there are problems with formulating possibility and intentional action. If, as Kant says, “ought implies can”, then contrapositively “not can implies not ought”; if modal analysis shows that alternative actions for a given agent are not possible, then no alternative actions can be “ought”. (Alternatively, if modal possibility is unreal, then “ought implies can” is confused to begin with). This is really not the interpretation of “ought” intended by moral realists; it’s redundant with the actual action.

Theory of mind: epistemic reductive physicalism

Chalmers claims that mental properties are “further facts” on top of physical properties, based on the zombie argument: it is conceivable that a universe physically identical to ours could exist, but with no consciousness in it. Ontological minimality suggests not believing in these “further facts”, especially given how dubious theories of consciousness tend to be. This seems a lot like eliminativism.

We don’t need to discard all mental concepts, though. Some mental properties such as logical inference and memory have computational interpretations. If I say my computer “remembers” something, I specify a certain set of physical configurations that way: the ones corresponding to computers with that something in the memory (e.g. RAM). I could perhaps be more precise than “remembers”, by saying something like “functionally remembers”.

A possible problem with eliminativism is that it might undermine the idea that we know things, including any evidence for eliminativism. It is epistemically judicious to have some ontological status for “we have evidence of this physical theory” and so on. The idea with reductive physicalism is to correspond such statements with physical ones. Such as: “in the universe, most agents who use this or that epistemic rule are right about this or that”. (It would be a mistake to assume, given a satisficing epistemology evaluation over existent agents, that we “could” maximize epistemology with a certain epistemic rule; that would open up the usual decision-theoretic complications. Evaluating the reliability of our epistemologies is more like evaluating third-personal practical reason than making first-personal recommendations.)

That might be enough. If it’s not enough then ontological minimality suggests adding as little as possible to physicalism to express epistemic facts. We don’t need a full-blown theory of consciousness to express meaningful epistemic statements.

Personal identity: empty individualism, similarity as successor

If a machine scans you and makes a nearly-exact physical copy elsewhere, is that copy also you? Paradoxes of personal identity abound. Whether that copy is “really you” seems like a non-question; if it had an answer, where would that answer be located?

Logically, we have a minimal notion of personal identity from mathematical identity (X=X). So, if X denotes (some mathematical object corresponding to) you at some time, then X=X. This is an empty notion of individualism, as it fails to hold that you are the same as recent past or future versions of yourself.

What’s fairly simple and predictive to say above X=X is that a near-exact copy of you is similar to you. As you are similar to near past and future versions of yourself, as two prints of a book are similar, and as two world maps are similar. There are also directed properties (rather than symmetric similarity), such as you remembering the experiences of past versions of yourself but not vice versa; these are reduce to physical properties, not further properties, as in the theory of mind section.

It’s easy to get confused about which entities are “really the same person”. Ontological minimality suggests there isn’t a general answer, beyond trivial reflexive identities (X=X). The successor concept is, then, something like similarity. (And getting too obsessed with “how exactly to define similarity?” misses the point; the use of similarity is mainly predictive/evidential, not metaphysical.)

Anthropic probability: non-realism, graph structure as successor

In the Sleeping Beauty problem, is the correct probability ½ or ⅓? It seems the argument is over nothing real. Halfers and thirders agree on a sort of graph structure of memory: the initial Sleeping Beauty “leads to” one or two future states, depending on the coin flip, in terms of functional memory relations. The problem has to do with translating the graph structure to a probability distribution over future observations and situations (from the perspective of the original Sleeping Beauty).

From physics and identification of basic mental functions, we get a graph-like structure; why add more ontology? Enough thought experiments of memory wipes, upload copying, and so on, suggest that the linear structure of memory and observation is not always valid.

This slightly complicates the idea of physical theories being predictive, but it seems possible to operationalize prediction without a full notion of subjective probability. We can ask questions like, “do most entities in the universe who use this or that predictive model make good predictions about their future observations?”. The point here isn’t to get a universal notion of good predictions, but rather one that is good enough to get basic inferences, like learning about universal physical laws.

Mathematics: formalism

Are mathematical facts, such as “Fermat’s Last Theorem is true”, real? If so, where are they? Are they in the physical universe, or at least partially in a different realm?

Both of these are questionable. If we try to identify “for all n,m: n + S(m) = S(n + m)” with “in the universe, it is always the case that adding n objects to S(m) objects yields S(n + m) objects”, we run into a few problems. First, it requires identifying objects in physics. Second, given a particular definition of object, physics might not be such that this rule always holds: maybe adding a pile of sand to another pile of sand reduces the number of objects (as it combines two piles into one), or perhaps some objects explode when moved around; meanwhile, mathematical intuition is that these laws are necessary. Third, the size of the physical universe limits how many test cases there can be; hence, we might un-intuitively conclude something like “for all n,m both greater than Graham’s number, n=m”, as the physical universe has no counter-examples. Fourth, the size of the universe limits the possible information content of any entity in it, forcing something like ultrafinitism.

On the other hand, the idea that the mathematical facts live even partially outside the universe is ontologically and epistemically questionable. How would we access these mathematical facts, if our behaviors are determined by physics? Why even assume they exist, when all we see is in the universe, not anything outside of it?

Philosophical formalism does not explain “for all n,m: n + S(m) = S(n + m)” by appealing to a universal truth, but by noting that our formal system (in this case, Peano arithmetic) derives it. A quasi-invariant holds: mathematicians tend to in practice follow the rules of the formal system. And mathematicians use one formal system rather than another for physical, historical reasons. Peano arithmetic, for example, is useful: it models numbers in physics theories and in computer science, yielding predictions due to the structure of the inferences having some correspondence with the structure of physics. Though, utility is a contingent fact about our universe; what problems are considered useful to solve varies with historical circumstances. Formal systems are also adopted for reasons other than utility, such as the momentum of past practice or the prestige of earlier work.

The thing we avoid with philosophical formalism is confusions over “further facts”, such as the Continuum Hypothesis, which has been shown to be independent of ZFC. We don’t need to think there is a real fact of the matter about whether the Continuum Hypothesis is true.

Formalism is suggestive of finitism and intuitionism, although these are additional principles of formal systems; we don’t need to conclude something like “finitism is true” per se. The advantage of such formal systems is that they may be a bit more “self-aware” as being formal systems; for example, intuitionism is less suggestive that there is always a fact of the matter regarding undecidable statements (like a Gödelian sentence), as it does not accept the law of the excluded middle. But, again, these are particular formal systems, which have advantages and disadvantages relative to other formal systems; we don’t need to conclude that any of these are “the correct formal system”.

Conclusion

The positions sketched here are not meant to be a complete theory of everything. They are a deliberately stripped-down “kernel” view, obtained by repeatedly biting bullets rather than preserving intuitions that demand extra ontology. Across causality, laws of physics, time, free will, decision theory, morality, mind, personal identity, anthropic probability, and mathematics, the same method has been applied:

  • Strip away purported “further facts” not needed for empirical adequacy.
  • Treat models as mathematical tools for describing the world’s structure, not as windows onto modal or metaphysical realms.
  • Accept that some familiar categories like “could,” “ought,” “the same person,” or “true randomness” may collapse into redundancy or dissolve into lighter successors such as statistical regularity or similarity relations.

This approach sacrifices intuitive richness for structural economy. But the payoff is clarity: fewer moving parts, fewer hidden assumptions, and fewer places for inconsistent intuitions to be smuggled in. Even if the kernel view is incomplete or false in detail, it serves as a clean baseline — one that can be built upon, by adding commitments with eyes open to their costs.

The process is iterative. For example, I stripped away a causal counterfactual ontology to get a DAG structure; then stripped away the timing of stochasticity into a-temporal uniform bits; then suggested that residuals not determined by simple physical laws (in a high Kolmogorov complexity universe) need not be “truly stochastic”, just well-predicted by a simple stochastic model. Each round makes the ontology lighter while preserving empirical usefulness.

It is somewhat questionable to infer from lack of success to define, say, optimal decision theories, that no such decision theory exists. This provides an opportunity for falsification: solve the problem really well. A sufficiently reductionist solution may be compatible with the philosophical kernel; otherwise, an extension might be warranted.

I wouldn’t say I outright agree with everything here, but the exercise has shifted my credences toward these beliefs. As with the Christian fundamentalist analogy, resistance to biting particular bullets may come from revising too few beliefs at once.

A practical upshot is that a minimal philosophical kernel can be extended more easily without internal conflict, whereas a more complex system is harder to adapt. If someone thinks this kernel is too minimal, the challenge is clear: propose a compatible extension, and show why it earns its ontological keep.

Measuring intelligence and reverse-engineering goals

It is analytically useful to define intelligence in the context of AGI. One intuitive notion is epistemology: an agent’s intelligence is how good its epistemology is, how good it is at knowing things and making correct guesses. But “intelligence” in AGI theory often means more than epistemology. An intelligent agent is supposed to be good at achieving some goal, not just knowing a lot of things.

So how could we define intelligent agency? Marcus Hutter’s universal intelligence measures an agent’s ability to achieve observable reward across a distribution of environments; AIXI maximizes this measure. Testing across a distribution makes sense for avoiding penalizing “unlucky” agents who fail in the real world, but use effective strategies that succeed most of the time. However, maximizing observable reward is a sort of fixed goal function; it can’t consider intelligent agents that effectively achieve goals other than reward-maximization. This relates to inner alignment: an agent may not be “inner aligned” with AIXI’s reward maximization objective, yet still intelligent in the sense of effectively accomplishing something else.

To generalize, it is problematic to score an agent’s intelligence on the basis of a fixed utility function. It is fallacious to imagine a paperclip maximizer and say “it is not smart, it doesn’t even produce a lot of staples!” (or happiness for conscious beings or whatever). Hopefully, the confusion of relativist pluralism of intelligence measures can be avoided.

Of practical import is the agent’s “general effectiveness”. Both a paperclip maximizer and a staple maximizer would harness energy effectively, e.g. effectively harnessing nuclear energy from stars. A generalization is Omohundro’s basic AI drives or convergent instrumental goals: these are what effective utility-maximizing agents would tend to pursue almost regardless of the utility function.

So a proposed rough definition: An agent is intelligent to the extent it tends to achieve convergent instrumental goals. This is not meant to be a final definition, it might have conceptual problems e.g. dependence on the VNM notion of intelligent agency, but it at least adds some specificity. “Tends to” here is similar to Hutter’s idea of testing an agent across a distribution of environments: an agent can tend to achieve value even when it actually fails (unluckily).

To cite prior work, Nick Land writes (in “What is intelligence”, Xenosystems):

Intelligence solves problems, by guiding behavior to produce local extropy. It is indicated by the avoidance of probable outcomes, which is equivalent to the construction of information.

This amounts to something similar to the convergent instrumental goal definition; achieving sufficiently specific outcomes involves pursuing convergent instrumental goals.

The convergent instrumental goal definition of intelligence may help study the Orthogonality Thesis. In Superintelligence, Bostrom states the thesis as:

Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.

(Previously I argued against a strong version of the thesis.)

Clearly, having a definition of intelligence helps clarify what the orthogonality thesis is stating. But the thesis also refers to “final goals”; how can that be defined? For example, what are the final goals of a mouse brain?

In some idealized cases, like a VNM-based agent that explicitly optimizes a defined utility function over universe trajectories, “final goal” is well-defined. However, it’s unclear how to generalize to less idealized cases. In particular, a given idealized optimization architecture has a type signature for goals, e.g. a Turing machine assigning a real number to universe trajectories which themselves have some type signature e.g. based on the physics model. But different type signatures for goals across different architectures, even idealized ones, makes identification of final goals more difficult.

A different approach: what are the relevant effective features of an agent other than its intelligence? This doesn’t bake in a “goal” concept but asks a natural left-over question after defining intelligence. In an idealized case like paperclip maximizer vs. staple maximizer (with the same cognitive architecture and so on), while the agents behave fairly similarly (harnessing energy, expanding throughout the universe, and so on), there is a relevant effective difference in that they manufacture different objects towards the latter part of the universe’s lifetime. The difference in effective behavior, here, does seem to correspond with the differences in goals.

To provide some intuition for alternative agent architectures, I’ll give a framework inspired by the Bellman equation. To simplify, assume we have an MDP with S being a set of states, A being a set of actions, t(s’ | s, a) specifying the distribution over next states given the previous state and an action, and s_0 being the initial state. A value function on states satisfies:

V(s) = \max_{a \in A} \sum_{s' \in S} t(s' | s, a) V(s')

This is a recurrent relationship in the sense that the values of states “depend on” the values of other states; the value function is a sort of fixed point. A valid policy for a value function must always select an action that maximizes the expected value of the following state. A difference with the usual Bellman equation is that there is no time discounting and no reward. (There are of course interesting modifications to this setup, such as relaxing the equality to an approximate equality, or having partial observability as in a POMDP; I’m starting with something simple.)

Now, what does the space of valid value functions for an MDP look like? As a very simple example, consider if there are three states {start, left, right}; two actions {L, R}; ‘start’ being the starting state; ‘left’ always transitioning to ‘left’, ‘right’ always transitioning to ‘right’; ‘start’ transitioning to ‘left’ if the ‘L’ action is taken, and to right if the ‘R’ action is taken. The value function can take on arbitrary values for ‘left’ and ‘right’, but the value of ‘start’ must be the maximum of the two.

We could say something like, the agent’s utility function is only over ‘left’ and ‘right’, and the value function can be derived from the utility function. This took some work, though; the utility function isn’t directly written down. It’s a way of interpreting the agent architecture and value function. We figure out what the “free parameters” are, and figure out the value function from these.

It of course gets more complex in cases where we have infinite chains of different states, or cycles between more than one state; it would be less straightforward to say something like “you can assign any values to these states, and the values of other states follow from those”.

In “No Universally Compelling Arguments”, Eliezer Yudkowsky writes:

If you switch to the physical perspective, then the notion of a Universal Argument seems noticeably unphysical.  If there’s a physical system that at time T, after being exposed to argument E, does X, then there ought to be another physical system that at time T, after being exposed to environment E, does Y.  Any thought has to be implemented somewhere, in a physical system; any belief, any conclusion, any decision, any motor output.  For every lawful causal system that zigs at a set of points, you should be able to specify another causal system that lawfully zags at the same points.

The switch from “zig” to “zag” is a hypothetical modification to an agent. In the case of the studied value functions, not all modifications to a value function (e.g. changing the value of a particular state) lead to another valid value function. The modifications we can make are more restricted: for example, perhaps we can change the value of a “cyclical” state (one that always transitions to itself), and then back-propagate the value change to preceding states.

A more general statement: Changing a “zig” to a “zag” in an agent can easily change its intelligence. For example, perhaps the modification is to add a “fixed action pattern” where the modified agent does something useless (like digging a ditch and filling it) under some conditions. This modification to the agent would negatively impact its tendency to achieve convergent instrumental goals, and accordingly its intelligence according to our definition.

This raises the question: for a given agent, keeping its architecture fixed, what are the valid modifications that don’t change its intelligence? The results of such modifications are a sort of “level set” in the function mapping from agents within the architecture to intelligence. The Bellman-like value function setup makes the point that specifying the set of such modifications may be non-trivial; they could easily result in an invalid value function, leading to un-intelligent, wasteful behavior.

A general analytical approach:

  • Consider some agent architecture, a set of programs.
  • Consider an intelligence function on this set of programs, based on something like “tendency to achieve convergent instrumental goals”.
  • Consider differences within some set of agents with equivalent intelligence; do they behave differently?
  • Consider whether the effective differences between agents with equivalent intelligence can be parametrized with something like a “final goal” or “utility function”.

Whereas classical decision theory assumes the agent architecture is parameterized by a utility function, this is more of a reverse-engineering approach: can we first identify an intelligence measure on agents within an architecture, then look for relevant differences between agents of a given intelligence, perhaps parametrized by something like a utility function?

There’s not necessarily a utility function directly encoded in an intelligent system such as a mouse brain; perhaps what is encoded directly is more like a Bellman state value function learned from reinforcement learning, influenced by evolutionary priors. In that case, it might be more analytically fruitful to identify relevant motivational features other than intelligence, and seeing how final-goal-like they are, rather than starting from the assumption that there is a final goal.

Let’s consider orthogonality again, and take a somewhat different analytical approach. Suppose that agents in a given architecture are well-parametrized by their final goals. How could intelligence vary depending on the agent’s final goal?

As an example, suppose the agents have utility functions over universe trajectories, which vary both in what sort of states they prefer, and in their time preference (how much they care more about achieving valuable states soon). An agent with a very high time preference (i.e. very impatient) would probably be relatively unintelligent, as it tries to achieve value quickly, neglecting convergent instrumental goals such as amassing energy. So intelligence should usually increase with patience, although maximally patient agents may behave unintelligently in other ways, e.g. investing too much in unlikely ways of averting the heat death of the universe.

There could also be especially un-intelligent goals such as the goal of dying as fast as possible. An agent pursuing this goal would of course tend to fail to achieve convergent instrumental goals. (Bostrom and Yudkowsky would agree that such cases exist, and require putting some conditions on the orthogonality thesis).

A more interesting question is whether there are especially intelligent goals, ones whose pursuit leads to especially high convergent instrumental goal achievement relative to “most” goals. A sketch of an example: Suppose we are considering a class of agents that assume Newtonian physics is true, and have preferences over Newtonian universe configurations. Some such agents have the goal of building Newtonian configurations that are (in fact, unknown to them) valid quantum computers. These agents might be especially intelligent, as they pursue the convergent instrumental goal of building quantum computers (thus unleashing even more intelligent agents, which build more quantum computers), unlike most Newtonian agents.

This is a bit of a weird case because it relies on the agents having a persistently wrong epistemology. More agnostically, we could also consider Newtonian agents that tend to want to build “interesting”, varied matter configurations, and are thereby more likely to stumble on esoteric physics like quantum computation. There are some complexities here (does it count as achieving convergent instrumental goals to create more advanced agents with “default” random goals, compared to the baseline of not doing so?) but at the very least, Newtonian agents that build interesting configurations seem to be more likely to have big effects than ones that don’t.

Generalizing a bit, different agent architectures could have different ontologies for the world model and utility function, e.g. Newtonian or quantum mechanical. If a Newtonian agent looks at a “random” quantum mechanical agent’s behavior, it might guess that it has a strong preference for building certain Newtonian matter configurations, e.g. ones that (in fact, unknown to it) correspond to quantum computers. More abstractly, a “default” / max-entropy measure on quantum mechanical utility functions might lead to behaviors that, projected back into Newtonian goals, look like having very specific preferences over Newtonian matter configurations. (Even more abstractly, see the Bertrand paradox showing that max-entropy distributions depend on parameterization.)

Maybe there is such a thing as a “universal agent architecture” in which there are no especially intelligent goals, but finding such an architecture would be difficult. This goes to show that identifying truly orthogonal goal-like axes is conceptually difficult; just because something seems like a final goal parameter doesn’t mean it is really orthogonal to intelligence.

Unusually intelligent utility functions relate to Nick Land’s idea of intelligence optimization. Quoting “Intelligence and the Good” (Xenosystems):

From the perspective of intelligence optimization (intelligence explosion formulated as a guideline), more intelligence is of course better than less intelligence… Even the dimmest, most confused struggle in the direction of intelligence optimization is immanently “good” (self-improving).

My point here is not to opine on the normativity of intelligence optimization, but rather to ask whether some utility functions within an architecture lead to more intelligence-optimization behavior. A rough guess is that especially intelligent goals within an agent architecture will tend to terminally value achieving conditions that increase intelligence in the universe.

Insurrealist, expounding on Land in “Intro to r/acc (part 1)”, writes:

Intelligence for us is, roughly, the ability of a physical system to maximize its future freedom of action. The interesting point is that “War Is God” seems to undermine any positive basis for action. If nothing is given, I have no transcendent ideal to order my actions and cannot select between them. This is related to the is-ought problem from Hume, the fact/value distinction from Kant, etc., and the general difficulty of deriving normativity from objective fact.

This class of problems seems to be no closer to resolution than it was a century ago, so what are we to do? The Landian strategy corresponds roughly to this: instead of playing games (in a very general, abstract sense) in accordance with a utility function predetermined by some allegedly transcendent rule, look at the collection of all of the games you can play, and all of the actions you can take, then reverse-engineer a utility function that is most consistent with your observations. This lets one not refute, but reject and circumvent the is-ought problem, and indeed seems to be deeply related to what connectionist systems, our current best bet for “AGI”, are actually doing.

The general idea of reverse-engineering a utility function suggests a meta-utility function, and a measure of intelligence is one such candidate. My intuition is that in the Newtonian agent architecture, a reverse-engineered utility function looks something like “exploring varied, interesting matter configurations of the sort that (in fact, perhaps unknown to the agent itself) tend to create large effects in non-Newtonian physics”.

To summarize main points:

  • Intelligence can be defined in a way that is not dependent on a fixed objective function, such as by measuring tendency to achieve convergent instrumental goals.
  • Within an agent architecture, effective behavioral differences other than intelligence can be identified, which for at least some architectures correspond with “final goals”, although finding the right orthogonal parameterization might be non-trivial.
  • Within an agent architecture already parameterized by final goals, intelligence may vary between final goals; especially unintelligent goals clearly exist, but especially intelligent goals would be more notable in cases where they exist.
  • Given an intelligence measure and agent architecture parameterized by goals, intelligence optimization could possibly correspond with some goals in that architecture; such reverse-engineered goals would be candidates for especially intelligent goals.