Following my statement of the Problem, the next step is to record some of my relevant beliefs. As an exercise, this is useful to clarify my thoughts and uncover inconsistencies. It’s also true that having these in writing will make it easier for others to interpret my work, as they won’t have to guess so much about the assumptions I’m making.
This list is not exhaustive. Also, some of the points are better thought through than others. I could have spent a bunch of time expanding and refining everything, but I don’t think doing so would have been good value. My views are always going to be shifting, so let’s take a snapshot and then move on to the next step.
As I went through my notes I found loose themes kept coming up, so have organised this post into sections. Not everything will be a perfect fit and there will be a lot of overlap. The sections are:
Risk: My beliefs about the overall risks from AI, including timelines and takeoff speeds.
Capabilities: My beliefs about how AI will develop certain capabilities, and how I think about or classify them.
Complexity: My beliefs about the complexity of the universe and how this impacts AI risk.
Philosophy: My more philosophical beliefs, roughly split across ontology and ethics.
Each point will be stated flatly, i.e. ‘this is such and such’, but please read them as meaning ‘I believe this is such and such’, with the self-awareness that implies.
1. Risk
My beliefs about the overall risks from AI, including timelines and takeoff speeds.
Smarter-than-human AI will be extremely dangerous for the simple reason that smarter things can outcompete stupider things1.
We have about five years until stuff gets weird. I’m highly uncertain about how weird, but the world is definitely going to feel different. In the faster case, we could see discontinuous changes up to and including catastrophe. In the slower, it will probably feel like an increasingly crazy version of recent trends towards more mediation, with algorithmic fog confusing and addicting us.
Slow takeoff is more likely than fast. Fast takeoff requires a bunch of specific things to be true, such as various important skills being learnable without slow real-world experimentation or new infrastructure (E.g. ‘solving’ physics not requiring new particle accelerators). By contrast, slow takeoff can occur incrementally via many paths.
Fast takeoff is impossible to usefully prepare for2, much like you can’t prepare for a hostile alien invasion.
Slow takeoff could still feel fast.
Even highly capable AI will make meaningful mistakes. The universe is too complex to assume an advanced ‘scientific method doer’ will infallibly figure it all out on the first try3. By analogy, consider LLM hallucinations as the kind of mistake a child might make. Adults do not make the same mistakes as children – they make more consequential ones instead. Put another way, there will always be another unsaturated benchmark that could be built.
It is important to consider the existence of an ‘incompetence gap’ between what an AI can reliably do and what it attempts to do, either by its own volition or because it has been asked. For more powerful models, danger will live in this gap. These models will need an exceptionally well-calibrated sense of their own limitations.
If the first above-human-level artificial AI researchers essentially set the trajectory for the future, then the errors they make will have an outsized impact.
2. Capabilities
My beliefs related to how AI will develop certain capabilities, and how I think about and classify them.
Effective learning about the world requires an interplay of theory and experiment. It is not likely that an advanced AI can just think hard or run a bunch of simulations and successfully figure out the universe without any real-world trial and error.
Intelligence is not distinct from what you do with it. AI can’t become superintelligent unless it learns to complete super-advanced tasks. Put another way, intelligence is a generalised ‘knowing-how’.
LLMs are good at code because the generating process that best predicts extant code includes being able to write good code. For science, the generating process that best predicts extant papers will not itself do good research. Scientists do a lot of things that are not captured in the literature or even formattable as text.
In practice, alignment is a capability. While you can imagine a model ‘really’ having the right goal but lacking the capability to follow it, this is not very useful. What we want is for the model to consistently succeed at the desired goal. It must be able to conduct itself correctly in any given situation, like a kind of propriety. This must be trained into your model.
Different training runs (e.g. pre-training, RLHF) reward models for contradictory behaviour (e.g. helpfulness vs harmlessness), creating some composite, context-dependent emergent goal in the final model. It will not by default have the capability to navigate tradeoffs between sub-goals in the way you might like4.
The orthogonality thesis5 is only true in the limit, and this obscures thinking about actual near-term AI. There isn't an objective ‘right’ that a model will figure out as it gets more capable, but some versions of right are more useful for building capable models than others.
3. Complexity
My beliefs about the complexity of the universe and how this impacts AI risk.
The classical alignment problem cannot be solved. It is predicated on mistaking the complex for the complicated6. In reality, abstractions are leaky and everything interacts with everything else. To be useful in this complex world, AI itself must be complex, making its behaviour not something ‘solvable’. Highly capable AI can at best be ‘managed’.
Davidad’s project working towards ‘guaranteed safe AI’7 is the best I have seen that tries to reduce the problem from a complex to a complicated one. However, while I believe the attempt to find formal guarantees is important, I do not believe it possible to reduce the complexity of the problem space enough for this to work for really advanced AI.
Alignment is not a static target: it is a process that must evolve with the environment. This follows from there not being a clean separation between a model and its environment – what is aligned on one environment may not be in another. For example, humans are less aligned to the goal of having lots of children in the modern environment than the ancestral one.
Conceptualising an AI future in terms of symbiosis may be productive. We want the role the AI plays in our society to require continuous positive feedback from humans, much like humans constantly have to ‘behave’. Or put another way, it must be instrumentally useful to the AI for humans to flourish.
There is crossover between AI risks and generic technocratic problems. In particular, the situation where a class of ‘experts’ implements policy on behalf of non-experts, where the latter cannot evaluate whether the experts really are experts, and whether the experts are acting in the non-experts’ interests as they themselves understand them.
The kind of cascading failures typical of complex systems, where feedback loops cause sudden and unexpected problems8, will be a serious danger with highly capable AI.
It is important to investigate ‘systemic’ evals, where by systemic I do not mean evaluating the impact on society or existing systems (although, this is important too), which is how the term is usually used in the field, but rather problems due to system complexity itself.
Safety is not just a model property. You cannot neatly decouple a model from its environment. Whether a model deployment is safe depends on properties of both.
Society progresses, roughly speaking, by re-engineering our environment to be more predictable. It’s easier to build a reliable machine if someone has first created standardised parts and materials. We do this to turn the complex into the merely complicated, the fuzzy into the crisp9. Re-engineering our environment must be part of successfully navigating AI risk.
4. Philosophy
My more philosophical beliefs related to AI risk, roughly split across ontology and ethics.
Ontology10 precedes epistemology. It is not useful to try and figure out what you know about various catgeories if you aren’t using the right categories for your purpose in the first place.
We need a better ontology for thinking about AI risk. Our current catgeories do not make it easy to think about the impact of highly advanced AI, which makes it hard to judge risk and make useful plans. This implies an ontological remodeling11 is needed.
P(doom) as a concept is of limited use. We are not operating in a quasi-frequentist forecasting competition with a stable ontology and meaningful base rates. You can’t bet against a consensus forecast to win points. Probabilities are ill-defined in this context. What is important about your subjective understanding of AI risk is how it compels you to act.
Theoretical concepts and arguments are useful to the extent that they allow you to make different decisions. A lot of philosophy fails this test12.
Academic ethics is not very useful. It is preoccupied with terminological distinctions and frequently makes questionable assumptions. For example, it is common in utilitarianism to build complicated theoretical arguments on top of a scalar measure of utility (such as the repugnant conclusion), yet people and the universe are both high-dimensional and diverse.
Morality is not objective in a stance-independent way. Human moral values are embedded on two different scales, in individuals and in their societies, with these levels subject to complex interactions.
We do not want AI to have human values. Humans with human values are often dangerous. We want AI to have values complementary to human flourishing, which will be different to literal human values.
It is better to think about the moral role an AI will play, which must change as society changes, than a static set of ‘correct’ values to align to. Even if we do a good job figuring out what values the AI should have, those chosen will quickly become outdated as the environment changes13.
The future moral role of AI should be worked out collaboratively with humans on an ongoing basis. This is a middle ground between preserving human agency and leveraging the superior knowledge of advanced AI.
As mentioned above, this is not an exhaustive list of my beliefs! But I do think it captures a bunch of important and load-bearing points in my world view.
If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!
This substack will not be trying to persuade you that smarter-than-human AI will be dangerous. Please read List of arguments that AI poses an existential risk by Katja Grace if you want some arguments for this (along with counter-arguments). For me, it isn’t that every point is watertight, rather that taken collectively I believe they are worrying enough that it’s worth trying to do something about it.
Short of shutting down all AI research, which would be a political nightmare to pull off.
Indeed, that the scientific method is constructed around experimentation – not deducing in advance what the right answer is – should make this clear.
Ryan Greenblatt et al., Alignment faking in large language models
That any level of capabilities can be combined with any goal.
Davidad et al., Towards guaranteed safe AI
Richard I Cook, How complex systems fail
Jan Leike, Crisp and fuzzy tasks
I define Ontology following David Chapman: “An ontology is an explanation of what there is. Key ontological questions for philosophy are: “What fundamental categories of things are there? What properties and relationships do they have?”… [E.g.] “do Thai eggplants count as eggplants at all” and “does this new respiratory virus count as a ‘cold’ virus?” Answers to ontological questions mostly can’t be true or false. Categories are more or less useful depending on purposes.”
David Chapman, Interlude: Ontological remodeling
Paul Graham, How to do philosophy
As far as I can tell, this is what Nate Soares calls a ‘sharp left turn’ – i.e. the environment changes and your model decides to follow a new path in response. Or put another way, consider that agent and environment are coupled: the agent developing new capabilities might allow it to change its environment in ways that make its old values inappropriate.