Assembly Theory Provides A Measure of Specified Complexity That Quantifies but does Not Explain Selection and Evolution

Fakhouri, Onsi Joe

Introduction

Arriving at a precise definition for life has proven to be a thorny scientific and philosophical challenge (2). Phenomenologically, however, it is apparent that life is built out of complex objects and that biotic processes – whether cells or third-graders – are, in turn, capable of constructing additional complex objects – whether proteins or poems.

Yet complex objects, by any meaningful measure of syntactic complexity (3), abound throughout the universe. What distinguishes non-living complex objects from living ones? And how might the living objects have come about? In particular, if living matter emerges from non-living matter then the causal chain of complex living objects must emanate (whether abruptly or gradually) from non-living objects. What sort of process could have accomplished this?

Assembly theory (AT) – as recently laid out in Sharma et al. (1) and popularized in Walker (4) – aims to make progress on these questions. AT claims to provide a mechanism for distinguishing biological complex objects from abiotic ones and to lay a foundational framework for understanding how these objects came to be via a process of object discovery dubbed selection. AT has received substantial engagement and critique [e.g. Uthamacumaran et al. (5) expanded upon in Abrahão et al. (6)] and much of this critique has centered around AT’s measure of complexity: the assembly index, a_i, which counts the shortest number of steps necessary to recursively construct an object from its constituent parts. Critics have noted that a_i is essentially implementing a simple form of compression and that superior implementations commonly used in the field of Algorithmic Information Theory (AIT) exist.

To this critique, AT proponents have replied (7) arguing that AT is distinct from AIT for two reasons. First, they argue that AT is modeled on a physically-motivated process (object construction via recursive concatenation) thereby allowing it to constrain the probabilities of undirected physical processes. Second, AT does not just measure complexity; it identifies complex objects that exhibit evidence of a directed process of construction (i.e. selection). This is done by augmenting an object’s a_i complexity measure with its copy number – the number of copies of the object observed. In short, Sharma et al. (1) claim that AT provides a novel method to distinguish complex functional information from random, and equivalently improbable, non-functional information.

But is this contribution novel? As we shall see: no, not exactly.

Moreover, Sharma et al. (1) and Walker (4) push further, claiming that AT provides a framework with the potential to build a ‘new physics’ of life; one that sheds light on the operation of selection over deep time ‘thus unifying key features of life with physics’. Only a few authors (6) have engaged with this particular aspect of AT and the nascent models explored in Sharma et al. (1).

Out present goal is two-fold. We first provide a pedagogical overview of AT and situate it in the broader context of the study of specified complexity (SC). We then explore what AT does, and does not say, about physical processes capable of selection. We argue that while AT points to the presence of an information-laden process like selection and can provide some interesting quantitative parameterizations, it does not (indeed, cannot) explain the origin or modes of operation of such a process in any concrete detail.

Specified complexity

In 1998, Bill Dembski published The Design Inference (8) to establish how and when the observation of low probability events can be leveraged to rule out undirected causes. Subsequent refinements of this work [most recently the 2nd edition of The Design Inference; Dembski and Ewert (9)] have placed Dembski’s original ideas on a firmer mathematical footing. The core notion underlying The Design Inference is that of SC.

SC is observed when a low probability event (complexity) that conforms to some independently specified pattern (specification) occurs. Dembski argues that it is precisely SC at work when we intuit low-probability events that carry an outsize significance.

Consider, for example, a hand in a game of spades in which 13 cards have been dealt to a player. Any given hand is rare as the odds of getting that particular hand are low: if we assume an undirected physical process in which cards are drawn at random with uniform distribution, then the odds of any given hand are ∼1 in 6 × 10¹¹. However, the majority of hands are inconsequential in the context of spades and we are unlikely to attribute anything other than the contingencies of chance and necessity to a given hand. But some hands are special and raise suspicion. If the dealer has a hand with 13 spades – a hand guaranteed to win 13 tricks – we might rightly grow suspicious.

Why? The 13-spade hand has exactly the same probability of occurring as any other 13-card hand so it is not simply the improbability of the hand that raises our suspicion. Rather, it is the independent pattern ‘always wins all 13 tricks’ coupled with this improbability that leads us to suspect that there may be more at play than merely chance and necessity. The presence of the specification marks the event as special and its combination with a low-probability estimate motivates the rejection of the null hypothesis. Specifically: we have cause to doubt that the undirected physical process underlying the probability estimate is a plausible explanation for the event.

Two questions should be addressed immediately. First, what determines whether a probability is small enough to rule out chance? To answer this we must evaluate the relevant available probabilistic resources (9) and use them to set a conservative bound. In our spades example the chances of randomly drawing a 13-spade hand is roughly one in a trillion – is this small? If we imagine every human on earth playing a game of spades (about 10 hands) every day for a year we would see ∼10¹³ hands of spades. With this many trials, some dealer drawing a 13-spade hand eventually is much more likely than not. However, should we observe multiple 13-spade hands from the same dealer in a given year the available probabilistic resources would be insufficient to quell our suspicions.

Second, how do we rigorously define specification? Different contexts can prompt different definitions for specification; however, Dembski and Ewert (9) takes an AIT approach to generically define specification as the description of the object with minimum length. This allows for the use of English phrases to describe objects (e.g. ‘always wins all 13 tricks’) and the authors apply this model to numerous problems such as analyzing texts, identifying fraud, and even evaluating claims about the extra-terrestrial origin of the asteroid ‘Oumuamua.

Indeed, several examples of SC measures exist, and many have been proposed for biological systems (10). These different models may adopt different measures for complexity and specificity; however, Montañz (10) proposed a unified schema and showed that all such measures can be expressed in terms of a common form Canonical Specified Complexity (CSC) model: (1) $SC (x) = - {log}_{2} r \frac{p (x)}{ν (x)}$ SC\left( x \right) = - {\log _2}r{{p\left( x \right)} \over {\nu \left( x \right)}} where x denotes an observation taking place in some space χ according to a probability distribution p(x), and ν(x) is a specification function ν : χ → ≥ ℝ ≥ 0 that measures the specification of x (higher values of ν denote a higher degree of specification). r is a normalization constant that can be freely chosen – however, for an SC model to be a canonical SC model, it must be true that ν(χ) ≤ r where $ν (χ) = \sum_{x \in χ} ν (x)$ \nu \left( \chi \right) = \sum\nolimits_{x \in \chi } {\nu \left( x \right)} Given a CSC measure Montañez (10) shows that the probability of observing an object x with SC(x) ≥ b satisfies: (2) $Pr (SC (x) \geq b) \leq 2^{- b}$ {\rm{Pr}}\left( {SC\left( x \right) \ge b} \right) \le {2^{ - b}}

This result is referred to as the Conservation of CSC and allows the use of SC(x) as a hypothesis test to rule out the chance explanations underlying p(x).

We shall show that AT can be expressed as a CSC measure. First we introduce and explore AT.

Assembly theory

AT (1) explores the idea that the discovery and production of complex objects can be modeled as a directed assembly process. Complex objects are recursively constructed from a pool of smaller constituent building blocks. At each step of the construction process two elements in the pool are selected and joined together. The resulting object is then added to the pool and made available for future steps; the pool is presumed to be large with ample provision of each member. Thus a given object, O_i, in AT is the result of some contingent construction-path (sometimes referred to as a lineage); in fact many such paths can exist for O_i. The complexity of O_i is measured via its assembly index, a_i – defined to be the number of steps in the shortest possible construction-path.

We can build a quick intuition for a_i by thinking of the assembly of strings in the English alphabet, see Figure 1. Here the basic building blocks are the 26 letters and a_i can be computed by looking for repeating blocks. The shortest path for a word of length ℓ with no repeating subunits is simply ℓ − 1 (e.g. a_i = 4 for ‘short’). However, words with repeating subunits will have a smaller assembly index (e.g. the 12-letter word ‘hubbubbubboo’ can reuse ‘ubb’ twice and has assembly index a_i = 7).

For ordered linear chains like words we can see that a_i is bounded from above by the length of the chain, ℓ, and from below by log₂ ℓ which represents a chain consisting entirely of a single repeated subunit. Since it tracks object reuse a_i is related to the compressibility of the object; however, AT is not particularly interested in identifying an optimal compression scheme.

Rather, the assembly index was first introduced in Marshall et al. (11) and subsequently formalized in Marshall et al. (12), in order to develop a model-free biosignature to detect the activity of life in the context of astrobiology. By ‘model-free’ we mean a measure of life that does not assume a particular (i.e. terrestrial) model of life and can therefore be used to agnostically identify signs of alien life. Indeed, the authors of AT refer to a_i as an intrinsic property of the object – though we note that the numerical value of a_i will depend on the choice of constituent building blocks and the set of allowed joining operations.

The assembly index is also a physically-motivated measure of complexity because a_i operates as a lower bound on the number of physical operations necessary to construct the first instance of O_i. The reasoning here is subtle: in AT each intermediate object in a construction-path is required to be a physically plausible object according to the rules of physics governing the object. However, the specific smallest a_i construction-path need not represent physically plausible construction trajectories.

An example will help clarify this distinction. Consider the case of molecular assembly in which atoms are the building blocks and bonds are the joining operation. Each object along a construction-path must correspond to a physically plausible molecule (e.g. the rules of chemical bonds apply, so hydrogen atoms can have at most one bond). However, the combination of molecules from prior steps to produce the product at a subsequent step need not correspond to a physically plausible reaction pathway. A pictorial combination of atoms and bonds is sufficient, even if there is no known chemical pathway for the transformation. Despite this, a_i nonetheless represents a valid (if potentially weak) lower bound on the construction history of the object in question as the constraints of a physically plausible path would necessarily require more steps and/or more intermediates.

AT leverages this lower bound to identify evidence for the action of a directed process behind the construction of complex objects. To do this AT needs an additional ingredient. AT argues that:

The number of objects with high a_i explodes combinatorially as a_i grows. Therefore a given O_i with high a_i has a low probability of being discovered by an undirected process.
Observing multiple copies of O_i signifies that some sort of directed process must be at work to discover and produce them.

and this second clause is crucial to AT. To capture both elements, Sharma et al. (1) define a quantity A, called assembly, that can be computed for an ensemble of objects: (3)

A = \sum_{i = 0}^{N} A_{i} = \sum_{i = 0}^{N} e^{a_{i}} \frac{n_{i} - 1}{N_{T}}

A = \sum\limits_{i = 0}^N {{A_i}} = \sum\limits_{i = 0}^N {{e^{{a_i}}}} {{{n_i} - 1} \over {{N_T}}}

here N represents the number of unique objects, O_i, in the ensemble. Each O_i has a copy number n_i representing the number of copies of O_i and the ensemble contains a total of

N_{T} = \sum_{i = 0}^{N} n_{i}

{N_T} = \sum\nolimits_{i = 0}^N {{n_i}}

objects. The assembly A_i of each O_i is composed of the object’s exponentiated assembly index e^a_i and its normalized copy number (n_i − 1)/N_T. The assembly index is exponentiated to capture the combinatorial explosion implied by a high a_i, while the normalized copy number acts as a filter screening out O_i that do not appear in substantial quantities: (n_i − 1)/N_T is small or even 0 for such objects.

Thus, an ensemble containing many identical simple objects (low a_i, high n_i) will have low assembly. As will an ensemble containing many unique individual instances of complex objects (high a_i, but n_i = 1). However, complex objects with high copy numbers will constitute an ensemble with high A. AT argues that a directed process, dubbed selection, must be postulated to explain the existence of such an ensemble.

For each step in the construction process of an object directed by selection, a specific set of products from earlier steps are selected and combined to produce the next iteration of the object. In this way, selection winnows down the combinatorial space of options making the construction of complex objects possible given limited probabilistic resources. Since assembly can be computed for a broad range of contexts, no universal concrete mechanism is proposed for selection. Rather, AT’s authors simply argue that given an ensemble of objects with high assembly some sort of selection-like process must have been at work to produce the ensemble.

Assembly as a measure of SC

The parallels between SC and assembly are apparent. In the construction of A, e^a_i captures the notion that complex objects inhabit a combinatorially large space and the probability of a random undirected process constructing such an object is small: p(O_i) ∼ e^−a_i. But low-probability events occur all the time; recall that drawing any particular hand in a game of spades is a low-probability event, but there is no need to attribute anything other than chance to such a hand. AT, however, seeks to identify objects for which there is reason to rule out an undirected construction process. This is done by providing a specification in the form of the normalized copy number: $ν (O_{i}) \sim \frac{n_{i} - 1}{N_{T}}$ \nu \left( {{O_i}} \right)\~{{{n_i} - 1} \over {{N_T}}} . Quoting Sharma et al. (1) ‘finding more than one identical copy indicates the presence of a non-random process generating the object’.

Such an argument mirrors the formal framing of SC with assembly index acting as a measure of (im)probability, and copy number acting as an external specification. The combination of complexity (high a_i) and specification (high n_i) resists explanation by chance alone – and requires the action of a directed selection process capable of winnowing down combinatorial space.

We explore a concrete application of assembly-as-SC to the problem space of protein discovery below. But first, we further motivate why copy number can serve as a viable specification of interest and explore the properties of assembly index as a measure of complexity.

Copy number as specification

We have proposed that copy number operates as a specification. But what does copy number specify? Let’s consider two examples from different domains.

First, take the case of written human artifacts. Imagine an archaeologist happening upon a heretofore undiscovered ancient manuscript (given a pool of alphabetical letters any non-trivial manuscript of moderate length will have a high assembly index). At first the copy number of such an artifact will be one – it is unique in the world. But as the manuscript is carefully extracted, preserved, and interpreted, copies will begin to proliferate. There will be copies sent to museums around the world. Copies in academic papers interpreting the text. Copies in anthologies collecting such texts. Over time we would measure n_i ≫ 1 reflecting the value ascribed to the manuscript by society.

Contrast this with a random collection of words with assembly index commensurate to the ancient manuscript. These two objects may have similar a_i (and therefore be equivalently improbable); however, the copy number of the random collection of words is unlikely to exceed one. Why? Because such a text carries no value to the society in question. In this context, the long-term copy number acts as a reasonable – if blunt – external specification for ‘value ascribed to a written text by a society’.

Now consider a biological case, that of proteins: long chains of sequentially arranged amino acids that can fold into complex functional three-dimensional shapes. Given that cells contain machinery to transcribe sequences of DNA into multiple copies of identical proteins it is common for proteins to exhibit high copy number both within an individual cell and across the biosphere.

But now, imagine a scenario in which random mutation results in a new pair of start and stop codons appearing around a heretofore untranscribed string of DNA. This will result in a new protein. At first the protein will have copy number 1, but as the machinery of the cell repeatedly transcribes the novel DNA strand n_i will increase. If, however, the protein serves no function for the cell (or, worse, is detrimental to the cell) we can expect that future mutations, selection effects, or even error-detection mechanisms will winnow out the production and replication of this strand of random protein. Thus reducing n_i back down to zero.

Here again we see that the long-term behavior of n_i serves as a reasonable proxy for ‘value’ or ‘function’. In the case of proteins in the context of existing cellular infrastructure, the presence and persistence of high copy number serves as an external specification that points towards functionality.

We note, however, that there are issues with the use of copy number as a specification. Consider, for example, the presence of stochastic error in the copying process of an object. These errors may prove superficial – the copy still retains functional value commensurate to that of the original – however, without specifying a tolerance range for such variations AT may fail to categorize these superficial variants as members of the same class of object. This would lower the assembly measure for the object (by spreading n_i across multiple O_i) and potentially result in a false negative.

In addition, the normalization of n_i by N_T – the total number of objects in the ensemble – makes normalization of assembly somewhat ambiguous. The actual value of A will depend on what is and is not included in the ensemble making it challenging to speak of a rigorous absolute A_min threshold for rejecting an undirected process.

Nonetheless, such limitations are not fatal to the core structure of the theory. As we have seen in both the artifact and biological cases long-term copy number can serve as a workable specification to call out that the random sequence in question – as equally unlikely as any other random sequence of equivalent a_i – plays some sort of externally specified role.

Assembly index as measure of complexity

In the framework of SC the probability under consideration must be computed for some hypothesized chance process. If this probability is low, and the event in question is highly specified, then we have good reason to rule out the chance process as the event’s cause. We have proposed that assembly index operates as a probability measure – but what chance process is it associated with?

This is where the physical motivation behind a_i proves valuable. Consider an object with assembly index a_i that is an ordered linear chain of initial building blocks. We assume there are β unique building block elements in the initial pool (for words in the English language β = 26, for proteins constructed of amino acids β = 20). An undirected construction process is one in which the choice of objects to combine at each assembly step is random (we assume a uniform distribution). We can estimate a bound on the probability of a given object being produced by such a process by estimating a bound on the number of possible product objects with index a_i. Doing this is not entirely straightforward, however, as it entails characterizing the minimal-length construction trajectory for objects to determine their a_i. Nonetheless we argue in the Appendix that there are at least β^a_i objects with index a_i. Thus the probability of a given object being generated by an undirected process is bounded above by p(O_i) ≲ β^−a_i = e^{−a_i ln β}.

Thus, we see how e^−a_i operates, heuristically, as a measure of the probability of an undirected process successfully identifying an object of interest in combinatorial construction space. Therefore, when understood as a measure of SC, assembly allows us to conclude that an undirected construction process is unlikely to explain the origin of an object with high a_i and high n_i.

Nonetheless, there are some noteworthy issues with a_i. For example, a highly ordered crystalline structure that occurs naturally by the laws of physics can readily have a high copy number and large a_i. If the crystal is simply the repeated presence of a single subunit, then a_i ∼ log₂ ℓ where ℓ is a measure of the size of the crystal and can grow quite large. Clearly one need not invoke a directed process to explain the existence of the crystal. Alternative measures of complexity are able to handle such edge cases and this limitation of AT is noted by some of its critics [see Zenil (13)].

Dembski and Ewert (9) emphasize that SC measures must avoid such false positives by carefully considering the space of possible causes when constructing their probability measures. A well-formed SC argument would note that the probability of a repeated crystalline structure is high on the known laws of physics and avoid concluding that a directed process must be at work. Though not articulated in Sharma et al. (1), this nuance is actually described in Figure 4 of Marshall et al. (11), reproduced here in Figure 2, which provides a technique for determining whether to infer a directed process from a given measure of a_i. The authors argue for analyzing the combination of a_i and object size ℓ. Recall that log₂ ℓ ≤ a_i < ℓ. If a_i is ‘too close’ to its lower bound, log₂ ℓ, then objects can be deemed too simple and therefore possibly attainable by an undirected process. Perhaps future iterations of AT will formally include such guards by conjoining all three of (a_i, n_j, ℓ) into A.

Before proceeding, we should note that there are several alternative measures of complexity that predate AT. Perhaps most comparable is the work of Böttcher (14, 15) who develops a local measure of molecular complexity C_m derived from the information content of the degrees of freedom on a per-atom basis. C_m has many advantageous qualities including its computability and its linear nature. Under Böttcher’s schema, the molecular complexity of an ensemble of objects is largely given by the sum of the complexities of each member. a_i by comparison can behave in unintuitive ways when we consider ensembles of objects with overlapping assembly histories (see, for example, the complexity in Section ‘Computing a bound on N (a_i) for proteins’ of the Appendix). Moreover, Böttcher pairs C_m with a measure of the sequence information of biomolecules C_i to develop a model-free schema for understanding biosignatures across a wide spectrum of contexts – much like AT. Indeed, the parallels are striking: both Böttcher and the AT authors apply their measures to the problems of drug discovery, the detection of life in an astrobiological context, the use of experimental methods to estimate complexity [nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry (MS), respectively], and questions of life’s origins. A deeper comparison of the two is warranted.

A more detailed example: proteins

To further illustrate how assembly operates as a form of SC we shall work through a concrete example and recast the assembly Eq. (3) in terms of the common form equation for CSC (1).

We consider the case of some protein O with assembly index a observed with copy number n ≥ 1 in an ensemble of N ≥ n total proteins. We explore concrete scenarios and estimate values for these quantities later.

The assembly for this protein, in the context of this ensemble, is: (4) $A = e^{a} \frac{n - 1}{N}$ A = {e^a}{{n - 1} \over N}

To recast this example in terms of SC we need a probability measure for how (im)probable it is for O to be discovered by an undirected process. As discussed in Section ‘Assembly index as measure of complexity’ this probability is bounded by p ≲ β^−a where β = 20 is the size of the pool of amino acid building blocks available for proteins. Since e^{a ln β} = β^a, we can express p in terms of assembly: (5) $p < {(\frac{n - 1}{AN})}^{ln β}$ p < {\left( {{{n - 1} \over {AN}}} \right)^{\ln \beta }}

Plugging Eq. (5) into Eq. (1), we find (6) $\begin{array}{l} SC = - {log}_{2} r \frac{p}{ν} > - {log}_{2} {(\frac{n - 1}{AN})}^{ln β} \frac{r}{ν}] \\ SC > ln β {log}_{2} \frac{AN}{n - 1} {(\frac{ν}{r})}^{1 / ln β}] \end{array}$ \matrix{ {SC = - {{\log }_2}r{p \over \nu } > - {{\log }_2}\left[ {{{\left( {{{n - 1} \over {AN}}} \right)}^{\ln \beta }}{r \over \nu }} \right]} \hfill \cr {SC > \ln \beta {{\log }_2}\left[ {{{AN} \over {n - 1}}{{\left( {{\nu \over r}} \right)}^{1/\ln \beta }}} \right]} \hfill \cr }

To ensure this is a measure of CSC we must choose a specification function ν and a normalization r such that $ν (χ) = \sum_{x \in χ} ν (x) < r$ \nu \left( \chi \right) = \sum\nolimits_{x \in \chi } {\nu \left( x \right) < r} . Recall that χ is the space of observations. Since AT’s specification is a function of copy number we shall take χ = {n|1 ≤ n ≤ N} – i.e. we consider the range of outcomes for observing up to N copies of O in the ensemble.

If we set (7) $ν = {(\frac{n - 1}{N})}^{ln β}$ \nu = {\left( {{{n - 1} \over N}} \right)^{{\rm{ln\;}}\beta }} and (8) $r = N$ r = N then it follows that (9) $ν (χ) = \sum_{n = 1}^{N} {(\frac{n - 1}{N})}^{ln β} < N = r$ \nu \left( \chi \right) = \sum\limits_{n = 1}^N {{{\left( {{{n - 1} \over N}} \right)}^{\ln \beta }} < N = r} because ln β > 0 and $\frac{n - 1}{N} < 1$ {{n - 1} \over N} < 1 . Thus these choices of ν and r give us ν(χ) < r.

Plugging Eqs (7) and (8) into Eq. (6), we find (10) $SC > ln β {log}_{2} \frac{A}{N^{1 / ln β}}$ SC > \ln \beta {\log _2}{A \over {{N^{1/\ln \beta }}}} and we see that this normalized form of A serves as a lower bound on a CSC measure. Because of the conservation of CSC it follows that the probability of observing O is: (11) $\Pr \leq 2^{- SC} \leq \frac{N}{A^{ln β}}$ Pr \le {2^{ - SC}} \le {N \over {{A^{\ln \beta }}}}

To expose the parameters in our model we can expand A: (12) $\Pr \leq \frac{N^{1 + ln β}}{β^{a} {(n - 1)}^{ln β}}$ Pr \le {{{N^{1 + \ln \beta }}} \over {{\beta ^a}{{(n - 1)}^{\ln \beta }}}}

These probability equations reflect the features and limitations of assembly we have been discussing. We see that proteins with larger assembly index a and copy number n are, indeed, less probable to attain by undirected processes. However, the assembly (and the resulting probability) is poorly normalized and can be modified substantially by considering different sized ensembles N.

To put some numbers to this we note that proteins routinely have ℓ ≫ 100 amino acids. Since a < ℓ, we will assume an assembly index of a = 100 to probe proteins of moderate complexity. We consider two scenarios.

First, we consider observing n ∼ 10 copies of O within the proteome of a single E. coli cell. Such a cell typically has N ∼ 10⁶ proteins (16). This yields A ∼ 10³⁸ and Pr ≲ 10⁻¹⁰⁹. This low probability correctly rules out the chance occurrence of n copies of O within E. coli: O is only present in multiple copies because its amino acid sequence is specified (i.e. selected) by the DNA code for O and translated by the machinery in the cell.

Second, we consider a scenario more relevant to evolutionary timescales. Now O is a protein present in the majority of bacteria. There are currently ∼10³⁰ bacteria on earth (17) so we assume at least n ∼ 10³⁰ copies of O exist. We draw these n copies from the reservoir of all bacterial proteins that have existed over the course of evolution. This number is difficult to estimate but we’ll assume ∼10⁴⁰ bacteria have ever existed, each with ∼10⁶ proteins giving us N ∼ 10⁴⁶. With these numbers A ∼ 10²⁶ and Pr ≲ 10⁻³⁶. Again, we see a low probability correctly rule out undirected processes. Some sort of selection mechanism must have been at work to generate O.

Of course these numbers are somewhat contrived – the results are sensitive to the choice of ensemble, N, and the choice of a = 100 is somewhat arbitrary.

In particular, if a = 100 is too high then our probability bound is too low and we are at risk of incorrectly ruling out the efficacy of an undirected process. However, there is reason to believe that a is not substantially lower than the length of the protein in question and there are many proteins with length ℓ ≫ 100 – several lines of evidence point in this direction, particularly the incompressibility and high entropy content of proteins (18, 19).

In fact, the study of protein compressibility is closely related to the notion of assembly index. Some state-of-the-art approaches have been found that can compress protein sequences substantially: by looking for long-range correlations at the genome-scale of a species Adjeroh and Nan (20) are able to compress protein coding regions by 50%–80% depending on the species. While impressive from a data-storage point of view this does little on its own to paint a convincing picture that proteins are, in fact, broadly constructed recursively from fundamental building blocks at a higher level of abstraction than individual amino acids. Perhaps more sophisticated complexity analyses such as Block Decomposition Method (BDM) (21) will show that proteins are recursively constructed from simpler transformations of building blocks – this would provide a tighter bound on a_i. Or perhaps the presence of minor or synonymous mutations is confounding compression algorithms, and so analyzing three-dimensional protein structure instead of sequences will be necessary to improve compressibility (22). However, absent such an analysis our current knowledge of the field indicates that a_i will likely scale more-or-less linearly with ℓ and that the scaling factor will be modest (i.e. ∼O(0.1)).

Thus our estimates, though contrived, may nonetheless serve as plausible upper limits. This illustrates how assembly operates as a measure of SC to argue that an undirected process is unlikely to yield high-assembly objects. Indeed, in the case of proteins it is unlikely that random sequences will yield novel complex proteins [see, e.g., Tian and Best (23)]; instead a more complex search process is typically invoked to explain the discovery of novel proteins (24, 25). Assembly correctly points to the need for such a process.

Selection as a material mechanism

Workers in the AT community have applied AT to various domains to explore its utility. Marshall et al. (26) compute the assembly index of molecules and show that these correlate with the number of MS2 peaks observed in tandem MS experiments. They use this correlation to estimate a_i for a series of samples via MS and find that a cutoff value of a_i ∼ 15 serves as a workable threshold for detecting biosignatures. In the context of MS the copy number specification is implied as only samples containing millions of copies of a molecule will yield measurable MS peaks (4, 7). Uthamacumaran et al. (5) use the same data sets to show that alternative compression algorithms applied to MS results outperform AT. Notably, Uthamacumaran et al. (5) apply compression to a binary representation of the MS distance matrices whereas Marshall et al. (26) simply count the number of peaks (after some filtering and thresholding) to measure a_i.

In more recent work, Jirasek et al. (27) have shown that peak-counting in infrared spectroscopy and NMR spectra also correlates with a_i. These correlations are routinely touted by AT’s authors as evidence that assembly is empirically measurable (4). We simply note that the underlying analysis (peak-counting) is relatively straightforward and that these correlations are unsurprising and would be expected with any well-formed measure of complexity (5, 14). They do not, in and of themselves, constitute evidence for the physical interpretation of one model of complexity over another.

Liu et al. (28) use assembly to explore a variety of chemical contexts: for example they illustrate how assembly can inform a search for novel drugs by focusing on candidate molecules that are, themselves, assembled from the common pool of building blocks that underlie existing drugs. They also explore how assembly can help adjudicate how closely related a group of molecules is by exploring the family of basic biomolecules and, also, a set of plasticizers.

Thus Marshall et al. (12) use AT as a biosignature detector and Liu et al. (28) use it as a mechanism to explore molecular groupings and potentially extend them. Importantly, the authors of both papers are clear that assembly index and the underlying assembly pathways are not intended to represent actual physical processes. For example, according to Liu et al. (28) the ‘assembly pathway of a molecule does not necessarily correspond to a realistic sequence of chemical reactions that produce this molecule. Instead, the shortest assembly pathway bounds the likelihood of the molecule forming probabilistically… No matter which methods or synthetic approaches are used, there will be no shorter way than this ideal one, which makes it an intrinsic property of a molecule’. This is consistent with our framing of a_i as a physically motivated measure that can set a bound on physical processes but does not necessarily map on to a known actual physical process.

A shift occurs in Sharma et al. (1); however, according to the paper’s title, AT ‘explains and quantifies selection and evolution’. The abstract claims that AT ‘enables us to incorporate novelty generation and selection into the physics of complex objects. It explains how these objects can be characterized through a forward dynamical process […and…] discloses a new aspect of physics emerging at the chemical scale’. And later, the paper introduces a parameterization scheme (discussed below) that ‘suggests that selectivity in an unknown physical process can be explained by experimentally detecting the number of objects, their assembly index and copy number as a function of time’. The usage of ‘explains’ here (emphasis added) is notably vague - does AT provide explanatory mechanisms for selection? Or does it merely explain how the presence and action of selection can be measured and characterized? In contrast, popular-level press releases about AT have been quick to proclaim that AT ‘explains both the discovery of new objects and the selection of existing ones’ (29). It would seem that AT is no longer simply a measure to rule out undirected processes, but also a framework for quantifying and, somehow, explaining directed mechanisms that can form complex objects.

Modeling selection?

The authors of AT refer to such mechanisms as selection: processes that can winnow combinatorial space along an object’s construction trajectory. Since their goal is to establish a broad framework for the construction of complex objects, AT’s authors choose to focus on the generic properties of selection by exploring abstract models in lieu of concrete physical systems. For example, to explore the dynamics of object discovery under selection, Sharma et al. (1) present a toy model represented by the equation: (13) $\frac{{dN}_{a + 1}}{dt} = k_{d} {(N_{a})}^{α}$ {{d{N_{a + 1}}} \over {dt}} = {k_d}{({N_a})^\alpha } where N_a is the number of unique objects at assembly index a, k_d represents the rate of discovery, and α represents the degree of selection. When α = 1 there is no selection (all objects at prior assembly index are available to construct subsequent objects) and the combinatorial space of possibilities explodes exponentially. However, if α < 1 then a selection process is at work reducing the number of objects from which to generate future objects. In principle, this alleviates the combinatorial explosion at high a and allows a selection process to construct high a_i objects within limited available resources.

But what does this model describe in practice? Solutions to Eq. (13) are provided in the supplementary information of Sharma et al. (1) and take the form N_a ∝ t^x where x is a function of a and α and always satisfies x ≥ 1 (its detailed behavior is unimportant here). This means the number of unique objects at assembly index a grows unbounded with time, even for small a. This is an unphysical result: the proposed solutions do not account for the combinatorial reality that even for large numbers of building blocks the possible number of unique objects at a given assembly index must be finite. Thus, it is unclear how to apply these solutions to realistic physical systems. Other exemplar models in Sharma et al. (1) are similarly limited. For example, the authors explore the construction of ‘linear chains defined as integers which are equivalent to linear polymers constructed from a single monomeric unit’. For such linear monomers the only distinguishing feature is their length; however, objects of identical length that have different construction histories are treated as distinct objects (e.g. there are two chains of length $4 : 1 \overset{1 + 1}{\to} 2 \overset{2 + 2}{\to} 4$ 4:1\buildrel {1 + 1} \over \longrightarrow 2\buildrel {2 + 2} \over \longrightarrow 4 and $1 \overset{1 + 1}{\to} 2 \overset{2 + 1}{\to} 3 \overset{3 + 1}{\to} 4$ 1\buildrel {1 + 1} \over \longrightarrow 2\buildrel {2 + 1} \over \longrightarrow 3\buildrel {3 + 1} \over \longrightarrow 4 ).This is a confusing choice that obscures the relationship between actual objects (e.g. ‘4’) and the various trajectories that could have constructed the objects.

These linear monomers are subsequently used to explore the effects of directed selection vs undirected construction. A simulation is performed in which an undirected process constructs chains by drawing two chains from the assembly pool at random, attaching them together, noting the product, then adding it to the pool. This is compared to a directed process in which, at each step, the longest chain in the pool is chosen and combined with a randomly chosen chain from the pool. The authors then note that the products of the directed process are substantially longer than the products of the undirected process. Since assembly index simply scales with the log of length for these monomeric chains, the directed process therefore generates more complex objects. But both the model and this result are trivial in light of the sorts of real-life physical problems that selection must be invoked to solve.

These various toy models illustrate a key limitation of selection as construed in AT. Selection is conceived to operate via object concatenation and is parameterized with α as such. However, the real-life construction of objects (whether by the hands of human workers or the blind forces of evolution) does not only operate via concatenation. Consider, again, the example of proteins: it is well known that enzymes exhibiting promiscuous behavior can be optimized to catalyze related molecules via a process of stochastic exploration of navigable fitness landscapes via mutation (30). Modeling such a change-by-modification process is awkward in AT given its change-by-concatenation perspective of selection.

Finally, Sharma et al. (1) further characterize the selection process by stipulating that the timescale for producing copies, τ_P, and the timescale for discovering new objects, τ_D, must be commensurate. If τ_P ≫ τ_D then many unique objects with low copy number will be discovered and proliferate (these have low A); and if τ_D ≫ τ_P then there will be many copies of objects with low assembly index (these will also have low A). Expressing these constraints (α < 1, τ_D ∼ τ_P) provides helpful language, however, neither these constraints nor the simple models explored in Sharma et al. (1) describe a concrete set of mechanisms for implementing an actual viable material selection process.

Selection and information

Indeed for a material selection process to actually exist, function, and succeed a few things need to be true:

There must be a mechanism for creating copies of objects such that earlier elements in an assembly trajectory are available for combination into future elements.
There must be a mechanism for generating variations of objects – envisioned in AT as recombining prior elements in new ways.
There must be a mechanism for filtering the objects at each step in the trajectory in a way that collapses combinatorial space.

Readers will recognize these as the three Lewontin conditions for evolution to hold (31). These are necessary but not sufficient. A viable selection process must also be shown to actually succeed in discovering high-a_i objects within the available probabilistic resources (time, number of trials, material, etc.) and in spite of any confounding factors (e.g. decay rates, polluting cross-linkages, etc.). AT does not attempt to concretely address any of these thorny issues for any actual physical system of interest – at best, AT narrowly focuses on the degree of challenge posed solely by the information dynamics of the system in question.

Nonetheless, AT’s senior authors argue that a material selection process must exist by dint of the existence of objects with high assembly. Lee Croning makes precisely this argument in the context of abiogenesis, arguing that 'existence is the proof' (32). Such a conviction, however, neither demonstrates that a material process capable of the necessary selection exists nor does it shed light on how it might operate. In fact, Marshall et al. (26) explicitly casts doubt, noting that high-a_i objects are ‘those so complex they require a vast amount of encoded information to produce their structures, which means that their abiotic formation is very unlikely’.

More pointedly, Sharma et al. (1) note that some known prebiotic synthesis reactions, ‘such as the formose reaction, end up producing tar, which is composed of a large number of molecules with too low a copy number to be individually identifiable’. This is one of many paradoxes associated with origin of life research (33) and it is not clear how a prebiotic selection process might emerge that is capable of selecting subcomponents of the combinatorially complex tar and moving the ensemble towards a high-copy number target such as life (34, 35). Indeed, it is increasingly apparent that origin of life experiments that manage to make progress typically do so only because of human intervention (36).

This brings us to the crux of the issue. In pointing to selection, AT draws attention to a core problem underpinning the formation of complex objects: that of information. Consider the toy model in Eq. (13), the parameterization of selection with a simple scalar α is, on its own, somewhat misleading. The challenge to solve is not merely identifying a low-α process capable of selecting arbitrary subset, $N_{a}^{α}$ N_a^\alpha of objects. The challenge is to identify a directed process that can select particular subsets of objects that result in the eventual construction of relevant high-a_i objects. This is what it means for a directed selection process to tame the combinatorial explosion of possibilities: there must be some non-random bias that drives the exclusion of some possibilities and the inclusion of others.

Indeed we can estimate the information content of such a process (37). As described in Shannon (3), information can be computed in terms of probability via: (14) $I = - {log}_{2} P$ I = - {\rm{lo}}{{\rm{g}}_2}P so at each step in which a viable subset $N_{a}^{α}$ N_a^\alpha is selected the selection process must utilize (15) $I = - {log}_{2} \frac{N_{a}^{α}}{N_{a}} = (1 - α) {log}_{2} N_{a}$ I = - {\rm{lo}}{{\rm{g}}_2}{{N_a^\alpha } \over {{N_a}}} = \left( {1 - \alpha } \right){\log _2}{N_a} bits of information to propagate the system forward. Where does this information come from? And how is it marshaled by a viable physical process to do the relevant work? Assembly theory does not (indeed, cannot) answer these questions; it merely helps us ask them in a more quantitative frame.

It is striking to compare this conclusion with what senior authors Cronin and Walker have argued in the past. In Cronin and Walker (38), both advocate for the need to ‘[focus] on information’ to make headway in the origin-of-life field. And in Walker and Davies (39) and Walker and Davies (40), Walker eloquently argues that the origin-of-life represents a phase shift in which information transitions from playing a passive role to taking on a more active role as a top-down causal force [drawing on Ellis (41)]. Walker describe this as ‘the hard problem’ of life. However, AT, as an adjudicator of undirected processes, can at best point us to where this ‘hard problem’ is at work; it does not make progress explaining how it works.

This can be seen in the shape of Cronin and Walker’s latest research program. Walker (4) discusses Cronin’s development of a ‘chemputer’ (42) – an impressive ‘grad-student-in-a-box’ programmatically driven chemical synthesis machine. According to Walker (4):

Lee right now has several chemputers running in his lab with primordial-soup chemistries in them, which he is evolving by programmable iteration over possible environments… Because we have assembly theory, we can quantify the assembly of the starting molecules and track how assembled things become over time to look for evidence of the emergence of evolution and selection within the boundary conditions of the experiment… Because the chemputer is automated, we hope to scale this up to search large volumes of chemical space all at once… [with] a cloud of potentially millions of chemical reactors.

Notably, while AT can help define success criteria for these experiments, it offers little insight into how to conduct them. Instead, the experimental approach most resembles a brute force search [albeit one in which each search attempt is heavily information laden – see Šiaučiulis et al. (43) for a sense of how much information is encoded in the protocols]. Ironically, AT itself casts doubt that such a random-sampling approach will succeed; indeed we can hazard a prediction: To the extent that the chemputer cluster can be directed towards finding a solution (perhaps by an AI-driven algorithm iterating on protocols to eventually generate high assembly objects) it will be because of a target-oriented design – an externally imposed selection – that has no material analogue in the prebiotic context of the early earth.

Conclusion

We have shown that AT includes a measure of SC: the conjoining of assembly index (complexity) and copy number (specification) can rule out undirected processes as explanations for high assembly objects. But while AT helps us identify the action of a directed process such as selection, and provides some helpful vocabulary for parameterizing such a process – it does not provide insight into how a physically viable material mechanism for the selection process might itself arise and operate. Such a selection process would need to overcome a wide range of confounding factors [see, e.g., Benner (33) for the origin-of-life context] that AT does not attempt to address.

What might explain the origin and nature of a viable selection process, particularly in the context of abiogenesis? Walker and Davies (40) argues that the causal agency of information is the ‘hard part of the problem’ of life, drawing an analogy to Chalmer’s ‘hard problem of consciousness’ (44). What if these two problems are actually one and the same, and the unified ‘hard problem’ to grapple with is the nature of conscious agents and their capacity to infuse and affect the material world with information? After all, in our uniform and repeated experience (45) we routinely observe that conscious agents are capable of sifting through combinatorial space by providing and acting upon the information necessary to perform selection and accomplish outcomes. The presence of selection may well be evidence of the action of a selector (46).

Assembly Theory Provides A Measure of Specified Complexity That Quantifies but does Not Explain Selection and Evolution

Full Article

Paradigm

My account