Quantum computing is one of the most fundamental aspects of quantum physics, with its origins traceable to Feynman’s early insights into quantum simulation [1]. It is essential to note that quantum algorithms often rely on the efficient encoding of classical Boolean functions into quantum circuits through a process known as oracle construction. An oracle is a reversible quantum circuit that implements an unknown Boolean function and is integral to several landmark quantum algorithms, including Deutsch-Jozsa [2], Grover’s search [3], Simon’s hidden shift finding [4], and Shor’s factoring algorithm [5].
Quantum gates, represented by unitary matrices, serve as the basic building blocks of quantum circuits. Unlike classical gates, they are inherently reversible. In fault-tolerant quantum computing [6], oracle construction becomes especially resource-intensive due to the high cost of implementing multi-controlled Toffoli (MCT) gates. These gates must be decomposed into a universal gate set, such as Clifford+T [7], where the key resource metrics like gate count, circuit depth, and qubit overhead become critical for the feasibility of large-scale quantum computations. Therefore, minimizing these resources is essential for improving algorithmic efficiency and mitigating computational errors.
Circuit depth is calculated by the number of layers of gates, assuming parallel execution on disjoint qubits. The T-depth explicitly measures the number of such layers containing T gates, a crucial measure in fault-tolerant quantum computation. Since T gates are resource-intensive to implement, often requiring costly magic state distillation, minimizing T-depth directly reduces circuit latency and execution time. This is particularly important for near-term quantum devices with limited coherence times. As a result, T-depth optimization has become a key focus in the design of efficient quantum circuits [8,9].
While recent advances in surface code implementations suggest that T-depth may not be the sole bottleneck in large-scale quantum circuit design [10,11], it nonetheless remains a critical metric. The overall cost of fault-tolerant quantum computation involves several factors, including T-count, qubit overhead, and scheduling strategies such as resource state distillation and teleportation. However, in the context of near-term quantum hardware with coherence times as the primary constraint, T-depth still plays a significant role in determining circuit latency and execution feasibility. Throughout the paper, we present our results under the standard assumption that T-depth is among the most resource-intensive parameters, and optimizing it is crucial for practical quantum circuit implementation.
In [12, Page 11], the authors emphasized the importance of precise resource estimation in implementing arbitrary n-input, m-output Boolean functions, generalizing beyond the scope of multivariate AND functions. Motivated by this, we present the first generic construction of optimal T-depth quantum circuits for arbitrary Boolean functions. Our approach generalizes and subsumes the methodology of [12], achieving the minimum possible Toffoli (and thus T) depth by exploiting the Algebraic Normal Form (ANF) of the Boolean function. While this construction incurs a considerable qubit overhead, it offers a compelling trade-off by substantially reducing T-depth. This makes our approach particularly valuable for efficient oracle construction, reversible circuit synthesis, and quantum implementations of cryptographic components such as S-boxes. We illustrate the practical relevance of our method by applying it to the quantum circuit design of AES, a widely used and structurally complex block cipher.
Numerous efforts have been made to reduce the quantum resource requirements for AES in last few years, including [13–19]. For example, [14, Table 7] and [16, Table VI] report a T-depth of 60 for AES and AES† combined (i.e., 30 per instance), while [19, Table 5] reports T-depths of 30, 36, and 42 for AES-128, AES-192, and AES-256, respectively. More recently, Huang et al. [18, Table 3] demonstrated an implementation of T-depth 3 for the 8-bit input 8-bit output AES S-box with a formal proof of its minimality.
In contrast, our approach is cipher-independent and provides a generic quantum circuit construction that achieves minimal T-depth for any Boolean function specified by its ANF, without putting any constraint on other resources. This construction also establishes a naive yet complete upper bound on ancillary resources, thereby setting benchmarks applicable to a broader class of ciphers beyond AES, or in general for any application that is understood by Boolean functions. In particular, our method generalizes the optimal T-depth MCT decomposition via Clifford plus Toffoli gates, as described in [12, Corollary 1]. All results are presented under the assumption that T-depth remains one of the most critical cost metrics for near-term quantum devices.
The structure of the paper and the contributions of each section are summarized below.
Section 2 outlines the preliminaries. Section 3 is the prime contributory section in terms of theoretical result, which provides a worst-case resource bound for optimal T-depth quantum circuit implementation of any n-input, m-output Boolean function. We illustrate the result with examples and highlight its implications for S-box design, including a summary of optimal T-depth resource estimates for different standard S-boxes in Table 1. In Section 4, to explain with an example, we focus on AES, providing a step-by-step resource estimation for an optimal T-depth quantum implementation, and discuss the broader impact for a large class of block ciphers executed over multiple rounds. Section 5 explores the cryptanalytic implications of our results, particularly in the context of Grover’s algorithm, and determines the optimal T-depth for a full-round Grover’s search for a large class of block ciphers. Finally, Section 6 concludes the paper with a summary of contributions and future research directions.
Optimal T-depth quantum circuit synthesis for various standard S-boxes with corresponding resource estimates using Theorem 2. *A detailed analysis for the AES S-box is provided in Section 4.
| S-box | #Variables | Ancilla count | CNOT count | CNOT depth | T-count | T-depth |
|---|---|---|---|---|---|---|
| LowMC [22] | 3 | 9 | 39 | 13 | 12 | 1 |
| DEFAULT [23] | 4 | 17 | 79 | 27 | 24 | 2 |
| GIFT [24] | 4 | 14 | 76 | 27 | 24 | 2 |
| PRESENT [25] | 4 | 19 | 105 | 30 | 32 | 2 |
| PRINCE [26] | 4 | 24 | 128 | 28 | 40 | 2 |
| ASCON [27] | 5 | 27 | 120 | 22 | 32 | 1 |
| AES [28]* | 8 | 596 | 3647 | 184 | 984 | 3 |
Let 𝔽2 = {0, 1} be the prime field of characteristic 2 and
The Algebraic Normal Form (ANF) of a Boolean function f ∈ ℬn can be represented by a polynomial over 𝔽2 in n binary variables, given by
For a Boolean function
Given the ANF of a Boolean function f ∈ ℬn, a reversible quantum circuit implementing f can be constructed using CNOT gates for linear terms and multi-controlled Toffoli (MCT) gates for nonlinear terms, where the relevant input variable(s) act as control qubit(s) and the output qubit serves as the target. However, since MCT gates are not native to most quantum hardware, they must be decomposed into a universal gate set for practical implementation.
The Clifford+T gate set is the most widely adopted universal gate set in quantum computing due to its compatibility with fault-tolerant quantum computation. The Clifford group consists of the Hadamard, Phase, and CNOT gates. Augmenting it with the non-Clifford T gate yields a universal set capable of approximating any unitary operation to arbitrary precision.
Recently, substantial efforts have been directed toward optimizing the Clifford+T decomposition of the Toffoli gate. State-of-the-art techniques employ measurement-based uncomputation, achieving Toffoli decompositions with four T gates. Notably, Gidney’s construction [20] attains an effective T-depth marginally above 1 without employing any ancilla qubit (see Figure 1a) by executing the initial T gates in parallel. In contrast, design provided in Jaques et al. [13] achieves a T-depth of 1 using a single reusable ancilla qubit (see Figure 1b). A comprehensive summary of optimized Toffoli decompositions across various resource metrics is provided in [12, Table 1].

In addition to optimizing basic Toffoli gates, significant progress has been made toward efficient implementations of multi-controlled Toffoli (MCT) gates. A recent work [12] demonstrates that employing a binary tree structure in the decomposition of an n-MCT gate achieves an optimal T-depth of ⌈log2 n⌉, requiring at most 2(n – 1) ancilla qubits and 4(n – 1) T gates. In [12], the optimality was first established in terms of Toffoli depth, assuming the MCT gate is decomposed into Clifford and Toffoli gates. Subsequently, by applying the Toffoli gate decomposition shown in Figure 1b, the optimality result was extended to T-depth as well.
In this regard, one may note that, under a different set-up, the T-depth can be reduced to as low as one by employing different constructions, such as using a large number of ancilla qubits and CCZ resource states, and leveraging teleportation-based techniques, as outlined in [21]. However, such explorations are beyond the scope of this paper. In the present paper, we focus on a specific computational model and establish the optimality within the framework specified in [12].
A comprehensive summary of recent advancements in MCT circuit decompositions is presented in [12, Table II]. In the following section, we extend this line of work by presenting a generic construction for quantum circuits with optimal Toffoli (and hence T) depth for evaluating arbitrary Boolean functions. Our key insight generalizes the approach of [12] by utilizing the ANF framework, particularly the XOR (𝔽2-addition) of AND (𝔽2-multiplication) components, in conjunction with tree-based circuit synthesis. We detail this construction in the next section.
In this section, we present an optimal T-depth quantum circuit for implementing any arbitrary n-input, m-output Boolean function, thereby completing the extension proposed in [12]. Before proceeding, we revisit [12, Corollary 1] and formally provide the idea in Theorem 1 to make the paper self-contained.
In [12], the optimality of T-depth was established by structuring the MCT decomposition as a binary tree, where the Toffoli depth corresponds to the height of the tree, which is lower bounded by ⌈log2 n⌉ for n leaf nodes. Each Toffoli gate was further decomposed into the Clifford+T gate set using the T-depth-1 construction shown in Figure 1b, yielding an overall T-depth of ⌈log2 n⌉. This is the optimal T-depth for implementing an n-MCT, assuming the decomposition is via Clifford plus Toffoli gates. We refer to the T-depth optimality obtained from this specific framework, [12] as MCT-Toffoli-T optimality.
Throughout this paper, any reference to ‘optimal T-depth’ adheres to this MCT-Toffoli-T optimality, obtained via Clifford plus Toffoli decomposition.
([12, Corollary 1]) Following the Toffoli decomposition circuit proposed in [13] (see Figure 1b), the Clifford+T decomposition of an n-MCT gate, implemented via Clifford plus Toffoli decomposition, is lower bounded by ⌈log2 n⌉, requiring 2n – 2 additional ancilla qubits and 4n – 4 T gates.
As discussed earlier, implementing higher-degree terms in the ANF of a Boolean function requires multi-controlled Toffoli (MCT) gates, where the input variables of the monomial serve as control qubits and the corresponding output qubit as the target. Specifically, a degree-k monomial requires a k-MCT gate. According to Theorem 1, a k-MCT gate can be decomposed into the Clifford+T gate set (via Toffoli gates) with an optimal T-depth ⌈log2 k⌉. Similarly, implementing the highest-degree term x1x2 ⋯ xn using a binary tree structure incurs a T-depth of ⌈log2 n⌉.
Notably, the ANF of an arbitrary n-input, m-output Boolean function
Additionally, the lower bound can be justified by noting that any Boolean circuit can be expressed as a composition of additions and multiplications. Since each layer of parallel multiplication increases the algebraic degree by at most one, the algebraic degree of the function is upper bounded by the maximum multiplicative depth.
To enable parallel MCT operations, multiple copies of the input variables and additional ancilla qubits are required to store intermediate results. These copies can be generated using CNOT gates, which, being Clifford operations, do not affect the T-depth. The result is summarized formally in the following theorem.
Let
There are
After realizing all the unique nonlinear monomials from the ANF, any m-output Boolean function can be constructed by copying up to
Finally, by [12, Corollary 1], each k-MCT gate in its optimal T-depth decomposition uses 2(k – 1) ancilla qubits and 9(k – 1) CNOT gates. Ancilla qubits are made reusable via uncomputation. Since each T-depth layer corresponds to a CNOT depth of 9, the overall CNOT depth contribution is 9⌈log2 n⌉. Hence, the overall resource requirement is given by:
#Ancilla qubits:
\left[ {\sum\limits_{k = 2}^n {k\left( {\matrix{ n \cr k \cr } } \right)} - n} \right] + \sum\limits_{k = 2}^n {\left( {\matrix{ n \cr k \cr } } \right)} + m + \sum\limits_{k = 2}^n {2(k - 1)\left( {\matrix{ n \cr k \cr } } \right)} = {2^{n - 1}}(3n - 2) - 3n + m + 1, #T gates:
\sum\limits_{k = 2}^n {4(k - 1)\left( {\matrix{ n \cr k \cr } } \right)} = {2^{n + 1}}(n - 2) + 4, #CNOT gates:
2\left[ {\sum\limits_{k = 2}^n {k\left( {\matrix{ n \cr k \cr } } \right)} - n} \right] + m\left[ {\sum\limits_{k = 2}^n {\left( {\matrix{ n \cr k \cr } } \right)} + n} \right] + \sum\limits_{k = 2}^n {9(k - 1)\left( {\matrix{ n \cr k \cr } } \right)} = {2^{n - 1}}(11n + 2m - 18) - 4n - m + 9, CNOT depth:
2(n - 1) + ({2^n} - 1) + 9\left\lceil {{{\log }_2}n} \right\rceil = {2^n} + 2n + 9\left\lceil {{{\log }_2}n} \right\rceil - 3.
Theorem 2 directly applies to the quantum circuit synthesis of S-boxes, where an n-bit S-box is modeled as an n-input, n-output Boolean function. The resulting circuit requires at most 2n–1(3n – 2) – 2n + 1 reusable ancilla qubits, 2n+1 (n – 2) + 4 T gates, and up to 2n–1(13n – 18) – 5n + 9 CNOT gates. The CNOT depth is bounded by 2n + 2n + 9⌈log2 n⌉ – 3, while achieving an optimal T-depth of ⌈log2 n⌉. To illustrate the construction, we present the 3-bit S-box used in LowMC [22] as an example.
The coordinate Boolean functions are given by: f0 = x0 ⊕ x1x2, f1 = x0 ⊕ x1 ⊕ x0x2, and f2 = x0 ⊕ x1 ⊕ x2 ⊕ x0x1. Since the maximum algebraic degree is 2, the corresponding quantum circuit (see Figure 2) achieves the optimal T-depth ⌈log2 2⌉ = 1, using 9 ancilla qubits, 12 T gates, and at most 33 CNOT gates. Notably, as the ANFs of these Boolean functions do not contain all possible monomials, the actual resource requirements are significantly lower than the worst-case estimates provided in Theorem 2. In practice, such sparsity often results in substantially reduced overheads compared to theoretical upper bounds.

Quantum circuit implementing a 3-bit S-box (used in LowMC) with T-depth 1.
In Table 1, we summarize the quantum resource requirements for synthesizing various standard S-boxes with optimal T-depth following Theorem 2.
From Theorem 2, the optimal T-depth Clifford+T decomposition of an n-input, single-output Boolean function f ∈ ℬn requires at most 2n–1(3n – 2) – 3n + 2 reusable ancilla qubits, 2n+1 (n – 2) + 4 T gates, and up to 2n–1(11n – 16) – 4n + 8 CNOT gates, with a CNOT depth of 2n + 2n + 9⌈log2 n⌉ – 3. We illustrate the circuit construction with a representative example.
Let f = x0x2 ⊕ x1x3 ⊕ x0x1x2x3 be a Boolean function consisting of three nonlinear terms. Since deg(f) = 4, its Clifford+T decomposition yields a T-depth of ⌈log2 4⌉ = 2. The resulting quantum circuit requires 12 ancilla qubits, 20 T gates, and 46 CNOT gates, with a CNOT depth of 12 (see Figure 3). Notably, as f does not contain all possible monomials, the resource overhead is significantly lower than the theoretical worst-case bound.

Quantum circuit implementing f (x0, x1, x2, x3) = x0x2 ⊕ x1x3 ⊕ x0x1x2x3 with T-depth 2.
While our construction achieves optimal T-depth, it incurs an exponential overhead in ancilla qubits and CNOT gates with respect to the number of variables. A more practical alternative is to allow a slight increase in T-depth in exchange for a substantial reduction in ancilla and CNOT counts. For instance, replacing the T-depth-1 Toffoli decomposition from [13] (Figure 1b) with the logical-AND-based decomposition from [20] (Figure 1a) in Theorem 2 reduces the ancilla count by 2n–1 (n – 2) + 1 and the CNOT count by 3 · 2n–1 (n – 2) + 3, while marginally increasing the T-depth from ⌈log2 n⌉ to ⌈log2 n⌉ + 1. This trade-off highlights a promising direction for future research, where our construction can serve as a T-depth benchmark while optimizing other quantum resources, even at the cost of slight T-depth sub-optimality, rather than prioritizing depth reduction alone.
This observation also clarifies the optimal T-depth for ANF-based implementations. In particular, for quantum circuits designed for cryptanalytic purposes, this optimal depth, viewed as a benchmark, has not been systematically explored as a performance metric. In the next section, we illustrate this gap through concrete examples, focusing on the widely studied AES block cipher. The idea naturally extends to any standard block ciphers in general.
This section presents an optimal T-depth quantum circuit for the AES algorithm. We first construct an optimal T-depth implementation of the AES S-box and subsequently extend the construction to the full AES algorithm for key sizes of 128, 192, and 256 bits, corresponding to 10, 12, and 14 rounds, respectively.
AES is a symmetric-key block cipher standardized by NIST, operating on 128-bit data blocks. The internal state is represented as a 4 × 4 matrix
The AES algorithm can be abstracted as a Boolean function
In practice, AES is implemented in a round-wise. Hence, a more feasible approach is to design the quantum circuit for each round individually and estimate the overall T-depth based on actual implementations. Each AES round (except the final one) consists of four transformations, described as follows.
SubBytes: Applies the AES S-box to each element Si,j ∈ 𝔽28 of the internal state. This is the only nonlinear operation and the primary contributor to the T-depth, requiring 16 parallel S-box evaluations: Si,j ← S-box(Si,j).
ShiftRows: Performs a cyclic left shift on each row of the state matrix. For row i, the transformation is defined as Si,j ← Si(j+i) mod 4. As this is essentially a SWAP operation, it is implemented via rewiring in the quantum circuit and does not incur additional quantum resources.
MixColumns: Applies a linear transformation to each column
of the internal state S using an MDS matrix{\bf{c}} \in _{{2^8}}^4 . Since this transformation is linear, it can be realized using only CNOT gates. Note that the final round of AES omits the MixColumns operation.M \in _{{2^8}}^{4 \times 4}:{\bf{c}} \leftarrow M \cdot {\bf{c}} AddRoundKey: The AES key schedule expands the primary key K into round keys
, each used in a specific round. In this step, the internal state S is XORed with the current round key Ki, defined as S ← S ⊕ Ki.{K_0},{K_1}, \ldots ,{K_r} \in _{{2^8}}^{4 \times 4}
Since only the SubBytes transformation introduces nonlinearity via S-boxes, specifically through multi-controlled Toffoli (MCT) gates contributing to the T-depth, the overall T-depth of an AES implementation is dominated by the S-box operations. As 16 S-boxes are evaluated in parallel per round, the total T-depth of the AES circuit can be estimated as r times the T-depth of a single S-box, where r denotes the number of rounds. We now present a detailed quantum resource estimation for the AES S-box with optimal T-depth, as derived from Theorem 2.
The coordinate Boolean functions of the AES S-box have a maximum algebraic degree of 7. By Theorem 2, an optimal T-depth quantum circuit for the AES S-box can thus be constructed with a T-depth of ⌈log2 7⌉ = 3. Notably, the ANF of all coordinate functions collectively includes all monomials up to degree 7. We construct these using three layers of parallel Toffoli gates.
First Layer: We begin by constructing all
degree-2 monomials. This requires 7 copies of each input variable, 56 copies in total, 8 of which already exist. Thus, 48 ancilla qubits and 48 CNOT gates are needed to copy the inputs, with a CNOT depth of 3. Storing the 28 quadratic terms requires 28 additional ancilla and 28 parallel Toffoli gates, contributing a Toffoli depth of 1. In total, the first layer uses 48 + 28 = 76 ancilla qubits, 48 CNOT gates (CNOT depth 3), and 28 Toffoli gates.\left( {\matrix{ 8 \cr 2 \cr } } \right) = 28 Second Layer: This layer constructs all
degree-3 and\left( {\matrix{ 8 \cr 3 \cr } } \right) = 56 degree-4 monomials by combining previously computed terms. To create 56 cubic terms, we combine degree-2 terms with single-variable terms. As we have 56 single-variable qubits and 28 degree-2 terms, we need to duplicate the quadratic terms using 28 ancilla and 28 CNOT gates (depth 1). Storing the 56 cubic terms adds 56 ancilla. To construct 70 quartic terms, we need five additional copies of all degree-2 monomials (140 ancilla and 140 CNOT gates), and 70 more ancilla to store the outputs. The second layer thus requires a maximuma of (28 + 56 + 140 + 70) = 294 ancilla, (28 + 140) = 168 CNOT gates (CNOT depth 3), and (56 + 70) = 126 Toffoli gates.\left( {\matrix{ 8 \cr 4 \cr } } \right) = 70 Third Layer: At this stage, we have: 70 quartic, 56 cubic, 196 quadratic, and 56 single-variable qubits. We now build all degree-5, 6, and 7 monomials. The 8 septics
are computed by combining eight degree-4 and eight degree-3 terms. The 56 quintics\left( {\matrix{ 8 \cr 7 \cr } } \right) = 8 are obtained by combining 56 degree-4 and degree-1 terms. The 28 sextics\left( {\left( {\matrix{ 8 \cr 5 \cr } } \right)} \right) require combinations of four degree-4 and degree-2, and 24 degree-3 and degree-3 terms. In total, constructing and storing these 92 higher-degree monomials requires (56 + 28 + 8) = 92 ancilla qubits and 92 Toffoli gates.\left( {\left( {\matrix{ 8 \cr 6 \cr } } \right)} \right)
Across all eight coordinate functions, 1001 monomials appear in total, requiring 1001 CNOT gates and 8 ancilla qubits for output storage. The CNOT depth is dominated by the coordinate function with the most monomials, which is 145.
Using the T-depth-1 Toffoli-to-T decomposition (see Figure 1b), implementing 246 Toffoli gates require (246 × 4) = 984 T gates and (246 × 9) = 2214 CNOT gates with a CNOT depth of (9 × 3) = 27. Since Toffoli gates are executed in three different layers (with a maximum of 126 in the second layer), this incurs an additional (reusable) ancilla count of 126. Although, the Toffoli gates can be uncomputed without extra cost, uncomputing intermediate qubits requires (48 + 168) = 216 CNOT gates.
Hence, the optimal T-depth implementation of the AES S-box requires a total (76 + 294 + 92 + 8 + 126) = 596 ancilla qubits, (48 +168 + 1001 + 2214 + 216) = 3647 CNOT gates, with a CNOT depth of 2(3 + 3) + 27 +145 = 184, and 984 T gates, achieving the T-depth 3.
Alternatively, replacing the T-depth-1 design with Gidney’s logical-AND decomposition (see Figure 1a) reduces the ancilla count by 126 and the CNOT count by (246 × 3) = 738, at the cost of increasing the T-depth to ⌈log2 7⌉ + 1 = 4. In this case, each Toffoli gate has CNOT depth 6, resulting in a total CNOT depth of 2 × (3 + 3) + 145 + (3 × 6) = 175. Table 2 summarizes the quantum resource requirements for implementing the AES S-box.
In the SubBytes transformation, 16 AES S-boxes are executed in parallel, resulting in a 16-fold increase in the number of ancilla qubits, CNOT gates, and T gates compared to a single S-box, while the circuit depths remain unchanged. The ShiftRows transformation introduces no additional quantum resource overhead. According to [29, Table 5], the MixColumns transformation can be implemented in-place (i.e., without additional ancilla) using 98 CNOT gates with a CNOT depth of 13. As MixColumns is applied in parallel to the four columns of the internal state, the total CNOT count becomes 4 × 98 = 392, while the CNOT depth remains 13. Finally, the bit-wise XOR with the round key requires 128 parallel CNOT gates, contributing a CNOT depth of 1. The overall quantum resource requirements for a single round of AES are summarized in Table 3.
Quantum implementation of a single round of AES with corresponding resource estimates.
| Toffoli-to-T | Ancilla count | CNOT count | CNOT depth | T-count | T-depth |
|---|---|---|---|---|---|
| Using Figure 1a | 7520 | 47064 | 189 | 15744 | 4 |
| Using Figure 1b | 9536 | 58872 | 198 | 15744 | 3 |
Assume that AES operates for r ∈ {10, 12, 14} rounds, depending on the key length. Ancilla qubits used in the S-boxes of each round are reclaimed for subsequent rounds through uncomputation, except for the 8 qubits required to store the functional output. As a result, the ancilla count increases by 8 × 16 = 128 per round, yielding a total ancilla requirement of 9536 + 128(r – 1). Furthermore, since the final round omits the MixColumns operation, the total CNOT count is given by 58872r – 392, and the CNOT depth is 198r – 13. Both the T-count and T-depth scale linearly with the number of rounds.
Table 4 presents the resource estimates for complete quantum implementations of AES with different key sizes, using our optimal T-depth construction (see Figure 1b). The T-depth achieved in this construction is provably optimal (from Theorem 1), as no quantum circuit for AES, when the rounds are implemented sequentially, can attain a lower T-depth, when decomposed via Toffoli plus Clifford gates, irrespective of the number of ancilla qubits or other resource overheads.
Optimal T-depth quantum implementation of full round AES with corresponding resources.
| Key size | No. of round | Ancilla count | CNOT count | CNOT depth | T-count | T-depth |
|---|---|---|---|---|---|---|
| 128 | 10 | 10688 | 588328 | 1967 | 157440 | 30 |
| 192 | 12 | 10944 | 706072 | 2363 | 188928 | 36 |
| 256 | 14 | 11200 | 823816 | 2759 | 220416 | 42 |
This additional resource overhead in Table 4 demonstrates the practical viability of our optimal T-depth quantum circuit constructions for AES. While earlier implementations have reported fewer ancilla qubits, lower gate counts, and in some cases, comparable T-depths (e.g., [14, Table 7] and [16, Table VI] report a T-depth of 60 for AES and AES† combined, i.e., 30 per instance, and [19, Table 5] reports T-depths of 30, 36, and 42 for AES-128, AES-192, and AES-256, respectively), none of these works formally establish the optimality of their constructions. Given that presently AES is one of the most popular ciphers in symmetric-key cryptography, numerous heuristic and brute-force efforts have been made to optimize its quantum implementation. The T-depth reductions observed in prior works primarily derived from those efforts, rather than from systematic constructions with provable optimality over a general class of block ciphers.
Optimal T-depth quantum implementation of full round AES: A comparison with prior results.
| Key size | References | Ancilla count | CNOT count | CNOT depth | T-count | T-depth |
|---|---|---|---|---|---|---|
| 128 | [13, Table 4] | 4244 | 284420 | NA | 54400 | 120 |
| 128 | [30, Table 2] | 9384 | NA | NA | 33600 | 50 |
| 128 | [15, Table 13] | 3689 | 132376 | NA | 27200 | 40 |
| 128 | [14, Table 7] | 5576 | 285393 | NA | 62400 | 30 |
| 128 | [16, Table VI] | NA | 228020 | NA | 52800 | 30 |
| 128 | [18, Table 8] | NA | 176580 | NA | 33600 | 30 |
| 128 | [19, Table 5] | 6128 | 120812 | NA | 117984 | 30 |
| 128 | [This paper] | 10688 | 588328 | 1967 | 157440 | 30 |
| 192 | [13, Table 4] | 4564 | 321021 | NA | 60928 | 144 |
| 192 | [30, Table 2] | 10456 | NA | NA | 37632 | 60 |
| 192 | [15, Table 13] | 3945 | 149256 | NA | 30464 | 48 |
| 192 | [19, Table 5] | 6448 | 136812 | NA | 132960 | 36 |
| 192 | [This paper] | 10944 | 706072 | 2363 | 188928 | 36 |
| 256 | [13, Table 4] | 4884 | 393534 | NA | 75072 | 168 |
| 256 | [30, Table 2] | 46368 | NA | NA | 12704 | 70 |
| 256 | [15, Table 13] | 4457 | 187128 | NA | 38080 | 56 |
| 256 | [19, Table 5] | 6768 | 168548 | NA | 165264 | 42 |
| 256 | [This paper] | 11200 | 823816 | 2759 | 220416 | 42 |
In this context, we acknowledge the recent work of Huang et al. [18], which also reports a T-depth 3 implementation of the AES S-box (see [18, Table 3]) and establishes its minimality, similar in spirit to our own findings. However, we emphasize that our work is independent and presents a generic quantum circuit construction method that achieves minimal the T-depth for any arbitrary Boolean function derived from its Algebraic Normal Form. This construction provides a naive yet complete upper bound on ancillary quantum resources, thereby setting new benchmarks applicable to a broad class of S-boxes beyond AES. Furthermore, our approach is a generalization of the optimal T-depth MCT decomposition (via Clifford plus Toffoli gates), as outlined in [12, Corollary 1]). Consequently, optimization strategies such as logical-AND and conditionally clean ancilla techniques can also be integrated to reduce quantum resource overheads significantly, at the cost of a marginal increase in T-depth beyond the optimal bound.
It is important to note that a T-depth 3 implementation of the AES S-box does not necessarily imply the optimality for the full-round AES circuit. Algebraic manipulations across multiple rounds may reduce the combined T-depth below the additive bound and must be analyzed individually for each block cipher. In contrast, we propose a generic, step-by-step construction framework that guarantees optimal T-depth for a broad class of block ciphers beyond AES, establishing a definitive benchmark for quantum cryptographic implementations.
For reference, Table 5 summarizes prior AES implementations focused on T-depth reduction, along with associated resource estimates.
We know that the Shor’s factoring algorithm [5] poses the most critical threat to existing classical public key cryptographic protocols due to its implications in attacking the RSA and the discrete logarithm-based schemes. On the other hand, the impact of Grover’s algorithm [3] can be temporarily mitigated by doubling the key size. Still there exist numerous instances where the Grover’s search, often in conjunction with other quantum algorithms, has demonstrated significant cryptanalytic potential. For example, in [31], Grover’s algorithm is combined with Simon’s hidden shift technique to break the security of Even-Mansour constructions. Needless to mention that the resource requirements for a full-scale key recovery attack on symmetric ciphers using Grover’s algorithm remain infeasible in the near term. Still, substantial work has been conducted on Grover-based cryptanalysis, particularly on AES [13–19]. In this context, we present the following result on the optimality of T-depth required for full-round cryptanalysis using Grover’s search on a block cipher B, where the only source of nonlinearity generates from a single S-box.
Suppose B is a block cipher whose only nonlinearity arises from an n-bit S-box S. Then, from Theorem 2, implementing the n-bit S-box S requires an optimal T-depth of ⌈log2 n⌉. Since the S-box is the sole source of nonlinearity in the cipher, each round incurs an optimal T-depth of ⌈log2 n⌉. Assuming the cipher runs for r rounds and produces a ciphertext of length m, the quantum implementation of the full-round cipher B has a total T-depth of r ⌈log2 n⌉.
To perform an exhaustive key search using Grover’s algorithm, one must implement B, compare the output with a known m-bit ciphertext using an m-MCT gate (which requires an optimal T-depth of ⌈log2 m⌉), and then apply the inverse of the full-round cipher, B†, which adds another r⌈log2 n⌉ to the T-depth. Therefore, a single Grover iteration requires an optimal T-depth of 2r⌈log2 n⌉ + ⌈log2 m⌉. Since Grover’s algorithm requires 2k/2 iterations for a key of length k, the exhaustive key recovery attack using sequential Grover’s search requires an overall T-depth of (2r⌈log2 n⌉ + ⌈log2 m⌉)2k/2.
As an immediate corollary, an exhaustive key recovery attack on AES using Grover’s algorithm can be executed with an optimal T-depth of (2r⌈log2 8⌉ + ⌈log2 128⌉)2k/2, which evaluates to 67 × 264, 79 × 296, and 91 × 2128 for key lengths of 128, 192, and 256 bits, respectively. In this regard, we make the following remarks based on different execution scenarios of Grover’s algorithm.
If the block cipher B is implemented out-of-place, then the inverse full-round cipher can be realized via measurement-based uncomputation, which does not incur any additional T-depth. Consequently, Grover’s search can be executed with an optimal T-depth of (r⌈log2 n⌉ + ⌈log2 m⌉)2k/2. The corresponding T-depth requirements for AES would be 37 × 264, 43 × 296, and 49 × 2128 for key lengths of 128, 192, and 256 bits, respectively.
Furthermore, if Grover’s search is executed in parallel, i.e., all the 2k copies of the oracle B is running concurrently, the T-depth for the parallel Grover’s search becomes (r⌈log2 n⌉ + ⌈log2 m⌉). Accordingly, for AES, the corresponding minimal T-depth would be 37, 43, and 49 for key lengths of 128, 192, and 256 bits, respectively.
Given the ANF of an arbitrary n-input, m-output Boolean function f having algebraic degree k, this work presents the construction of an optimal T-depth quantum circuit for f, along with a complete resource estimation. The primary focus of this paper is on minimizing T-depth, an essential metric in quantum circuit design due to its direct impact on circuit latency and coherence time. While the overall resource usage may be high, we argue that establishing the benchmark T-depth should be the first step in any quantum circuit synthesis process, after which other resource parameters may be optimized. This work conclusively settles the minimum achievable T-depth for Boolean functions and demonstrates its practical relevance through block cipher constructions, as well as in cryptanalysis.