Polynomial Complexity of Quantum Sample Tomography

Tang, Kun; Lai, Jun

doi:10.2478/qic-2025-0008

Full Article

1.

Introduction

Quantum computing represents a revolutionary computational paradigm by offering substantial advantages in various applications, including quantum system simulation [1], linear equations [2], and combinatorial optimization [3]. Although fully functional quantum computers remain unrealized, their potential impact on computational capabilities is profound. One of the big challenges in quantum computing is figuring out how to reconstruct quantum states. Specifically, a system with n qubits requires approximately 4ⁿ measurements for complete state reconstruction, resulting in exponential computational costs. For example, reconstructing a 10-qubit system requires 8,192 measurements across 59,049 distinct quantum circuits, requiring roughly 130 hours on IBM quantum processors [4]. Consequently, developing efficient methods for quantum state reconstruction is crucial for practical quantum computing applications.

Several approaches have been developed to address the challenges of quantum state reconstruction. These include maximum likelihood estimation [5], projected gradient descent [6], and compressed sensing [7]. For instance, compressed sensing reduces the computational complexity from 𝒪(d²) to 𝒪(d log d), where d = 2ⁿ represents the density matrix dimension. Some works also address quantum state tomography for specific cases, such as matrix product density operators [8] and low rank density matrix [9]. While these methods have demonstrated promising results, they primarily focus on full density matrix reconstruction and fail to achieve exponential acceleration. Furthermore, their reliance on iterative procedures complicates both convergence analysis and complexity estimation.

Recently, neural network-based approaches have emerged as promising tools to quantum state reconstruction challenges. Initially started from the application of restricted Boltzmann machines (RBM) [10], this field has expanded to incorporate deep learning methods [11,12], adaptive measurement base selection techniques [13], and quantum machine learning methods [14]. These approaches compress quantum states into statistical learning models, enabling information extraction without exhaustive measurements. A key advantage of these methods is their ability to infer measurement outcomes from partial data, eliminating the need for complete state reconstruction. For example, when computing the trace of the product between a quantum state density matrix ρ of n-qubit and a given matrix M, full density matrix reconstruction can be avoided if M can be decomposed as: $M = \sum_{i = 1}^{k} α_{i} M_{i},$ where k is significantly smaller than O(4ⁿ) and tr(ρM_i) is computable in polynomial time.

Despite these algorithmic advances, there are very few theoretical justifications. An important theoretical foundation for neural network applications in quantum state reconstruction was established by Aaronson [15], who introduced the concept of quantum shadow tomography. While his work demonstrated that two-outcome measurements could be predicted using polynomial scale sampling, it left open the question of whether it is possible to predict distributional measurement outcomes, similar to classical shadow tomography [16,17], with polynomial complexity. In this paper, we extend this theoretical framework by employing local Rademacher complexity analysis to prove the effectiveness of quantum sample tomography for measurements with distributional outcomes. Relying on the theorem from Bartlett et al. [18], we demonstrate that the required number of measurements scales polynomially with qubit number, which is shown in Theorem 1. By incorporating random projection techniques, we show that this relationship holds even for high-dimensional quantum systems. To validate our findings, we conduct numerical experiments using RNNs. We configure the sample size and the number of RNN parameters to scale polynomially with qubit number and introduce a penalty term in the loss function to satisfy the conditions of our theorem. Our experiments on the ground state of the transverse field Ising model (TFIM) show agreement between theoretical predictions and numerical results. Furthermore, we compare our RNN approach against other models and show its effectiveness in various quantum states including cat states, random states, and W states.

The remainder of this paper is structured as follows. In Section 2, we introduce the fundamentals of quantum sample tomography and provide the main result. Sections 3 and 4 address the theoretical challenges using our new framework and give a rigorous proof of the main theorem. In Sections 5 and 6, we validate the theoretical findings through several quantum models, with particular emphasis on the performance of the proposed neural network. Section 7 concludes the paper.

2.

Preliminary

We first review the fundamental concepts in quantum computation and give some necessary notations.

A quantum system is mathematically represented by a density matrix ρ, which is characterized by three essential properties: Hermiticity (ρ = ρ^†), positive definiteness (ρ ≽ 0), and unit trace (trρ = 1). The matrix elements may take complex values, reflecting the quantum mechanical nature of the system. Under physical operations implemented through quantum circuits, the system undergoes unitary transformations: when a unitary matrix U is applied, the density matrix transforms as ρ → U_ρU^†. Modern quantum computers, despite their varied physical implementations, fundamentally operate by applying sequences of unitary transformations. According to the Solovay-Kitaev theorem [19], arbitrary unitary operations can be efficiently approximated to any desired precision using a finite universal set of elementary quantum gates.

A fundamental challenge in quantum state characterization is the reconstruction of an unknown quantum state ρ. The standard approach to this problem, known as quantum state tomography, relies on measurements using Pauli operators. The Pauli matrices are defined as follows: $X = (\begin{matrix} 0 & 1 \\ 1 & 0 \end{matrix}), Y = (\begin{matrix} 0 & - i \\ i & 0 \end{matrix}), Z = (\begin{matrix} 1 & 0 \\ 0 & - 1 \end{matrix}) .$

The Pauli matrices, together with the identity matrix I, constitute a complete basis for two-dimensional Hermitian matrices. In quantum mechanical measurements, the expectation value of a Pauli operator M with respect to a quantum state ρ is given by tr(ρM).

The measurement principles extend naturally to multi-qubit systems, where measurements can be performed on individual qubits. For n-qubit systems, the density matrix can be expressed in terms of Pauli operator measurements through the following decomposition: $ρ = \sum_{{v_{1}, \dots, v_{n}} \in 4^{\otimes n}} \frac{tr (ρ σ_{1}^{v_{1}} \otimes σ_{2}^{v_{2}} \otimes \dots σ_{n}^{v_{n}}) σ_{1}^{v_{1}} \otimes σ_{2}^{v_{2}} \otimes \dots σ_{n}^{v_{n}}}{2^{n}},$ where $σ_{i}^{v_{i}}$ denotes the Pauli operator on the i-th qubit with v_i ∈ {1, 2, 3, 4}, corresponding to ${σ_{i}^{1} = I, σ_{i}^{2} = X, σ_{i}^{3} = Y, σ_{i}^{4} = Z}$ . The complete reconstruction of the quantum state requires measuring all terms in this decomposition, requiring at least O(3ⁿ) distinct measurements [20]. This exponential scaling with respect to qubit number presents a fundamental challenge for quantum state tomography of large-scale quantum systems.

To address the exponential scaling challenge, Aaronson [15] introduced the concept of shadow tomography. The problem can be formally stated as follows:

Problem 1

Given an unknown D = 2ⁿ dimensional quantum mixed state ρ and a distribution D over two-outcome measurements, we sample m measurements {E₁, …, E_m} independently from D and obtain their measurement results b_i = tr(ρE_i) for i = 1, …, m. For any chosen failure probability δ > 0, accuracy γ > 0, and error tolerance ε > 0, what is the minimum number of measurements m (depending on D, ε, and δ), such that any hypothesis state σ minimizing the quadratic functional $\sum_{i = 1}^{m} (tr (σ E_{i}) - b_{i})^{2}$ satisfies ${Pr}_{E \in D} (| tr (ρ E) - tr (σ E) | > γ) \leq ε$ with probability at least 1 – δ?

This formulation essentially asks whether a statistical learning model exists that can extract exponentially large amounts of information from a relatively small dataset. Aaronson provided an affirmative answer by proving that only $m_{m i n} = O (\frac{1}{γ^{2} ε^{2}} (\frac{n}{γ^{2} ε^{2}} \ln^{2} (\frac{1}{γ ε}) + \ln (\frac{1}{δ})))$ measurements are sufficient, given that γε ≥ 7η, with η > 0 being the upper bound of |tr(σE_i) – tr(ρE_i)| for all i = 1, …, m. This result demonstrates that approximating an exponential number of two-outcome measurements requires only a number of quantum state measurements that scale linearly with n, offering a considerable improvement over the exponential complexity of traditional approaches.

While Aaronson’s result focuses on predicting two-outcome measurements, our work addresses quantum sample tomography, building on the experimental investigations by Smith et al. [4]. In the subsequent analysis, let 𝔼 denote the expectation. The problem can be formally defined as follows:

Problem 2

Given an unknown D = 2ⁿ dimensional quantum mixed state ρ and a distribution 𝒟 over two-outcome measurements, we sample m unitaries {U₁, …, U_m} independently from 𝒟 and obtain their measurement results $b_{i} = diag (U_{i}^{+} ρ U_{i})$ for i = 1, …, m. For any chosen failure probability δ > 0 and error tolerance ε > 0, what is the minimum number of measurements m (depending on D, ε, and δ) such that any hypothesis state σ minimizing the quadratic functional $\sum_{i = 1}^{m} ‖ diag (U_{i}^{+} σ U_{i}) - b_{i} ‖^{2}$ , where ∥·∥ is given in the sense of l₂ norm, satisfies $E_{U \in D} (‖ diag (U^{+} σ U) - diag (U^{+} ρ U) ‖) \leq ε$ with probability at least 1 – δ?

While structurally similar to Problem 1, our formulation differs in three key aspects. First, instead of two-outcome measurements, we consider unitary transformations as independent variables. Second, our output consists of complete probability distributions, which can be represented as vectors in practical implementations, rather than binary outcomes. Third, we employ expectation rather than probability, which is more general. In fact, using Markov’s inequality, if we choose ε = ε′γ′, where ε′, γ′ > 0, then it holds ${Pr}_{U \in D} (‖ diag (U^{+} σ U) - diag (U^{+} ρ U) ‖ > γ^{'}) \leq ε^{'},$ which is consistent with the conclusion of Problem 1. Thus, our formulation represents a natural generalization of Aaronson’s original shadow tomography problem.

In the subsequent analysis, we let the loss function $ℓ (x, y) = ‖ x - y ‖_{2}^{2}$ unless otherwise specified. We denote poly(n) a polynomial function of variable n. We provide an affirmative answer to Problem 2 under the assumption that ρ is sparse, which is stated in the following theorem:

Theorem 1

Let ρ be an n-qubit quantum state and 𝒟 be a distribution over unitary matrices. Assume that the number of nonzero elements in ρ is only polynomial in n. For any given failure probability 0 < δ < 1 and error tolerance γ > 0, consider a collection of unitary matrices ${_{U_{i}}^{}}$ , where m = poly(n) with coefficient depending on δ and γ, sampled independently from distribution 𝒟, with the corresponding probability distributions $h (U_{i}) = diag (U_{i} ρ U_{i}^{+})$ . Suppose there exists a learning model f from a sufficiently expressive hypothesis space ℱ such that:

(1)
f corresponds to some quantum state σ such that $f (U_{i}) = diag (U_{i} σ U_{i}^{+})$ for all i = 1, …, m,
(2)
f achieves empirical error bound such that $\frac{1}{m} \sum_{i = 1}^{m} ℓ (h (U_{i}), f (U_{i})) \leq η$ for a given η > 0.

Then with probability at least 1 – δ, the expected loss satisfies: $E_{D} ℓ (h (U), f (U)) \leq γ + 2 η .$

The theorem requires three assumptions. First, the number of nonzero elements in the matrix should be polynomial qubit number, which enables dimension reduction of the feature map in Section 4. Second, the output of the learning algorithm should be consistent with measurement results from some quantum state, allowing us to express the algorithm’s output as a functional of feature functions. In Section 5, for neural network, we achieve this requirement through a penalty term in the loss function. Third, the loss function should remain below a certain threshold, which is achievable for algorithms with sufficient learning capacity.

For the proof, we diverge from the method employed in [21], which relies on amplification techniques and adaptive protocols for iterative density matrix reconstruction. Instead, we follow the theoretical framework established in [15], which offers a more direct approach to estimate measurement statistics without requiring complete state reconstruction. In particular, we employ local Rademacher complexity, a fundamental concept in statistical learning theory [18], to establish our proof. Our proof strategy focuses on controlling the local Rademacher complexity of the loss function. We first decompose the local Rademacher complexity of the loss function into the local Rademacher complexities of its components. Next we construct a kernel function and the corresponding feature map such that the components can be expressed as functionals of the feature map. We then estimate the expected loss over the entire distribution using empirical loss and local Rademacher complexity. Finally, we employ random projection techniques to reduce the required dataset size to polynomial scale.

3.

Rademacher Complexity

In order to prove Theorem 1, we need to bound the actual error of the model, which requires measuring the learning capacity through the Rademacher complexity. This useful tool helps us measure how effectively a learning algorithm can capture patterns in random data. The definition of Rademacher complexity is given below.

Definition 1

(Rademacher Complexity). Let (χ, P) be a probability space and ℱ be a set of measurable functions mapping from χ to ℝ. Given a positive integer m, consider:

(1)
m i.i.d. samples from (χ, P), denoted by X₁, X₂, …, X_m,
(2)
m i.i.d. Rademacher variables with $\Pr (σ_{i} = \pm 1) = \frac{1}{2}$ , denoted by σ = {σ₁, σ₂, …, σ_m}.

For any f ∈ ℱ, we define: ${\begin{array}{l} P_{m} f = \frac{1}{m} \sum_{i = 1}^{m} f (X_{i}), & (empirical mean) \\ P f = E f (X), & (expected value) \\ R_{m} f = \frac{1}{m} \sum_{i = 1}^{m} σ_{i} f (X_{i}) . & (Rademacher average) \end{array}$

The empirical Rademacher complexity of ℱ is then defined as: $E_{σ} R_{m} ℱ = \frac{1}{m} E_{σ} [\sup_{f \in ℱ} \sum_{i = 1}^{m} σ_{i} f (X_{i}) | (X_{1}, X_{2}, \dots, X_{m})],$ where R_mℱ = sup_{f ∈ ℱ}R_m f.

The local Rademacher complexity (LRC) is defined in the same way but with an additional constraint P_mf² < r, which is denoted as 𝔼_σR_mℱ(r). The LRC serves as a fundamental tool in theoretical machine learning proofs, especially its sub-root property [18].

Definition 2

A function φ : [0, +∞) → ℝ is called sub-root if it satisfies the following conditions:

(1)
φ is non-negative,
(2)
φ is non-decreasing,
(3)
$ϕ (r) / \sqrt{r}$ is non-increasing for r > 0.

This sub-root property ensures the existence of a unique fixed point for φ, which is shown in the following lemma.

Lemma 1

A sub-root function φ, which is not identically zero, possesses exactly one positive fixed point, denoted as r*. For any r > 0, the inequality φ(r) ≤ r holds if and only if r* ≤ r. Furthermore, given two sub-root functions φ₁ and φ₂ where φ₁ (r) φ₂ (r) for all r > 0, the fixed point $r_{1}^{*} o f ϕ_{1}$ must be smaller than the fixed point $r_{2}^{*} o f ϕ_{2}$ .

Proof

The proof begins by establishing the continuity of φ. For any r₁ > r₂ > 0, the non-decreasing property of φ implies |φ(r₁) – φ(r₂)| = φ(r₁) – φ(r₂). Given that $ϕ (r) / \sqrt{r}$ is non-increasing, we can derive $ϕ (r_{1}) / \sqrt{r_{2}} \leq \sqrt{r_{1}} ϕ (r_{2}) / r_{2}$ , leading to the inequality: $ϕ (r_{1}) - ϕ (r_{2}) \leq ϕ (r_{2}) \frac{\sqrt{r_{1}} - \sqrt{r_{2}}}{\sqrt{r_{2}}} .$

The continuity of φ follows as |φ(r₁) – φ(r₂)| approaches zero when r₁approaches r₂from either direction. The function φ(r)/r inherits continuity in (0, +∞) and maintains non-negativity. The strict monotonic decrease of $1 / \sqrt{r}$ in (0, +∞) ensures that φ(r)/r is also strictly decreasing.

If this ratio φ(r)/r consistently exceeds 1, then lim $\lim_{r \to + \infty} ϕ (r) / \sqrt{r}$ would diverge to infinity, which contradicts the sub-root property (3). Conversely, if φ(r)/r is consistently less than 1, then $\lim_{r \to 0^{+}} ϕ (r) / \sqrt{r} = 0$ would imply that φ(r) vanishes in [0, +∞), contradicting the non-triviality of φ. Therefore, equation φ(r)/r = 1 must have exactly one positive solution, guaranteed by monotonicity. When φ(r) ≤ r for some r > 0, we can deduce that φ(t)/t ≤ 1 for all t ≥ r, implying r* ≤ r. The converse follows analogously.

Finally, for two sub-root functions φ₁ and φ₂, where φ₁ (r) < φ₂ (r) for all r > 0, we can conclude that $ϕ_{1} (r_{2}^{*}) < ϕ_{2} (r_{2}^{*}) = r_{2}^{*}$ , which implies $r_{1}^{*} < r_{2}^{*}$ .

Lemma 1 establishes a fundamental property of sub-root functions. It shows that a sub-root function has a unique fixed point. It also provides us with a method to bound this fixed point. It turns out that the LRC has a sub-root nature, as given in the following lemma [18].

Lemma 2

Let ℱ be a class of measurable functions. Consider a functional T : ℱ → [0, +∞) that satisfies T(αf) ≤ α²T(f) for all f ∈ ℱ and α ∈ [0, 1]. Given $\hat{f} \in ℱ$ , define a random function φ as LRC on a subset of ℱ: $ϕ (r) = E_{σ} R_{m} {f \in ℱ : T (f - \hat{f}) \leq r}, r \in [0, + \infty) .$

Then both φ and its expectation 𝔼φ(r) are sub-root functions.

Now consider a supervised learning framework with input space χ and output space $Y$ , where the joint distribution P is defined on $X \times Y$ . Given a training sample ${_{(}^{X_{i}}$ drawn independently from P, the objective is to identify a function $f : X \to Y$ from a function class ℱ that minimizes the expected loss $E ℓ_{f} = E ℓ (f (X), Y),$ where ℓ is an arbitrary loss function on $Y \times Y$ . Let ℓ_ℱ denote the loss class induced by ℱ, formally defined as: $ℓ_{ℱ} = {(x, y) \mapsto ℓ (f (x), y) : f \in ℱ} .$

Define a star-hull around zero as: $s t a r (ℓ_{ℱ}, 0) = {α f : f \in ℓ_{ℱ}, α \in [0, 1]} .$

Building upon Lemma 2, one can establish the following theorem [18]:

Theorem 2

Consider a bounded loss function $ℓ : Y \times Y \to [0, 1]$ . Given x > 0, define the function ${\hat{ϕ}}_{m, x}$ as: 1 ${\hat{ϕ}}_{m, x} (r) = 20 E_{σ} R_{m} {f \in s t a r (ℓ_{ℱ}, 0) : P_{m} f^{2} \leq 2 r} + \frac{13 x}{m} .$

Let ${\hat{r}}^{*}$ be the fixed point of ${\hat{ϕ}}_{m, x}$ . Then for any K > 1, the following inequality holds for all functions f ∈ ℱ: 2 $P ℓ_{f} \leq \frac{K}{K - 1} P_{m} ℓ_{f} + 6 K {\hat{r}}^{*} + \frac{x (11 + 5 K)}{m}$ with probability at least 1 – 3e^−x.

This theorem establishes a relationship between the empirical error and the actual error through local Rademacher complexity. When the Rademacher term on the right-hand side of equation (2) can be sufficiently small and the fixed point ${\hat{r}}^{*}$ approaches the origin, it is expected that the actual error Pℓf approaches zero. To quantify this convergence, an estimation of the Rademacher complexity is required. This estimation is based on two results, the first is related to the Rademacher complexity of vector-valued functions [22].

Theorem 3

For any set χ and a sequence (X₁, …, X_m) ∈ χ^m, consider a class of functions ℱ mapping from χ to ℓ₂ space. Given a collection of functions h_i : ℓ₂ → ℝ with Lipschitz constant L, the following inequality holds: 3 $E_{σ} \sup_{f \in ℱ} \sum_{i = 1}^{m} σ_{i} h_{i} (f (X_{i})) \leq \sqrt{2} L E_{σ} \sup_{f \in ℱ} \sum_{i = 1}^{m} \sum_{k} σ_{i k} f_{k} (X_{i}),$ where σ_ik represents the independent doubly indexed Rademacher sequence and f_k (X_i) denotes the k-th component of the vector f (X_i).

Theorem 3 establishes a fundamental relationship between the Rademacher complexity of vector-valued functions and their scalar components. In particular, it provides a practical way for complexity estimation by showing that the complexity of a composite vector-valued function can be bounded by the complexities of its components. To obtain a more accurate complexity bound, we adapt the theorem from Cortes et al. [23] with an additional norm scaling factor.

Theorem 4

Let $k (x, \tilde{x})$ be a Mercer kernel (symmetric and positive semidefinite) with its associated feature map Φ_k. Consider its eigenvalue decomposition: $k (x, \tilde{x}) = \sum_{j = 1}^{\infty} λ_{j} φ_{j} (^{x) T} φ_{j} (\tilde{x}),$ where ${(λ_{j})}_{j = 1}^{\infty}$ represents the sequence of eigenvalues in descending order with corresponding eigenfunctions ${(φ_{j})}_{j = 1}^{\infty}$ . Given B > 0, define the function class: $ℱ = {f_{w} = (x \mapsto 〈 ω, Φ_{k} (x) 〉) : ‖ w ‖_{2} \leq B} .$

Then, for any r > 0, it holds the following bound on the expected Rademacher complexity: 4 $E_{σ} R_{m} ℱ (r) \leq B \sqrt{\frac{2}{m} \min_{θ \geq 0} (θ r + \sum_{j > θ} λ_{j})} = B \sqrt{\frac{2}{m} \sum_{j = 1}^{\infty} \min (r, λ_{j})} .$

Theorem 4 shows that if we know the eigenvalues of the kernel function, we can estimate the local Rademacher complexity of functionals of its feature map. The construction of such a feature map will be presented in Section 4.

4.

Polynomial Complexity

In the preceding sections, we introduced the local Rademacher complexity and related theorems. This section focuses on proving Theorem 1. We begin by establishing the Lipschitz continuity of the loss function class, which is needed in Theorem 3. In the subsequent analysis, 𝒢 represents the function class that maps the input space χ to the distribution space ℝ^N, where N = 2ⁿ is the dimension of n-qubit Hilbert space.

Lemma 3

For a given distribution function $\bar{f} \in ℱ$ , let us define a new function class $G_{\bar{f}}$ as: $G_{\bar{f}} = {f (x) - \bar{f} (x) : f \in ℱ} .$

Consider the function $ℓ_{G} : G_{\bar{f}} \to ℝ$ defined by: $ℓ_{G} (g (x)) = \sum_{k = 1}^{N} (g_{k} (x))^{2}, g \in G_{\bar{f}},$ where g_k(x) is the k-th component of g(x). Then every function in star(ℓ_𝒢, 0) is Lipschitz continuous with Lipschitz constant 4.

Proof

It is sufficient to prove 5 $| ℓ_{G} (g (x)) - ℓ_{G} (h (x)) | \leq 4 ‖ g (x) - h (x) ‖,$ where $g = f_{1} - \bar{f}$ and $h = f_{2} - \bar{f}$ both belong to $G_{\bar{f}}$ .

Let us expand the left-hand side of equation (5): $\begin{array}{l} | ℓ_{G} (g (x)) - ℓ_{G} (h (x)) | \\ = & | \sum_{i = 1}^{N} (g_{i} (x))^{2} - \sum_{i = 1}^{N} (h_{i} (x))^{2} | \\ = & | \sum_{i = 1}^{N} (g_{i} (x) - h_{i} (x)) (g_{i} (x) + h_{i} (x)) | \\ \leq & | \sum_{i = 1}^{N} f_{1, i} (x) (g_{i} (x) - h_{i} (x)) | + | \sum_{i = 1}^{N} f_{2, i} (x) (g_{i} (x) - h_{i} (x)) | + | 2 \sum_{i = 1}^{N} {\bar{f}}_{i} (x) (g_{i} (x) - h_{i} (x)) | \\ \leq & ‖ g (x) - h (x) ‖_{\infty} (\sum_{i = 1}^{N} f_{1, i} (x) + \sum_{i = 1}^{N} f_{2, i} (x) + 2 \sum_{i = 1}^{N} {\bar{f}}_{i} (x)) \\ \leq & 4 \sqrt{\sum_{i = 1}^{N} (g_{i} (x) - h_{i} (x))^{2}} \\ = & 4 ‖ g - h ‖_{2} . \end{array}$

The second inequality uses the fact that all $f_{1, i} (x), f_{2, i} (x), {\bar{f}}_{i} (x)$ are components of probability distributions.

We then introduce a specific feature map for unitary matrices that plays a crucial role in our estimation. Let SU(N) denote the special unitary group of degree N = 2ⁿ, consisting of all N × N unitary matrices with determinant equal to 1. For a unitary matrix U := (u_ij) ∈ SU(N) and 1 ≤ k ≤ N, we define the feature map $Φ^{(k)} (U) = {{(u_{k i} \bar{u_{k j}})}_{(i, j)}, i, j = 1, \dots, N} \in ℂ^{N^{2}}$ . Given any two unitary matrices U₁, U₂, we define the corresponding kernel function as their inner product in the feature space: $k (U_{1}, U_{2}) = 〈 Φ^{(k)} (U_{1}), Φ^{(k)} (U_{2}) 〉 .$

By Theorem 4, this kernel admits a spectral decomposition: $k (U_{1}, U_{2}) = \sum_{i = 1}^{K} λ_{i} φ_{i} (U_{1}) φ_{i}^{T} (U_{2}),$ where K represents the rank of the kernel matrix, bounded above by N². The construction of this feature map naturally leads to the following result.

Theorem 5

Let ρ be a density matrix of order N = 2ⁿ and U ∈ SU(N). Then each element of the vector h = diag(U_ρU^†) can be expressed as an inner product between the feature map and a vector whose ℓ₂-norm is strictly bounded by 1.

Proof

For k = 1 …, N, we have $h_{k} (U) = \sum_{i, j = 1}^{N} u_{k i} \bar{u_{k j}} ρ_{i j} .$

Let w = (ρ_ij) ∈ ℂ ^N2. Then we can express h_k (U) as the inner product: $h_{k} (U) = 〈 w, Φ^{(k)} (U) 〉 .$

To establish the norm bound, observe that ∥w∥₂ equals the Frobenius norm of ${σ_{i}}_{i = 1}^{N}$ denote the singular values of ρ. By the properties of density matrices, it holds $\sum_{i = 1}^{N} σ_{i} = tr ρ = 1$ . Therefore: $‖ ρ ‖_{F} = \sqrt{\sum_{i = 1}^{N} σ_{i}^{2}} \leq \sqrt{\sum_{i = 1}^{N} σ_{i}} = 1,$ where the inequality follows from the fact that 0 ≤ σ_i ≤ 1. This establishes the conclusion that ∥w∥₂ ≤ 1.

We also require the Johnson-Lindenstrauss (JL) lemma [24], which is fundamental to the dimensional reduction analysis.

Theorem 6

(Johnson-Lindenstrauss Lemma). For any sample size m ≥ 1, any finite point set $X = {x_{i} \in ℝ^{d}}_{i = 1}^{m}$ , and any error tolerance ε ∈ (0, 1), let q be a positive integer satisfying: $q \geq \frac{C \ln m}{ε^{2} (1 - ε)},$ where C is a positive constant that only depends on d. Then there exists a linear map F : ℝ^d → ℝ^q of the form F(x) = V^Tx, where V = (v_ij) ∈ ℝ^d×q and v_ij are i.i.d. random variables with zero mean and scaled unit variance, such that it holds: 6 $(1 - ε) ‖ x_{i} - x_{j} ‖_{2}^{2} \leq ‖ F (x_{i}) - F (x_{j}) ‖_{2}^{2} \leq (1 + ε) ‖ x_{i} - x_{j} ‖_{2}^{2}$ for all pairs x_i, x_j ∈ χ with probability at least 1 – δ, where δ = 2e^{−(ε²(1−ε))(q/4)}.

Remark 1

A canonical example is simply choosing v_ij ~ 𝒩 (0, 1/q).

Now we have all the tools necessary to establish our main result. We begin by presenting a weaker variant of Theorem 1:

Theorem 7

Let ρ be an n-qubit quantum state and 𝒟 be a distribution over unitary matrices. For any given failure probability 0 < δ < 1 and error tolerance γ > 0, consider a collection of unitary matrices ${U_{i}}_{i = 1}^{m}$ , where m = poly(2ⁿ) with coefficient depending on δ and γ, sampled independently from distribution 𝒟, with the corresponding probability distributions $h (U_{i}) = diag (U_{i} ρ U_{i}^{+})$ . Suppose there exists a learning model f from a sufficiently expressive hypothesis space ℱ that:

(1)
f corresponds to some quantum state σ such that $f (U_{i}) = diag (U_{i} σ U_{i}^{+})$ for all i = 1, …, m,
(2)
f achieves empirical error bound such that $\frac{1}{m} \sum_{i = 1}^{m} ℓ (h (U_{i}), f (U_{i})) \leq η$ for a given η > 0.

Then with probability at least 1 – δ, the expected loss satisfies: $E_{D} ℓ (h (U), f (U)) \leq γ + 2 η .$

Remark 2

Theorems 1 and 7 differ only in their sample complexity: m = poly(n) in Theorem 1 versus m = poly(2ⁿ) in Theorem 7.

Proof

According to the assumption, we can represent each probability distribution h(U_i) with i = 1, …,m as an N-dimensional vector, where N = 2ⁿ corresponds to the dimension of the n-qubit Hilbert space. Let h_k(U_i) denote the probability of measuring the system in the k-th computational basis state of h(U_i), where the states are encoded in binary representation, i.e., |0〉^⊗n corresponds to index 0, |0〉^⊗(n−1)|1〉 corresponds to index 1, and so forth.

We denote the unitary matrices U_i as input variables x_i. According to Theorem 3 and Lemma 3, it holds: 7 $\begin{array}{l} E_{σ} R_{m} ℓ_{G} & = \frac{1}{m} E_{σ} \sup_{g \in G} \sum_{i = 1}^{m} σ_{i} ℓ (g (x_{i}), h (x_{i})) \\ \leq \frac{4 \sqrt{2}}{m} E_{σ} \sup_{g \in G} \sum_{i = 1}^{m} \sum_{k = 1}^{N} σ_{i k} (g_{k} (x_{i}) - h_{k} (x_{i})) \\ \leq \frac{4 \sqrt{2}}{m} \sum_{k = 1}^{N} E_{σ} \sup_{g \in G} \sum_{i = 1}^{m} σ_{i k} (g_{k} (x_{i}) - h_{k} (x_{i})) \\ = 4 \sqrt{2} \sum_{k = 1}^{N} E_{σ} R_{m} G_{k}, \end{array}$ where g_k is the k-th component of g, and 𝒢_k = {g_k – h_k : g ∈ 𝒢}.

Equation (7) enables us to estimate the Rademacher complexity of the loss function through its component-wise Rademacher complexities. Given r > 0, considering the constraint $\frac{1}{m} \sum_{i = 1}^{m} {(\sum_{k = 1}^{N} {(g_{k} (x_{i}) - h_{k} (x_{i}))}^{2})}^{2} \leq r,$ we can deduce that ${(\sum_{k = 1}^{N} {(g_{k} (x_{i}) - h_{k} (x_{i}))}^{2})}^{2} \leq m r$ for all i = 1, …, m. This implies $\sum_{k = 1}^{N} (g_{k} (x_{i}) - h_{k} (x_{i}))^{2} \leq \sqrt{m r} .$

Summing over i yields $\sum_{i = 1}^{m} \sum_{k = 1}^{N} (g_{k} (x_{i}) - h_{k} (x_{i}))^{2} \leq m \sqrt{m r}$ , or equivalently, 8 $\sum_{k = 1}^{N} \frac{1}{m} \sum_{i = 1}^{m} (g_{k} (x_{i}) - h_{k} (x_{i}))^{2} \leq \sqrt{m r} .$

Denote $a_{k} = \frac{1}{m} \sum_{i = 1}^{m} (g_{k} (x_{i}) - h_{k} (x_{i}))^{2}$ . For any interval $[\sqrt{m r} / 2^{j}, \sqrt{m r} / 2^{j - 1}]$ , the number of a_k within this interval cannot exceed 2^j, where 1 ≤ j ≤ n. For j = n, we extend the lower bound to 0 so that $0 \leq a_{k} \leq \sqrt{m r} / 2^{n - 1}$ . Consequently, from equation (7), it implies 9 $E_{σ} R_{m} ℓ_{G} (r) \leq 4 \sqrt{2} \sum_{j = 1}^{n} 2^{j} E_{σ} R_{m} G_{k} (\frac{\sqrt{r m}}{2^{j - 1}}) .$

Based on Theorem 5, we can get that g_k = w · Φ(U) and h_k = v · Φ(U), which implies $‖ w - v ‖ \leq ‖ w ‖ + ‖ v ‖ \leq 2.$

From Theorem 4, we obtain $E_{σ} R_{m} G_{k} (\frac{\sqrt{r m}}{2^{j - 1}}) \leq 2 \sqrt{\frac{2}{m} \sum_{k = 1}^{K} \min (\frac{\sqrt{r m}}{2^{j - 1}}, λ_{k})},$ where K ≤ N². Since k ranges from 1 to K, this yields 10 $E_{σ} R_{m} G_{k} (\frac{\sqrt{r m}}{2^{j - 1}}) \leq 2 \sqrt{\frac{2 K}{m} \frac{\sqrt{r m}}{2^{j - 1}}}$

Substituting equation (10) into equation (9) produces 11 $E_{σ} R_{m} ℓ_{G} (r) \leq 8 \sqrt{2} r^{\frac{1}{4}} m^{- \frac{1}{4}} \sqrt{K} \sum_{j = 1}^{n} 2^{\frac{j}{2}} \leq 32 \sqrt{2} \sqrt{N K} m^{- \frac{1}{4}} r^{\frac{1}{4}} .$

Combining equation (11) and equation (1) in Theorem 2, we get that 12 ${\hat{ϕ}}_{m, x} (r) \leq 1280 r^{\frac{1}{4}} m^{- \frac{1}{4}} N^{\frac{3}{2}} + \frac{13 x}{m} = {\tilde{ϕ}}_{m, x} (r) .$

Based on Lemma 1, we now aim to find the upper bound of the fixed point $\tilde{r} of {\tilde{ϕ}}_{m, x} (r)$ , where $\tilde{r}$ satisfies the equation $1280 {\tilde{r}}^{\frac{1}{4}} m^{- \frac{1}{4}} N^{\frac{3}{2}} + \frac{13 x}{m} = \tilde{r} .$

This equation can be rewritten as ${\tilde{r}}^{\frac{1}{4}} ({\tilde{r}}^{\frac{3}{4}} - a) = b$ , where $a = 1280 N^{\frac{3}{2}} m^{- \frac{1}{4}}$ and $b = \frac{13 x}{m}$ . Let 13 $\hat{r} = {(1280 N^{\frac{3}{2}} m^{- \frac{1}{4}} + \frac{1}{\sqrt{m}})}^{\frac{4}{3}} + {(\frac{13 x}{\sqrt{m}})}^{4} .$

Then $\hat{r}$ satisfies ${\hat{r}}^{\frac{3}{4}} \geq a + \frac{1}{\sqrt{m}}$ and ${\hat{r}}^{\frac{1}{4}} \geq \sqrt{m} b$ , so using Lemma 1, we get $\hat{r} \geq \tilde{r}$ .

Applying Theorem 2, we can establish: 14 $\begin{array}{l} P ℓ_{f} & \leq \frac{K}{K - 1} P_{m} ℓ_{f} + 6 K \hat{r} + \frac{x (11 + 5 K)}{m} \\ \leq \frac{K}{K - 1} P_{m} ℓ_{f} + 6 K ({(1280 N^{\frac{3}{2}} m^{- \frac{1}{4}} + \frac{1}{\sqrt{m}})}^{\frac{4}{3}} + {(\frac{13 x}{\sqrt{m}})}^{4}) + \frac{x (11 + 5 K)}{m} \end{array}$ with probability 1 – 3e^−x. For any 0 < δ < 1, setting $x = - \ln (\frac{δ}{3})$ , we can select an appropriate m = poly(2ⁿ), to ensure the sum of the last two terms remains below a given error tolerance γ. In particular, it holds $m = O (N^{\frac{8}{3}} {(\ln (\frac{δ}{3}))}^{2} \frac{1}{γ^{3}})$ from equation (14).

While our analysis shows that quantum state reconstruction is possible, the exponential scaling of measurement requirements m = ploy(2ⁿ) remains a fundamental challenge for practical applications. To address this limitation, we extend our analysis to Theorem 1 by incorporating sparsity constraints on the density matrix representation of quantum states, which demonstrates a substantial reduction in measurement complexity.

Proof. of Theorem 1

Here we follow the notations established in Theorem 7. From Theorem 7, for random samples ${(x_{i}, y_{i})}_{i = 1}^{m}$ drawn from distribution 𝒟, where m = poly(2ⁿ), we obtain equation (14). Let $\tilde{D}$ denote the uniform distribution over these samples.

Choose a constant $ε = \min {\frac{γ}{2}, \frac{1}{2}}$ . Following Lemma 6, we define $q = \frac{C_{1} n}{ε^{2} (1 - ε)} \geq \frac{C \ln (2 m)}{ε^{2} (1 - ε)}$ , where C > 0, C₁ > 0 are constants. We construct a random projection matrix V ∈ ℝ^2m×q with entries independently sampled from $N (0, \frac{1}{q})$ . With probability $1 - 2 {(2 m)}^{^{- \frac{C}{4}}}$ , it holds 15 $(1 - ε) ‖ z_{i} - z_{j} ‖ \leq ‖ V z_{i} - V z_{j} ‖ \leq (1 + ε) ‖ z_{i} - z_{j} ‖$ for all pairs z_i, z_j ∈ {y₁, …,y_m,f(x₁), …, f(x_m)}. Let $\tilde{f} (x) = V f (x)$ and $\tilde{y} = V y$ . From equation (15), we derive: 16 ${(1 - ε)}^{2} ℓ (f, y) \leq ℓ (\tilde{f}, \tilde{y}) \leq {(1 + ε)}^{2} ℓ (f, y) .$

Following the proof structure of Theorem 7 with distribution $\tilde{D}$ , we now try to determine the sample size $\tilde{m}$ . Given the compression to dimension q, equation (7) becomes: $E_{σ} R_{\tilde{m}} ℓ_{G} \leq 4 \sqrt{2} \sum_{k = 1}^{q} E_{σ} R_{\tilde{m}} G_{k}$

Using the sparsity hypothesis, we can reduce the dimension of the feature map Φ^(k), thus reducing the dimensionality factor K in equation (10) to M (the number of nonzero elements in ρ, which is poly(n)). This yields the refined bound: $r^{*} \leq {(1280 \sqrt{M q} \cdot {\tilde{m}}^{- \frac{1}{4}} + \frac{1}{\sqrt{\tilde{m}}})}^{\frac{4}{3}} + {(\frac{13 x}{\sqrt{\tilde{m}}})}^{4} .$

Choosing $\tilde{m} = O ({(M q)}^{\frac{4}{3}} {(\ln (\frac{δ}{3}))}^{2} \frac{1}{{\tilde{γ}}^{3}}) = poly (n)$ ensures: $P ℓ_{\tilde{f}} \leq 2 P_{\tilde{m}} ℓ_{\tilde{f}} + \tilde{γ},$ where 0 $0 < \tilde{γ} < 1$ . Since $\tilde{D}$ is discrete, we have $P ℓ_{\tilde{f}} = P_{m} ℓ_{\tilde{f}}$ . The desired result follows from equation (14) and equation (16).

5.

Recurrent Neural Network (RNN)

In this section, we address the practical implementation of Theorem 1. A challenge lies in determining an efficient parameterization of unitary matrices that can serve as viable input to our model. Since the dimension of the unitary group SU(2ⁿ) grows exponentially with n, direct parameterization of arbitrary unitary matrices becomes impractical. Therefore, for practical implementation, we restrict our attention to a subset of unitary matrices that can be decomposed as Kronecker products of 2 × 2 unitary matrices:

U = U_{1} \otimes U_{2} \otimes \dots \otimes U_{n}, U_{i} \in S U (2) .

This reduction to local operations simplifies our approach by focusing on the parameterization of SU(2) matrices. We express the unitary matrices as: $(\begin{matrix} \cos \frac{θ}{2} e^{i (ϕ + ψ) / 2} & \sin \frac{θ}{2} e^{i (ϕ - ψ) / 2} \\ - \sin \frac{θ}{2} e^{i (- ϕ + ψ) / 2} & \cos \frac{θ}{2} e^{i (- ϕ - ψ) / 2} \end{matrix}),$ where θ, φ, ψ are all from 0 to 2π. The volume form for this parameterized unitary matrix is proportional to d cos(θ)dφdψ. While our analysis allows for arbitrary probability distributions of unitary matrices, we adopt the Haar measure for sampling to ensure a uniform error distribution across randomly sampled unitary matrices. This choice leads to the following parameterization: ${\begin{array}{l} θ = \arccos θ^{'}, \\ ϕ = ϕ^{'}, \\ ψ = ψ^{'}, \end{array}$ where φ′, ψ′ are sampled uniformly from the original defined interval, and θ⁰ is sampled uniformly in [–1, 1].

To validate Theorem 1, we design a series of numerical experiments using the Long Short-Term Memory (LSTM) as our main statistical learning model. The LSTM model, known for its capability to process sequential data with long-term dependencies, is particularly suitable for our application. In our framework, each measurement outcome can be represented as a sequence of binary digits (0 or 1), analogous to a sentence in natural language processing. The model generates conditional probabilities for each subsequent measurement outcome based on current and previous results. The joint probability of the entire measurement sequence is then computed as the product of these conditional probabilities.

The prediction phase follows a different process from the training phase. During prediction, the neural network accepts specified conditions as input and generates conditional probabilities at each sequential step. These probabilities guide a sampling process where each binary outcome (“word”) is selected according to the computed probability distribution. The selected outcome is then fed as input to the subsequent step. The final probability of the complete quantum state is computed as the product of these sequential conditional probabilities. This prediction mechanism is depicted in Figure 1.

To enhance the predictive capability, we incorporate quantum measurement context through a dual-input structure: (1) the measurement outcome of the previous particle, and (2) the parameterization of the unitary operator U corresponding to the current particle’s measurement basis. This design enables the network to learn the correlation between measurement outcomes and their corresponding basis transformations, thereby capturing the quantum state’s behavior under different measurement bases. For sequence initialization, we employ a designated token ‘2’, which appears exclusively at the beginning of each measurement sequence. This initialization token provides a clear way of sequence boundaries in the training data.

We now introduce the loss function employed in this work. We adopt the cross-entropy loss as our objective function due to its effectiveness in distribution approximation tasks. The cross-entropy loss between the true distribution p and the predicted distribution q is defined as:

ℒ_{cross-entropy} (p, q) = \sum_{i} p_{i} \log \frac{1}{q_{i}} .

Since our training samples are drawn independently from the true distribution p, by the law of large numbers, the empirical loss converges to:

ℒ_{cross-entropy} (p, q) \approx E_{x ~ p} [\log \frac{1}{q (x)}] = \sum_{i} \log \frac{1}{q_{i}} .

This formulation eliminates the need to compute the true probabilities p_i, thereby reducing memory requirements.

To ensure that our neural network correctly represents the quantum state transformation h(U) = diag(U_ρU^†), we incorporate two constraints into the loss function. The first is a parameterization invariance constraint: parameter sets {θ, φ, ψ} and {–θ, φ + π, ψ + π} must produce identical outputs, as they represent the same unitary transformation. The second is the constraint derived from the properties of f (U) = diag(UρU^†). For the single-qubit case, these properties are derived as follows and they can naturally extend to multi-qubit systems through tensor products. Consider the following matrix representations:

ρ = (\begin{array}{l} ρ_{11} & ρ_{12} \\ ρ_{21} & ρ_{22} \end{array}), U = (\begin{array}{l} u_{11} & u_{12} \\ u_{21} & u_{22} \end{array}), U_{1} = (\begin{matrix} u_{11} & - u_{12} \\ - u_{21} & u_{22} \end{matrix}) .

This yields the functional form:

\begin{array}{l} f (U) & = & (ρ_{11} {| u_{11} |}^{2} + 2 Re (ρ_{12} \bar{u_{12}} u_{11}) + ρ_{22} {| u_{12} |}^{2}, \\ {ρ_{11} {| u_{21} |}^{2} + 2 Re (ρ_{12} \bar{u_{21}} u_{22}) + ρ_{22} {| u_{22} |}^{2})}^{T} . \end{array}

Under the chosen parameterization, we derive a symmetric relation: $f (U) + f (U_{1}) = 2 (\cos^{2} (\frac{θ}{2}) f (I) + \sin^{2} (\frac{θ}{2}) f (X)),$ where

I = (\begin{array}{l} 1 & 0 \\ 0 & 1 \end{array}), X = (\begin{array}{l} 0 & 1 \\ 1 & 0 \end{array}) .

Furthermore, let

H_{1} = \frac{1}{\sqrt{2}} (\begin{matrix} 1 & 1 \\ - 1 & 1 \end{matrix}), H_{2} = \frac{1}{\sqrt{2}} (\begin{array}{l} 1 & i \\ i & 1 \end{array}),

and ${\bar{f}}_{1} = (f (H_{1}) - \frac{f (I) + f (X)}{2}), {\bar{f}}_{2} = (f (H_{2}) - \frac{f (I) + f (X)}{2})$ , we obtain the antisymmetric component:

f (U) - f (U_{1}) = 2 (\sin (θ) \cos (ψ) {\bar{f}}_{1} - \sin (θ) \sin (ψ) {\bar{f}}_{2}) .

These two constraints are incorporated into the total loss function via l₁regularization: $ℒ_{total} = ℒ_{cross-entropy} + λ (ℒ_{symm} + ℒ_{invariant}),$ where λ serves as the regularization coefficient.

We have also investigated replacing the LSTM model with a multi-head attention mechanism, which has shown remarkable success in various sequence modeling tasks. However, LSTM maintains a higher accuracy in our case and offers easier training procedures. This observation suggests that the sequential nature of quantum measurements may be better modeled by the LSTM’s memory structure, which is demonstrated in the following section.

6.

Numerical Experiments

Before proceeding with the experimental analysis, we detail the configuration of our numerical experiments. To quantify the performance of the model, we employ the classical fidelity measure between the predicted distribution p and the true distribution q:

f (p, q) = \sum_{i = 1}^{N} \sqrt{p_{i} q_{i}} .

This metric provides a natural measure of the similarity between the probability distributions of the measurement.

We now detail the hyperparameters employed in our implementation. The model was trained for 50 epochs across all experiments. While this choice lacks rigorous foundation, empirical evidence shows its consistency in achieving convergence across various system sizes. The regularization coefficient λ was scaled dynamically with the particle number, ranging from 5 to 50. For smaller systems, we found that a smaller λ value leads to faster loss convergence, while larger systems benefit from increased λ values to enforce stronger constraints on the neural network’s behavior.

To address the computational complexity of evaluating ℒ_symm and ℒ_invariant, we implement a stochastic index sampling strategy. Rather than computing losses over all indices, we randomly select 50 indices for loss calculation in each iteration. This approach reduces computational overhead while maintaining the effectiveness of the constraints. To ensure polynomial scaling of computational resources, we impose strategic constraints on the model architecture. The dimensionality of both the RNN hidden states and the word vector embeddings is set to 2n², where n denotes qubit number.

6.1.

TFIM Models

For initial validation, we test our model against the ground state of the transverse field Ising model (TFIM), a system for studying quantum phase transitions. The TFIM Hamiltonian is expressed as: $H = J_{z} \sum_{〈 i, j 〉} σ_{i}^{z} σ_{j}^{z} - J_{x} \sum_{i} σ_{i}^{x},$ where 〈i, j〉 denotes nearest-neighbor interactions, $σ_{i}^{z}$ and $σ_{i}^{x}$ are Pauli operators acting on site i. We fix the ferromagnetic coupling strength J_z = 1 as our energy scale and vary the transverse field strength J_x∈ [0, 2] to probe different quantum phases. This parameterization enables us to investigate the system’s behavior at the quantum critical point at J_x / J_z = 1.

The model’s performance, measured by fidelity and illustrated in Figure 2, consistently exceeds 90% across different parameter regimes. These results demonstrate the model’s robustness and its capability to capture essential features of the quantum state’s probability distribution.

We test the neural network on quantum systems of 6 and 12 qubits at Jx = 1 using 600 random unitary matrices. The prediction fidelity under these transformations is shown as heatmaps in Figure 3. The model achieves prediction fidelity above 90%, validating its effectiveness in capturing the characteristics of the quantum state.

We also examine the relationship between sample size and model performance at a fixed transverse field of 1.0. The model accuracy is evaluated using average fidelity across 400 random unitary matrices. Figure 4 indicates that high fidelity can be achieved with limited measurements: 6-qubit systems reach approximately 90% fidelity with 200 samples, while 12-qubit systems require approximately 600 samples for comparable performance. Overall, both the model parameters and required training samples scale polynomially with qubit number, indicating efficient scaling of the neural network architecture.

Finally, we show how the minimum required sample size varies with qubit number for a given fidelity threshold, as in Figure 5. We use the same method as in Figure 4 to generate results for quantum systems ranging from 6 to 12 qubits, and define the first intersection point with the fidelity level line as the minimum sample size. We select two fidelity levels, 0.9 and 0.95, to plot. Due to the inherent randomness in the data generation and training process, the result is not strictly monotonically increasing as expected. However, the trend follows an approximately power-law relation with a power less than 2.

6.2.

Comparison between Different Neural Network Models

Given that Cha et al. [25] showed the applicability of attention mechanisms to quantum state tomography, we conduct a comparative analysis between attention-based approaches and our proposed method. We investigate two distinct attention-based implementations: (1) reformulating our quantum state reconstruction task as a sequence-to-sequence translation problem, and (2) substituting the LSTM architecture with a multi-head attention mechanism while maintaining the overall framework. To ensure fair comparison and prevent information leakage, we use causal masking in the attention mechanisms to maintain the sequential nature of quantum measurements. Our empirical results, as illustrated in Figure 6, demonstrate that our LSTM-based architecture outperforms both attention-based variants.

6.3.

Other States

To further validate our findings, we extend our analysis to a more complex quantum system: the cat state from quantum optics, which has the form $| ϕ 〉_{c a t} = | α 〉 + | - α 〉,$ where |α〉 is the coherent state, and α ∈ C. While theoretically infinite-dimensional, this state can be effectively studied through dimensional truncation, providing an ideal test case for our method. We generate quantum cat states using the QuTiP library [26]. As illustrated in Figure 7, the neural network demonstrates robust performance in accurately representing this quantum state, further verifying the generalizability of our approach.

We also conduct experiments on two different quantum states. The first state is a randomly generated state using Qutip, with a dense density matrix. From the results in Figure 7, the neural network shows impressive performance in this state, with the fidelity of each measurement being very close to 1. The second state is a W-state with n² nonzero elements, where n is the qubit number. It has the following form:

| ϕ 〉_{w} = \frac{1}{\sqrt{n}} (| \underset{n}{\underset{︸}{100 \dots 0}} 〉 + | 010 \dots 0 〉 + \dots + | 000 \dots 1 〉) .

The results in this state show that the fidelity remains approximately above 0.9.

7.

Conclusion

This work presents a novel approach for quantum sample tomography that differentiates itself from conventional methods. We have developed both theoretical foundations and practical implementations for this challenging problem. Our analysis, grounded in local Rademacher complexity theory, establishes a fundamental theorem that rigorously justifies the application of machine learning method to quantum measurement prediction.

Building upon this mathematical foundation, we have developed a specialized Long Short-Term Memory (LSTM) architecture, incorporating symmetry-preserving constraints and invariance properties inherent to quantum systems. Our model maintains polynomial complexity with system size. Through several numerical experiments across various quantum systems, including the TFIM, we demonstrate that our approach consistently achieves good accuracy compared to other methods, including recent attention-based models. However, RNNs have several limitations. One challenge is the gradient propagation when processing long sequences of quantum states. Additionally, our model assumes noise-free conditions, and small perturbations in the quantum system may lead to huge changes in the RNN parameters. Future work is to improve the neural network performance for large-scale quantum systems.

Polynomial Complexity of Quantum Sample Tomography

Full Article

Paradigm

My account