Quantum computing represents a revolutionary computational paradigm by offering substantial advantages in various applications, including quantum system simulation [1], linear equations [2], and combinatorial optimization [3]. Although fully functional quantum computers remain unrealized, their potential impact on computational capabilities is profound. One of the big challenges in quantum computing is figuring out how to reconstruct quantum states. Specifically, a system with n qubits requires approximately 4n measurements for complete state reconstruction, resulting in exponential computational costs. For example, reconstructing a 10-qubit system requires 8,192 measurements across 59,049 distinct quantum circuits, requiring roughly 130 hours on IBM quantum processors [4]. Consequently, developing efficient methods for quantum state reconstruction is crucial for practical quantum computing applications.
Several approaches have been developed to address the challenges of quantum state reconstruction. These include maximum likelihood estimation [5], projected gradient descent [6], and compressed sensing [7]. For instance, compressed sensing reduces the computational complexity from 𝒪(d2) to 𝒪(d log d), where d = 2n represents the density matrix dimension. Some works also address quantum state tomography for specific cases, such as matrix product density operators [8] and low rank density matrix [9]. While these methods have demonstrated promising results, they primarily focus on full density matrix reconstruction and fail to achieve exponential acceleration. Furthermore, their reliance on iterative procedures complicates both convergence analysis and complexity estimation.
Recently, neural network-based approaches have emerged as promising tools to quantum state reconstruction challenges. Initially started from the application of restricted Boltzmann machines (RBM) [10], this field has expanded to incorporate deep learning methods [11,12], adaptive measurement base selection techniques [13], and quantum machine learning methods [14]. These approaches compress quantum states into statistical learning models, enabling information extraction without exhaustive measurements. A key advantage of these methods is their ability to infer measurement outcomes from partial data, eliminating the need for complete state reconstruction. For example, when computing the trace of the product between a quantum state density matrix ρ of n-qubit and a given matrix M, full density matrix reconstruction can be avoided if M can be decomposed as:
Despite these algorithmic advances, there are very few theoretical justifications. An important theoretical foundation for neural network applications in quantum state reconstruction was established by Aaronson [15], who introduced the concept of quantum shadow tomography. While his work demonstrated that two-outcome measurements could be predicted using polynomial scale sampling, it left open the question of whether it is possible to predict distributional measurement outcomes, similar to classical shadow tomography [16,17], with polynomial complexity. In this paper, we extend this theoretical framework by employing local Rademacher complexity analysis to prove the effectiveness of quantum sample tomography for measurements with distributional outcomes. Relying on the theorem from Bartlett et al. [18], we demonstrate that the required number of measurements scales polynomially with qubit number, which is shown in Theorem 1. By incorporating random projection techniques, we show that this relationship holds even for high-dimensional quantum systems. To validate our findings, we conduct numerical experiments using RNNs. We configure the sample size and the number of RNN parameters to scale polynomially with qubit number and introduce a penalty term in the loss function to satisfy the conditions of our theorem. Our experiments on the ground state of the transverse field Ising model (TFIM) show agreement between theoretical predictions and numerical results. Furthermore, we compare our RNN approach against other models and show its effectiveness in various quantum states including cat states, random states, and W states.
The remainder of this paper is structured as follows. In Section 2, we introduce the fundamentals of quantum sample tomography and provide the main result. Sections 3 and 4 address the theoretical challenges using our new framework and give a rigorous proof of the main theorem. In Sections 5 and 6, we validate the theoretical findings through several quantum models, with particular emphasis on the performance of the proposed neural network. Section 7 concludes the paper.
We first review the fundamental concepts in quantum computation and give some necessary notations.
A quantum system is mathematically represented by a density matrix ρ, which is characterized by three essential properties: Hermiticity (ρ = ρ†), positive definiteness (ρ ≽ 0), and unit trace (trρ = 1). The matrix elements may take complex values, reflecting the quantum mechanical nature of the system. Under physical operations implemented through quantum circuits, the system undergoes unitary transformations: when a unitary matrix U is applied, the density matrix transforms as ρ → UρU†. Modern quantum computers, despite their varied physical implementations, fundamentally operate by applying sequences of unitary transformations. According to the Solovay-Kitaev theorem [19], arbitrary unitary operations can be efficiently approximated to any desired precision using a finite universal set of elementary quantum gates.
A fundamental challenge in quantum state characterization is the reconstruction of an unknown quantum state ρ. The standard approach to this problem, known as quantum state tomography, relies on measurements using Pauli operators. The Pauli matrices are defined as follows:
The Pauli matrices, together with the identity matrix I, constitute a complete basis for two-dimensional Hermitian matrices. In quantum mechanical measurements, the expectation value of a Pauli operator M with respect to a quantum state ρ is given by tr(ρM).
The measurement principles extend naturally to multi-qubit systems, where measurements can be performed on individual qubits. For n-qubit systems, the density matrix can be expressed in terms of Pauli operator measurements through the following decomposition:
To address the exponential scaling challenge, Aaronson [15] introduced the concept of shadow tomography. The problem can be formally stated as follows:
Given an unknown D = 2n dimensional quantum mixed state ρ and a distribution D over two-outcome measurements, we sample m measurements {E1, …, Em} independently from D and obtain their measurement results bi = tr(ρEi) for i = 1, …, m. For any chosen failure probability δ > 0, accuracy γ > 0, and error tolerance ε > 0, what is the minimum number of measurements m (depending on D, ε, and δ), such that any hypothesis state σ minimizing the quadratic functional
This formulation essentially asks whether a statistical learning model exists that can extract exponentially large amounts of information from a relatively small dataset. Aaronson provided an affirmative answer by proving that only
While Aaronson’s result focuses on predicting two-outcome measurements, our work addresses quantum sample tomography, building on the experimental investigations by Smith et al. [4]. In the subsequent analysis, let 𝔼 denote the expectation. The problem can be formally defined as follows:
Given an unknown D = 2n dimensional quantum mixed state ρ and a distribution 𝒟 over two-outcome measurements, we sample m unitaries {U1, …, Um} independently from 𝒟 and obtain their measurement results
While structurally similar to Problem 1, our formulation differs in three key aspects. First, instead of two-outcome measurements, we consider unitary transformations as independent variables. Second, our output consists of complete probability distributions, which can be represented as vectors in practical implementations, rather than binary outcomes. Third, we employ expectation rather than probability, which is more general. In fact, using Markov’s inequality, if we choose ε = ε′γ′, where ε′, γ′ > 0, then it holds
In the subsequent analysis, we let the loss function
Let ρ be an n-qubit quantum state and 𝒟 be a distribution over unitary matrices. Assume that the number of nonzero elements in ρ is only polynomial in n. For any given failure probability 0 < δ < 1 and error tolerance γ > 0, consider a collection of unitary matrices
- (1)
f corresponds to some quantum state σ such that
for all i = 1, …, m, - (2)
f achieves empirical error bound such that
for a given η > 0.
Then with probability at least 1 – δ, the expected loss satisfies:
The theorem requires three assumptions. First, the number of nonzero elements in the matrix should be polynomial qubit number, which enables dimension reduction of the feature map in Section 4. Second, the output of the learning algorithm should be consistent with measurement results from some quantum state, allowing us to express the algorithm’s output as a functional of feature functions. In Section 5, for neural network, we achieve this requirement through a penalty term in the loss function. Third, the loss function should remain below a certain threshold, which is achievable for algorithms with sufficient learning capacity.
For the proof, we diverge from the method employed in [21], which relies on amplification techniques and adaptive protocols for iterative density matrix reconstruction. Instead, we follow the theoretical framework established in [15], which offers a more direct approach to estimate measurement statistics without requiring complete state reconstruction. In particular, we employ local Rademacher complexity, a fundamental concept in statistical learning theory [18], to establish our proof. Our proof strategy focuses on controlling the local Rademacher complexity of the loss function. We first decompose the local Rademacher complexity of the loss function into the local Rademacher complexities of its components. Next we construct a kernel function and the corresponding feature map such that the components can be expressed as functionals of the feature map. We then estimate the expected loss over the entire distribution using empirical loss and local Rademacher complexity. Finally, we employ random projection techniques to reduce the required dataset size to polynomial scale.
In order to prove Theorem 1, we need to bound the actual error of the model, which requires measuring the learning capacity through the Rademacher complexity. This useful tool helps us measure how effectively a learning algorithm can capture patterns in random data. The definition of Rademacher complexity is given below.
(Rademacher Complexity). Let (χ, P) be a probability space and ℱ be a set of measurable functions mapping from χ to ℝ. Given a positive integer m, consider:
- (1)
m i.i.d. samples from (χ, P), denoted by X1, X2, …, Xm,
- (2)
m i.i.d. Rademacher variables with
, denoted by σ = {σ1, σ2, …, σm}.
For any f ∈ ℱ, we define:
The empirical Rademacher complexity of ℱ is then defined as:
The local Rademacher complexity (LRC) is defined in the same way but with an additional constraint Pmf2 < r, which is denoted as 𝔼σRmℱ(r). The LRC serves as a fundamental tool in theoretical machine learning proofs, especially its sub-root property [18].
A function φ : [0, +∞) → ℝ is called sub-root if it satisfies the following conditions:
- (1)
φ is non-negative,
- (2)
φ is non-decreasing,
- (3)
is non-increasing for r > 0.
This sub-root property ensures the existence of a unique fixed point for φ, which is shown in the following lemma.
A sub-root function φ, which is not identically zero, possesses exactly one positive fixed point, denoted as r*. For any r > 0, the inequality φ(r) ≤ r holds if and only if r* ≤ r. Furthermore, given two sub-root functions φ1 and φ2 where φ1 (r) φ2 (r) for all r > 0, the fixed point
The proof begins by establishing the continuity of φ. For any r1 > r2 > 0, the non-decreasing property of φ implies |φ(r1) – φ(r2)| = φ(r1) – φ(r2). Given that
The continuity of φ follows as |φ(r1) – φ(r2)| approaches zero when r1approaches r2from either direction. The function φ(r)/r inherits continuity in (0, +∞) and maintains non-negativity. The strict monotonic decrease of
If this ratio φ(r)/r consistently exceeds 1, then lim
Finally, for two sub-root functions φ1 and φ2, where φ1 (r) < φ2 (r) for all r > 0, we can conclude that
Lemma 1 establishes a fundamental property of sub-root functions. It shows that a sub-root function has a unique fixed point. It also provides us with a method to bound this fixed point. It turns out that the LRC has a sub-root nature, as given in the following lemma [18].
Let ℱ be a class of measurable functions. Consider a functional T : ℱ → [0, +∞) that satisfies T(αf) ≤ α2T(f) for all f ∈ ℱ and α ∈ [0, 1]. Given
Then both φ and its expectation 𝔼φ(r) are sub-root functions.
Now consider a supervised learning framework with input space χ and output space
Define a star-hull around zero as:
Building upon Lemma 2, one can establish the following theorem [18]:
Consider a bounded loss function
Let
This theorem establishes a relationship between the empirical error and the actual error through local Rademacher complexity. When the Rademacher term on the right-hand side of equation (2) can be sufficiently small and the fixed point
For any set χ and a sequence (X1, …, Xm) ∈ χm, consider a class of functions ℱ mapping from χ to ℓ2 space. Given a collection of functions hi : ℓ2 → ℝ with Lipschitz constant L, the following inequality holds:
Theorem 3 establishes a fundamental relationship between the Rademacher complexity of vector-valued functions and their scalar components. In particular, it provides a practical way for complexity estimation by showing that the complexity of a composite vector-valued function can be bounded by the complexities of its components. To obtain a more accurate complexity bound, we adapt the theorem from Cortes et al. [23] with an additional norm scaling factor.
Let
Then, for any r > 0, it holds the following bound on the expected Rademacher complexity:
Theorem 4 shows that if we know the eigenvalues of the kernel function, we can estimate the local Rademacher complexity of functionals of its feature map. The construction of such a feature map will be presented in Section 4.
In the preceding sections, we introduced the local Rademacher complexity and related theorems. This section focuses on proving Theorem 1. We begin by establishing the Lipschitz continuity of the loss function class, which is needed in Theorem 3. In the subsequent analysis, 𝒢 represents the function class that maps the input space χ to the distribution space ℝN, where N = 2n is the dimension of n-qubit Hilbert space.
For a given distribution function
Consider the function
It is sufficient to prove
Let us expand the left-hand side of equation (5):
The second inequality uses the fact that all
We then introduce a specific feature map for unitary matrices that plays a crucial role in our estimation. Let SU(N) denote the special unitary group of degree N = 2n, consisting of all N × N unitary matrices with determinant equal to 1. For a unitary matrix U := (uij) ∈ SU(N) and 1 ≤ k ≤ N, we define the feature map
By Theorem 4, this kernel admits a spectral decomposition:
Let ρ be a density matrix of order N = 2n and U ∈ SU(N). Then each element of the vector h = diag(UρU†) can be expressed as an inner product between the feature map and a vector whose ℓ2-norm is strictly bounded by 1.
For k = 1 …, N, we have
Let w = (ρij) ∈ ℂ N2. Then we can express hk (U) as the inner product:
To establish the norm bound, observe that ∥w∥2 equals the Frobenius norm of
We also require the Johnson-Lindenstrauss (JL) lemma [24], which is fundamental to the dimensional reduction analysis.
(Johnson-Lindenstrauss Lemma). For any sample size m ≥ 1, any finite point set
A canonical example is simply choosing vij ~ 𝒩 (0, 1/q).
Now we have all the tools necessary to establish our main result. We begin by presenting a weaker variant of Theorem 1:
Let ρ be an n-qubit quantum state and 𝒟 be a distribution over unitary matrices. For any given failure probability 0 < δ < 1 and error tolerance γ > 0, consider a collection of unitary matrices
- (1)
f corresponds to some quantum state σ such that
for all i = 1, …, m, - (2)
f achieves empirical error bound such that
for a given η > 0.
Then with probability at least 1 – δ, the expected loss satisfies:
Theorems 1 and 7 differ only in their sample complexity: m = poly(n) in Theorem 1 versus m = poly(2n) in Theorem 7.
According to the assumption, we can represent each probability distribution h(Ui) with i = 1, …,m as an N-dimensional vector, where N = 2n corresponds to the dimension of the n-qubit Hilbert space. Let hk(Ui) denote the probability of measuring the system in the k-th computational basis state of h(Ui), where the states are encoded in binary representation, i.e., |0〉⊗n corresponds to index 0, |0〉⊗(n−1)|1〉 corresponds to index 1, and so forth.
We denote the unitary matrices Ui as input variables xi. According to Theorem 3 and Lemma 3, it holds:
Equation (7) enables us to estimate the Rademacher complexity of the loss function through its component-wise Rademacher complexities. Given r > 0, considering the constraint
Summing over i yields
Denote
Based on Theorem 5, we can get that gk = w · Φ(U) and hk = v · Φ(U), which implies
From Theorem 4, we obtain
Substituting equation (10) into equation (9) produces
Combining equation (11) and equation (1) in Theorem 2, we get that
Based on Lemma 1, we now aim to find the upper bound of the fixed point
This equation can be rewritten as
Then
Applying Theorem 2, we can establish:
While our analysis shows that quantum state reconstruction is possible, the exponential scaling of measurement requirements m = ploy(2n) remains a fundamental challenge for practical applications. To address this limitation, we extend our analysis to Theorem 1 by incorporating sparsity constraints on the density matrix representation of quantum states, which demonstrates a substantial reduction in measurement complexity.
Here we follow the notations established in Theorem 7. From Theorem 7, for random samples
Choose a constant
Following the proof structure of Theorem 7 with distribution
Using the sparsity hypothesis, we can reduce the dimension of the feature map Φ(k), thus reducing the dimensionality factor K in equation (10) to M (the number of nonzero elements in ρ, which is poly(n)). This yields the refined bound:
Choosing
In this section, we address the practical implementation of Theorem 1. A challenge lies in determining an efficient parameterization of unitary matrices that can serve as viable input to our model. Since the dimension of the unitary group SU(2n) grows exponentially with n, direct parameterization of arbitrary unitary matrices becomes impractical. Therefore, for practical implementation, we restrict our attention to a subset of unitary matrices that can be decomposed as Kronecker products of 2 × 2 unitary matrices:
This reduction to local operations simplifies our approach by focusing on the parameterization of SU(2) matrices. We express the unitary matrices as:
To validate Theorem 1, we design a series of numerical experiments using the Long Short-Term Memory (LSTM) as our main statistical learning model. The LSTM model, known for its capability to process sequential data with long-term dependencies, is particularly suitable for our application. In our framework, each measurement outcome can be represented as a sequence of binary digits (0 or 1), analogous to a sentence in natural language processing. The model generates conditional probabilities for each subsequent measurement outcome based on current and previous results. The joint probability of the entire measurement sequence is then computed as the product of these conditional probabilities.
The prediction phase follows a different process from the training phase. During prediction, the neural network accepts specified conditions as input and generates conditional probabilities at each sequential step. These probabilities guide a sampling process where each binary outcome (“word”) is selected according to the computed probability distribution. The selected outcome is then fed as input to the subsequent step. The final probability of the complete quantum state is computed as the product of these sequential conditional probabilities. This prediction mechanism is depicted in Figure 1.

The diagram illustrates the RNN architecture for sampling. FNN represents feedforward neural network. The output’s 0-1 distribution allows for sampling a specific state, which is then passed on to the next step as a measurement result.
To enhance the predictive capability, we incorporate quantum measurement context through a dual-input structure: (1) the measurement outcome of the previous particle, and (2) the parameterization of the unitary operator U corresponding to the current particle’s measurement basis. This design enables the network to learn the correlation between measurement outcomes and their corresponding basis transformations, thereby capturing the quantum state’s behavior under different measurement bases. For sequence initialization, we employ a designated token ‘2’, which appears exclusively at the beginning of each measurement sequence. This initialization token provides a clear way of sequence boundaries in the training data.
We now introduce the loss function employed in this work. We adopt the cross-entropy loss as our objective function due to its effectiveness in distribution approximation tasks. The cross-entropy loss between the true distribution p and the predicted distribution q is defined as:
Since our training samples are drawn independently from the true distribution p, by the law of large numbers, the empirical loss converges to:
This formulation eliminates the need to compute the true probabilities pi, thereby reducing memory requirements.
To ensure that our neural network correctly represents the quantum state transformation h(U) = diag(UρU†), we incorporate two constraints into the loss function. The first is a parameterization invariance constraint: parameter sets {θ, φ, ψ} and {–θ, φ + π, ψ + π} must produce identical outputs, as they represent the same unitary transformation. The second is the constraint derived from the properties of f (U) = diag(UρU†). For the single-qubit case, these properties are derived as follows and they can naturally extend to multi-qubit systems through tensor products. Consider the following matrix representations:
This yields the functional form:
Under the chosen parameterization, we derive a symmetric relation:
Furthermore, let
and
These two constraints are incorporated into the total loss function via l1regularization:
We have also investigated replacing the LSTM model with a multi-head attention mechanism, which has shown remarkable success in various sequence modeling tasks. However, LSTM maintains a higher accuracy in our case and offers easier training procedures. This observation suggests that the sequential nature of quantum measurements may be better modeled by the LSTM’s memory structure, which is demonstrated in the following section.
Before proceeding with the experimental analysis, we detail the configuration of our numerical experiments. To quantify the performance of the model, we employ the classical fidelity measure between the predicted distribution p and the true distribution q:
This metric provides a natural measure of the similarity between the probability distributions of the measurement.
We now detail the hyperparameters employed in our implementation. The model was trained for 50 epochs across all experiments. While this choice lacks rigorous foundation, empirical evidence shows its consistency in achieving convergence across various system sizes. The regularization coefficient λ was scaled dynamically with the particle number, ranging from 5 to 50. For smaller systems, we found that a smaller λ value leads to faster loss convergence, while larger systems benefit from increased λ values to enforce stronger constraints on the neural network’s behavior.
To address the computational complexity of evaluating ℒsymm and ℒinvariant, we implement a stochastic index sampling strategy. Rather than computing losses over all indices, we randomly select 50 indices for loss calculation in each iteration. This approach reduces computational overhead while maintaining the effectiveness of the constraints. To ensure polynomial scaling of computational resources, we impose strategic constraints on the model architecture. The dimensionality of both the RNN hidden states and the word vector embeddings is set to 2n2, where n denotes qubit number.
For initial validation, we test our model against the ground state of the transverse field Ising model (TFIM), a system for studying quantum phase transitions. The TFIM Hamiltonian is expressed as:
The model’s performance, measured by fidelity and illustrated in Figure 2, consistently exceeds 90% across different parameter regimes. These results demonstrate the model’s robustness and its capability to capture essential features of the quantum state’s probability distribution.

The prediction fidelity for different number of qubits. The fidelity is computed by averaging 400 random samples.
We test the neural network on quantum systems of 6 and 12 qubits at Jx = 1 using 600 random unitary matrices. The prediction fidelity under these transformations is shown as heatmaps in Figure 3. The model achieves prediction fidelity above 90%, validating its effectiveness in capturing the characteristics of the quantum state.

The performance of the neural network on TFIM states. (a) is for 6-qubit system, and (b) is for 12-qubit system. Both show the accuracy of the neural network predictions on randomly sampled unitary matrices.
We also examine the relationship between sample size and model performance at a fixed transverse field of 1.0. The model accuracy is evaluated using average fidelity across 400 random unitary matrices. Figure 4 indicates that high fidelity can be achieved with limited measurements: 6-qubit systems reach approximately 90% fidelity with 200 samples, while 12-qubit systems require approximately 600 samples for comparable performance. Overall, both the model parameters and required training samples scale polynomially with qubit number, indicating efficient scaling of the neural network architecture.

The average prediction accuracy as the number of samples increases. Generally, as the number of samples increases, the average prediction accuracy also improves.
Finally, we show how the minimum required sample size varies with qubit number for a given fidelity threshold, as in Figure 5. We use the same method as in Figure 4 to generate results for quantum systems ranging from 6 to 12 qubits, and define the first intersection point with the fidelity level line as the minimum sample size. We select two fidelity levels, 0.9 and 0.95, to plot. Due to the inherent randomness in the data generation and training process, the result is not strictly monotonically increasing as expected. However, the trend follows an approximately power-law relation with a power less than 2.

Relation between qubit number and sample size. The overall trend suggests a power-law relationship between the sample size and qubit number. The dashed lines represent the fitted curves for the corresponding data.
Given that Cha et al. [25] showed the applicability of attention mechanisms to quantum state tomography, we conduct a comparative analysis between attention-based approaches and our proposed method. We investigate two distinct attention-based implementations: (1) reformulating our quantum state reconstruction task as a sequence-to-sequence translation problem, and (2) substituting the LSTM architecture with a multi-head attention mechanism while maintaining the overall framework. To ensure fair comparison and prevent information leakage, we use causal masking in the attention mechanisms to maintain the sequential nature of quantum measurements. Our empirical results, as illustrated in Figure 6, demonstrate that our LSTM-based architecture outperforms both attention-based variants.

Performances of different neural network architecture. The fidelity is computed by averaging 400 random samples. It can be found that our LSTM model behaves well, but the attention-based neural networks (ANN (1) and ANN (2)) are unstable.
To further validate our findings, we extend our analysis to a more complex quantum system: the cat state from quantum optics, which has the form

Performance of the neural network on different quantum states. Both of them displays the accuracy of the neural network predictions on randomly sampled unitary matrices. The first row consists of 6-qubit systems, while the second row comprises 12-qubit systems. From left to right, each column represents a different quantum state: cat-state, random state, and w-state.
We also conduct experiments on two different quantum states. The first state is a randomly generated state using Qutip, with a dense density matrix. From the results in Figure 7, the neural network shows impressive performance in this state, with the fidelity of each measurement being very close to 1. The second state is a W-state with n2 nonzero elements, where n is the qubit number. It has the following form:
The results in this state show that the fidelity remains approximately above 0.9.
This work presents a novel approach for quantum sample tomography that differentiates itself from conventional methods. We have developed both theoretical foundations and practical implementations for this challenging problem. Our analysis, grounded in local Rademacher complexity theory, establishes a fundamental theorem that rigorously justifies the application of machine learning method to quantum measurement prediction.
Building upon this mathematical foundation, we have developed a specialized Long Short-Term Memory (LSTM) architecture, incorporating symmetry-preserving constraints and invariance properties inherent to quantum systems. Our model maintains polynomial complexity with system size. Through several numerical experiments across various quantum systems, including the TFIM, we demonstrate that our approach consistently achieves good accuracy compared to other methods, including recent attention-based models. However, RNNs have several limitations. One challenge is the gradient propagation when processing long sequences of quantum states. Additionally, our model assumes noise-free conditions, and small perturbations in the quantum system may lead to huge changes in the RNN parameters. Future work is to improve the neural network performance for large-scale quantum systems.