With the continuous development of quantum science, quantum technology and theory are gradually maturing. This has led to the emergence of diverse quantum devices, each with varying technical approaches and associated software tools. In recent years, industry giants such as IBM, Quantinuum, Rigetti, and Xanadu have made significant strides. They have launched their own quantum devices and software development kits (SDKs), including Qiskit [1], TKET [2], Forest [3], and Pennylane [4]. These developments have accelerated the applications of quantum computing. They have showcased tremendous potential and advantages in addressing complex problems that are challenging for classical computing to solve. Examples include Grover’s algorithm [5] for non-sequential database searching, Shor’s algorithm [6] for prime factorization problems, and Harrow-Hassidim-Lloyd (HHL) algorithm [7] for linear equation problems. However, similar to classical computing, these quantum algorithms necessitate compilation. They must be compiled using appropriate software to align with the native gate sets, connectivity, and other constraints of the target quantum devices.
The compilation of quantum circuits involves multiple key steps, each presenting different choices–namely, compilation options. These include the selection of qubit technology, quantum device, quantum compiler, as well as specific compilation steps and optimization strategies (e.g., decomposition, mapping, and optimization). Faced with such a multitude of options, effectively selecting the optimal combination is undoubtedly a highly challenging task for ordinary end-users, especially those lacking background knowledge in quantum computing. The selection of the optimal compilation option combination has a crucial impact on the compilation efficiency and performance of quantum circuits (e.g., depth and fidelity). This reliance on expert knowledge and the high computational costs pose a significant barrier to the widespread adoption of quantum computing.
To address the optimal selection of compilation option combinations for quantum circuits, a series of researches have been carried out. In 2020, Salm et al. [8] proposed a rule-based method for selecting quantum circuit compilation options. This method can automate the selection of quantum devices for specific quantum algorithms, thereby improving the efficiency of quantum circuit execution. In 2022, Salm et al. [9] proposed a quantum circuit compilation option selection method based on multi-criteria decision analysis. This method enables the automated selection of quantum device and quantum compiler for given quantum circuits, achieving stable execution results. In 2023, Salm et al. [10] proposed an option selection method for quantum circuit compilation based on machine learning (ML) algorithm. This method can automatically select quantum device and quantum compiler for a given quantum circuit according to user requirements. It has the advantages of reducing resource consumption and enhancing method scalability. Also in 2023, Quetschlich et al. [11] proposed a selection method for quantum circuit compilation options based on machine learning technology. This method can automatically select qubit technology, quantum device, quantum compiler, as well as specific compilation steps and optimization strategies for a given quantum circuit, thereby reducing compilation time and enhancing compilation efficiency. In another 2023 study, Quetschlich et al. [12] proposed a method for selecting compilation options for quantum circuits based on reinforcement learning technology. This method can select appropriate compilation steps and optimization strategies for different quantum circuits. It enhances the quality of quantum circuit execution on quantum device. Furthermore, in 2023, Questschlich et al. [13] proposed a quantum circuit compilation option selection method based on machine learning and reinforcement learning technologies. This method can automatically select the optimal qubit technology, quantum device, quantum compiler, as well as specific compilation steps and optimization strategies for different quantum algorithms. Not only does it significantly improve the compilation quality of quantum circuits, but it also provides the MQT Predictor framework as a practical tool for non-quantum computing experts to use quantum computing. Although these automated selection methods have alleviated the difficulty of manual selection to some extent, there is still room for improvement in the accuracy of predicting the best combination of options. Especially in the representation of quantum circuits and the use of quantum circuit compilation option prediction model, further research and improvement are needed to improve the accuracy and reliability of prediction.
Regarding the research on the representation of quantum circuits, in 2022, Tomesh et al. [14] proposed a composite feature representation method of quantum circuits based on application-level metrics. This method can better represent the interaction features within quantum circuits and comprehensively represent the hardware resource requirements of quantum circuits. In 2023, Bandic et al. [15] proposed a quantum circuit representation method based on interaction graphs, which can more effectively extract the edge features of quantum circuit interaction graphs to optimize the mapping of quantum circuits. Also in 2023, Qian et al. [16] proposed a quantum circuit representation method based on the priority of quantum gate timing weights. This method can synthesize the interaction graph and dependency graph to effectively extract the interaction frequency and importance of qubits in the quantum circuit, so as to optimize the mapping and conversion process of the quantum circuit. Currently, in the prediction of quantum circuit compilation options, the width (i.e., the number of qubits involved) and depth (including the number of singlequbit gates, two-qubit gates, and sequential executable gates) of quantum circuits, as well as the number of different types of gates and five application-level composite features (program communication, critical depth, entanglement ratio, parallelism, liveness, measurement) [14] have been used. However, the existing representation methods of quantum circuits have not fully considered other characteristic forms of quantum circuits, and lack sufficient attention to the arrangement and combination of features. This means that current representation methods fail to fully utilize all available information to represent quantum circuits. The representation of quantum circuits is crucial for the prediction model of quantum circuit compilation options, affecting the accuracy of prediction results. Therefore, to more accurately predict quantum circuit compilation options, it is necessary to further explore and develop new representation methods that can comprehensively capture quantum circuit information.
Regarding the research on the prediction model problem of quantum circuit compilation options, in 2023, Quetschlich et al. [11] applied a variety of traditional machine learning prediction models to predict quantum circuit compilation options. These include Random Forest [17], Gradient Boosting [18], Decision Tree [19], Nearest Neighbor [20], Multi-layer Perceptron [21], Support Vector Machine [22], and Naive Bayes [23]. Among them, Random Forest model has been proved to be the most effective choice due to its excellent performance [11]. However, with the continuous development of machine learning, its branch deep learning has made significant progress in various fields compared with traditional machine learning. For example, in 2016, He et al. [24] proposed an image recognition method based on deep residual network model (ResNet). This method employed skip connections to effectively solve the issues of gradient vanishing and exploding during model training. In 2017, Vaswani et al. [25] proposed a machine translation method completely based on the Transformer network model with an attention mechanism. This method greatly improves the model’s perception and processing ability of important information. It also improved parallelism, thereby significantly reducing the time required for model training. Also in 2017, Wang et al. [26] proposed an image classification method based on the residual attention network model. This method can generate attention-aware features and adapt to the depth change of the model, so that the model can perform arbitrary depth transformation for specific problems. These studies indicate that deep learning has great application potential in the prediction of quantum circuit compilation options, further enhancing the accuracy and reliability of such predictions.
Therefore, we propose a new prediction method for quantum circuit compilation options based on Deep Residual Attention Neural Network(DRAN-Net). It not only enables common end-users to automatically select the optimal compilation options but also improves the efficiency and accuracy of quantum circuit compilation, thereby paving the way for building fully adaptive and scalable intelligent compilation systems in the future. Specifically, the main contributions of this paper are as follows:
- (1)
A feature extraction algorithm of quantum circuit based on Time-Weighted Interaction Graph (TWIG) is proposed. This algorithm transforms quantum circuit into TWIG and extracts the features of average weighted shortest path, maximum degree, minimum degree, average clustering coefficient, and standard deviation of the adjacency matrix. By combining these new features with traditional features, we construct a 36-dimensional vector that can more fully represent quantum circuit information and make the deep learning model scalable.
- (2)
A quantum circuit compiler option prediction model based on DRAN-Net is proposed. The model combines the residual connection mechanism and the attention mechanism. Specifically designed for existing features, the residual connections mechanism addresses the issue of gradient vanishing in deep networks, while the attention mechanism allows the model to focus on important features. By integrating the above technologies, a model with greater adaptability to the quantum circuit compilation option prediction task is constructed. This model not only effectively handles complex data patterns but also significantly improves generalization ability, thereby achieving fast and accurate prediction of quantum circuit compilation option sequences.
- (3)
Extensive experimental evaluations and comparisons are conducted on the MQT Bench [11] dataset, which contains 3,003 quantum circuits. It verifies the effectiveness and advancement of the proposed method. In the feature correlation experiment, it is verified that the proposed features have low correlation with traditional features. Their correlation with other features is less than 95%, thereby ensuring the inherent validity of the proposed features. In the feature importance experiment, it is verified that the importance of some proposed features in model prediction is greater than 0.1. This further proves the effectiveness of the proposed features for model training. In the ablation experiments, compared with traditional representation methods, the proposed quantum circuit representation method improves the performance of all or most metrics of all baselines, which proves the effectiveness and robustness of the proposed quantum circuit representation method. Compared with traditional machine learning methods, the proposed prediction model DRAN-Net significantly improves all performance metrics, which demonstrates the effectiveness and robustness of DRAN-Net. In the advancedness comparison experiment, compared with the current state-of-the-art method Random Forest [11], the accuracy of the proposed method is increased by 5.44%, Top-3 by 2.6%, and F1 by 2.05%. These results prove the advancement of the proposed method.
- (4)
The rest of the work in this paper is as follows: In Section 2, we review the fundamentals of quantum computing and the compilation option flow, as well as the principles of deep residual neural network and attention mechanism. On this basis, the specific details of the proposed method are introduced in Section 3, the experimental results are analyzed and evaluated in Section 4, and the work in this paper is summarized in Section 5.
This section provides a brief introduction to the fundamentals of quantum computing and compilation, as well as a technical introduction to deep residual neural networks and attention mechanisms.
Quantum Bit, Qubit: Qubits are the basic unit of quantum computing. Unlike bits in classical computing, qubits can be in a superposition of 0 and 1. Classical bits can only take on the value of 0 or 1, while qubits can be represented by a state vector |ψ〉 whose states are:
where α and β are complex numbers, representing the probability that qubits are measured as |0〉 and |1〉, respectively, and they satisfy the equation |α|2 + |β|2 = 1. This superposition gives qubits great potential when dealing with complex problems.|\psi \rangle = \alpha |0\rangle + \beta |1\rangle Quantum Gate: Quantum gates are the basic operation units in quantum computing that are used to transform qubits. Different from logic gates in classical computing, quantum gates are linear transformations of quantum states, commonly represented by unitary matrices. Common single-qubit quantum gates include Pauli-X gate (analogous to classical NOT gate), Hadamard gate (H-gate, used to create superposition states), and Controlled NOT gate (CNOT-gate). These gate operations can be represented as the following matrix:
X-gate (Pauli-X Gate):
X = \left( {\matrix{ 0 \hfill & 1 \hfill \cr 1 \hfill & 0 \hfill \cr } } \right) It flips the state of the qubit, changing |0〉 to |1〉 and vice versa. Its symbol representation in a quantum circuit is shown in Figure 1a.
H-gate (Hadamard Gate)
H = {1 \over {\sqrt 2 }}\left( {\matrix{ 1 & 1 \cr 1 & { - 1} \cr } } \right) It converts qubits from the ground state (|0〉 or |1〉) to a superposition state, whose symbol representation in a quantum circuit is shown in Figure 1b.
CNOT-gate (Controlled NOT Gate)
CNOT = \left( {\matrix{ 1 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \cr 0 \hfill & 1 \hfill & 0 \hfill & 0 \hfill \cr 0 \hfill & 0 \hfill & 0 \hfill & 1 \hfill \cr 0 \hfill & 0 \hfill & 1 \hfill & 0 \hfill \cr } } \right) This is a two-qubit quantum gate. Only when the control qubit is |1〉, the target qubit will be flipped. Its symbol representation in a quantum circuit is shown in Figure 1c.
Quantum Circuit: A quantum circuit is a complex sequence of operations that implements quantum computing through (cascading) multiple quantum gates. It consists of qubits and quantum gates, where the quabits are used as input and are operated through a series of quantum gates to output the results. Each quantum gate acts on qubits from left to right, forming the process of quantum computing. Figure 2 shows a simple quantum circuit example, which includes an H-gate, two X-gates, and multiple CNOT-gates.

Circuit symbol representation of common quantum gates.

A simple quantum circuit example, including H-gate, X-gate, and CNOT-gate.
Quantum circuit compilation faces many choices (i.e., compilation options), which mainly include the selection of qubit technology, quantum device, quantum compiler, compilation steps, and optimization strategies. The combination of compilation options is crucial to the efficiency and performance of quantum circuit compilation.
Qubit Technology: Qubit technology determines the design and implementation of different quantum devices. Currently, the main qubit technology include superconducting qubits, ion traps, topological qubits, etc. Each technology has its unique advantages and limitations. For example, superconducting qubits are excellent in scalability, but they require an extremely low temperature environment; whereas ion trap qubits have long coherence times and high operational precision, yet they are constrained by slower gate operation speeds.
Quantum Device: A quantum device is a physical hardware that performs quantum computing, and its design is based on a specific qubit technology. Different quantum devices differ in qubit number, coupling topology, noise characteristics, and gate operation integration. Currently, the most widely adopted superconducting qubit-based devices in academic literature primarily include 27-qubits IBM Q Toronto, which adopts a two-dimensional planar coupling structure. The device coupling topology of IBM Q Toronto is shown in Figure 3a, where each qubit is connected through a coupling link. Currently, the most widely adopted ion trap-based devices in academic literature primarily include 11-qubits IonQ Harmony, whose device coupling topology is shown in Figure 3b. Unlike superconducting qubits, ion trap devices typically have a fully connected coupling graph, meaning that each qubit can operate directly with any other qubit. This fully connected topology greatly simplifies the compilation process of quantum circuits, but the technology is extremely demanding for control and prone to crosstalk.
Quantum Compiler: Quantum compilers are tools that translate high-level quantum algorithms into quantum instructions that can be executed on specific quantum devices. Quantum circuits may be handled differently by different compilers, such as IBM’s Qiskit, Google’s Cirq, Quantinuum’s TKET, etc. They each provide compilation steps and optimization strategies.
Compilation Steps and Optimization Strategies: Quantum circuit compilation mainly includes the following steps: quantum gate decomposition, qubit mapping, and optimization. The following examples explain the process of these main compilation steps.

Common schematic diagram of quantum computing devices.
Take the quantum circuit shown in Figure 2 as an example. In order to make the circuit meet the restriction of the device’s native gate set {Ry, Rz, X, CNOT}. The H gate is equivalently replaced by

Quantum circuit compilation.
The compilation steps differ across compilers, and each step involves multiple choices. These choices directly impact the execution performance of the final quantum circuit. For example, Qiskit has four optimization levels (i.e., O0, O1, O2, and O3). O0 (No Optimization) does not do any optimization and is suitable for fast compilation and debugging. O1 (Level 1 Optimization) carries out basic gate-level optimizations by removing unnecessary gate operations to moderately optimize the circuit structure. O2 (Level 2 Optimization) includes all optimizations of O1. Additional circuit transformations and optimizations are performed to balance compilation time and circuit performance. O3 (Level 3 Optimization) uses all available optimization techniques in addition to containing all the optimizations of O2. It minimizes the gate count and depth of the circuit to improve circuit performance, but may significantly increase compilation time. Different optimization levels provide distinct circuit optimization strategies, leading to variations in quantum gate decomposition, mapping, circuit optimization results, and computational complexity. The choice of optimization level depends on the quantum circuit itself and the target device. TKET offers two different qubit mapping strategies: Line Placement and Graph Placement. These strategies are used to map logical qubits from quantum circuits to physical qubits from quantum devices. Line Placement maps logical qubits into a linear arrangement of physical qubits, suitable for simple device topologies. Graph Placement maps logical qubits to a graph structure of physical qubits. It optimizes operations across qubits taking into account the topological structure of the device and is suitable for complex device architectures. Therefore, the choice of mapping strategy depends on the structural constraints of the quantum device in use and the specific requirements of the circuit.
All available compilation options are defined as compilation option combinations. A complete set of compilation option combinations can accomplish the compilation of quantum circuit, and different compilation option combinations yield different compilation effects. The currently available compilation options are shown in Figure 5. It mainly includes two qubit technologies (i.e., superconducting and ion trap), five devices (i.e., 8, 27, 80, and 127-qubits devices based on superconducting and an 11-qubits device based on ion trap), two compilers (i.e., Qiskit and TKET), as well as compilation steps and optimization strategies (i.e., Qiskit’s O0 to O3 optimization levels and TKET’s Line Placement and Graph Placement mapping strategies). These compilation options can form a total of 30 combinations, and the sequence of options highlighted in red is one combination.

The currently available compilation options.
Residual Neural Network(ResNet) is a deep neural network composed of multiple residual blocks. It is used to solve the problems of gradient vanishing and feature loss in deep network training, making the training of deep networks simpler and more effective. With the deepening of research on residual network, there have been many improved versions of deep residual neural network(ResNet). Such as deep residual shrinkage network, but its core idea is still the construction of residual block.
Residual Block is the basic unit of ResNet, and each residual block contains multiple convolutional layers (as well as other possible layers such as batch normalization and activation layers). However, the key lies in its introduction of a skip connection, which allows the input to directly bypass certain network layers and be added to the output of subsequent layers. It enables the network layers to learn the changes in the input data directly rather than the input data itself. This design not only improves the trainability of the network but also solves the degradation problem of the deep network. A common residual block diagram is shown in Figure 6.

Residual block.
When the model input is X, the map that a certain layer in the network needs to learn is ℋ(X). In the residual block, it only needs to learn the learning objective of a residual function ℱ(X) = ℋ(X) – X. Therefore, the output of the network can be expressed as:
With this design, the network no longer learns the complex mapping ℋ(x) directly, but instead learns the relatively simple residuals ℱ(x). This not only simplifies the learning task of the network but also effectively retains the input information during the training process and reduces the risk of gradient vanishing.
Attention Mechanism is a technique that shows significant performance in serialized data processing tasks. By simulating the focusing characteristics of human attention, the model can dynamically focus on the key parts of the input data, thereby improving the processing quality and efficiency. It can generally be divided into two categories: hard attention and soft attention. The hard attention mechanism directly selects a small number of important information in the input for processing through discrete selection methods. However, its non-differentiability is typically solved with reinforcement learning methods. Soft attention mechanism assigns continuous weights to each input and performs a weighted average based on these weights. Because of their differentiability, they are easier to train through optimization methods such as gradient descent. In contrast, the soft attention mechanism not only allows the model to more precisely control the degree of attention to information but also facilitates end-to-end training through the backpropagation algorithm. Therefore, soft attention is used in this paper.
Attention Model is a neural network that incorporates an attention mechanism. The basic structure of the attention model is shown in Figure 7, which shows the entire process from input to output of the model. In the attention model, the vector s of the new hidden layer state is obtained by multiplying each initial hidden layer state hi with the corresponding weight αi, and then summing these products. The weight αi reflects the weight of each initial hidden layer state in the new hidden layer state. Its calculation formula can be expressed in the following mathematical form:

Basic structure of attention model.
This section introduces a method for predicting quantum circuit compilation options based on DRAN-Net. Specifically, the proposed method is divided into the following three steps: (1) Firstly, the quantum circuit compilation option selection problem is modeled as a classification prediction task, then (2) the quantum circuit features are extracted to form a representation vector, and (3) a suitable deep neural network model is designed and trained to effectively solve the classification prediction task of selecting quantum circuit compilation options.
Given a quantum circuit LC, it contains NQ qubits Q = {q1, q2, …, qNQ} and NG quantum gates G = {g1,g2, …, gNG}, among which there are NTG two-qubit gates TG = {tg1, tg2, …, tgNTG}. Given a quantum circuit compilation options set O = {A, B, C, D}, which is designed to cover the most frequently used and decisionuncertain compilation options in current practices. It should be noted that this set is extensible. This paper instantiates the abstract compilation option set by systematically selecting 30 combinations from two leading quantum computing compilers, Qiskit and TKET, serving as a representative instantiation of this abstract problem. The elements in O correspond, respectively, to qubit technology options set A, quantum device options set B, quantum compiler options set C, and compilation steps and optimization strategy options set D. Each compilation option has multiple corresponding choices, namely, A = {a1, a2}, B = {b1, b2, b3, b4, b5}, C = {c1, c2}, D = {d1,d2, d3, d4, d5, d6}. Among them, a1 and a2 are superconducting qubit technology, and ion trap qubit technology, respectively. b1, b2, b3, and b4 are 8-qubits, 27-qubits, 80-qubits, and 127-qubits quantum devices based on superconducting, respectively, and b5 is an 11-qubits quantum device based on ion trap. c1 and c2 are Qiskit quantum compiler, and TKET quantum compiler, respectively. d1, d2, d3, and d4 are the optimization levels of Qiskit’s O0, O1, O2, and O3, respectively. d5 and d6 are the qubit mapping strategies of TKET’s Line Placement, and Graph Placement, respectively. So b1, b2, b3, and b4 have a conditional constraint (dependency) relationship with α1, while b5 has a conditional constraint relationship with a2. Similarly, d1, d2, d3, and d4 have a conditional constraint relationship with c1; d5and d6 have a conditional constraint relationship with c2. In other words, b1, b2, b3, and b4 can be selected only after a1 is selected; b5 can be selected only after a2 is selected; d1, d2, d3, and d4 can be selected only after c1 is selected; d5 and d6 can be selected only after c2 is selected. For a quantum circuit LC compiled by any combination of quantum circuit compilation options o = {ai, bj, ck, dl} (where i = 1, 2, j = 1, 2, 3, 4, 5, k = 1, 2, l = 1, 2, 3, 4, 5, 6, oϵO), it is scored based on the following fidelity formula:
In order to find the compilation option combination with the highest fidelity score for a specific quantum circuit, we can model the selection problem of quantum circuit compilation options as a classification prediction task. That is, for a quantum circuit LC, we aim to find a combination of compilation options o(oϵO), which can be defined as a label yi. All compilation options are combined to form a label set ϒ = {y1, y2, …, y30}. Therefore, the prediction task of quantum circuit compilation options is mainly divided into the following two stages: (1) Propose a quantum circuit feature extraction method based on a TWIG. The extracted NZ-dimensional graph feature vector Z is combined with the NT-dimensional traditional feature T [11] to form an NF-dimensional representation vector F, which fully represents the quantum circuit information. (2) Design and train a quantum circuit compilation option prediction model based on DRAN-Net. By mapping the quantum circuit representation vector F to label ϒ(F → ϒ), the label y corresponding to its highest fidelity compilation option o can be predicted quickly and accurately. The flowchart for the two phases is shown in Figure 8.

The flowchart for the two phases.
The compilation optimization and mapping of quantum circuits are highly dependent on their structural characteristics. However, existing traditional features (e.g., circuit depth, entanglement ratio) can reflect global statistical information. But they fail to capture the complex, time-evolving interaction patterns between qubits. To address this limitation, this paper proposes a new feature representation method that combines 31-dimensional traditional features [11] with graph features extracted from the TWIG to more comprehensively characterize quantum circuits.
As shown in Figure 8, a core innovation of this paper lies in introducing TWIG to encode the spatiotemporal structural information of the circuit and extracting graph features with clear physical meanings from it. Finally, we conduct correlation analysis and importance selection on all features. This forms a concise but effective feature vector. The vector serves as input for DRAN-Net, which is another core innovation of this paper. The following sections will first elaborate on the construction principle of TWIG, and then explain the extraction and selection rationale of graph features.
Starting from the quantum circuit itself, converting the quantum circuit into a graph form allows for the extraction of new quantum circuit features from a different dimension. Therefore, the Time-Weighted Interaction Graph (TWIG) and graph features are introduced. Based on the Interaction Graph (IG), the TWIG incorporates the execution order of two-qubit gates in the circuit by weighting them and adding them to the IG, thereby introducing dependencies and constraints on the execution order between two-qubit gates. This enables the graph to contain more information about the quantum circuit, which is beneficial for the graph features to better represent the quantum circuit. TWIG is designed to convert the execution semantics of quantum circuits (e.g., interaction strength) into a graph structure. The core motivation for its construction lies in modeling interaction intensity: the greater the number of two-qubit gate operations that two qubits jointly participate in, and the ‘earlier’ these operations occur (i.e., the shallower their logical layer), the more important their interaction should be deemed, and the greater its influence on the selection of compilation strategies (e.g., mapping). Notably, multi-qubit gates (i.e., three-qubit gates) are all converted into two-qubit gates during final execution. Since the specific gates generated by this conversion can only be determined once the target device is selected, only two-qubit gates are considered when constructing the TWIG in this paper.
In the above given quantum circuit LC, we consider all qubits Q = {q1, q2,…, qNQ} and two-qubit gates TG = {tg1, tg2,…, tgNTG} is used to construct the TWIG. Firstly, we establish NQ one-to-one corresponding graph nodes (i.e., qi→qi) based on qubits Q. Then, to reflect the mutual influence between two-qubit gates that diminishes with logical topological distance–even if the gates are not optimized, their relative execution order can retain local ‘temporal proximity’ signals, which benefits the subsequent selection of mapping devices. To achieve this, a temporal level weighting is applied to each gate within the two-qubit gates. Specifically, the circuit undergoes topological levelization: gates that can be executed in parallel are grouped into the same level. Across different levels, weights decrease with level depth (gates within the same level have equal weights). It is assumed that the subscript of two-qubit gates TG = {tg1, tg2,…, tgNTG} represents its execution level, we assign decreasing weights {NTG, NTG – 1,…, 1} to each gate sequentially (i.e., wg1 = NTG, wg2 = NTG – 1,…, wgNTG = 1). Finally, for any gate tgi in the two-qubit gates TG that acts on two qubits qi and qj, we determine an edge eij and assign the weight of the gate as the weight of the edge (i.e., wij = wtg). If an edge eij already exists between qi and qj, the weight of the quantum gate is added to the edge weight (i.e., wij = wij + wtg). This process is performed iteratively until all two-qubit gates have been traversed. The final result is the TWIG, in which each node represents a qubit, each edge represents the interaction between two qubits, and the weight of the edge reflects the importance of its interaction. The TWIG is actually an undirected weighted graph, whose properties are described as follows:
- (1)
Nodes: Each qubit in the quantum circuit corresponds to a node in the graph.
- (2)
Edges: If there are any number of quantum gate operations between two qubits, there are edges between the corresponding two nodes in the graph.
- (3)
Weights: All two-qubit gate weights acting between a certain pair of qubits are accumulated to their corresponding edge weights, thus highlighting the importance of the qubit pairs. The weights can represent the number of two-qubit gate operations between qubits and the timing information of execution.
The following is an example of the construction of a TWIG.
Taking the quantum circuit shown in Figure 9 as an example, the circuit is executed in sequence from g1 to g7 sequentially. Among them, g1 and g3,g4 and g5 are parallel, so they are grouped into the same level and assigned weights wg1 = 5, wg2 = 4, wg3 = 5, wg4 = 3, wg5 = 3, wg6 = 2, wg7 = 1, respectively. Since both g1 and g5 act on q1 and q2, we create an edge(q1, q2) and assign it a weight value of w12 = 5 + 3 = 8. Similarly, we create an edge(q2, q3) with a weight value of w23 = 4, an edge(q4, q5) with a weight value of w45 = 5 + 2 = 7, an edge(q3, q5) with a weight value of w35 = 1, an edge(q3, q4) with a weight value of w34 = 3. Finally, the quantum circuit is transformed into the TWIG shown on the right in Figure 9.

Transformation of TWIG.
After completing the construction of the TWIG, we need to further extract feature information from the graph that can fully represent the quantum circuit. There are many features of metric graphs, but not all features of metric graphs are relevant to our goals. After carefully studying Hernandez’s classification of graph features [27] and Bandic’s analysis of interactive graph features [15], we select some main graph features for extraction. The corresponding definition, meaning and calculation formula of each graph feature are as follows:
- (1)
Average Weighted Shortest Path (AP): refers to the average weight of the shortest paths between all node pairs in the TWIG. The size of the average weighted shortest path can indicate the average interaction strength and importance between qubits. The calculation formula is as follows:
4 where dw (qi, qj) represents the shortest weighted path length between node qi and node qj, and NQ is the number of nodes.AP = {1 \over {{N_Q}\left( {{N_Q} - 1} \right)}}\sum\limits_{i \ne j} {{d_w}} \left( {{q_i},{q_j}} \right) - (2)
Max Degree (MXd): refers to the maximum sum of connection weights of nodes in the graph. The maximum degree reveals the strength of the most important qubit interactions in a quantum circuit. The calculation formula is as follows:
5 where deg(qi) is the weighted degree with node qi, and max() is a function to find the maximum value.M{X_d} = \mathop {\max }\limits_{i \in {N_Q}} \left( {deg\left( {{q_i}} \right)} \right) - (3)
Min Degree (MId): refers to the minimum sum of connection weights of nodes in the graph. The minimum degree can reveal whether there are non-interacting qubits in a quantum circuit. The calculation formula is as follows:
6 M{I_d} = \mathop {\min }\limits_{i \in {N_Q}} \left( {deg\left( {{q_i}} \right)} \right) Its calculation method is similar to the maximum degree, but it takes the minimum value, and min() is a function to find the minimum value.
- (4)
Average Clustering Coefficient (AC): refers to the average value of the clustering coefficients of all nodes in a graph. A high clustering coefficient indicates close interactions between qubits and helps to understand the density of global interactions in quantum circuits. The calculation formula is as follows:
7 where C(qi) is the weighted clustering coefficient of node qi based on the strength of triangle subgraphs [28], eg(qi) is the number of edges directly connected with node qi, and Δ is the triangle symbol.AC = {1 \over {{N_Q}}}\sum\limits_{i \in {N_Q}} C \left( {{q_i}} \right);C\left( {{q_i}} \right) = \sum\limits_{i \in {N_Q}} {{{2{{\sum\nolimits_{\exists {q_i}{q_j}{q_k}} {\left( {{w_{ij}}{w_{jk}}{w_{ki}}} \right)} }^{{1 \over 3}}}} \over {eg\left( {{q_i}} \right)\left( {eg\left( {{q_i}} \right) - 1} \right)M{X_d}}}} - (5)
Standard Deviation of Adjacency Matrix (SD): refers to the degree of dispersion in the distribution of connection strengths between nodes in the graph. A high standard deviation indicates diverse types of interactions between qubits and an uneven connection structure among qubits. The calculation formula is as follows:
8 where A is the adjacency matrix of the graph, Aij is an element of the adjacency matrix, and μA is the average value of all matrix elements.SD = \sqrt {{1 \over {{n^2}}}\sum\limits_{i = 1}^n {\sum\limits_{j = 1}^n {{{\left( {{A_{ij}} - {\mu _A}} \right)}^2}} } } ;{\mu _A} = {1 \over {{n^2}}}\sum\limits_{i = 1}^n {\sum\limits_{j = 1}^n {{A_{ij}}} } By extracting these features to form the NZ = 5-dimensional feature vector Z = [AP, MXd, MId, AC, SD], we can better understand the structural properties of quantum circuits and optimize the representation. These features play an important role in the performance analysis of quantum circuits and the prediction of compilation options. Here is an example of extracting features.
Taking the TWIG in Figure 9 as an example, its corresponding graph features can be calculated as AP = 7.064, MXd = 12, MId = 7.66, AC = 0.1169331126, SD = 2.86557. The resulting NZ = 5-dimensional feature vector Z = [7.064, 12, 7.66, 0.1169331126, 2.86557]. Then, it is combined with the NT = 31-dimensional traditional feature T = [u3, u2, u1, cx, u, p, x, h, t, tdg, rx, ry, rz, cz, swap, ccx, cswap, cry, cu1, cp, cu3, cu, rzz, rccx, num_qubits, depth, prog_comm, crit_depth, entang_ratio, parallelism, liveness] to form an NF = 36-dimensional representation vector F, which is used to represent the quantum circuit.
After extracting the feature combination from the quantum circuit into 36-dimensional representation vector F = [f1, f2,…, f36], a prediction model is needed to map the quantum circuit representation vector F to the label set ϒ = {y1, y2,…, y30} (F → ϒ) for prediction. To this end, this paper designs a prediction model DRAN-Net. DRAN-Net integrates attention mechanism into a deep residual neural network, enabling the model to gain a deeper understanding of the features of quantum circuits and thus provide more accurate prediction results. The design details of the prediction model are introduced in detail below.
The DRAN-Net network model is divided into three parts: input layer, hidden layer, and output layer. The model structure is shown in Figure 10. Each of the three parts is described below:
- (1)
Input layer. The input layer is the part that receives the representation vector F = [f1, f2,…, f36] of the input quantum circuit and processes it into a high-dimensional vector. As shown in Figure 10, it first uses a linear layer to process the input representation vector, and maps the representation vector F from 36 dimensions to 256 dimensions
for more complex processing. Because the relationships between features in the representation vector are complex and nonlinear, high dimensions help to capture these complex relationships. Later, using the ReLU (Rectified Linear Unit) activation function that helps the model learn nonlinear features and Dropout that randomly sets some activation values to zero. The final output is a 256-dimensional high-dimensional feature vectorF_1^{inp} .F_3^{inp} Among them, the Liner Layer, also known as the fully connected Layer (FC Layer) or Dense layer, is a basic block in the neural network architecture. It is primarily used for linearly transforming input feature vector data to generate new feature representations. This transformation is realized through the learned weight matrix and bias term. As shown in the input layer in Figure 10, for input vector F, its mathematical expression is:
9 where Winp is the weight matrix of the linear layer of the input layer, and binp is the bias term of the linear layer of the input layer.F_1^{inp} = {W^{inp}}F + {b^{inp}} - (2)
Hidden layer. The hidden layer is the part of the network model that further learns deeper features from the high-dimensional feature vector output by the input layer. It is also the most important part of the network model learning feature data. As shown in Figure 10, the hidden layer adopts a combination of 4 residual blocks (RBs) and 3 attention residual blocks (ARBs) stacked alternately in sequence. Firstly, the 256-dimensional feature vector
is initially learned through the residual block (RB) to obtainF_3^{inp} , and then its output is used as the input of the attention residual block (ARB) for deeper learning to obtainF_1^{hid} . By alternating input and output through feature mapping, the 256-dimensional high-level featureF_2^{hid} is finally extracted.F_7^{hid} Among them, the Residual Block (RB) makes the deep network easier to train and improves the performance of the model by introducing skip connections. The RB can prevent the loss of features in the network model, and a local residual network can be formed by stacking multiple RBs. It can solve the problems of gradient vanishing and gradient explosion in deep network training. Figure 11 illustrates the structure of the RB. To more effectively learn global feature information over a period of time, the input feature map
is first processed through a trunk branch consisting of a linear layer and a ReLU activation function. Then the skip connection branch transfers the original inputF_3^{inp} \in {R^{1 \times 256}} to the trunk branch for Element-wise Product operation. Finally, the ReLU activation function is applied to introduce nonlinearity and output:F_3^{inp} 10 whereF_1^{hid} = {S^{r{b_1}}}\left( {F_3^{inp}} \right) = Relu\left( {\left( {W_2^{r{b_1}}Relu\left( {W_1^{r{b_1}}F_3^{inp} + b_1^{r{b_1}}} \right) + b_2^{r{b_1}}} \right) + F_3^{inp}} \right) is the output of the first RB of the hidden layer, Srb1 () is its corresponding residual function.F_1^{hid} andW_1^{r{b_1}} are the weight matrices of the first and second linear layers in the first RB, respectively, andW_2^{rb{b_1}} andb_1^{r{b_1}} are their corresponding bias terms, respectively.b_2^{r{b_1}} Attention Residual Block (ARB) can make the deep prediction model pay more attention to important regions, concentrating on learning crucial features while weakening unimportant ones. As shown in Figure 12, the structural diagram of the ARB, it mainly consists of one attention block and three RBs. Three stacked RBs serve as the trunk branch, while the attention block acts as the mask branch. The output feature map
of the previous RB is input to the trunk branch and the mask branch at the same time. Through the trunk branch, more comprehensive feature information can be learned using the RBs, resulting inF_1^{hid} . Through the mask branch, a feature weight valueS_3^{ar{b_1}}\left( {S_2^{ar{b_1}}\left( {S_1^{ar{b_1}}\left( {F_1^{hid}} \right)} \right)} \right) is learned. Then, after adding 1 to each element of the feature weight value{A^{ar{b_1}}}\left( {F_1^{hid}} \right) , an Element-wise Product operation is performed with the output of the three RBs. Finally, the output{A^{ar{b_1}}}\left( {F_1^{hid}} \right) of the ARB is:F_2^{hid} 11 whereF_2^{hid} = H\left( {F_1^{hid}} \right) = S_3^{ar{b_1}}\left( {S_2^{ar{b_1}}\left( {S_1^{ar{b_1}}\left( {F_1^{hid}} \right)} \right)} \right) \times \left( {1 + {A^{ar{b_1}}}\left( {F_1^{hid}} \right)} \right) ,S_1^{ar{b_1}}() , andS_2^{ar{b_1}}() are the first, second, and third RB in the first ARB, respectively. Each element ofS_3^{ar{b_1}}() has a value range of [0,1]. By applying{A^{ar{b_1}}}\left( {F_1^{hid}} \right) , each element is assigned a weight within the range of [1,2], which makes the feature data become clearer as the depth increases. This can serve as a feature selector to enhance good features and suppress noise from the trunk branch features.F_2^{hid} Attention block (Attention) can adaptively adjust the importance of various features, so that more important features receive higher weights in subsequent computations. As shown in Figure 13, which illustrates the structure of the attention block. In order to explore the location of important target features in the input feature map, the input feature map
is simultaneously fed into three branches. Each branch first undergoes a linear transformation of the features through a linear layer, then introduces nonlinearity through the ReLU activation function to obtain preliminary feature relationships. Subsequently, a second linear layer further linearly transforms these features to obtain more expressive feature relationships. Immediately after, the elements from the three layers are multiplied together to obtain Fa1, which assigns a score to each element. Finally, the Sigmoid activation function generates attention weights. This process assigns a larger weight to important features within the feature relationships, and the weight output is:{A^{ar{b_1}}}\left( {F_1^{hid}} \right) 12 1 + {A^{ar{b_1}}}\left( {F_1^{hid}} \right) Similarly,
andF_1^{hid} are the weight matrices of the first and second linear layers of the i-th branch of the first attention block, respectively,{A^{ar{b_1}}}\left( {F_1^{hid}} \right) = Sigmoid\left( {\prod\limits_{i = 1}^3 {W_2^{{a_1}(i)}} Relu\left( {W_1^{{a_1}(i)}\left( {F_1^{hid}} \right) + b_1^{{a_1}(i)}} \right) + b_2^{{a_1}(i)}} \right) andW_1^{{a_1}(i)} are their corresponding bias terms, respectively. For the inputW_2^{{a_1}(i)} , the mathematical expression of the Sigmoid activation function is:b_1^{{a_1}(i)} 13 b_2^{{a_1}(i)} - (3)
Output layer. The output layer is the part that performs final aggregation processing on the 256-dimensional high-level mapping features
outputted by the hidden layer. As shown in Figure 10, it first applies a Dropout to the high-level mapped features to prevent overfitting. Finally, a linear layer is used to map the dimensions to a 30-dimensional{F^{{a_1}}} = \left[ {f_1^{(3)},f_2^{(3)}, \ldots ,f_{256}^{(3)}} \right] that matches the number of label categories, and the Softmax activation function is used to calculate the probability distribution P = [p1, p2,…, p30] corresponding to the prediction results of ϒ. The mathematical expression corresponding to the Softmax activation function is:{A^{ar{b_1}}}\left( {F_1^{hid}} \right) = Sigmoid\left( {{F^{{a_1}}}} \right) = \left[ {{1 \over {1 + {e^{ - f_1^{(3)}}}}},{1 \over {1 + {e^{ - f_2^{(3)}}}}}, \ldots {1 \over {1 + {e^{ - f_{256}^{(3)}}}}}} \right] 14 F_7^{hid} The label with the highest probability pi(piϵP) serves as the predicted label yi(yiϵϒ) for the quantum circuit. Each label corresponds to one compilation option combination in Figure 5. All compilation option combinations form a label set ϒ = {y1, y2,…, y30}. The core task of this paper is precisely to find the compilation option combination with the highest fidelity score for a given quantum circuit. Thus, the probabilities of these classification prediction results are used to calculate the cross-entropy loss, which serves to evaluate the model’s performance.

DRAN-Net.

Residual block (RB).

Attention residual block (ARB).

Attention block.
In this study, in order to train the proposed DRAN-Net, we adopt a series of standardized steps to ensure the model’s effectiveness and generalization ability to different quantum circuits. The following is a detailed description of the model training process.
- (1)
Data preprocessing. To accelerate convergence during model training and enhance model performance, we preprocess all quantum circuit representation vectors by standardizing them before training. This step can expedite the convergence of the gradient descent method, improve model training efficiency, and consequently better enhance model performance. The standardization formula is as follows:
15 where Fi is a vector composed of all feature values of the i-th feature in all quantum circuit representation vectors, and μi and σi are the mean and standard deviation of the corresponding feature, respectively. Through this formula, the mean of each feature is transformed to 0, and the standard deviation to 1. This process ensures that features of varying scales are converted to a common scale, thereby preventing any single feature with a larger scale from disproportionately influencing the model training process.F_1^{out} = \left[ {f_1^{(4)},f_2^{(4)}, \ldots ,f_{30}^{(4)}} \right] - (2)
Loss function. In forward propagation, the model uses the CrossEntropyLoss function to evaluate the discrepancy between the predicted results and the actual labels, thereby optimizing the model’s parameters. The formula is as follows:
16 where Nbs is the number of samples (batch_size), yi is the real category label corresponding to the i-th sample, and Pij is the predicted probability by the model that the i-th sample belongs to category j.P = \left[ {{p_1},{p_2}, \ldots ,{p_{30}}} \right] = Softmax\left( {F_1^{out}} \right) = \left[ {{{{e^{f_1^{(4)}}}} \over {\sum\nolimits_{i = 1}^{30} {{e^{f_i^{(4)}}}} }},{{{e^{f_2^{(4)}}}} \over {\sum\nolimits_{i = 1}^{30} {{e^{f_i^{(4)}}}} }}, \ldots ,{{{e^{f_{30}^{(4)}}}} \over {\sum\nolimits_{i = 1}^{30} {{e^{f_i^{(4)}}}} }}} \right] To prevent model overfitting and ensure that the model’s parameters do not become excessively large, we introduce L1 and L2 regularization during the training process. This adds constraints on the model’s parameters, avoiding them from becoming too large or overly complex, thereby preventing overfitting. L1 regularization achieves feature selection by adding the sum of the absolute values of the weights to the loss function, with the formula as follows:
17 F = \left[ {{{{F^1} - {\mu ^1}} \over {{\sigma ^1}}},{{{F^2} - {\mu ^2}} \over {{\sigma ^2}}}, \ldots ,{{{F^{36}} - {\mu ^{36}}} \over {{\sigma ^{36}}}}} \right] L2 regularization reduces the values of the weights by adding the sum of the squares of the weights to the loss function, with the formula as follows:
18 where λ1 and λ2 are hyperparameters representing the strength of the regularization, NW is the total number of model weight parameters, wj wj(wj ∈ W) is any parameter in any weight matrix. Therefore, the final loss function L of the model is:{L^*} = - {{\sum\nolimits_{i = 1}^{{N_{bs}}} {\sum\nolimits_{j = 1}^{30} {{y_i}} } log{p_{ij}}} \over {{N_{bs}}}} 19 L = {L^*} + {L_1} + {L_2}
This section first elaborates on the experimental dataset and evaluation metrics, then gives detailed information on the parameter settings of the proposed model method, and finally explains and analyzes the experimental results of feature correlation and importance experiments, ablation experiments, and advancedness comparison experiments.
The dataset used in this paper contains 3,003 quantum circuits, all sourced from the MQT Bench benchmark library (Version 0.2.2) [11]. We included all circuits in it that meet the following criteria: belonging to the ‘targetindependent’ level, with the number of qubits ranging from 2 to 127, and the number of quantum gates ranging from 2 to approximately 14,000. This ensures the dataset has sufficient breadth and diversity. Since these circuits involve two qubit technologies (i.e., superconducting and ion trap), five devices (i.e., four superconducting-based devices with 8, 27, 80, and 127 qubits, and one ion trap-based device with 11 qubits), two compilers (i.e., Qiskit and TKET), and six compilation steps and optimization strategies (i.e., optimization levels O0 to O3 in Qiskit, and Line Placement and Graph Placement mapping strategies in TKET), the dataset is divided into 30 label categories.
To systematically evaluate the effectiveness of different feature representation methods, we define three input representation sets based on the 3003 quantum circuits:
- (1)
T-Pre (Traditional Representation Set) is based on the 31-dimensional traditional features from Reference [11], mainly used to verify the effectiveness of new features. It specifically includes counts of 24 types of gate operations described in accordance with the Open QASM 2.0 specification [29] (e.g., u3, u2, u1, cx, u, p, x, h, t, tdg, rx, ry, rz, cz, swap, ccx, cswap, cry, cu1, cp, cu3, cu, rzz, and rccx) and 7-dimensional global circuit attributes (num_qubits, depth, prog_comm, crit_dept, entang_ratio, parallelism, and liveness). This set only characterizes the statistical properties and macro-features of the circuit, without involving topological interactions between qubits; as a baseline input, its core value lies in isolating the variable of ‘topological information’ through comparison with new representation sets, thereby ensuring the interpretability of experimental results.
- (2)
N-Pre (Proposed Novel Representation Set) is the innovative hybrid representation set proposed in this paper. It serves as the default input for DRAN-Net. It has 36 dimensions in total. It is formed by fusing two parts: the 31-dimensional traditional features from T-Pre, and 5 new graph features extracted from the TWIG. Its design aims to fix T-Pre’s weakness (i.e., T-Pre lacks topological information). Traditional features reflect the circuit’s macro-attributes. Graph features describe how qubits interact. Together, they complement each other. This enables full-dimensional representation of quantum circuits. As a baseline input, its main purpose is twofold. It improves the model’s prediction accuracy. It also verifies that feature synergy works in the new representation method.
- (3)
TWIG-Pre (Graph-Based Representation Set) is a graph-representation representation set for quantum circuits. It is constructed as described in Section 3.2.1. It converts the structured interactions of quantum circuits into computable structured features. These features support input for Graph Neural Networks (GNNs). Specifically, it first extracts four types of core node-level features from the TWIG. Node degree characterizes the frequency of qubit interactions; weighted node degree distinguishes the influence of gate operations; clustering coefficient reflects local topological compactness; and sum of gate counts measures the operation load of qubits. Then it applies per-graph zero-mean normalization. This step eliminates scale bias across samples. Finally, it adapts the features into a GNN-compatible data format. At the same time, it retains graph-level attributes. As a baseline input, it has two core purposes. One is to add baseline experiments. The other is to verify that T-Pre’s hybrid representation method is superior to pure graph-based representation methods.
To comprehensively evaluate the model’s prediction performance, this paper adopts the accuracy and Top-3 accuracy used in Reference [11] as the primary assessment metrics. Accuracy directly reflects a model’s overall ability to classify all samples correctly. It serves as a fundamental metric for evaluating classification model performance. Top-3 Accuracy relaxes the prediction requirements: as long as the true label falls within the top three categories with the highest predicted probabilities by the model, the prediction is considered correct. This metric is more suitable for application scenarios with a large number of categories or fine distinctions between categories, and can assess the model’s recognition ability in the sense of ‘approximately correct’. In addition, to comprehensively evaluate the model’s performance on class-imbalanced datasets, this paper also introduces the Weighted-Averaging F1-score. This metric integrates Precision and Recall, and assigns each category a weight proportional to its sample proportion in the dataset. This effectively alleviates the impact of uneven sample sizes across categories on evaluation results. Thus, the Weighted-Averaging F1-score can more fairly reflect the model’s classification robustness across different categories, in particular those with small sample sizes, and acts as a valuable supplement to the aforementioned accuracy metrics. The formulas for calculating these metrics are as follows:
where
where Wc is the weight of category c (the proportion of samples in that class within the dataset), Precisionc and Recallc are precision and recall rates of category c, respectively, and TPc, FPc, and FNc denote the true positives, false positives, and false negatives of category c, respectively.
In the quantum circuit compilation option prediction experiment, the hyperparameter settings required for training the model are as follows: the training and test sets are divided in an 80%:20% ratio, the batch size is set to 128, the Adam algorithm is used to optimize the loss function with an initial learning rate of 0.001. Weight decay is also added to the optimization function, which is a regularization method that can further prevent overfitting, and the weight decay parameter is set to 0.0001. Both L1 and L2, which are hyperparameters representing the strength of regularization, are set to 0.00005. For detailed settings of other parameters, refer to Table 1. The model in this paper is implemented using the PyTorch deep learning framework and programmed in Python 3.9.1. The version of PyTorch used is 1.7.1, the version of Qiskit is 0.39.3, the operating system is Ubuntu 20.04 LTS, the CPU is an Intel(R) Xeon(R) Silver 4114 @ 2.20GHz, and the GPU is a Tesla V100 graphics card with 32GB of memory.
DRAN-Net parameter.
| Layer name | Parameters |
| Liner layer | 36×256 |
| Dropout | 0.01 |
| Residual Block1 | 256×256 |
| Attention Residual Block1 | 256×256 |
| Residual Block2 | 256×256 |
| Attention Residual Block2 | 256×256 |
| Residual Block3 | 256×256 |
| Attention Residual Block3 | 256×256 |
| Residual Block4 | 256×256 |
| Dropout | 0.01 |
| Liner layer | 256×30 |
To verify the effectiveness and advancement of the proposed method, we design three sets of experiments, namely, feature correlation and importance experiment, ablation experiment, and advancedness comparison experiment. The feature correlation and importance experiments aims to validate the effectiveness and robustness of the newly proposed features. The ablation experiment is conducted to separately verify the effectiveness of the proposed representation method and model method. The advancedness comparison experiment is intended to validate the overall advancement of the proposed method. The following sections provide detailed analyses of the different experimental results.
To build an efficient and interpretable representation set, we followed a systematic, data-driven feature engineering process. This subsection aims to achieve three goals: (1) verify the effectiveness of the newly proposed graph features; (2) identify and eliminate excessive redundancy between features; and (3) finally provide quantitative evidence for the effectiveness of our hybrid representation method.
Correlations between all features are unavoidable [27]. Multicollinearity (high correlation) between features not only increases model complexity and the risk of overfitting, but also indicates redundancy in the representation set. We calculated the Spearman’s rank correlation coefficients [30] between all 36 initial features (including 31-dimensional traditional features and 5 new graph features). We also plotted a heatmap of the correlation matrix, as shown in Figure 14 (-1/1 indicates perfect correlation, and 0 indicates no correlation).

Feature correlation.
Analysis of feature correlations reveals that many feature pairs are correlated, with some showing perfect correlation–meaning their metric values are identical. For example, the count of t gates is perfectly correlated with that of tdg gates, while other features also exhibit some degree of correlation. Notably, the correlation between the proposed new features and other features does not exceed 95%. This indicates that the new features successfully capture novel structural information not described by traditional statistical features, statistically validating their unique value as an effective supplement. Based on this analysis, we established strict screening rules: for any highly correlated feature pair (with similarity ≥ 95%), only one is retained. We prioritize retaining features that are more commonly used in domain knowledge or have clearer physical meanings (e.g., retaining the count of t gates while removing the perfectly correlated count of tdg gates). Through this step, we removed 3 redundant traditional features (tdg, parallelism, rzz), reducing the feature dimensionality from 36 to 33.
To further quantify the actual contribution of each feature to model predictions, this paper employs the permutation feature importance method. The trained DRAN-Net model is used to generate a feature importance ranking chart based on the permutation feature importance in descending order, as shown in Figure 15. The results indicate significant differences in the influence of various features. Among them, the number of qubits has an importance greater than 0.6 in the model’s predictions, holding the highest significance. This strongly demonstrates the successful training of the model, as a quantum circuit cannot be executed if the device’s number of qubits is less than what the circuit requires. Among the new features proposed in this paper, the Average Clustering Coefficient has an importance greater than 0.3 in the model’s predictions, which convincingly proves the effectiveness of the proposed features. Moreover, features with a permutation importance of zero do not necessarily imply they are useless. Because redundancy among features and the model’s independence can result in a permutation importance of zero. Through this analysis, not only is the model’s decision-making basis revealed, but it also provides a clearer understanding of the role of each feature in the prediction task, further validating the effectiveness of the proposed new features.

Feature importance.
To deeply verify the effectiveness of the proposed method, we design a series of ablation experiments to evaluate their impact on the overall performance. In the experimental results section, we detail the model performance metrics for each experimental configuration, including accuracy, Top-3 accuracy, and F1. All experiments are conducted under the same evaluation criteria, and results are reported as the mean ± standard deviation from five independent runs.
First, we apply the proposed quantum circuit representation method to part of traditional models for experimentation and analyze its performance results. The purpose of this step is to observe the impact of the proposed quantum circuit representation method on prediction performance, as well as its effectiveness and robustness, and obtain the specified evaluation results as shown in Table 2.
Prediction results between the new representation and the traditional model combination.
| Model+Representation Set | Accuracy(%) | Top-3(%) | F1(%) |
|---|---|---|---|
| Random Forest+T-Pre | 77.39 ± 0.05 | 95.54 ± 0.12 | 80.75 ± 0.12 |
| Random Forest+N-Pre | 78.88 ± 0.07 | 95.87 ± 0.16 | 81.90 ± 0.14 |
| Gradient Boosting+T-Pre | 76.63 ± 0.21 | 94.58 ± 0.17 | 80.47 ± 0.76 |
| Gradient Boosting+N-Pre | 77.31 ± 0.05 | 95.24 ± 0.61 | 79.95 ± 0.27 |
| Decision Tree+T-Pre | 72.93 ± 0.10 | 95.17 ± 0.00 | 78.52 ± 0.20 |
| Decision Tree+N-Pre | 73.47 ± 0.34 | 95.17 ± 0.64 | 74.35 ± 1.15 |
| Nearest Neighbor+T-Pre | 72.48 ± 0.00 | 93.68 ± 0.00 | 77.13 ± 0.00 |
| Nearest Neighbor+N-Pre | 73.73 ± 0.00 | 93.34 ± 0.00 | 78.56 ± 0.00 |
Figure 16 shows that the proposed new quantum circuit representation method (N-Pre) has improved the accuracy, Top3 and F1 of Random Forest, which performs best among traditional models. Compared with the traditional representation method (T-Pre) of quantum circuits, the proposed quantum circuit representation method improves the Random Forest accuracy by 1.49%, the Top-3 accuracy by 0.33%, and the F1 by 1.15%, proving the effectiveness of the proposed quantum circuit representation method. However, since the N-Pre method introduces more features than the T-Pre method, and the Gradient Boosting model, Decision Tree model, and Nearest Neighbor model are sensitive to feature scales, the computational complexity of these three models increases. This leads to a decrease in the F1 of Gradient Boosting, the F1 of Decision Tree, and the Top-3 of Nearest Neighbor. In addition, the N-Pre method improves the performance of other models, which demonstrates the robustness of the proposed method. In summary, although TWIG features demonstrate certain correlations with conventional features (Figure 14), ablation studies conclusively verify their ability to provide incremental predictive information. This indicates that the correlations primarily stem from these features describing the same underlying circuit complexity from different perspectives, while the unique temporal and structural dynamic information captured by TWIG features serves as the key driver for performance enhancement. Thus, the value of these novel features lies in providing complementary perspectives for a more comprehensive understanding of quantum circuits.

Performance Comparison of T-Pre and N-Pre.
Notably, in the permutation importance analysis, only the new feature Average Clustering Coefficient demonstrated high importance, while the other four TWIG features ranked lower. To investigate the actual contribution of these features, we conducted an ablation experiment: we removed these four lower-importance features to form a combination of traditional features with the single Average_Clustering_Coefficient (T-Pre+AC). This feature set was subsequently evaluated using traditional models, with the designated assessment results presented in Table 3.
Prediction results after removal of four lower-importance features.
| Model+Representation Set | Accuracy(%) | Top-3(%) | F1(%) |
|---|---|---|---|
| Random Forest+T-Pre+AC | 78.13 ± 0.08 | 95.84 ± 0.11 | 80.93 ± 0.22 |
| Random Forest+N-Pre | 78.88 ± 0.07 | 95.87 ± 0.16 | 81.90 ± 0.14 |
| Gradient Boosting+T-Pre+AC | 76.84 ± 0.19 | 94.91 ± 0.17 | 79.82 ± 0.20 |
| Gradient Boosting+N-Pre | 77.31 ± 0.05 | 95.24 ± 0.61 | 79.95 ± 0.27 |
| Decision Tree+T-Pre+AC | 73.05 ± 0.12 | 94.64 ± 0.22 | 76.14 ± 1.46 |
| Decision Tree+N-Pre | 73.47 ± 0.34 | 95.17 ± 0.64 | 74.35 ± 1.15 |
| Nearest Neighbor+T-Pre+AC | 72.48 ± 0.00 | 93.68 ± 0.00 | 77.13 ± 0.00 |
| Nearest Neighbor+N-Pre | 73.73 ± 0.00 | 93.34 ± 0.00 | 78.56 ± 0.00 |
The experimental results in Figure 17 demonstrate that models using the condensed representation set (T-Pre+AC) achieve comprehensively lower accuracy than those utilizing the complete representation set (N-Pre). Apart from fluctuations observed in Decision Tree and Nearest Neighbor models due to their sensitivity to feature scaling, the same pattern holds for the other two metrics. This finding indicates that although these features individually exhibit limited impact on model output when perturbed in isolation, they collectively contribute complementary information that enhances the model’s overall representational capacity and generalization performance when integrated into the complete representation set. Consequently, we ultimately retained all TWIG features to achieve optimal performance.

Performance Comparison of T-Pre+AC and N-Pre.
Next, we apply the proposed model DRAN-Net to traditional quantum circuit representation method and analyze its result performance. The core of this experimental step is to evaluate the effectiveness and robustness of DRAN-Net with traditional circuit representations. The specified evaluation results are shown in Table 4. By comparing and analyzing the performance differences between traditional models and DRAN-Net, we can gain insights into the advantages of DRAN-Net.
Prediction results between the new model and the traditional representation combination.
| Model+Representation Set | Accuracy (%) | Top-3 (%) | F1 (%) |
|---|---|---|---|
| DRAN-Net+T-Pre | 82.32 ± 1.16 | 97.87 ± 0.48 | 82.18 ± 1.08 |
| Random Forest+T-Pre | 77.39 ± 0.05 | 95.54 ± 0.12 | 80.75 ± 0.12 |
| Gradient Boosting+T-Pre | 76.63 ± 0.21 | 94.58 ± 0.17 | 80.47 ± 0.76 |
| Decision Tree+T-Pre | 72.93 ± 0.10 | 95.17 ± 0.00 | 78.52 ± 0.20 |
| Nearest Neighbor+T-Pre | 72.48 ± 0.00 | 93.68 ± 0.00 | 77.13 ± 0.00 |
| Multilayer Perceptron+T-Pre | 65.35 ± 0.54 | 82.06 ± 1.18 | 58.49 ± 1.57 |
| Support Vector Machine+T-Pre | 62.49 ± 0.00 | 78.70 ± 0.00 | 69.51 ± 0.00 |
| Naive Bayes+T-Pre | 33.60 ± 0.00 | 54.08 ± 0.00 | 29.48 ± 0.00 |
As shown in Figure 18, the proposed DRAN-Net model improves all metrics of the traditional model. Compared with the best performing Random Forest in the traditional model, DRAN-Net improves the accuracy by 4.93%, the Top-3 accuracy by 2.33%, and the F1 by 1.43%, which proves the effectiveness of DRAN-Net. Furthermore, the prediction results obtained by DRAN-Net using traditional representation are close to the best prediction performance currently achievable, thereby proving the robustness of DRAN-Net.

Performance Comparison of model using T-Pre.
Through these ablation experiments, we can more clearly understand how the various components of the method work together and their specific impact on improving prediction accuracy. These supplementary experiments further demonstrate the contribution of these improvements to the overall performance improvement and verify the effectiveness, robustness and reliability of the proposed representation method and model method.
To comprehensively verify the superiority of the method proposed in this paper, we systematically compare it with two types of baseline methods. One type includes traditional machine learning models used in [11]: Random Forest, Gradient Boosting, Decision Tree, Nearest Neighbor, Multilayer Perceptron, Support Vector Machine, and Naive Bayes. The other type consists of more advanced graph neural network models: GIN (Graph Isomorphism Network), GCN (Graph Convolutional Network), and GraphSAGE. All comparative experiments are conducted under the same dataset and evaluation criteria. Results are reported as the mean / pm standard deviation from five independent runs, as shown in Table 5.
Prediction results of various model methods.
| Model+Representation Set | Accuracy (%) | Top-3 (%) | F1 (%) |
|---|---|---|---|
| Random Forest+T-Pre | 77.39 ± 0.05 | 95.54 ± 0.12 | 80.75 ± 0.12 |
| Gradient Boosting+T-Pre | 76.63 ± 0.21 | 94.58 ± 0.17 | 80.47 ± 0.76 |
| Decision Tree+T-Pre | 72.93 ± 0.10 | 95.17 ± 0.00 | 78.52 ± 0.20 |
| Nearest Neighbor+T-Pre | 72.48 ± 0.00 | 93.68 ± 0.00 | 77.13 ± 0.00 |
| Multilayer Perceptron+T-Pre | 65.35 ± 0.54 | 82.06 ± 1.18 | 58.49 ± 1.57 |
| Support Vector Machine+T-Pre | 62.49 ± 0.00 | 78.70 ± 0.00 | 69.51 ± 0.00 |
| Naive Bayes+T-Pre | 33.60 ± 0.00 | 54.08 ± 0.00 | 29.48 ± 0.00 |
| GraphSAGE+TWIG-Pre | 71.10 ± 1.02 | 95.77 ± 0.88 | 60.41 ± 2.49 |
| GIN+TWIG-Pre | 48.28 ± 0.64 | 84.52 ± 1.57 | 21.82 ± 1.50 |
| GCN+TWIG-Pre | 30.97 ± 0.25 | 56.54 ± 3.23 | 3.83 ± 0.49 |
| Ours (DRAN-Net+T-Pre) | 82.83 ± 1.63 | 98.14 ± 0.32 | 82.80 ± 1.52 |
Figure 19 demonstrates that Ours (DRAN-Net+T-Pre) achieves the best performance in terms of accuracy, Top-3 accuracy and F1. The accuracy has reached 82.83%, the Top-3 accuracy has reached 98.14%, and the F1 has reached 82.80%. In particular, the Top-3 accuracy is close to error-free prediction. This can avoid a large number of invalid attempts for end users, which is of great significance. Compared with the current best Random Forest prediction results, the method proposed in this paper has been greatly improved in all metrics, including an increase of 5.44% in accuracy, 2.6% in Top-3 accuracy, and 2.05% in F1. The analysis of the experimental results verifies the advancement of our proposed method. Moreover, it shows significant advantages over graph neural network (GNN) baseline methods. Compared with the best-performing GNN model (GraphSAGE), it achieves an 11.73% improvement in accuracy, a 2.37% improvement in Top-3 accuracy, and a 22.39% improvement in F1. This significantly enhances the balance of classification. Even when compared with advanced GNNs designed specifically for graph structure processing, DRAM-Net+N-Pre still demonstrates clear advantages. Analysis of the experimental results fully verifies the superiority of the method proposed in this paper.

Performance Comparison of various model methods.
To evaluate the applicability and scalability of the proposed method in broader quantum compilation decisionmaking scenarios, we migrate our approach to the device selection task defined in [13] and conduct direct comparisons with the baseline method (Random Forest) adopted in that study, with experimental results summarized in Table 6. It is worth noting that this task aims to select the optimal execution device for a given quantum circuit and has been formulated as a classification problem, with the original dataset comprising 600 circuits.
Prediction results on the device selection task from [13].
| Model | Accuracy (%) | Top-3 (%) | F1 (%) |
|---|---|---|---|
| DRAN-Net | 98.82 ± 0.41 | 100 ± 0.00 | 98.71 ± 0.50 |
| Random Forest | 98.15 ± 0.63 | 100 ± 0.00 | 97.01 ± 1.00 |
As shown in Figure 20, our DRAN-Net model comprehensively outperformed the original baseline method on this task introduced by [13]. Specifically, our model achieved statistically significant improvements in both accuracy (0.67% increase) and F1-score (1.7% increase). Particularly noteworthy is that while maintaining a perfect Top-3 accuracy of 100%, our model demonstrated more reliable performance on the more stringent accuracy. These results sufficiently indicate that the proposed method can effectively capture features applicable to various compilation decision scenarios, exhibiting excellent transferability and generalization potential. This provides empirical evidence for its integration as an intelligent component into more complex compilation frameworks such as [13]

Performance comparison on the [13] device selection task.
In addition, when compared with the top-performing DRAN-Net+T-Pre in Table 4, DRAN-Net+N-Pre further delivers performance gains, with accuracy improved by 0.51%, Top-3 accuracy improved by 0.27%, and F1 improved by 0.62%. Since both configurations employ the same model method, this result further validates the advancedness of the representation method proposed in this paper. Separately, in comparison with the classification results of the top-performing Random Forest+N-Pre in Table 2, DRAN-Net+N-Pre exhibits performance improvements, including a 3.95% increase in accuracy, a 2.27% increase in Top-3 accuracy, and a 0.9% increase in F1. As both approaches utilize the same input representation set, this finding further confirms the advancedness of the DRAN-Net model.
Finally, to ensure the reliability of evidence, we conducted paired t-tests on all metrics of the proposed method and baseline methods on the test set. The goal was to evaluate the statistical significance of performance differences. For all comparisons, Welch’s corrected t-test was used when variances were unequal, while the independent samples t-test was applied when variances were equal. Additionally, the sample size was set to 5, which ensured the reliability of the statistical results.
Table 7 shows that for all performance metrics, the performance differences between Ours (DRAN-Net+N-Pre) and all baseline methods are statistically significant (p < 0.05). This indicates that this advantage does not only come from random errors, but more from the fusion advantages of the N-Pre representation and DRAN-Net’s modeling capability for feature fusion. These results fully validate the superiority and effectiveness of the overall method proposed in this paper for the quantum circuit compilation option prediction task.
Statistical significance tests.
| Ours (DRAN-Net+N-Pre) vs. | Metric | t | p-Value (α = 0.05) |
|---|---|---|---|
| Ours (DRAN-Net+N-Pre) vs. Random Forest+T-Pre | Accuracy | 8.4356 | 0.0009 |
| Top-3 | 17.0113 | 0.0000 | |
| F1 | 3.0064 | 0.0391 | |
| Ours vs. Gradient Boosting+T-Pre | Accuracy | 7.4592 | 0.0017 |
| Top-3 | 21.9686 | 0.0000 | |
| F1 | 3.0658 | 0.0154 | |
| Ours vs. Decision Tree+T-Pre | Accuracy | 13.5555 | 0.0002 |
| Top-3 | 20.7535 | 0.0000 | |
| F1 | 6.2425 | 0.0030 | |
| Ours vs. Nearest Neighbor+T-Pre | Accuracy | 14.1983 | 0.0001 |
| Top-3 | 31.1652 | 0.0000 | |
| F1 | 8.3411 | 0.0011 | |
| Ours vs. Multilayer Perceptron+T-Pre | Accuracy | 22.7628 | 0.0000 |
| Top-3 | 29.4089 | 0.0000 | |
| F1 | 24.8754 | 0.0000 | |
| Ours vs. Support Vector Machine+T-Pre | Accuracy | 27.9028 | 0.0000 |
| Top-3 | 135.8411 | 0.0000 | |
| F1 | 19.5509 | 0.0000 | |
| Ours vs. Naive Bayes+T-Pre | Accuracy | 67.5347 | 0.0000 |
| Top-3 | 307.8786 | 0.0000 | |
| F1 | 78.4389 | 0.0000 | |
| Ours vs. GraphSAGE+TWIG-Pre | Accuracy | 13.6408 | 0.0000 |
| Top-3 | 5.6596 | 0.0005 | |
| F1 | 17.1617 | 0.0000 | |
| Ours vs. GIN+TWIG-Pre | Accuracy | 44.1176 | 0.0000 |
| Top-3 | 19.0074 | 0.0000 | |
| F1 | 63.8515 | 0.0000 | |
| Ours vs. GCN+TWIG-Pre | Accuracy | 70.3203 | 0.0000 |
| Top-3 | 28.6586 | 0.0000 | |
| F1 | 110.5693 | 0.0000 |
To evaluate the practical application value of the proposed method, this paper randomly selected 100 representative circuits from MQT Bench (due to limited computational resources of the device, the selected circuits have qubit counts ranging from 5 to 50 and gate counts ranging from 50 to 5000). Exhaustive search (enumeration) of all 30 compilation options and DRAN-Net model prediction experiments were conducted separately to assess the execution efficiency (time) and effectiveness (quality) of exhaustive search versus the DRAN-Net model in single-circuit compilation decision-making. The experimental results are presented in Table 8.
Efficiency and effectiveness of different methods.
| Method | Average Time(s) | Time Ranges(s) | Accuracy (%) | Top-3 (%) |
|---|---|---|---|---|
| Exhaustive Search | 118 | 5-243 | 100 ± 0.00 | 100 ± 0.00 |
| DRAN-Net Prediction | 0.27 | 0.19-0.51 | 90.40± 0.80 | 99.40±0.49 |
As shown in Figure 21, the DRAN-Net model significantly reduces the compilation decision time from an average of 118 seconds to 0.27 seconds, achieving a 437-fold performance improvement. This indicates that our model accomplishes a leap from minute-level to sub-second response times, providing crucial technical support for interactive quantum program development. Furthermore, while achieving this dramatic speed improvement, the DRAN-Net model maintains competitive decision-making quality. The model’s Top-3 accuracy reaches 99.4%, meaning that for nearly all circuits, the optimal compilation option is included among the top three recommendations provided by the model. This ensures that in practical scenarios, users only need to try a very limited number of options to obtain near-optimal compilation results. The model’s accuracy is 90.4%, meaning that in approximately nine out of ten cases, its top recommendation is the globally optimal solution. Although there is a gap compared to the theoretical optimum achieved by exhaustive search, this level of accuracy is attained alongside a 437-fold acceleration. For the vast majority of application scenarios, the minimal suboptimal risk accepted in exchange for such a substantial efficiency gain is acceptable. In summary, the DRAN-Net model achieves an effective trade-off in the compilation decision-making problem, further demonstrating the validity of the DRAN-Net model.

Exhaustive search vs. DRAN-Net prediction.
This paper presents an efficient quantum compilation option prediction method. However, with the rapid development of quantum computing technology, the current study still has some limitations. The following directions warrant further exploration:
Evolution of Model Architecture. Although DRAN-Net performs excellently in compilation option prediction tasks, the current model still faces challenges when processing extremely large-scale quantum circuits. In the future, Graph Neural Networks (GNN) could be introduced for end-to-end circuit representation, combined with Reinforcement Learning (RL) to achieve dynamic compilation strategy sequence generation as explored in [13], thereby enhancing the model’s adaptability to complex quantum circuits.
Dataset Expansion and Benchmark Establishment. The current research is based on the MQT Bench benchmark dataset, which primarily contains arithmetic and logical quantum circuits. To improve the model’s generalization capability, we will incorporate more representative quantum algorithms such as VQE [31], QAOA [32], and QFT [33] to build a more diverse training dataset and conduct systematic comparative analysis with advanced models like Transformer [25].
Exploration of Fundamental Compilation Issues. This paper focuses on the high-level decision-making problem of compilation option prediction and has not yet deeply explored the dependencies between compilation options or the impact of hardware noise. In the future, we will commit to explicit dependency modeling between compilation options and develop noise-aware adaptive compilation strategies to address the noise challenges in real quantum hardware [34].
In this paper, we propose a quantum circuit compilation option prediction method based on DRAN-Net, which can automatically select the optimal compilation scheme. The efficiency and accuracy of quantum circuit compilation are improved, allowing end users from non-quantum computing fields to implement their applications more efficiently on actual hardware. Firstly, we propose a new quantum circuit representation method, introduce graph features based on TWIG, and combine them with traditional features into a fixed 36-dimensional representation vector, which more fully extracts quantum circuit information. Secondly, we propose a new quantum circuit compilation option prediction model DRAN-Net. The introduction of residual and attention mechanism methods prevents the gradient vanishing problem in model training, which greatly improves the prediction performance of the model. Finally, in order to verify the effectiveness and advancement of the proposed method, we design three groups of experiments. In the feature correlation and importance experiments, we verify the effectiveness of the proposed features and obtain a new feature arrangement form. In the ablation experiment, we verify the effectiveness of the proposed quantum circuit representation method and prediction model method, respectively. In the advancedness comparison experiment, the proposed method improves the accuracy by 5.44%, the Top-3 accuracy by 2.6%, and the F1 by 2.05% compared with the current advanced methods.