Masahiro Nakadai, y Shiro Tamiya, y - arXiv

Qulacs: a fast and versatile quantum circuit simulator forresearch purposeYasunari Suzuki1,2, Yoshiaki Kawase3, Yuya Masumura4, Yuria Hiraga5, Masahiro Nakadai6,Jiabao Chen7, Ken M. Nakanishi7,8, Kosuke Mitarai3,7,9, Ryosuke Imai7, Shiro Tamiya7,10,Takahiro Yamamoto7, Tennin Yan7, Toru Kawakubo7, Yuya O. Nakagawa7, Yohei Ibe7,Youyuan Zhang7,8, Hirotsugu Yamashita11, Hikaru Yoshimura11, Akihiro Hayashi12, andKeisuke Fujii2,3,9,13

1NTT Computer and Data Science Laboratories, NTT Corporation, Musashino 180-8585, Japan2JST PRESTO, Kawaguchi, Saitama 332-0012, Japan3Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan4Graduate School of Information Science and Technology, Osaka University, 1-1 Yamadaoka, Suita, Osaka 565-0871, Japan5Graduate School of Information and Science, Nara Institute of Science and Technology, Takayama, Ikoma, Nara 630-0192,Japan

6Graduate School of Science, Kyoto University, Yoshida-Ushinomiya, Sakyo, Kyoto 606-8302, Japan7QunaSys Inc., Aqua Hakusan Building 9F, 1-13-7 Hakusan, Bunkyo, Tokyo 113-0001, Japan8Graduate School of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan9Center for Quantum Information and Quantum Biology, Institute for Open and Transdisciplinary Research Initiatives, OsakaUniversity, Japan

10Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan11Individual researcher12School of Computer Science, Georgia Institute of Technology, Atlanta, GA, 30332, USA13Center for Emergent Matter Science, RIKEN, Wako Saitama 351-0198, Japan

To explore the possibilities of a near-term intermediate-scale quantum algo-rithm and long-term fault-tolerant quan-tum computing, a fast and versatile quan-tum circuit simulator is needed. Here,we introduce Qulacs, a fast simulator forquantum circuits intended for researchpurpose. We show the main concepts ofQulacs, explain how to use its features viaexamples, describe numerical techniquesto speed-up simulation, and demonstrateits performance with numerical bench-marks.

Yasunari Suzuki: [email protected] Chen: [email protected] Imai: [email protected] Yan: [email protected] Kawakubo: [email protected] O. Nakagawa: [email protected]

1 Introduction

Many theoretical groups have explored quantumcomputing applications due to the rapid improve-ments in quantum technologies and huge effortsof experimental groups to develop quantum com-puters [1, 2]. Although classical simulation ofquantum circuits is a vital tool to develop quan-tum computers, the simulation time increases ex-ponentially with the number of qubits.

The primary reasons for using a classical sim-ulator are the following: (i) Quantum devicessuffer from much higher error rates. A classicalsimulator is necessary to determine the ideal re-sults for comparison. (ii) In certain cases, not allnecessary values can be directly measured fromexperiments such as the full quantum state vec-tor and marginal probabilities of measurements.(iii) To analyze the performance of quantum errorcorrection and noise characterization for an arbi-trary noise model, noisy quantum circuits mustbe simulated. Hence, there are broad demandson classical simulators for research on quantum

Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 1

arX

iv:2

011.

1352

4v4

[qu

ant-

ph]

5 O

ct 2

021

https://quantum-journal.org/?s=Qulacs:%20a%20fast%20and%20versatile%20quantum%20circuit%20simulator%20for%20research%20purpose&reason=title-click

https://quantum-journal.org/?s=Qulacs:%20a%20fast%20and%20versatile%20quantum%20circuit%20simulator%20for%20research%20purpose&reason=title-click

mailto:[email protected]






computing. Although full simulations of quan-tum circuits are not efficient in the sense of com-putational complexity theory, it is important toimplement a fast classical simulator as much aspossible.

Here, we introduce a quantum circuit simula-tor, which is called Qulacs [3]. The main featureof Qulacs is that it meets many popular demandsin quantum computing research such as evaluat-ing near-term applications, quantum error correc-tion, and quantum benchmark methods. In ad-dition, Qulacs is available on many popular en-vironments, and is one of the fastest quantumcircuit simulators. Herein we demonstrate thestructure of our library, optimization methods,and numerical benchmarks for simulating typicalquantum gates and circuits.

This paper consists of the following topics. Sec-tion. 2 overviews the main features and struc-ture of Qulacs as well as compares it with exist-ing libraries. Section. 3 introduces the expecteduse of Qulacs. Section. 4 discusses how to writecodes with Qulacs using examples codes. Sec-tion. 5 introduces further optimization techniquesto speed-up the simulation of quantum circuits.Section. 6 shows the numerical benchmarks forQulacs. Section 7 compares the performance ofQulacs with that of existing simulators. Finally,Section. 8 is devoted to summary and discussion.

2 Overview

2.1 Features of Qulacs

Qulacs is designed to accelerate research on quan-tum computing. Thus, Qulacs prioritizes the fol-lowing:

Fast simulation of large quantum circuits:A full simulation of quantum circuits requires atime that grows exponentially with the numberof qubits. This problem can be mitigated by op-timizing and parallelizing the simulation codesfor single- or multi-core CPUs and SIMD (Sin-gle Instruction Multiple Data) units and evenGPUs (Graphics Processing Units), and optimiz-ing quantum circuits before a simulation. Thesetechniques can enable a few orders of magnitudeperformance improvement compared to naive im-plementations. Although this is a constant fac-tor speed-up, the effect on practical research is

tremendous. Qulacs offers optimized and paral-lelized codes to update a quantum state and eval-uate probability distributions, observables, andso on.

Small overhead for simulating small quan-tum circuits: Often small but noisy quantumcircuits (up to 10 qubits) are simulated manytimes rather than simulating a single-shot largeideal quantum circuit. In this case, the over-head due to pre- and post-processing for callingcore API functions is not negligible relative tothe overall simulation time. Qulacs is designedto minimize such overhead by focusing on corefeatures and avoiding complicated functionalities.

Available on different environments:While numerical analysis is typically performedon workstations or high-performance computers,most software development occurs on laptopsor desktop personal computers. For versatility,Qulacs provides interfaces for both Python andC++ languages, while most of the codes ofQulacs are written in C and C++ languages.In addition, Qulacs supports several compilerssuch as GNU Compiler Collection (GCC) andMicrosoft Visual Studio C++ (MSVC). Qulacsis tested on several operating systems such asLinux, Windows, and Mac OS.

Many useful utilities for research: To meetvarious demands in quantum computing research,a simulator should support general quantum op-erations. Qulacs can create not only common uni-tary gates and projection measurements, but alsogeneral operations such as completely positiveinstruments and adaptive quantum gates condi-tioned on measurement results.

2.2 Structure of Qulacs

Qulacs consists of three shared libraries. The firstone is a core library written in C language for op-timized memory management, update functionsof quantum states, and property evaluation ofquantum states. The second library is built ontop of the first library, and is written in C++language. This allows users to easily create andcontrol quantum gates and circuits in an object-oriented way, thereby improving the programma-bility. Also, at runtime, it adaptively chooses


the best performing implementation of quantumgates depending on the number of qubits. Thethird library explores variational methods withquantum circuits. This library wraps quantumgates and circuits so that quantum circuits canbe treated as variational objects.

We expect that users who work on variationalquantum algorithms use Qulacs via the third li-brary, and the other users access Qulacs via thesecond library. Additionally, Qulacs can be usedas a Python library, while there is some over-head in interfacing between Python and C++.Figure 1 shows the overview of the structure ofQulacs. The components of Qulacs and their us-ages are explained in Sec. 4 with example codes.Qulacs uses Eigen [4] to treat a sparse matrixand to manage matrix representations of quan-tum gates and pybind11 [5] for exporting func-tions and classes from C++ to python. The fea-tures introduced in this paper is fully tested withGoogleTest [6] and pytest [7].

2.3 Simulation methods

There are several approaches to simulate generalquantum circuits with classical computers. Thesimplest approach is to update quantum statesrepresented by state vectors or density matricessequentially by applying quantum gates as gen-eral maps. This method is called Schrödinger’smethod [1]. Qulacs implements this method forsimulating quantum circuits due to its fast andversatile simulations of quantum circuits com-pared with the other methods introduced in thissection. A detail of implementation with thismethod is described in Sec. 4.

Another method is Feynman’s approach [1, 8,9], which computes the sum of all Feynman’spath contributions. This technique greatly de-creases the memory size requirement, allowingfor the single amplitude of the final quantumstate to be quickly known. While the numberof Feynman’s paths increases exponentially ac-cording to not only the number of qubits but alsothe number of quantum gates, this requirementcan be relaxed by using tensor-network-based ap-proach. With tensor-network-based simulator,we can make the simulation time increases ex-ponentially according to a tree-width of the net-work [9], which is a characteristic value of graphrepresenting how a tensor network is close to atree graph. However, except special cases where

tree-width is small, such as shallow quantum cir-cuits with a large number of qubits, a tree-widthbecomes equal to the number of qubits. More-over, the Schrödinger’s method is much fasterthan Feynman’s path integral when the numberof qubits is limited and the memory size is suf-ficient for storing a full state vector. Althoughwe can also consider Schrödinger-Feynman ap-proach [1, 10] as a hybrid method, this method istypically faster than Schrödinger’s method simu-lating large quantum circuits with a small depth.

If quantum circuits have specific features, thenthe simulation time can be reduced. For exam-ple, if quantum circuits are dominated by Cliffordgates, non-Clifford gates can be treated as a per-turbation to the simulation. This treatment canreduce the time for simulation [11, 12]. This con-dition is satisfied in several situations, e.g., quan-tum circuits of stabilizer measurements wherenon-Clifford errors happens with a very smallprobability or fault-tolerant quantum computingwith a limited number of T and TOFFOLI-gates.However, most of the quantum circuits of typicalquantum algorithms do not satisfy these condi-tions.

2.4 Relation to the existing libraries

To date, many groups have published a vari-ety of quantum circuit simulators. Cirq [13],Qiskit [14], PyQuil [15], and PennyLane [16] arepublished by Google, IBM, Rigetti computing,and Xanadu, respectively. Since these groupsare developers of hardware for quantum com-puting, these libraries are designed for submit-ting quantum tasks as a job without thinkingabout detailed experimental procedures or se-tups. Q# [17] by Microsoft focuses on provid-ing higher levels of abstraction, such as pack-aged instructions for integer arithmetic or tool-chains for compilation. To simulate quantumcircuits with a low depth and a large num-ber of qubits, tensor-network-based simulatorsare used [18–20]. However, these are not goodfor higher depths or a fewer number of qubits.For quantum circuit simulations on state-of-the-art supercomputers, several works have reportedon the performance and optimizations [21–24].Qiskit-Aer [14], Intel-QS [25, 26], QX Simula-tor [27, 28], ProjectQ [29], QuEST [30], qsim [31],Yao [32], QCGPU [33], and Qibo [34] were all de-veloped with motivations similar to ours. They


Quantum State

Quantum Circuit

Quantum Gates

Front-end C++ classes Optimized low-level functions

CPUs

GPUs

Computing devices

Qulacs

Pyt

hon

in

terf

aces

C++ user codes

Memory management

Quantum state evaluation

Quantum state manipulationBasic gate

General quantum maps

CPTP-map

Adaptive op

CP-instrument

Dense matrix Diagonal matrix

Sparse matrix Permutation matrix

Pauli matrix ⋯

Python user codes

Expectation value

Marginal probability

Sampling with Z-basis

Inner product

Observable

෨𝑂 =

𝑃∈𝒫𝑛

𝜃𝑃𝑃

𝜓 =𝜓0𝜓1⋮

𝜌 =𝜌00 𝜌01 ⋯𝜌10 ⋱⋮

𝜌 ↦ 𝑈𝜌𝑈† 𝜌 ↦

𝑖

𝐾𝑖𝜌𝐾𝑖†

𝑅𝑋 𝜃

𝐻

𝑅𝑍𝑋(𝜙)

depolarize

𝑓 𝑣

Circuit optimization for fast simulation

Concatenate / drop qubits

Copy / load state

Allocate / deallocate array

⋯

⋯

⋯

Parameter controls and differentiation for variational algorithms - SIMD

- Multi-threading

Figure 1: The overview of the structure of Qulacs.

focused on optimizing quantum circuit simula-tions for quantum computing researchers. Bycontrast, Qulacs is one of the fastest simulators,which minimizes overhead even for small simula-tions, supports general quantum operations suchas completely-positive and trace-preserving mapsand completely-positive instruments, and runs onvarious environments. In addition, Qulacs pro-vides many useful utilities frequently used in re-search such as calculations of transition ampli-tudes and reversible Boolean functions.

3 Expected usages of Qulacs

Quantum circuit simulators should be designedfor specific targets. Qulacs is designed to helpresearchers of quantum computing. In particular,we expect the following usages:

3.1 Exploration of near-term applications anderror mitigation techniques

To highlight typical evaluation targets, here weshow two popular directions of near-term appli-cations. One is optimizing a target function usingvariational quantum circuits such as the varia-tional quantum eigensolver (VQE) [35]. In a typ-ical scenario, we assume rotation angles of Pauli

gates as variational parameters of a cost func-tion and optimize them by repeatedly simulatingquantum circuits with a relatively small numberof qubits. The other application is a quantumsimulator [36]. In this approach, large quantumsystems are simulated to explore physics in many-body quantum systems. In a typical scenario, weperform a single simulation with quantum circuitsas large as possible. Thus, the size of memory orallowed time limits the size of quantum circuits.

To date, Qulacs has been used in a few tensof research papers. For example, Qulacs isused by papers related to noisy intermediate-scale quantum (NISQ) applications [37, 37–42]and fault-tolerant quantum computing [43]. Al-though Qulacs does not support gate decompo-sition, which is supported by high-layer libraries,Qulacs can be used as a faster backend library.For example, Qulacs can serve as a backend ofCirq [13] using a library cirq-qulacs [44]. Qulacshas also been chosen as a fast backend simulatorin several libraries and services such as Penny-Lane (Xanadu) [16], t|ket> (Cambridge Quan-tum Computing) [45], Orquestra (Zapata com-puting) [46], and Tequila [47].


3.2 Performance analysis of quantum errorcorrection schemes

Another important usage of the simulator forquantum circuits is performance analysis ofthe quantum error correction and fault-tolerantquantum computing. To construct a quantumcomputer large enough for Shor’s algorithm [48,49], a quantum simulation for quantum many-body systems [36, 50], or algorithms for linearsystems [51], quantum error correction [52] is nec-essary to reduce logical error rates to an arbitrar-ily small value. Many types of quantum error-correcting codes and schemes have been pro-posed. However, the number of qubits neededfor a specific application is highly dependenton the performance of quantum error-correctingschemes. Consequently, it is difficult to controlthe noise properties on real devices and the per-formance of quantum error correction must beanalyzed with classical simulation in near-termdevelopment. Unfortunately, the time to accu-rately simulate quantum error-correcting codeswith practical noise models grows exponentiallywith the number of qubits. Thus, we need a fastand accurate simulator of noisy quantum circuitsof quantum error correction.

3.3 Generation of a reference of experimentaldata

To characterize and calibrate controls of qubits,sometimes experimental data must be comparedwith the ideal one. For example, several verifi-cation methods for large quantum devices [1, 53]require a full simulation of large quantum sys-tems. While quantum circuits in quantum com-putational supremacy regime require supercom-puters for simulations, portable and fast quan-tum circuit simulators remain useful for generat-ing small-scale experimental references.

4 Implementation of Qulacs

4.1 Overview

In Qulacs, any state of a quantum system is rep-resented as a subclass of the QuantumStateBaseclass. Thus far, Qulacs supports two represen-tations of quantum states: state vector and den-sity matrix. The StateVector class representsa state vector, while the DensityMatrix class

represents a density matrix. These classes havesome basic utilities as their member functionssuch as initialization to a certain quantum state,computing marginal probabilities, and samplingmeasurement results. The QuantumStateBaseclass also contains a variable-length integer arraycalled classical registers, which are used to storemeasurement results.

When a quantum state is updated by quan-tum operations, subclasses of QuantumGateBaseare instantiated and applied to a quantum state.This class supports not only unitary operationsand projection measurements but also a vari-ety of operations for general quantum mappingsuch as a completely-positive instrument and acompletely-positive trace-preserving map.

To evaluate the expectation values of observ-ables, there is a class named Observable. Weassume the Observable class is described as alinear combination of Pauli operators. Thus, theObservable instance can be constructed directlyor from an output of an external library such asOpenFermion [54]. Additionally, a Trotterizedquantum circuit can be created from a given ob-servable.

By default, Qulacs performs a simulation byallocating and manipulating StateVector on aCPU. Also, since the use of GPUs can signifi-cantly outperform a CPU in certain cases, Qulacssupports GPU execution. This can be done byusing the StateVectorGpu. Once a state vectoris allocated on a GPU, all the computations suchas update and evaluation of state vectors are per-formed in a GPU unless the state is explicitlyconverted into StateVector. Qulacs does notsupport allocating a quantum state on multipleGPUs. One GPU can be selected by supplyingan index if multiple GPUs are installed.

In the discussion below, we explain the casewhere quantum states are allocated as state vec-tors on the main RAM and processed within aCPU. Although Qulacs can be used as a C++library, we show examples in the Python lan-guage for simplicity in the main text. See Ap-pendix.A for the example codes of the C++ lan-guage. Here, we show typical examples and theirbasic features. Qulacs supports more operationsthan those explained here. For a detailed expla-nation, please see the documentation on the offi-cial website [3].


4.2 Quantum state

4.2.1 Initialization

In Qulacs, instantiating the StateVector classallocates a state vector. By default, a statevector is initialized to zero-state |0〉⊗n. How-ever, the state can be initialized to other statessuch as a computational basis, a state vector ofa given complex array, or a random pure statewith its member functions. The instance of theStateVector class can create its copy or loadthe contents of another StateVector. Listing. 1shows example codes.

1 import numpy as np2 from qulacs import StateVector3

4 # Allocate a state vector5 num_qubit = 26 state = StateVector ( num_qubit )7

8 # Reset to a computational basis9 ## (0:|00 > , 1:|01 > , 2:|10 > , 3:|11 >)

10 ## Note that the right -most digitcorresponds to

11 ## the 0-th qubit in Qulacs .12 state. set_computational_basis (index = 2)13

14 # Create a copy of the state vector15 sub_state = state.copy ()16

17 # Load a given list , numpy array ,18 # or another StateVector19 state.load(state =[0.5 , 0.5, 0.5, -0.5])20 state.load(np.ones (4) /2)21 state.load( sub_state )22

23 # Prepare a randomized pure quantumstate

24 state. set_Haar_random_state (seed = 42)

Listing 1: An example Python program that initializesquantum states.

4.2.2 Analysis

Qulacs implements several functions to evaluatethe properties of quantum states. Although theget_vector functions provide a full state vec-tor, evaluating the properties with built-in func-tions is fast. For example, Listing. 2 shows ex-ample codes to compute a marginal probability,squared norm, and inner-product of two states.Note that the sampling functions create a cumu-lative probability distribution as pre-processingfor the fast sampling, which temporally allocatesan additional 2n-length array.

1 from qulacs import StateVector2 num_qubit = 33 state = StateVector ( num_qubit )4 state. set_Haar_random_state (0)5

6 # Get the state vector as numpy array7 vec = state. get_vector ()8

9 # Get the marginal probability10 ## The below example obtains prob of

|021 >11 ## where "2" is a wild card12 ## matching to both 0 and 1.13 prob = state. get_marginal_probability (

measured_value =[1, 2, 0])14

15 # Sampling results of the Z-basismeasurements

16 samples = state. sampling (count =100 , seed=42)

17

18 # Computing the squared norm19 squared_norm = state. get_squared_norm ()20

21 # Computing the inner product of twoquantum states

22 from qulacs .state import inner_product23 state_bra = StateVector ( num_qubit )24 state_bra . set_Haar_random_state ()25 state_ket = StateVector ( num_qubit )26 state_ket . set_Haar_random_state ()27 value = inner_product (state_bra ,

state_ket )

Listing 2: An example Python program that evaluatesthe properties of quantum states.

4.2.3 Update

Several member functions of StateVector can beused for quickly updating a state vector. Themultiply_coef function multiplies a complexnumber to each element of a quantum state, whilethe multiply_elementwise_function multi-plies index-dependent coefficients to a state vec-tor with a function that returns a coefficient ac-cording to a given index. The add_state func-tion adds two state vectors. While these op-erations are not physically achievable, they areuseful for analysis in theoretical studies suchas creating a superposition of two given statesand supplying specific phases to each elementof a state vector. A list of qubits in a statevector can be concatenated, permutated, or re-duced with tensor_product, permutate_qubit,or drop_qubit, respectively. Listing. 3 showscode examples for multiplication kernels.

1 from qulacs import StateVector


2 from qulacs .state import tensor_product ,permutate_qubit , drop_qubit

3

4 state = StateVector (2)5 state. set_Haar_random_state ()6

7 # Normalize the state vector8 squared_norm = state. get_squared_norm ()9 state. normalize ( squared_norm )

10

11 # Multiply a complex number to eachelement

12 state. multiply_coef (0.5+0.1 j)13

14 # Perform element -wise multiply15 def func(index):16 return (0.5 if index %2 else 0)17

18 state. multiply_elementwise_function (func)

19

20 # Perform element -wise addition21 state. add_state (state)22

23 # Make a tensor product of states .24 # The resultant state has 4 qubits25 sub_state = StateVector (2)26 state = tensor_product (state , sub_state )27

28 # Permutate qubit indices from [0 ,1 ,2 ,3]to [3 ,1 ,2 ,0]

29 state = permutate_qubit (state ,[3 ,1 ,2 ,0])

30

31 # Drop the 1-st and 2-nd qubits fromstate

32 # and project to |0> and |0> subspace .33 new_state = drop_qubit (state , [1,2],

[0 ,0])

Listing 3: An example Python program that update andmodify quantum states.

4.3 Quantum gates

4.3.1 Gate type

As explained in the overview, quantum gates in-clude not only typical quantum gates such asunitary operators and Pauli-Z basis measure-ments, but also contain all operations that up-date quantum states. All the classes of quan-tum gates are defined as a subclass of theQuantumGateBase class, which has a functionupdate_quantum_state that acts on derivedclasses of QuantumStateBase. In Qulacs, quan-tum gates in which the action can be written as|ψ〉 7→ K |ψ〉, where |ψ〉 is a state vector andK is a certain complex matrix, are called basicgates. Examples include a Pauli rotation on mul-

tiple qubits, TOFFOLI-gate, and stabilizer pro-jection to +1 eigenspace. Quantum maps thatare not basic gates such as CPTP-map, projec-tion measurements, and adaptive operations, arerepresented using basic gates. Here, we show sev-eral popular types of operations which are imple-mented as basic operations.

4.3.2 Basic gate

A basic gate is an operation that can be repre-sented by ρ 7→ KρK†. In the case of a purestate, its action on the state vector is representedby |ψ〉 7→ K |ψ〉. Note that K is not neces-sarily unitary. Suppose that this quantum gateacts on a pure quantum state |ψ〉 and obtain anupdated quantum state |ψ′〉 = K |ψ〉. We de-note a state vector of the computational basis as|x〉 =

⊗i |xi〉 for x ∈ {0, 1}n, a coefficient of the

state vector along with the computational basisas ψx = 〈x|ψ〉, and a matrix representation ofK as Kx,y = 〈x|K|y〉. Then, a coefficient of theupdated quantum state can be written by

ψ′x =∑

y∈{0,1}n

Kx,yψy. (1)

Typically, quantum gates non-trivially act onlyon a few qubits. Suppose that the total number ofqubits is n, the list of qubits on which the quan-tum gate non-trivially acts is M , and the num-ber of the target qubits is m. Then, the action ofthe quantum gate K can be simplified by using a2m× 2m complex matrix K̃, which we call a gatematrix, as follows. We define the following twolists of n-bit strings:

B(0) = {x ∈ {0, 1}n|∀i ∈M,xi = 0} (2)B(1) = {x ∈ {0, 1}n|∀i /∈M,xi = 0}. (3)

Here, B(0) is a list of n-bit strings where all thevalues at indices contained in M are zero, andB(1) is a list of n-bit strings where all the valuesat indices not contained inM are zero. There are2n−m elements in B(0) and 2m elements in B(1).An arbitrary n-bit string x can be uniquely de-composed as x = x(0) + x(1) where x(i) ∈ Bi.We use this decomposition implicitly in the fol-lowing discussion. Since the quantum gate onlyacts on the target qubits, a transition amplitudebetween two computational basis states is zeroif their n-bit strings are different at indices notcontained in the target qubits, and a transition


amplitude is independent of the values at indicesnot contained in the target qubit, i.e.,

Kx,y = Kx(0)+x(1),y(0)+y(1)

= Kx(1),y(1)δx(0),y(0) (4)

for an arbitrary x,y ∈ {0, 1}n. Then, Eq. (1) canbe rephrased as

ψ′x(0)+x(1) =∑

y(1)∈B(1)

Kx(1),y(1)ψx(0)+y(1) . (5)

We define a bijective function r : B(1) → {0, 1}msuch that (x0, · · · , xn−1) 7→ (xM0 , · · · , xMm−1)where Mi is the qubit index of the i-th targetqubit, and also define s : {0, 1}m → B(1) as theinverse of r. Then, a 2m × 2m gate matrix K̃ ofthe quantum gate K is defined as

K̃z,w = 〈s(z)|K|s(w)〉 (6)

for z,w ∈ {0, 1}m. With the gate matrix, we canwrite Eq. (5) as follows:

ψ′x(0)+s(z) =∑

w∈{0,1}m

K̃z,wψx(0)+s(w). (7)

We also define a temporal state vector |ψ̃x(0)〉 as a2m-dim complex vector of which the i-th elementis ψx(0)+s(bin(i)) where bin(i) represents the m-bitbinary representation of the integer i. Then, wecan simplify Eq. (7) as

|ψ̃′x(0)〉 = K̃ |ψ̃x(0)〉 . (8)

Since for all x(0) ∈ B(0), Eq. (8) has to be calcu-lated, an update function for K consists of 2n−m

matrix-vector multiplications with a gate matrixK̃ and the temporal state vectors |ψ̃x(0)〉. List-ing. 4 shows a naive implementation of updatefunctions in the Python language, where B0 andB1 are a list of n-bit integers corresponding to B0and B1, respectively.

1 import numpy as np2

3 def func( state_vector , gate_matrix , B0 ,B1 , m):

4 temp_vector = np.zeros (2**m)5 for x0 in B0:6 # Read values from the state vector7 for ind , x1 in enumerate (B1):8 temp_vector [ind] = state_vector [x0

+x1]9 # Perform matrix - vector

multiplication

10 temp_vector = np.dot( gate_matrix ,temp_vector )

11 # Write values to the state vector12 for ind , x1 in enumerate (B1):13 state_vector [x0+x1] = temp_vector [

ind]

Listing 4: An example Python code that naivelyimplements an update function of a quantum state

Note that in the example code, without loss ofgenerality, B1 is considered to be arranged so thatthe i-th element of B1 is s(bin(i)). In practice,a time for reading/writing temporal state vectors(i.e., temp_vector in the example code) from/tothe whole state vector is not negligible. Thus,the update function essentially performs 2n−m it-erations of the following three parts: read 2m-dim complex numbers from a memory, performmatrix-vector multiplication, and write 2m com-plex numbers to the memory.

The DenseMatrix function generates a ba-sic gate K with a gate matrix K̃, and theRandomUnitary function generates that with arandom unitary matrix sampled from a Haar-random distribution. Listing. 5 shows quantumgates instantiated with dense matrices.

1 from qulacs import StateVector2 from qulacs .gate import DenseMatrix ,

RandomUnitary3 state = StateVector (10)4

5 # Update quantum state with given gatematrix

6 target_list = [1]7 gate_matrix = [[0, 1],[1, 0]]8 gate = DenseMatrix ( target_list ,

gate_matrix )9 gate. update_quantum_state (state)

10

11 # Update quantum state with randomunitary

12 ## Matrix is drawn from Haar - randomdistribution

13 random_gate = RandomUnitary ( target_list )14 random_gate . update_quantum_state (state)

Listing 5: An example Python program that appliesdense matrix gates to quantum states.

The computation time to apply a quantum gateis mainly determined by two factors: time for pro-cessing arithmetic operations to perform a calcu-lation with complex numbers and time for pro-cessing memory operations to transfer complexnumbers between the CPU and main RAM; wecall these arithmetic-operation cost and memory-operation cost, respectively. They are propor-tional to the number of arithmetic and mem-


ory operations in update functions. The heavierof the two determines the application time of afunction. The number of arithmetic operationsin each iteration of an update function of densematrix gates is O(22m) and that of memory op-erations is O(2m + 22m), where O is a Landaunotation. Since this iteration is looped for 2n−m

times, the total number of arithmetic and mem-ory operations are O(2n+m) and O(2n + 2n+m).In the case of small m, the gate matrix K̃, whichis re-used in all the iterations, is expected to re-side on cache memory, and the memory-operationcost for the gate matrix is counted at once. Thus,the number of memory operation is effectivelyO(2n + 22m). Typically, because we consider thecase when m� n, the memory-operation cost isfurther approximated as O(2n).

Although any basic gate can be treated asa dense matrix gate, quantum gates in quan-tum computing research sometimes have an addi-tional structure in a gate matrix K̃. By utilizingthis structure, the arithmetic-operation cost, thememory-operation cost, or both can be decreased.This motivates us to define specialized subclassesof dense matrix gates for quantum gates withstructured gate matrices. Here, we show func-tions for several types of basic gates with a struc-ture. Note that the relation between arithmetic-and memory-operation costs and the total com-putation time is discussed in Sec. 5.

Let Mc be a subset of target qubits and c (0 ≤c < 2|Mc|) be an integer, where | · | representsthe number of elements in a given set. Anygate matrix K̃ can be represented in the formK̃ =

∑2|Mc|−1x,y=0 |x〉〈y| ⊗ Lx,y, where the first part

of the tensor product acts on the space of qubitsin Mc, and the latter part acts on the space oftarget qubits except for Mc. Suppose a gate ma-trix K̃ which satisfies Lx,y = 0 if x 6= y andLx,x = I if x 6= c. A quantum gate with sucha gate matrix K̃ is called controlled quantumgates. We can describe a gate matrix of a con-trolled quantum gate with an integer c and com-plex matrix Lc,c. Let mc := |Mc| be the num-ber of control qubits, and mt := m −mc be thenumber of qubits on which Lc,c act. Then, thisgate can be applied with arithmetic-operationcosts O(2n−mc+mt) and memory-operation costsO(2n−mc). In Qulacs, we can specify thepair of the digit of control index c and corre-sponding item in Mc with a member function

add_control_qubit.When a gate matrix has a small number of

non-zero elements, it is called sparse. When K̃is a sparse matrix, the arithmetic- and memory-operation costs can be decreased according to thenumber of non-zero elements. A quantum gatewith a sparse gate matrix can be generated withSparseMatrix.

Another special case is a diagonal matrix,which is a matrix with non-zero elements only inthe diagonal elements. In this case, arithmetic-and memory-operation costs become O(2n).These costs are independent of the number of tar-get qubits m. In this case, DiagonalMatrix canbe used for generating diagonal matrix gates.

An action of reversible Boolean functions inclassical computing can always be represented asa permutation matrix, which is a matrix wherethere is a single unity element in each row andcolumn. ReversibleBoolean creates a unitaryoperation with a permutation matrix by supply-ing a function that returns the index of a columnwith a unity element from the index of a givenrow. This function is applicable not only for re-versible circuits but also for creating and anni-hilating operators as a product of a permutationmatrix and a diagonal matrix. Its arithmetic-and memory-operation costs are O(2n) and arealso independent of the number of target qubitsm.

A set of m-qubit Pauli matrices is defined as atensor product of Pauli matrices {I,X, Y, Z}⊗m,where

I =(

1 00 1

), X =

(0 11 0

),

Y =(

0 −ii 0

), Z =

(1 00 −1

). (9)

A Pauli gate is a gate in which the gate matrix is aPauli matrix. By assigning numbers {0, 1, 2, 3} toPauli matrices {I,X, Y, Z}, respectively, a Paulimatrix can be represented with a sequence of in-tegers {0, 1, 2, 3}m. The Pauli function generatesa basic gate with a Pauli matrix represented bya sequence of assigned integers. Its arithmetic-and memory-operation costs are O(2n) and areindependent of m.

Since a set of m-qubit Pauli matrices is a ba-sis of 2m × 2m matrices, any 2m × 2m matrixcan be represented as a linear combination ofm-qubit Pauli matrices. Furthermore, any self-


adjoint matrix can be represented as a linear com-bination of m-qubit Pauli matrices with real co-efficients. Therefore, any unitary gate matrix canbe represented in the form K̃ = exp(i

∑P θPP ),

where θP is a real coefficient. Suppose a quan-tum gate such that θP = 0 if P 6= Q, where Qis a certain Pauli matrix. Such a quantum gateis called a Pauli rotation gate. PauliRotationcan be used for generating a Pauli rotation gatewith a description of Pauli matrix Q and rotationangle θQ. Its arithmetic- and memory-operationcosts are also O(2n). Note that quantum gateswith multiple non-zero rotation angles, which arevital for simulating the dynamics of quantum sys-tems under a given Hamiltonian, can be gener-ated with DenseMatrix function with an explicitmatrix representation of exp(i

∑P θPP ) or gen-

erated as a Trotterized quantum circuit with ob-servable. For the latter, see Sec. 4.5. Listing. 6shows examples. Qulacs has several other spe-cializations for basic gates, which are detailed inthe online manuals [3].

1 import numpy as np2 from scipy. sparse import csr_matrix3 from qulacs import StateVector4 from qulacs .gate import DenseMatrix ,

SparseMatrix , DiagonalMatrix , Pauli ,PauliRotation , ReversibleBoolean

5 state = StateVector (10)6

7 # Update a quantum state with acontrolled dense matrix gate

8 target_list = [1]9 gate_matrix = [[0, 1],[1, 0]]

10 control_gate = DenseMatrix ( target_list ,gate_matrix )

11 ## Act when the 2-nd qubit is |0>12 control_gate . add_control_qubit (2, 0)13 ## Act when the 3-rd qubit is |1>14 control_gate . add_control_qubit (3, 1)15 control_gate . update_quantum_state (state)16

17 # Update a quantum state with a sparsematrix gate

18 target_list = [2, 1]19 sparse_matrix = csr_matrix (20 ([1 ,1] , ([0 ,3] , [0 ,3])),21 shape =(4 ,4) , dtype= complex )22 sparse_gate = SparseMatrix ( target_list ,

sparse_matrix )23 sparse_gate . update_quantum_state (state)24

25 # update a quantum state with a diagonalmatrix gate

26 target_list = [3, 5]27 diagonal_element = [1, -1, -1, 1]28 diagonal_gate = DiagonalMatrix (

target_list , diagonal_element )

29 diagonal_gate . update_quantum_state (state)

30

31 # Update a quantum state with apermutation matrix gate

32 def basis_to_basis (index , dim):33 return (index +3)%dim34

35 target_list = [0, 3, 4]36 rev_gate = ReversibleBoolean ( target_list

, basis_to_basis )37 rev_gate . update_quantum_state (state)38

39 # Update a quantum state with Pauli gate40 target_list = [1 ,2]41 pauli_ids = [3 ,2]42 pauli_gate = Pauli( target_list ,

pauli_ids )43 pauli_gate . update_quantum_state (state)44

45 # Update a quantum state with a Paulirotation gate

46 target_list = [1 ,2]47 pauli_ids = [3 ,2]48 rotation_angle = np.pi/549 rot_gate = PauliRotation ( target_list ,

pauli_ids , rotation_angle )50 rot_gate . update_quantum_state (state)

Listing 6: An example Python program that appliesseveral basic gates to quantum states.

4.3.3 Quantum map

Quantum maps are a general operation, whichincludes all the quantum maps that cannot berepresented as basic gates such as measurement,noisy operation, and feedback operation.

The most general form of physical operationswithout measurements is a completely-positivetrace-preserving (CPTP) [55]. According to theoperator-sum representation, this map can berepresented as ρ 7→

∑iKiρK

†i , where ρ is the den-

sity matrix and Ki is called the Kraus operator.The map must satisfy the condition

∑iK†iKi =

I. In Qulacs, a list of basic gates, which representthe action of Kraus operators {Ki}, is required tocreate a CPTP-map. When a density matrix isused as a representation of quantum states, ρ ismapped to

∑iKiρK

†i . On the other hand, when

a state vector is used as a representation of quan-tum states, the i-th Kraus operator is chosen withprobability pi = |Ki |ψ〉 |2. Then, the state vector

is mapped toKi |ψ〉√

pi. Suppose that the number

of Kraus operators is k and each Kraus opera-tor acts on m-qubits, then in the worst case, the


arithmetic- and memory-operation costs becomeO(2n+mk). The CPTP-map can be created witha CPTP function.

One of the most general representations of allphysically achievable operations, including mea-surements, is the completely-positive (CP) in-strument. In Qulacs, this operation is the sameas a CPTP-map except that the index of thechosen Kraus operator is stored in the classicalregister of QuantumStateBase and can be usedlater. This map can be generated with a func-tion Instrument. The arithmetic- and memory-operation costs are the same as CPTP-map.

A CPTP-map is called unital when it maps amaximally mixed state to itself. A unital CPTP-map can be represented as a probabilistic applica-tion of unitary operations (i.e., for all i, Ki has aform Ki = √piUi, where pi is a real value and Ui

is unitary). Unlike a CPTP-map, the arithmetic-and memory-operation costs of unital maps de-crease to O(2n+m). The costs become indepen-dent of the number of Kraus operators since theprobability distribution for sampling a Kraus op-erator is independent of the input state and canbe given in advance. A unital CPTP-map cangenerated with Probabilistic function.

An adaptive map is one that acts on the quan-tum states only when a given classical conditionis satisfied. This map requires a Boolean func-tion that determines an output according to theclassical registers. Then, a map is applied onlywhen the returned value of the Boolean functionis True. This map is useful for treating feedbackand feedforward operations such as the heraldedoperation, readout initialization, measurement-based quantum computation, and look-up ta-ble decoder for quantum error correction. Thisgate can be generated with a function namedAdaptive. Its arithmetic- and memory-operationcosts are dependent on a given gate and the prob-ability that the given condition is satisfied.

Listing. 7 shows the example codes of these gen-eral maps. There are several other forms of gen-eral gates in Qulacs for a specific research pur-pose. See the online manual [3] for details.

1 from qulacs import StateVector2 from qulacs .gate import X, Y, Z, P0 , P13 from qulacs .gate import Instrument , CPTP

, Probabilistic , Adaptive4

5 state = StateVector (3)6 gate_list = [X(0) , Y(0) , Z(0)]7

8

9 # Update a quantum state with a CPTP map10 gate_list = [P0 (0) , P1 (0)]11 cptp_gate = CPTP( gate_list )12 cptp_gate . update_quantum_state (state)13

14 # Update a quantum state with a CPinstrument

15 classical_register = 016 gate_list = [P0 (0) , P1 (0)]17 inst_gate = Instrument (gate_list ,

classical_register )18 inst_gate . update_quantum_state (state)19

20 # Get and set values in the classicalregister

21 value = state. get_classical_value (classical_register )

22 state. set_classical_value (classical_register , 1-value)

23

24 # Update a quantum state with a unitalgate

25 gate_list = [X(0) , Y(0) , Z(0)]26 prob_list = [0.2 , 0.3, 0.1]27 prob_gate = Probabilistic (prob_list ,

gate_list )28 prob_gate . update_quantum_state (state)29

30 # Update a quantum state with anadaptive gate

31 def func( classical_register_list ):32 return classical_register_list [0] ==

033

34 gate = X(0)35 adap_gate = Adaptive (gate , condition =

func)36 adap_gate . update_quantum_state (state)

Listing 7: An example Python program that appliesgeneral quantum maps to quantum states.

4.3.4 Named gate

Since quantum gates with several specific gatematrices and Kraus operators are frequently usedin quantum computing research, functions togenerate these gates are defined. TABLE.1lists these named gates. Although some ofthese function calls are simply redirected tothe definition as quantum maps, the followinggates are redirected optimized update functions:X,Y,Z,H,CNOT,SWAP,CZ. Thus, when these quan-tum gates are used, they should be instantiatedusing these functions.


Category Name DescriptionSingle-qubit gate X Pauli-X gate

Y Pauli-Y gateZ Pauli-Z gate

sqrtX π/4 rotation of Pauli-X gatesqrtXdag −π/4 rotation of Pauli-X gate

sqrtY π/4 rotation of Pauli-Y gatesqrtYdag −π/4 rotation of Pauli-Y gate

S π/4 rotation of Pauli-Z gateSdag −π/4 rotation of Pauli-Z gate

T π/8 rotation of Pauli-Z gateTdag −π/8 rotation of Pauli-Z gate

H Hadamard gateTwo-qubit gate CNOT Controlled-NOT gate

CZ Controlled-Z gateSWAP SWAP gate

Three-qubit gate TOFFOLI TOFFOLI gateFREDKIN FREDKIN gate

Single-qubit RX Pauli-X rotation: exp(iθX/2)rotation gate RY Pauli-Y rotation: exp(iθY/2)

RZ Pauli-Z rotation: exp(iθZ/2)U1 Rotate phase of LO (Local oscillator)U2 Rotate phase of LO with single π/2-pulseU3 Rotate phase of LO with two π/2-pulses

Projection and P0 Projection matrix to |0〉 statemeasurement P1 Projection matrix to |1〉 state

Measurement Single qubit measurement with Z-basisNoise BitFlipNoise Probabilistic Pauli-X operation

DephasingNoise Probabilistic Pauli-Z operationDepolarizingNoise Single-qubit uniform depolarizing noise

TwoQubitDepolarizingNoise Two-qubit uniform depolarizing noiseAmplitudeDampingNoise Single-qubit amplitude damping noise

Table 1: Partial listing of named gates in Qulacs. These are all defined in the qulacs.gate module. Several gateshave optimized functions, while others are alias to quantum gates. For the definition of U1, U2, and U3, see thereference of IBMQ [56]

4.4 Quantum circuit

In Qulacs, a quantum circuit is representedas a simple array of quantum gates. TheQuantumCircuit class instantiates quantum cir-cuits. The add_gate function inserts a quan-tum gate to a circuit with a given position.If a position is not given, the gate is ap-pended to the last of a quantum circuit. Theremove_gate function removes a quantum gateat a given position. By calling a member function

update_quantum_state, all the contained quan-tum gates are applied to a given state sequen-tially.

When users need to treat paramet-ric quantum gates and circuits, theParametricQuantumCircuit class shouldbe used, which provides functions to treatparameters in quantum circuits. By addingParametricRX, ParametricRY, ParametricRZ,and PrametricPauliRotation gates with theadd_parametric_gate function, their rota-


tion angles can be set and varied with theget_parameter and set_parameter functions.Listing. 8 shows some examples.

1 from qulacs import StateVector ,QuantumCircuit ,ParametricQuantumCircuit

2 from qulacs .gate import DenseMatrix , H,CNOT

3 from qulacs .gate import ParametricRX ,ParametricRY , ParametricPauliRotation

4

5 # Create a quantum circuit and addquantum gates

6 n = 57 circuit = QuantumCircuit (n)8 circuit . add_gate ( DenseMatrix ([1] ,

[[0 ,1] ,[1 ,0]]))9 for index in range(n):

10 circuit . add_gate (H(index))11

12 # Insert quantum gates into a givenposition

13 position = 114 circuit . add_gate (CNOT (2 ,3) , position )15

16 # Remove a quantum gate at a position17 position = 018 circuit . remove_gate ( position )19

20 # Compute the depth of the quantumcircuit

21 depth = circuit . calculate_depth ()22

23 # Get the number of quantum gates24 gate_count = circuit . get_gate_count ()25

26 # Get a copy of the quantum gate at a

position27 position = 028 gate = circuit . get_gate ( position )29

30 # Update a state vector with the quantumcircuit

31 state = StateVector (n)32 circuit . update_quantum_state (state)33

34 # Create parametric quantum circuit35 par_circuit = ParametricQuantumCircuit (n

)36 par_circuit . add_parametric_gate (

ParametricRX (0, 0.1))37 par_circuit . add_parametric_gate (

ParametricRY (1, 0.1))38 par_circuit . add_parametric_gate (

ParametricPauliRotation ([0 ,1] , [1 ,1],0.1))

39

40 # Get the number of parameters in thequantum circuit

41 par_count = par_circuit .get_parameter_count ()

42

43 # Get and set a parameter at a position

44 index = 045 angle = 0.246 value = par_circuit . get_parameter (index)47 par_circuit . set_parameter (index , angle)48

49 # Get the position of a parametric gatefrom the index of a parameter

50 position = par_circuit .get_parametric_gate_position (index)

Listing 8: An example Python program that generatesquantum circuits and parametric ones.

4.5 Observable

In quantum physics, physical values are obtained as an expectation value of a self-adjoint operatornamed observable. In Qulacs, any observable O is represented as a linear combination of Pauli ma-trices with real coefficients, i.e., O =

∑P αPP where αP ∈ R. A Pauli term in observable αPP is

constructed with the PauliOperator class. An observable is generated with the Observable class.The add_operator function adds a Pauli term to an observable. Then, the get_expectation_valuefunction computes the expectation value of an observable according to a given quantum state, and theget_transition_amplitude function computes the transition amplitude of an observable accordingto two quantum states.

A Hamiltonian is an observable for the energy of a quantum system, and a unitary operator for timeevolution is described as an exponential of the Hamiltonian operator with an imaginary coefficient.The add_observable_rotation_gate function adds a set of quantum gates for simulating the timeevolution under a given Hamiltonian, which is generated with the Trotter decomposition, to a quantumcircuit.

Expectation values or transition amplitudes of Hamiltonian of molecules are frequently studiedin the field of NISQ applications [41, 57, 58]. They are usually represented as a linear combi-nation of products of fermionic operators. They can be converted to an observable with Pauli


operators using the Jordan-Wigner transformation [59] or the Bravyi-Kitaev transformation [60],which are implemented in OpenFermion [54]. The format of OpenFermion is loaded with thecreate_quantum_operator_from_openfermion_text function, which allows the output of Open-Fermion to be interpreted as a format of Qulacs. The GeneralQuantumOperator class can be used togenerate observables that are not self-adjoint. Listing. 9 shows example codes for treating observablesand evaluating expectation values.

1 from qulacs import Observable , PauliOperator , StateVector , QuantumCircuit2 from qulacs . quantum_operator import create_quantum_operator_from_openfermion_text3

4 # Construct a Pauli operator5 coef = 2.06 Pauli_string = "X 0 X 1 Y 2 Z 4"7 pauli = PauliOperator ( Pauli_string , coef)8

9 # Create an observable acting on n qubits10 n = 511 observable = Observable (n)12 # Add a Pauli operator to the observable13 observable . add_operator (pauli)14 # or directly add it with coef and str15 observable . add_operator (0.5 , "Y 1 Z 4")16

17 # Get the number of terms in the observable18 term_count = observable . get_term_count ()19

20 # Get the number of qubit on which the observable acts21 qubit_count = observable . get_qubit_count ()22

23 # Get a specific term as PauliOperator24 index = 125 pauli = observable . get_term (index)26

27 # Calculate the expectation value <a|H|a>28 state = StateVector (n)29 state. set_Haar_random_state (0)30 expect = observable . get_expectation_value (state)31

32 # Calculate the transition amplitude <a|H|b>33 bra = StateVector (n)34 bra. set_Haar_random_state (1)35 trans_amp = observable . get_transition_amplitude (bra , state)36

37 # Create a quantum circuit to simulate38 # the time evolution by a given observable39 # Observable is Trotterized with given slice count.40 circuit = QuantumCircuit (n)41 angle = 0.142 t_slice = 10043 circuit . add_observable_rotation_gate (obs , angle , t_slice )44 circuit . update_quantum_state (state)45

46 # Load an observable from OpenFermion text47 open_fermion_text = """48 ( -0.8126100000000005+0 j) [] +49 (0.04532175+0 j) [X0 Z1 X2] +50 (0.04532175+0 j) [X0 Z1 X2 Z3] +51 (0.04532175+0 j) [Y0 Z1 Y2] +52 (0.04532175+0 j) [Y0 Z1 Y2 Z3] +53 (0.17120100000000002+0 j) [Z0] +54 (0.17120100000000002+0 j) [Z0 Z1] +55 (0.165868+0 j) [Z0 Z1 Z2] +56 (0.165868+0 j) [Z0 Z1 Z2 Z3] +


57 (0.12054625+0 j) [Z0 Z2] +58 (0.12054625+0 j) [Z0 Z2 Z3] +59 (0.16862325+0 j) [Z1] +60 ( -0.22279649999999998+0 j) [Z1 Z2 Z3] +61 (0.17434925+0 j) [Z1 Z3] +62 ( -0.22279649999999998+0 j) [Z2]63 """64 obs_of = create_quantum_operator_from_openfermion_text ( open_fermion_text )

Listing 9: An example Python program that generates and evaluates observables.

5 Optimizations

In this section, we discuss possible performancebottlenecks in quantum simulation, and discussseveral optimization techniques for different com-puting devices.

5.1 Background

Since an application of a quantum gate is a largenumber of simple iterations, a time for circuitsimulation can be roughly estimated as the sumof constant-time overheads for invoking functionsand times for processing arithmetic and memoryoperations. Several factors such as an overheadto call C++ function via python interfaces, func-tions with parallelization, and GPU kernels in-cur additional overheads in computation. Theseadditional overheads are quantitatively discussedlater in Sec. 6. The times for processing arith-metic and memory operations are determined bythe number of operations divided by throughput.For the time for arithmetic operations, the num-ber of operations that can be processed in a unitsecond is vital, which is known as floating-pointoperations per second (FLOPS). To process com-plex numbers in the CPU, we need to load com-plex numbers representing quantum states fromthe CPU cache or main RAM to the CPU reg-isters. The size of data per unit time that wecan transfer between a memory and processor iscalled the bandwidth of the memory. Let the timefor the additional overheads be Tover, the numberof arithmetic operations be Ncom, the number ofmemory operations be Nmem, the FLOPS of CPUbe VFLOPS, and the memory bandwidth, i.e., thenumber of complex numbers which we can trans-fer, be VBW. Tover is determined by a design ofa library, Nmem and Ncom are determined by aquantum gate to apply, and VFLOPS and VBW aredetermined by a computing device. Then, an ap-proximate total time for applying a quantum gate

Tgate is lower bounded as

Tgate ≥ Tover + max(Ncom/VFLOPS, Nmem/VBW)(10)

when computation and memory operations do notblock each other. While actual processing is notnecessarily simplified to this equation, estimationwith this equation works well when we develop aquantum circuit simulator.

To develop a quantum circuit simulator satis-fying the demands shown in Sec. 3, we can findseveral basic directions from this equation. In thecase of a small number of qubits, Tover becomesa dominant factor. Thus, the pre- and post-processing for applying quantum gates should beminimized. In Qulacs, every core function is de-signed to minimize overheads. When the numberof qubits n increases, values of Ncom/VFLOPS andNmem/VBW grow exponentially to n, and theybecome larger than the overhead Tover. In thisregion, the values Ncom/VFLOPS and Nmem/VBWshould be minimized. If there are two ways toupdate quantum states, and if one has smallerNcom and Nmem than the other, the first oneshould be chosen to minimize Tgate. To thisend, Qulacs provides several specialized updatefunctions that utilize the structure of gate ma-trices as introduced in Sec. 4. In this section,we show four additional techniques to minimizeTgate when Ncom/VFLOPS and Nmem/VBW aredominant: SIMD Optimization, multi-threadingwith OpenMP, quantum circuit optimization, andGPU acceleration.

5.2 SIMD optimization

Recent processors support SIMD (single-instruction, multiple data) instructions, whichcan apply the same operation to multiple datasimultaneously. Qulacs utilizes instructionsnamed Intel AVX2 [61], in which up to 256-bit


data can be processed simultaneously. Whena quantum state is represented as an array ofdouble-precision real values, we can load, store,and process four real values (i.e. two complexnumbers) simultaneously. Thus, the use of AVX2can reduce the number of instructions Ncom by afactor of four at most. In Qulacs, several updatefunctions are optimized with AVX2 instructionsby hand. When Qulacs is being installed, theinstaller checks if a system supports such afeature. If it is supported, the library is builtwith AVX2 instructions enabled.

Here, we discuss our SIMD optimization tech-niques using dense matrix gates. However, it isworth noting that our techniques are applicableto the other quantum gates. The naive imple-mentation of dense matrix gates is shown in List-ing. 4. We implemented two SIMD versions of theB1-loop. When all the indices of the target qubitsare large, it is not possible to SIMDize it as it isbecause the state vector elements required in oneiteration of the B0-loop are scattered across non-contiguous memory locations. However, since thevalue of the adjacent memory location is alwaysloaded in the next iteration, we unroll the B0-loop according to the number of target qubits toenable AVX2’s SIMD load/store operations. Onthe other hand, when there is a target qubit witha small index, we can SIMDize it without suchan unrolling because the required state vector el-ements are already adjacent. Note that the over-head of enumerating B0 and B1 is not negligiblewhen the number of target qubits m is small. Wereduce the cost of the enumeration of B0 and B1as follows: Instead of listing all the items in B0beforehand, the i-th element of B0 is computedfrom the index i in each iteration using bit-wiseoperation techniques. In contrast, the list of B1 iscomputed before the B0-loop since the size of B1is typically small. This implementation is usefulfor parallelizing iterations with multi-threadingby OpenMP, which is explained in the Sec. 5.3.When a gate matrix has a structure and basicgates other than a dense matrix can be utilized,an updated state vector can be calculated withoutmatrix-vector multiplication, and thus a differentoptimization is applied.

5.3 Multi-threading with OpenMP

Since recent CPUs contain multiple processingcores, executing iterations in update and eval-

uation functions in parallel can increase the ef-fective instruction throughput VFLOPS. The useof multiple cores is effective particularly whenFLOPS is a performance bottleneck (compute-bound), and each core has a certain amount ofworkload. Qulacs parallelizes the execution ofupdate functions using OpenMP directives [62].The number of threads used in these func-tions can be controlled with environment vari-able OMP_NUM_THREADS. The naive implementa-tion shown in Listing. 4 consists of two loops:B0-loop and B1-loop. In Qulacs, the B0-loop isparallelized to maximize the amount of workloadfor each core and to minimize the overhead in-curred by multi-threading. Specifically, the par-allelized loop iterates over [0..2n−m − 1] and theiteration space is chunked into T chunks, whereT is the number of threads. In the loop body, thei-th element of B0 is computed on-the-fly fromthe loop index, which enables the even distribu-tion of workload across threads. In contrast, thelist of B1 is created before executing the parallelloop. While the data of B1 is accessed from ev-ery thread, its overhead is expected to be not sohigh because B1 is not updated during the loopand its size |B1| = 2m is typically small enoughto store within CPU registers or caches. A bufferfor a temporal state vector (temp_vector in List-ing. 4) is a thread-local array that can be readand written by each thread independently. Thus,a buffer space for T × 2m data is allocated be-fore the B0-loop, so that each 2m block can beused by a corresponding thread, and is deallo-cated at the end of the parallel loop. Note thatwhen n and m are small, the amount of workloadfor each thread becomes small. In that case, theoverhead due to multi-threading becomes largerthan the speed-up by multi-threading. Therefore,Qulacs automatically disables multi-threading ifthe number of qubits n is smaller than a thresh-old value even when OMP_NUM_THREADS is set to2 or more. This threshold value is empiricallydetermined according to the number of m and atype of basic gates.

5.4 Circuit optimization

When the SIMD and multi-threading withOpenMP are enabled, a time for processing arith-metic operations Ncom/VFLOPS becomes smallerthan that for processing memory operationsNmem/VBW, and a computing time Tgate is de-


voted to transferring data between the mainRAM or CPU cache and the CPU. Circuit opti-mization is a technique to trade Nmem and Ncom,i.e., we can reduce Nmem by sacrificing Ncom witha circuit optimization technique. For simplicity,we suppose the case when we are given a quan-tum circuit that consists only of dense matrixgates. As we discussed in Sec. 4, the number ofarithmetic operations for a dense matrix gate act-ing on a set of qubits M increases as 2n+|M | andthat of memory operations as 2n. We suppose asituation where we need to apply two successivedense matrix gates which act on a set of qubitsM1 and M2, where |M2| ≤ |M1| � n. Here, wecan merge these two quantum gates to a singlequantum gate and apply it to a quantum stateinstead of applying two gates one by one. Then,the number of arithmetic operations changes from2n+|M1| + 2n+|M2| to 2n+|M1∪M2|, and the num-ber of memory operations changes from 2n+1 to2n. Note that while there is a cost for comput-ing a gate matrix of a merged quantum gate,the complexity for computing the gate matrix isat most O(23|M1∪M2|), which is negligible com-pared with the cost for applying quantum gatesto quantum states when |M1 ∪M2| � n. Thus,we can halve the number of memory operationsby multiplying that of arithmetic operations by2|M1∪M2|−|M1|. This means whenM2 is a subset ofM1, we can decrease the number of memory oper-ations with a negligible penalty. Even if there is apenalty, we should perform merge operations un-til Nmem/VBW is balanced to Ncom/VFLOPS. Theoptimal size of merged quantum gates is relevantto a value of VBW divided by VFLOPS, which iscalled a BF ratio. Although the BF ratio variesdepending on a chosen CPU and memory andthe best strategy is dependent on the structureof given quantum circuits, it is typically optimalto merge quantum gates until every quantum gateacts on at most two qubits in the case of randomquantum circuits.

To minimize the time for simulating quantumcircuits with this technique, Qulacs provides thegate.merge function to create a merged quantumgate from two basic gates. The merge_all func-tion of QuantumCircuitOptimizer class mergesall the basic quantum gates in a given quantumcircuit to a single gate. While Qulacs expects thatthis kind of optimization should be performed bythe user according to the tasks, Qulacs provides

two strategies to quickly merge several quantumgates in quantum circuits: light and heavy opti-mizations. The light optimization finds a pair ofgates such that they are neighboring and targetqubits of one gate is a subset of the other witha greedy algorithm and merge it. As discussed,such a pair of gates can be merged without anypenalty. By contrast, the heavy optimizationmerges two quantum gates when the followingconditions are satisfied; two gates can be movedto neighboring positions by repetitively swap thecommutative quantum gates, and the number oftarget qubits of the merged quantum gate is notlarger than the given block size. While the heavyoptimization can decrease the number of quan-tum gates compared to the light optimization, itconsumes a longer time for optimization. Whenwe measure a time for optimization as a part ofthe simulation time, overall performance bene-fits from the heavy optimization varies dependingon the structure of quantum circuits and com-puting devices. Note that every quantum gatekeeps the information about commutable Pauli-basis for each qubit index. For example, a CNOTgate can commute with the Pauli-Z basis at thecontrolled qubit and with the Pauli-X basis atthe target qubit. By utilizing this information,the heavy optimization can check whether twoquantum gates can commute or not quickly. Notethat several quantum gates such as CPTP-mapsand parametric quantum gates cannot be mergedin the optimization process. Listing. 10 shows ex-ample codes for circuit optimization.

1 from qulacs import StateVector ,QuantumCircuit

2 from qulacs .gate import RandomUnitary ,CNOT , merge

3 from qulacs . circuit importQuantumCircuitOptimizer

4 n = 45 gate1 = RandomUnitary ([0 ,1])6 gate2 = RandomUnitary ([2 ,1])7

8 # Create a merged gate9 merged_gate = merge(gate1 , gate2)

10

11

12 # Create an example circuit13 layer_count = 514 circuit = QuantumCircuit (n)15 for layer_index in range( layer_count ):16 for index in range(n):17 circuit . add_gate ( RandomUnitary ([

index ]))18 for index in range( layer_index %2, n-1,


2):19 circuit . add_gate (CNOT(index , index

+1))20

21 for index in range(n):22 circuit . add_gate ( RandomUnitary ([ index

]))23

24 # Optimize the quantum circuit25 optimizer = QuantumCircuitOptimizer ()26 ## Merge all the gates to a single

unitary27 whole_unitary = optimizer . merge_all (

circuit )28 ## Light optimization29 circuit_opt1 = circuit .copy ()30 optimizer . optimize_light ( circuit_opt1 )31 ## Heavy optimization32 circuit_opt2 = circuit .copy ()33 optimizer . optimize ( circuit_opt2 ,

block_size =3)

Listing 10: An example Python program that performscircuit optimization.

5.5 GPU acceleration

A GPU is a computing device that has a highermemory bandwidth than a CPU, and it also hashigher peak performance than a CPU since aGPU has a large number of processing cores.Therefore, it is possible that a quantum simula-tion on a GPU is significantly faster than that ona CPU in certain cases. Also, since a GPU hasseveral types of memories with different perfor-mance characteristics, it is important to considerhow to access and where to place state vectorsand gate matrices for achieving the performanceclose to the peak FLOPS. Here, we discuss op-timization techniques for NVIDIA GPUs, How-ever, our techniques should apply to other GPUssuch as AMD GPUs. In NVIDIA GPUs, thereare six types of memories in a GPU: registers,shared memory, local memory, constant memory,texture memory, and global memory. Qulacs usesregisters, shared memory, constant memory, andglobal memory for calculation. The global mem-ory has the largest capacity but has the highestlatency and lowest bandwidth in the memories.By contrast, the registers can be accessed withthe lowest latency in the memories but their ca-pacity is limited. The shared memory is a mem-ory that is shared by and synchronized amongthreads in a block. This memory has a largercapacity than the registers and has higher band-width and lower latency than the global mem-

ory. The constant memory is a memory that isread-only for GPU kernels and writable from aCPU host. Since memory accesses to the con-stant memory are cached, we can use the con-stant memory as a read-only memory of whichthe bandwidth and latency are almost the sameas those of the registers and the capacity is largerthan the registers. In our implementation, awhole state vector is allocated in the global mem-ory since the size of the other memories is typi-cally not sufficient for storing the state vector. Ineach iteration, a gate matrix, which consists of4m complex numbers, and a temporal state vec-tor, which consists of 2m complex numbers of thewhole state vector (i.e., a variable temp_vectorin Listing. 4), should be temporally stored in ahigh-bandwidth memory to minimize the time formemory operations. To this end, in Qulacs, mem-ories used for storing a gate matrix and a tempo-ral state vector is chosen according to the numberof target qubits. The memory used for storing atemporal state vector is chosen as follows: Whenthe number of target qubits is no more than four,we expect that a temporal state vector is fetchedfrom the global memory to registers and arith-metic operations are performed in each thread.When the number of target qubits is between fiveto eleven, a temporal state vector is loaded to theshared memory since it must be synchronized inthe block. Otherwise, a temporal state vectoris allocated on the global memory, and matrix-vector multiplication is performed with that. Thememory used for storing a gate matrix is chosenas follows: If the number of the target qubits isno more than three, it tends to be placed in regis-ters by the compiler by giving elements of the gatematrix as arguments of the function call. Whenthe number of target qubits is four or five, weuse constant memory since a gate matrix is com-mon and constant for all the threads. When thenumber of qubits is six or more, gate matrices arestored in the global memory.

A state vector can be allocated on a GPU withthe StateVectorGpu class. When the computingnode has multiple GPUs, the index of GPU to al-locate state vectors can be identified with the sec-ond argument of the constructor function. Once astate vector is allocated on a GPU, every process-ing on it is performed with the selected GPU un-less a state vector is converted to StateVector.Hence, multiple tasks can be simulated with mul-


tiple GPUs simultaneously. Listing. 9 shows ex-ample codes for GPU computing.

1 from qulacs import StateVectorGpu ,StateVector

2 from qulacs .gate import H, CNOT3

4 # Create a state vector within the 0-thGPU.

5 n = 106 gpu_id = 07 state = StateVectorGpu (n, gpu_id )8

9 # Apply Hadamard and CNOT gates toStateVector on GPU

10 H(0). update_quantum_state (state)11 CNOT (0 ,1). update_quantum_state (state)12

13 # Load the state vector on GPU to mainRAM

14 state_cpu = StateVector (n)15 state_cpu .load(state)16

17 # Get the state vector as numpy array18 numpy_array = state. get_vector ()

Listing 11: An example Python program that createsand manipulates quantum states within a GPU.

6 Numerical performance

Here, we compare the simulation times for sev-eral computing tasks with Qulacs using vari-ous types of settings. For the benchmark, weuse Qulacs 0.2.0 with Python 3.8.5 on Ubuntu20.04.1 LTS. Qulacs is is compiled with GNUCompiler Collection (GCC) 9.3.0 and with theoptions: -O2 -mtune=native -march=native-mfpmath=both. Benchmarks are conducted us-ing a workstation with two CPUs and a processorname of Intel(R) Xeon(R) Platinum 8276 CPU@ 2.20 GHz, which has 28 physical cores. Thus,there are 56 physical cores in total. The cachesize of each CPU is 39,424 KB. GPU benchmarksare performed with NVIDIA Tesla V100 SXM232GB, where the driver version is 440.100 andthe CUDA version is 10.2. Since CUDA 10.2 isnot compatible with GCC 9.3.0, binaries for GPUbenchmarks are compiled with GCC 8.4.0. Thefollowing benchmarks are performed with Pythonunless specified otherwise. To evaluate the execu-tion times of Python functions, we use pytest [7]with the default settings and use the minimumexecution time for several runs. For C++ func-tions, we measure time with the std::chrono li-brary. We repeat functions until the sum of times

is 1 second or a function is executed 103 times.Then we take use average time for the benchmark.

6.1 Performance of basic gates

First, we compare the times for applying basicgates via Python interfaces. Figure 2 shows thetime to apply gates as a function of the numberof total qubits. The post-fix of legends such asQ1 represents the number of target qubits. Notethat the time of the sparse matrix gate dependson the number of non-zero elements. We choose amatrix such that the top leftmost element is unityand the others are zero. All computational timesgrow exponentially with the number of qubits.When the number of qubits is small, the times forapplying each quantum gate converge to a certainvalue, which is because of the overhead for callingC++ functions from Python. This overhead isevaluated in Sec. 6.3.

The times for dense matrix gates also grow ac-cording to the number of target qubits m. Notethat the times for m = 1, 2 are much faster thanthose for m ≥ 3. This is because specific op-timization is performed for dense matrix gateswith m = 1, 2. They show almost the samevalues since the bottleneck in their simulationtimes is not due to arithmetic-operation costs butmemory-operation costs, which are independentof the number of target qubits.

As expected, the times for diagonal matrixgates, Pauli rotation gates, and Permutationgates are independent of the number of targetqubits m. Note that the time for diagonal matrixgate with m = 1 is much faster than the other msince specific optimization is performed for diago-nal matrix gates with m = 1. Thus, diagonal ma-trix gates should be used instead of dense matrixgates when the number of target qubits is large.Note that the times for permutation gates aremuch slower because a given reversible Booleanfunction is a Python function and its executiontime has a large overhead compared with a func-tion written in C++ language. Using the Qulacsas a C++ library should enhance the performanceof permutation gates.

For sparse gates, the simulation time decreasesaccording to the number of gates, as expected.Although the computational costs of controlledgates (i.e., Pauli-X, CNOT, TOFFOLI, and CC-CNOT where CCCNOT is a bit-flip gate con-trolled by three qubits) should decrease as the


5 10 15 20 25# of qubits

10 6

10 5

10 4

10 3

10 2

10 1

100

101

Tim

e [s

ec]

XCNOTTOFFOLICCCNOTDenseMatrix1QDenseMatrix2QDenseMatrix3QDenseMatrix4QDiagonalMatrix1QDiagonalMatrix2QDiagonalMatrix3QDiagonalMatrix4QSparseMatrix1QSparseMatrix2QSparseMatrix3QSparseMatrix4QPermutation1QPermutation2QPermutation3QPermutation4QPauliRotation1QPauliRotation2QPauliRotation3QPauliRotation4Q

Figure 2: Times for applying basic gates are plotted as a function of the total number of qubits.The post-fix such as 1Q represents the number of target qubits. CCCNOT means a Pauli-Xgate controlled by three qubits.

5 10 15 20 25# of qubits

10 6

10 5

10 4

10 3

10 2

10 1

100

Tim

e [s

ec]

DenseMatrix1QCPTP1QInstrument1QProbabilistic1Q

Figure 3: Times for applying general quantum maps are plotted as a function of the total numberof qubits.

number of control qubits increases, they do notshow a clear scaling. This is because Pauli-X andCNOT gates have SIMD optimized functions andbecause the bottleneck of their computing timesis due to the memory-operation cost rather thanthe arithmetic-operation cost. Note that thereare jumps at 17 and 22 qubits in the plots of con-trolled gates. This is because the memory sizeof the state vector exceeds the cache size, whichcauses discontinuous changes of the bandwidthfor memory operations.

6.2 Performance of general quantum maps

Next, we compare the simulation times for densematrix gates with general single-qubit quantummaps: CPTP-map, instrument, and probabilisticgates. We choose a quantum map such that oneof two dense matrix gates is chosen with a prob-ability of 0.5 as a benchmark target of a CPTP-map and probabilistic gates. We also choose theZ-basis measurement as that of an instrument.Figure 3 shows the benchmark results. The prob-abilistic map has a similar performance to thedense matrix gate since it only requires overheadto randomly draw a gate whose probability isknown in advance. On the other hand, CPTP-


map and instrument are about 10 times slower.This is because they must not only calculate theprobabilities of the Kraus operators but also al-locate and release a temporal buffer to computethem.

6.3 Overhead due to a function call fromPython

While Qulacs is written in C++ language, itsfunctions can be called from Python. However,there is an overhead for calling a C++ functionfrom Python. In the case of small quantum cir-cuits, this overhead is not negligible. To eval-uate this overhead, we compare the times forQulacs with the Python interfaces to those forQulacs without the Python interfaces. Figure 4shows the results. Although we made efforts tominimize the overhead due to Python interfaces,Qulacs still consumes about 0.3 µs to call C++functions from Python. This overhead is not neg-ligible when the number of qubits is below 10.Thus, using Qulacs as a C++ library should in-crease the speed up to about seven times whenthe number of qubits is small.

6.4 Speed-up by SIMD, OpenMP, and GPU

We then evaluate performance improvement us-ing the SIMD optimization, multi-threading withOpenMP, and GPU acceleration. We testthe following settings: single-thread withoutSIMD, single-thread with SIMD, multi-threadwith SIMD, and computing on a single GPU. Fig-ure 5 shows the times for applying dense matrixgates with several numbers of target qubits. TheSIMD variants are always faster than the exe-cution times without SIMD optimization. TheOpenMP variant shows better performance whenthe number of qubits is greater than about 14for m = 1, 2 and about 10 for m = 3, 4. Notethat since multi-threading increases the overhead,and this overhead is not negligible in the case ofa small number of qubits, Qulacs automaticallyuses the function without OpenMP in such cases.This is because the times for the OpenMP variantchanges at a certain point. The GPU variants sig-nificantly improve the computing time when thenumber of qubits is large, but it requires about10 µs overhead.

6.5 Parallelization efficiency

We discuss the parallelization efficiency of multi-threading with OpenMP. Figure 6 shows the ex-ecution time relative to the execution time ofsingle-thread run with its standard error. We de-note this ratio as parallelization efficiency, whichbecomes equal to the number of threads in thecase of the linear speed-up. This ideal line isplotted as black lines in the figure. Since the over-head of parallelization becomes dominant when nis small, Qulacs automatically disables the multi-threading when n is smaller than about 13. Thus,we performed benchmark from n = 15 to n = 25.While we show the case of two-qubit gates, itsbehavior is almost the same as the case of single-qubit gates. Note that this evaluation is per-formed on the two CPUs workstation, each ofwhich has 28 physical core, resulting in 56 phys-ical cores in total. We vary the number of avail-able threads from 1 to 56 with OMP_NUM_THREADS.

When the number of qubits is n = 15, theparallelization efficiency saturates around threadnumber t = 10, and then decreases as the numberof threads increases due to the overhead of par-allelization. As the number of qubits increases,the granularity of the arithmetic and memoryoperations becomes coarser, and the paralleliza-tion efficiency improves gradually. Then, the per-formance trend drastically changes at n = 22,where the performance improves beyond the lin-ear speed-up. This can be explained with the sizeof cache and state vector; Our benchmark ma-chine has two CPUs and each CPU has 40 MBL3 cache, and the size of the state vector is67 MB at n = 22, respectively. Thus, this isan exceptional situation where the whole statevector can reside in the L3 cache when we usetwo or more threads, assuming thread N goesto CPU N%2 (OMP_PLACES=socket). Thus, inthis situation, the effective bandwidth is super-linearly increased compared to the single-threadcase. While the amount of speed-up depends onthe index of quantum gates, we always observedsuper-linear speed-up at n = 22 due to this rea-son. To validate this explanation, we also per-formed the evaluation with OMP_PLACES=cores;OMP_PROC_BIND=close. These environment vari-ables force CPUs to use a single CPU until thenumber of threads is below the number of physi-cal cores per CPU, and use two CPUs the number


5 10 15 20 25# of qubits

10 7

10 6

10 5

10 4

10 3

10 2

10 1

100

Tim

e [s

ec]

Python DenseMatrix1QPython DenseMatrix2QPython DenseMatrix3QPython DenseMatrix4QCPP DenseMatrix1QCPP DenseMatrix2QCPP DenseMatrix3QCPP DenseMatrix4Q

Figure 4: Times of function calls for applying dense matrix gates via python and C++ arecompared.

5 10 15 20 25# of qubits

10 6

10 5

10 4

10 3

10 2

10 1

100

Tim

e [s

ec]

SingleThread NoSIMD DenseMatrix1QSingleThread NoSIMD DenseMatrix2QSingleThread NoSIMD DenseMatrix3QSingleThread NoSIMD DenseMatrix4QSingleThread SIMD DenseMatrix1QSingleThread SIMD DenseMatrix2QSingleThread SIMD DenseMatrix3QSingleThread SIMD DenseMatrix4QMultiThread SIMD DenseMatrix1QMultiThread SIMD DenseMatrix2QMultiThread SIMD DenseMatrix3QMultiThread SIMD DenseMatrix4QGPU DenseMatrix1QGPU DenseMatrix2QGPU DenseMatrix3QGPU DenseMatrix4Q

Figure 5: Times for applying dense matrix gates with several optimization settings are plottedas a function of the total number of qubits.

of threads exceeds it. Figure 7 shows the perfor-mance with these options. We see that there is noimprovement beyond the linear speed-up with asingle CPU, and the performance drastically im-proves from t > 28, which allows using the twoCPUs. This behavior agrees with our explana-tion.

When the number of qubits becomes largerthan n = 22, the parallelization efficiency quicklyreduces again. This is because the whole statevector cannot be stored in the cache in this re-gion, and communication between the main RAMand CPUs is required. This reduces the effectivebandwidth and the memory operation costs be-come dominant in the execution time. Since themulti-threading improves mainly the arithmetic

operation costs, the advantage of parallelizationbecomes small.

6.6 Circuit optimization

We evaluate the performance of two circuit opti-mization strategies. Here, we choose the followingrandom quantum circuits for the benchmark. Ann-qubit random quantum circuit consists of n lay-ers. In each layer, three random rotations (RZ, RX,RZ) act on each qubit, and controlled-Z gates areapplied to each neighboring qubit. Whether thestarting index of the controlled-Z gates is even orodd depends on the index of the layer. Finally,three random rotations act on each qubit. List-ing. 12 shows the source code to generate random


0 10 20 30 40 50# of threads

0

10

20

30

40

50

60

70

Para

lleliz

atio

n ef

ficie

ncy

(ratio

to si

ngle

thre

ad)

linear speedupDenseMatrix2Q n=15DenseMatrix2Q n=16DenseMatrix2Q n=17DenseMatrix2Q n=18DenseMatrix2Q n=19DenseMatrix2Q n=20DenseMatrix2Q n=21DenseMatrix2Q n=22DenseMatrix2Q n=23DenseMatrix2Q n=24DenseMatrix2Q n=25

Figure 6: Times for applying dense matrix gates with several numbers of gates are plotted as afunction of the total number of threads.

0 10 20 30 40 50# of threads

0

10

20

30

40

50

60

Para

lleliz

atio

n ef

ficie

ncy

(ratio

to si

ngle

thre

ad)

linear speedupDenseMatrix2Q n=15DenseMatrix2Q n=16DenseMatrix2Q n=17DenseMatrix2Q n=18DenseMatrix2Q n=19DenseMatrix2Q n=20DenseMatrix2Q n=21DenseMatrix2Q n=22DenseMatrix2Q n=23DenseMatrix2Q n=24DenseMatrix2Q n=25

Figure 7: Times for applying dense matrix gates with several numbers of gates are plotted as afunction of the total number of threads, where the setting of thread affinity is changed.

circuits.

1 import numpy as np2 from qulacs import QuantumCircuit3 from qulacs .gate import RX , RZ , CZ4 def generate_random_circuit ( nqubits : int

, depth: int) -> QuantumCircuit :5 qc = QuantumCircuit ( nqubits )6 for layer_count in range(depth +1):7 for index in range( nqubits ):8 angle1 = np. random .rand ()*np.pi*29 angle2 = np. random .rand ()*np.pi*2

10 angle3 = np. random .rand ()*np.pi*211 qc. add_gate (RZ(index , angle1 ))12 qc. add_gate (RX(index , angle2 ))13 qc. add_gate (RZ(index , angle3 ))14 if layer_count == depth:15 break16 for index in range( layer_conut %2,

nqubits -1, 2):

17 qc. add_gate (CZ(index , index +1))18 return qc

Listing 12: An Python program that generates randomquantum circuits for our benchmark.

We simulate these circuits by enabling SIMD op-timizations and disabling OpenMP. We choosea block size of two for the heavy optimizations.Figure 8 shows the simulation times for these cir-cuits, where the solid and dashed lines excludeand include the circuit optimization time, respec-tively. When the circuit optimization time is ig-nored, the performance of the heavy optimizationis slightly better than that of the light optimiza-tion. However, when the optimization time is in-cluded in the computing time, there is no advan-


4 6 8 10 12 14 16 18 20# of qubits

10 6

10 5

10 4

10 3

10 2

10 1

100

Tim

e [s

ec]

No OptimizationLight Optimization (+opt_time)Light OptimizationHeavy Optimization (+ opt time)Heavy Optimization

Figure 8: Times for simulating random quantum circuits are plotted with several strategies ofquantum circuit optimization. "+ opt time" in legend represents that its plot includes a time forcircuit optimization.

4 6 8 10 12 14 16 18 20# of qubits

10 5

10 4

10 3

10 2

10 1

Tim

e [s

ec]

No OptimizationLight Optimization (+opt_time)Light OptimizationHeavy Optimization (+ opt time)Heavy Optimization

Figure 9: Times for simulating random quantum circuits dominated by commutative gates areplotted with several strategies of quantum circuit optimization.

tage in the heavy optimization. Thus, the lightoptimization (or no optimization) is more suit-able in cases when a merged circuit is used onlya few times.

Since the heavy optimization looks for pairsof quantum gates that can be merged consider-ing commutation relations of gates, it is effec-tive when there are many commutative gates inquantum circuits. When the three rotations (RZ,RX, RZ) are replaced with a single rotation RZ, allgates in quantum circuits become commutative inthe Z-basis. In such a case, quantum circuits canbe compressed to a constant depth by consideringcommutation relations. Figure 9 shows the simu-lation times for these circuits. As expected, the

heavy optimization is effective in this case even ifthe optimization time is taken into account.

7 Comparison with existing simulators

In this section, we compare the performance ofQulacs with that of existing libraries under thesettings of single-thread, multi-thread, and withGPU acceleration. We create a benchmark frame-work based on Ref. [63], of which the benchmarkcodes are reviewed by contributors of each li-brary. This repository chooses the following ran-dom quantum circuits for the benchmark: Sup-pose we generate n-qubit random circuits. A


layer with random RZ, RX, RZ rotations on eachqubit is named a rotation layer. A layer withCNOT gates acting on the i-th qubit as a targetqubit and (i + 1 mod n)-th qubit as a controlqubit for (0 ≤ i < n) is named a CNOT layer. Ina random circuit, a rotation layer and a CNOTlayer are alternately repeated ten times, and arotation layer follows it. Note that the first RZrotations for all the qubits in the first rotationlayer and the last RZ rotations in the last rotationlayer are eliminated since they are meaningless ifa quantum state is prepared in |0〉⊗n and mea-sured in the Pauli-Z basis. See Refs. [63, 64] forsource codes for circuit generation. The bench-marks with CPU are performed with a worksta-tion with two CPUs and a processor name of In-tel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz onCentOS 7.2.1511. This workstation has 24 phys-ical cores in total. The benchmarks with GPUare performed with two CPUs and a processorname of Intel(R) Xeon(R) Silver 4108 CPU @1.80 GHz and with Tesla V100 PCIe 32GB onCentOS 7.7.1908. The benchmarks are performedwith double precision, i.e., each complex numberis represented with 128 bits. Qulacs is compiledwith the same options as those used in Sec. 6.Versions of quantum circuit simulators and re-lated libraries for CPU and GPU benchmarks arelisted in Table. 2 and Table. 3, respectively. Allthe simulators are installed with the latest stableversions as of November 2020. In benchmarks, atime for simulating a circuit to obtain the finalstate vector is evaluated, and a time for creatingquantum circuits is not included. If an evaluatedsimulator offers the option of enabling circuit op-timizations, we plot times both with and withoutquantum circuit optimization to discuss the ad-vantage of acceleration by circuit optimization.Our benchmark codes can be found at Ref. [64].

While we carefully make the comparison fair,it is non-trivial to compare all the libraries in theexactly same conditions since they are developedwith different purposes and designs. To clarifybenchmark settings, here we describe severalnotes on how we install or evaluate simulators.Qiskit [14] is a large software development kit,and Qiskit Aer is a component of Qiskit that im-plements fast simulation of quantum circuits. Weused a backend named StatevectorSimulatorfor the benchmark since it is expected to be thefastest for simulating typical random circuits

Library VersionGCC 9.2.0Python 3.7.9Julia 1.5.2NumPy 1.19.2MKL 2020.2TensoFlow 2.3.1Intel-QS [25, 26] see main textProjectQ [29] 0.5.1PyQuEST-cffi [30] 3.2.3.1Qibo [34] 0.1.2Qiskit [14] 0.23.1Qiskit Aer [14] 0.7.1Qiskit Terra [14] 0.16.1Qulacs 0.2.0qxelarator [27] 0.3.0Yao [32] 0.6.3

Table 2: A list of libraries and versions for CPU bench-mark

among the implemented simulators. In thedefault setting, we need to compile quantumcircuits with qiskit.compiler.transpile andqiskit.compiler.assemble to create a jobto call a core function of Qiskit Aer. A corefunction of Qiskit Aer is executed by submittingthe compiled job to a thread pool. However,the overheads for these processes are sometimeslonger than the time for simulation itself whenthe number of qubits is small. Thus, we elim-inate the times for these processes from thebenchmark and directly evaluate an executiontime of a core function. Qiskit Aer performscircuit optimization similar to our technique,which is called gate fusion in Qiskit Aer. In thefollowing discussion for Qiskit, we plot executiontimes with and without gate fusion, which canbe switched via flag enable_fusion. We notethat execution times of qiskit.execute arelonger than plotted ones. Intel-QS [25], whichis also known as qHiPSTER, is a C++ librarythat has two official repositories in GitHub. Inthe benchmark, actively maintained one [65]is used. In this repository, Intel-QS does nothave stable release, so we installed the latestmaster branch of which the latest commit hash isb625e1fb09c5aa3c146cb7129a2a29cdb6ff186a.Intel-QS is compiled with GCC and with SIMD


Library VersionGCC 7.3.0Python 3.7.9Julia 1.5.2NumPy 1.19.2MKL 2020.2NVIDIA driver 440.33.01CUDA 10.2Qiskit [14] 0.23.1Qiskit Aer [14] 0.7.1Qiskit Aer GPU [14] 0.7.1Qiskit Terra [14] 0.16.1Qulacs 0.2.0Yao [32] 0.6.3

Table 3: A list of libraries and versions for GPU bench-mark

optimization, i.e., compiled with the optionsCXX=g++ and IqsNative=ON. While Intel-QSis a C++ library, we use Intel-QS via pythoninterface for a fair comparison. QuEST [30]is a C++ library of which the python inter-faces are provided by another project namedPyQuEST-cffi. While QuEST itself providesacceleration by parallelization and GPU, we skipthe benchmark of QuEST with multi-thread andGPU acceleration since we cannot enable themvia the python interfaces. ProjectQ [29] raiseserrors to users when there is no measurement incircuits, which may cause an additional overheadin benchmarks. While we expect this overheadis constant, we note that an execution timeof ProjectQ may be faster than plotted whenquantum circuits have measurements. Qibo [34]is a library that supports GPU acceleration usingTensorFlow, and we expect its performance de-pends on that of TensorFlow. However, since thelatest TensorFlow is not compatible with CUDA10.2, we skip the GPU benchmark of Qibo. Inaddition, since we installed TensorFlow with pipcommands, TensorFlow does not support AVX2extension. Thus, the performance of Qibo in theCPU benchmark would be a few times fasterthan plotted ones by installing TensorFlow fromthe source. QX Simulator [27] is used with itspython interface named qxelerator, which acceptsa file with QASM-format [56] strings as an input,we generate a benchmark circuit, convert it to aQASM string, save it as a file, load it with qxel-

erator, and evaluated a time for simulation, i.e.,a time for executing qxelerator.QX.executefunction. Since QCGPU and qsim only supportsimulation with single precision, we did notperform the benchmarks for them. For all thebenchmark libraries, their execution times mayvary depending on the structure of quantumcircuits for benchmarks, library versions, compil-ers, the status of the CPU and GPU, simulationenvironment, etc. In particular, while executiontimes for single-thread simulation are stable,those for multi-thread and GPU fluctuate bya few percent each time we run a benchmarkscript. Thus, it should be noted that a fewpercent difference in multi-thread and GPUbenchmarks are meaningless. It should be alsonoted that several libraries among the above suchas Intel-QS, QuEST, QX Simulator, and Pro-jectQ support quantum circuit simulation withdistributed computing and are not necessarilyoptimized for the single-node performance.

First, we perform benchmarking with a sin-gle thread simulation with CPU. Figure 10 showsthe results. For Qulacs and Qiskit, the perfor-mance of them with and without circuit optimiza-tion is plotted as a bold line and broken line, re-spectively. Qulacs without optimization is fasterthan that with optimization when the number ofqubits is small. On the other hand, Qulacs withoptimization becomes faster than Qulacs withoutoptimization when the number of qubits is greaterthan 7 since the time for circuit optimizationbecomes negligible compared to the simulationtime. As far as we know, Qiskit with optimizationand Yao utilize the structure of quantum circuitsto improve the performance, and thus their per-formance overcomes that of Qulacs without cir-cuit optimization. Since Qiskit without optimiza-tion is slower than Qulacs without optimizationbut they are comparable with optimization, weguess circuit optimization by Qiskit is more time-consuming but near-optimal than that by Qulacs.

Second, we perform benchmarking with multi-thread computing. Figure 11 shows the results.Due to a small overhead of Qulacs, its executiontime is the fastest when the number of qubitsis small. When the number of qubits becomeslarge, several libraries without circuit optimiza-tion achieve almost the same performance. Whenmulti-threading is enabled, an execution time isdetermined by memory operations rather than


5 10 15 20 25# of qubits

10 5

10 4

10 3

10 2

10 1

100

101

102

Tim

e [s

ec]

QulacsQulacs with optYaoQiskitQiskit with optProjectQPyQuEST-cffiQiboIntel-QSqxelerator

Figure 10: Times for simulating random quantum circuits with a single thread using severallibraries.

5 10 15 20 25# of qubits

10 5

10 4

10 3

10 2

10 1

100

101

Tim

e [s

ec]

QulacsQulacs with optQiskitQiskit with optProjectQQiboIntel-QSqxelerator

Figure 11: Times for simulating random quantum circuits with parallelization using severallibraries.

arithmetic operations. Since the number of mem-ory operations is almost independent of the de-tail of implementation, it is natural that timesof several libraries converge to a certain value.When we compare the performance of librarieswith circuit optimization, Qulacs with optimiza-tion shows better performance than that with-out optimization above 9 qubits. Since Qiskitis expected to perform near-optimal circuit opti-mization, Qiskit shows better performance thanQulacs when the number of qubits is larger than21. Note that there is a small bump around 14qubits in the plot of Qiskit, which happens be-cause Qiskit enables multi-thread when the num-ber of qubits is more than 14 qubits in the defaultsetting, thus the times of Qiskit around 14 qubits

can be slightly faster by optimization of settings.Finally, we perform the benchmark with GPU

acceleration. Figure 12 shows the results. In theGPU benchmarking, Qulacs is also one of thefastest libraries. In the GPU simulation, the over-head due to circuit optimization becomes neg-ligible since there is a larger overhead due toGPU function calls. Therefore, the performanceof Qulacs with circuit optimization is always bet-ter than that without circuit optimization. Sincetimes for circuit optimization is negligible, we canfurther improve the performance of Qulacs by uti-lizing the heavy optimization. As far as we tried,the heavy optimization with the block size of fouris optimal. The performance with the heavy op-timization is plotted in the figure with the leg-


5 10 15 20 25# of qubits

10 3

10 2

10 1

100

Tim

e [s

ec]

QulacsQulacs with optQulacs with heavy optYaoQiskitQiskit with opt

Figure 12: Times for simulating random quantum circuits with GPU acceleration using severallibraries.

end "Qulacs with heavy opt". The performanceof Qulacs with the heavy optimization overcomesthat with the light optimization above 19 qubits.Note that the time for Qulacs with the heavy op-timization at n = 4 is significantly faster thanthe other number of qubits. This is because acircuit is merged to a single quantum gate by anoptimizer and a GPU function is called at once.

In conclusion, Qulacs is one of the fastest sim-ulators among existing libraries in several regionsthat are vital for researches. In particular, Qulacsshows significant speed-up when the number ofqubits is small, which is essential for exploringthe possibilities of quantum computing.

8 Conclusion and Outlook

Here, we introduced Qulacs, which is a fast andversatile simulator. First, we showed the ba-sic concept and intended usages of Qulacs. Sec-ond, we explained the library structure and pro-vided several examples. We optimized the up-date functions of Qulacs according to the prop-erties of gate matrices. We then utilized addi-tional optimization techniques such as SIMD op-timization, multi-threading with OpenMP, GPUacceleration, and circuit optimization for numeri-cal speed-up. We also showed concrete simulationtimes for several quantum gates and evaluatedspeed-up by optimization techniques. Finally, wecompared the performance of Qulacs with that ofthe existing libraries. With the benchmarks, we

showed our simulator has advantages in severalscenarios. Although Qulacs focuses on support-ing fundamental operations, we can use Qulacs toexplore simulations with many layers using it as abackend of other libraries, for example, Cirq [13]and OpenFermion [54].

When quantum circuits are constructed onlywith efficiently simulatable quantum gates suchas Clifford gates [66, 67] or matchgates [68–70],these circuits are efficiently simulated via special-ized algorithms. We plan to implement these al-gorithms and support faster simulations of quan-tum circuits in the future.

Acknowledgment

Yasunari Suzuki, Yoshiaki Kawase, Yuria Hiraga,Yuya Masumura, Masahiro Nakadai, and KeisukeFujii are the core contributors of Qulacs. Ten-nin Yan, Yohei Ibe, Toru Kawakubo, HirotsuguYamashita, Hikari Yoshimura, Jiabao Chen, andYouyuan Zhang have made significant efforts tomaintain manuals, repositories, and web sites.Ken M. Nakanishi, Kosuke Mitarai, Yuya O.Nakagawa, Shiro Tamiya, Takahiro Yamamoto,Ryosuke Imai, and Akihiro Hayashi providedmany essential comments and directions for im-proving the performance and usability of Qulacs.

Yasunari Suzuki would like to thank TysonJones for the fruitful discussion on parallel anddistributed computing, Xiu-Zhe Luo for the vitalcomments on vectorization, and Shinya Morino


for the advice on GPU acceleration. We wouldalso like to thank all the contributors and usersof Qulacs for supporting this project.

This work is supported by PRESTO, JST,Grant No. JPMJPR1916; ERATO, JST, GrantNo. JPMJER1601; MEXT Q-LEAP GrantNo. JPMXS0120319794, JPMXS0118068682, andJPMXS0120319794.

A C++ example codes

To show Qulacs can be used as a C++ libraryin almost the same way as Python, we show theexample codes of Qulacs in the C++ language.Listing. 13 shows an example procedure to cre-ate, modify, update, and release quantum statesusing quantum gates. This program outputs amessage shown in Listing. 14. As you can see,the names and design of API are almost the sameas the python library, and we can use Qulacs asa C++ library with small difference, e.g., com-plex matrices are supplied using Eigen instead ofNumPy, users have a responsibility to release al-located state vector, and so on. For more detailedexamples, see the online manual of Qulacs [3].

1 # include <vector >2 # include <complex >3 # include <Eigen/Core >4 # include <cppsim /state.hpp >5 # include <cppsim / gate_matrix .hpp >6 # include <cppsim / gate_factory .hpp >7

8 int main (){9 unsigned int num_qubit = 2;

10

11 // create state vector12 QuantumState * state = new QuantumState

( num_qubit );13 state -> set_computational_basis (2);14 QuantumState * sub_state = state ->copy

();15

16 std :: vector <std :: complex <double >>values = {0.5 , 0.5, 0.5, -0.5};

17 state ->load( values );18 state ->load( sub_state );19 state -> set_Haar_random_state (42);20

21 Eigen :: MatrixXcd gate_matrix (4 ,4);22 gate_matrix <<23 1, 0, 0, 0,24 0, 1, 0, 0,25 0, 0, 0, 1,26 0, 0, 1, 0;27 QuantumGateBase * dense_gate = gate ::

DenseMatrix ({0, 1}, gate_matrix );

28 dense_gate -> update_quantum_state (state);

29

30 QuantumGateBase * swap_gate = gate ::SWAP (0 ,1);

31 swap_gate -> update_quantum_state (state);

32 std :: cout << dense_gate << std :: endl;33 std :: cout << swap_gate << std :: endl;34 std :: cout << state << std :: endl;35

36 delete dense_gate ;37 delete swap_gate ;38 delete state;39 delete sub_state ;40 }

Listing 13: An example C++ program that initializesquantum states.

1 *** gate info ***2 * gate name : DenseMatrix3 * target :4 0 : commute5 1 : commute6 * control :7 * Pauli : no8 * Clifford : no9 * Gaussian : no

10 * Parametric : no11 * Diagonal : no12 * Matrix13 (1 ,0) (0 ,0) (0 ,0) (0 ,0)14 (0 ,0) (1 ,0) (0 ,0) (0 ,0)15 (0 ,0) (0 ,0) (0 ,0) (1 ,0)16 (0 ,0) (0 ,0) (1 ,0) (0 ,0)17

18 *** gate info ***19 * gate name : SWAP20 * target :21 0 : commute22 1 : commute23 * control :24 * Pauli : no25 * Clifford : yes26 * Gaussian : no27 * Parametric : no28 * Diagonal : no29

30 *** Quantum State ***31 * Qubit Count : 232 * Dimension : 433 * State vector :34 (0.343453 ,0.418285)35 ( -0.287486 , -0.663243)36 (0.309688 , -0.266554)37 ( -0.0511105 , -0.122348)

Listing 14: An output message of the program shown inListing. 13


References[1] Frank Arute, Kunal Arya, Ryan Bab-

bush, Dave Bacon, Joseph C Bardin, RamiBarends, Rupak Biswas, Sergio Boixo, Fer-nando GSL Brandao, David A Buell, et al.Quantum supremacy using a programmablesuperconducting processor. Nature, 574(7779):505–510, 2019. DOI: 10.1038/s41586-019-1666-5.

[2] Laird Egan, Dripto M Debroy, CrystalNoel, Andrew Risinger, Daiwei Zhu, De-bopriyo Biswas, Michael Newman, MuyuanLi, Kenneth R Brown, Marko Cetina,et al. Fault-tolerant operation of a quan-tum error-correction code. arXiv preprintarXiv:2009.11482, 2020.

[3] Qulacs website. https://github.com/qulacs/qulacs, 2018.

[4] Gaël Guennebaud, Benoît Jacob, et al. Eigenv3. http://eigen.tuxfamily.org, 2010.

[5] Wenzel Jakob, Jason Rhinelander, and DeanMoldovan. pybind11 – seamless operabil-ity between c++11 and python. https://github.com/pybind/pybind11, 2017.

[6] GoogleTest. https://github.com/google/googletest, 2019.

[7] Holger Krekel, Bruno Oliveira, RonnyPfannschmidt, Floris Bruynooghe, BriannaLaugher, and Florian Bruhin. pytestx.y. https://github.com/pytest-dev/pytest, 2004.

[8] Sergio Boixo, Sergei V Isakov, Vadim NSmelyanskiy, and Hartmut Neven. Simula-tion of low-depth quantum circuits as com-plex undirected graphical models. arXivpreprint arXiv:1712.05384, 2017.

[9] Igor L Markov and Yaoyun Shi. Simu-lating quantum computation by contract-ing tensor networks. SIAM Journal onComputing, 38(3):963–981, 2008. DOI:10.1137/050644756. URL https://doi.org/10.1137/050644756.

[10] Igor L Markov, Aneeqa Fatima, Sergei VIsakov, and Sergio Boixo. Quantumsupremacy is both closer and farther than itappears. arXiv preprint arXiv:1807.10749,2018.

[11] Sergey Bravyi and David Gosset. Im-proved classical simulation of quantumcircuits dominated by clifford gates. Phys.Rev. Lett., 116:250501, Jun 2016. DOI:

10.1103/PhysRevLett.116.250501. URLhttps://link.aps.org/doi/10.1103/PhysRevLett.116.250501.

[12] Sergey Bravyi, Dan Browne, Padraic Calpin,Earl Campbell, David Gosset, and MarkHoward. Simulation of quantum cir-cuits by low-rank stabilizer decompositions.Quantum, 3:181, September 2019. ISSN2521-327X. DOI: 10.22331/q-2019-09-02-181. URL https://doi.org/10.22331/q-2019-09-02-181.

[13] Quantum AI team and collaborators. Cirq,October 2020. URL https://doi.org/10.5281/zenodo.4062499.

[14] Héctor Abraham et al. Qiskit: An open-source framework for quantum computing,2019. URL https://doi.org/10.5281/zenodo.2562110.

[15] Robert S Smith, Michael J Curtis, andWilliam J Zeng. A practical quantum in-struction set architecture. arXiv preprintarXiv:1608.03355, 2016.

[16] Ville Bergholm, Josh Izaac, Maria Schuld,Christian Gogolin, Carsten Blank, KeriMcKiernan, and Nathan Killoran. Pen-nylane: Automatic differentiation of hy-brid quantum-classical computations. arXivpreprint arXiv:1811.04968, 2018.

[17] Krysta Svore, Alan Geller, Matthias Troyer,John Azariah, Christopher Granade, Bet-tina Heim, Vadym Kliuchnikov, MariiaMykhailova, Andres Paz, and Martin Roet-teler. Q#: Enabling scalable quantumcomputing and development with a high-level dsl. RWDSL2018, New York, NY,USA, 2018. Association for Computing Ma-chinery. ISBN 9781450363556. DOI:10.1145/3183895.3183901. URL https://doi.org/10.1145/3183895.3183901.

[18] Benjamin Villalonga, Sergio Boixo, BronNelson, Christopher Henze, Eleanor Rief-fel, Rupak Biswas, and Salvatore Mandrà.A flexible high-performance simulator forverifying and benchmarking quantum cir-cuits implemented on real hardware. npjQuantum Information, 5(1):86, Oct 2019.ISSN 2056-6387. DOI: 10.1038/s41534-019-0196-1. URL https://doi.org/10.1038/s41534-019-0196-1.

[19] Chase Roberts, Ashley Milsted, MartinGanahl, Adam Zalcman, Bruce Fontaine, Yi-


https://doi.org/10.1038/s41586-019-1666-5

https://doi.org/10.1038/s41586-019-1666-5

https://github.com/qulacs/qulacs

https://github.com/qulacs/qulacs

http://eigen.tuxfamily.org

https://github.com/pybind/pybind11

https://github.com/pybind/pybind11

https://github.com/google/googletest

https://github.com/google/googletest

https://github.com/pytest-dev/pytest

https://github.com/pytest-dev/pytest

https://doi.org/10.1137/050644756

https://doi.org/10.1137/050644756

https://doi.org/10.1137/050644756

https://doi.org/10.1137/050644756

https://doi.org/10.1103/PhysRevLett.116.250501


https://link.aps.org/doi/10.1103/PhysRevLett.116.250501

https://link.aps.org/doi/10.1103/PhysRevLett.116.250501

https://doi.org/10.22331/q-2019-09-02-181

https://doi.org/10.22331/q-2019-09-02-181

https://doi.org/10.22331/q-2019-09-02-181

https://doi.org/10.22331/q-2019-09-02-181

https://doi.org/10.5281/zenodo.4062499




https://doi.org/10.1145/3183895.3183901

https://doi.org/10.1145/3183895.3183901

https://doi.org/10.1145/3183895.3183901

https://doi.org/10.1145/3183895.3183901

https://doi.org/10.1038/s41534-019-0196-1

https://doi.org/10.1038/s41534-019-0196-1

https://doi.org/10.1038/s41534-019-0196-1

https://doi.org/10.1038/s41534-019-0196-1

jian Zou, Jack Hidary, Guifre Vidal, and Ste-fan Leichenauer. Tensornetwork: A libraryfor physics and machine learning. arXivpreprint arXiv:1905.01330, 2019.

[20] Matthew Fishman, Steven R White, andE Miles Stoudenmire. The ITensor SoftwareLibrary for Tensor Network Calculations.arXiv preprint arXiv:2007.14822, 2020.

[21] Benjamin Villalonga, Dmitry Lyakh, Ser-gio Boixo, Hartmut Neven, Travis S Hum-ble, Rupak Biswas, Eleanor G Rieffel, AlanHo, and Salvatore Mandrà. Establishingthe quantum supremacy frontier with a281 pflop/s simulation. Quantum Scienceand Technology, 5(3):034003, 2020. DOI:10.1088/2058-9565/ab7eeb. URL https://doi.org/10.1088/2058-9565/ab7eeb.

[22] Koen De Raedt, Kristel Michielsen,Hans De Raedt, Binh Trieu, GuidoArnold, Marcus Richter, Th Lippert,Hiroshi Watanabe, and Nobuyasu Ito.Massively parallel quantum computersimulator. Computer Physics Commu-nications, 176(2):121–136, 2007. DOI:10.1016/j.cpc.2006.08.007. URL https://doi.org/10.1016/j.cpc.2006.08.007.

[23] Hans De Raedt, Fengping Jin, DennisWillsch, Madita Willsch, Naoki Yosh-ioka, Nobuyasu Ito, Shengjun Yuan, andKristel Michielsen. Massively paral-lel quantum computer simulator, elevenyears later. Computer Physics Com-munications, 237:47–61, 2019. DOI:10.1016/j.cpc.2018.11.005. URL https://doi.org/10.1016/j.cpc.2018.11.005.

[24] Thomas Häner and Damian S Steiger. 0.5petabyte simulation of a 45-qubit quan-tum circuit. In Proceedings of the In-ternational Conference for High Perfor-mance Computing, Networking, Storageand Analysis, pages 1–10, 2017. DOI:10.1145/3126908.3126947. URL https://doi.org/10.1145/3126908.3126947.

[25] Gian Giacomo Guerreschi, JustinHogaboam, Fabio Baruffa, and Nico-las PD Sawaya. Intel Quantum Simulator:A cloud-ready high-performance simulatorof quantum circuits. Quantum Scienceand Technology, 5(3):034007, 2020. DOI:10.1088/2058-9565/ab8505. URL https://doi.org/10.1088/2058-9565/ab8505.

[26] Mikhail Smelyanskiy, Nicolas PD Sawaya,and Alán Aspuru-Guzik. qHiPSTER:The quantum high performance soft-ware testing environment. arXiv preprintarXiv:1601.07195, 2016.

[27] Nader Khammassi, Imran Ashraf, Xiang Fu,Carmen G Almudever, and Koen Bertels.QX: A high-performance quantum computersimulation platform. In Design, Automation& Test in Europe Conference & Exhibition(DATE), 2017, pages 464–469. IEEE, 2017.DOI: 10.23919/DATE.2017.7927034. URLhttps://doi.org/10.23919/DATE.2017.7927034.

[28] Nader Khammassi, Imran Ashraf, J vSomeren, Razvan Nane, AM Krol, M Adri-aan Rol, L Lao, Koen Bertels, and Carmen GAlmudever. OpenQL: A portable quantumprogramming framework for quantum accel-erators. arXiv preprint arXiv:2005.13283,2020.

[29] Damian S Steiger, Thomas Häner, andMatthias Troyer. ProjectQ: an open sourcesoftware framework for quantum comput-ing. Quantum, 2:49, 2018. DOI: 10.22331/q-2018-01-31-49. URL https://doi.org/10.22331/q-2018-01-31-49.

[30] Tyson Jones, Anna Brown, Ian Bush,and Simon C Benjamin. QuEST andHigh Performance Simulation of Quan-tum Computers. Scientific reports, 9(1):1–11, 2019. DOI: 10.1038/s41598-019-47174-9. URL https://doi.org/10.1038/s41598-019-47174-9.

[31] Quantum AI team and collaborators. qsim,September 2020. URL https://doi.org/10.5281/zenodo.4023103.

[32] Xiu-Zhe Luo, Jin-Guo Liu, Pan Zhang,and Lei Wang. Yao.jl: Extensible, Effi-cient Framework for Quantum AlgorithmDesign. Quantum, 4:341, October 2020.ISSN 2521-327X. DOI: 10.22331/q-2020-10-11-341. URL https://doi.org/10.22331/q-2020-10-11-341.

[33] Adam Kelly. Simulating quantum com-puters using OpenCL. arXiv preprintarXiv:1805.00988, 2018.

[34] Stavros Efthymiou, Sergi Ramos-Calderer,Carlos Bravo-Prieto, Adrián Pérez-Salinas,Diego García-Martín, Artur Garcia-Saez,José Ignacio Latorre, and Stefano Carrazza.


https://doi.org/10.1088/2058-9565/ab7eeb




https://doi.org/10.1016/j.cpc.2006.08.007








https://doi.org/10.1145/3126908.3126947

https://doi.org/10.1145/3126908.3126947

https://doi.org/10.1145/3126908.3126947

https://doi.org/10.1145/3126908.3126947

https://doi.org/10.1088/2058-9565/ab8505

https://doi.org/10.1088/2058-9565/ab8505

https://doi.org/10.1088/2058-9565/ab8505

https://doi.org/10.1088/2058-9565/ab8505

https://doi.org/10.23919/DATE.2017.7927034



https://doi.org/10.22331/q-2018-01-31-49

https://doi.org/10.22331/q-2018-01-31-49

https://doi.org/10.22331/q-2018-01-31-49

https://doi.org/10.22331/q-2018-01-31-49

https://doi.org/10.1038/s41598-019-47174-9

https://doi.org/10.1038/s41598-019-47174-9

https://doi.org/10.1038/s41598-019-47174-9

https://doi.org/10.1038/s41598-019-47174-9



https://doi.org/10.22331/q-2020-10-11-341

https://doi.org/10.22331/q-2020-10-11-341

https://doi.org/10.22331/q-2020-10-11-341

https://doi.org/10.22331/q-2020-10-11-341

Qibo: a framework for quantum simulationwith hardware acceleration. arXiv preprintarXiv:2009.01845, 2020. DOI: 10.5281/zen-odo.3997194. URL https://doi.org/10.5281/zenodo.3997194.

[35] Alberto Peruzzo, Jarrod McClean, PeterShadbolt, Man-Hong Yung, Xiao-Qi Zhou,Peter J Love, Alán Aspuru-Guzik, andJeremy L O’brien. A variational eigenvaluesolver on a photonic quantum processor. Na-ture communications, 5:4213, 2014. DOI:10.1038/ncomms5213. URL https://doi.org/10.1038/ncomms5213.

[36] Seth Lloyd. Universal quantum simu-lators. Science, pages 1073–1078, 1996.DOI: 10.1126/science.273.5278.1073. URLhttps://doi.org/10.1126/science.273.5278.1073.

[37] Suguru Endo, Iori Kurata, and Yuya ONakagawa. Calculation of the green’sfunction on near-term quantum comput-ers. Physical Review Research, 2(3):033281, 2020. DOI: 10.1103/PhysRevRe-search.2.033281. URL https://doi.org/10.1103/PhysRevResearch.2.033281.

[38] Kosuke Mitarai, Yuya O Nakagawa, andWataru Mizukami. Theory of analytical en-ergy derivatives for the variational quantumeigensolver. Physical Review Research, 2(1):013129, 2020. DOI: 10.1103/PhysRevRe-search.2.013129. URL https://doi.org/10.1103/PhysRevResearch.2.013129.

[39] Kosuke Mitarai, Tennin Yan, and KeisukeFujii. Generalization of the output of avariational quantum eigensolver by param-eter interpolation with a low-depth ansatz.Phys. Rev. Applied, 11:044087, Apr 2019.DOI: 10.1103/PhysRevApplied.11.044087.URL https://link.aps.org/doi/10.1103/PhysRevApplied.11.044087.

[40] Yuta Matsuzawa and Yuki Kurashige.Jastrow-type decomposition in quantumchemistry for low-depth quantum circuits.Journal of Chemical Theory and Com-putation, 16(2):944–952, 2020. DOI:10.1021/acs.jctc.9b00963. URL https://doi.org/10.1021/acs.jctc.9b00963.

[41] Hiroki Kawai and Yuya O. Nakagawa.Predicting excited states from ground statewavefunction by supervised quantum ma-chine learning. Machine Learning: Science

and Technology, 1(4):045027, oct 2020.DOI: 10.1088/2632-2153/aba183. URLhttps://doi.org/10.1088%2F2632-2153%2Faba183.

[42] Jakob Kottmann, Mario Krenn, Thi HaKyaw, Sumner Alperin-Lea, and AlánAspuru-Guzik. Quantum computer-aideddesign of quantum optics hardware. Quan-tum Science and Technology, 2021. DOI:10.1088/2058-9565/abfc94. URL https://doi.org/10.1088/2058-9565/abfc94.

[43] Yasunari Suzuki, Suguru Endo, and YuukiTokunaga. Quantum error mitigation forfault-tolerant quantum computing. arXivpreprint arXiv:2010.03887, 2020.

[44] Cirq-Qulacs. https://github.com/qulacs/cirq-qulacs, 2019.

[45] Seyon Sivarajah, Silas Dilkes, AlexanderCowtan, Will Simmons, Alec Edgington, andRoss Duncan. t|ket〉: A retargetable com-piler for NISQ devices. Quantum Scienceand Technology, 2020. DOI: 10.1088/2058-9565/ab8e92. URL https://doi.org/10.1088/2058-9565/ab8e92.

[46] Orquestra. https://orquestra.io/, 2020.[47] Jakob S. Kottmann and Sumner Alperin-

Lea, Teresa Tamayo-Mendoza, AlbaCervera-Lierta, Cyrille Lavigne, Tzu-ChingYen, Vladyslav Verteletskyi, AbhinavAnand, Matthias Degroote, Maha Kesebi,and Alán Aspuru-Guzik. tequila: Ageneralized development library for novelquantum algorithms. https://github.com/aspuru-guzik-group/tequila, 2020.

[48] Peter W Shor. Polynomial-time algo-rithms for prime factorization and dis-crete logarithms on a quantum computer.SIAM review, 41(2):303–332, 1999. DOI:10.1137/S0097539795293172. URL https://doi.org/10.1137/S0097539795293172.

[49] Craig Gidney and Martin Ekerå. How tofactor 2048 bit rsa integers in 8 hours us-ing 20 million noisy qubits. Quantum, 5:433, 2021. DOI: 10.22331/q-2021-04-15-433. URL https://doi.org/10.22331/q-2021-04-15-433.

[50] Ian D Kivlichan, Craig Gidney, Dominic WBerry, Nathan Wiebe, Jarrod McClean, WeiSun, Zhang Jiang, Nicholas Rubin, AustinFowler, Alán Aspuru-Guzik, et al. Im-proved fault-tolerant quantum simulation






https://doi.org/10.1038/ncomms5213




https://doi.org/10.1126/science.273.5278.1073



https://doi.org/10.1103/PhysRevResearch.2.033281








https://doi.org/10.1103/PhysRevApplied.11.044087

https://link.aps.org/doi/10.1103/PhysRevApplied.11.044087

https://link.aps.org/doi/10.1103/PhysRevApplied.11.044087

https://doi.org/10.1021/acs.jctc.9b00963




https://doi.org/10.1088/2632-2153/aba183

https://doi.org/10.1088%2F2632-2153%2Faba183

https://doi.org/10.1088%2F2632-2153%2Faba183

https://doi.org/10.1088/2058-9565/abfc94




https://github.com/qulacs/cirq-qulacs

https://github.com/qulacs/cirq-qulacs

https://doi.org/10.1088/2058-9565/ab8e92

https://doi.org/10.1088/2058-9565/ab8e92

https://doi.org/10.1088/2058-9565/ab8e92

https://doi.org/10.1088/2058-9565/ab8e92

https://orquestra.io/

https://github.com/aspuru-guzik-group/tequila

https://github.com/aspuru-guzik-group/tequila

https://doi.org/10.1137/S0097539795293172

https://doi.org/10.1137/S0097539795293172

https://doi.org/10.1137/S0097539795293172

https://doi.org/10.1137/S0097539795293172

https://doi.org/10.22331/q-2021-04-15-433

https://doi.org/10.22331/q-2021-04-15-433

https://doi.org/10.22331/q-2021-04-15-433

https://doi.org/10.22331/q-2021-04-15-433

of condensed-phase correlated electrons viatrotterization. Quantum, 4:296, 2020. DOI:10.22331/q-2020-07-16-296. URL https://doi.org/10.22331/q-2020-07-16-296.

[51] Aram W Harrow, Avinatan Hassidim, andSeth Lloyd. Quantum algorithm for linearsystems of equations. Physical review letters,103(15):150502, 2009. DOI: 10.1103/Phys-RevLett.103.150502. URL https://doi.org/10.1103/PhysRevLett.103.150502.

[52] Austin G Fowler, Matteo Mariantoni,John M Martinis, and Andrew N Cleland.Surface codes: Towards practical large-scalequantum computation. Physical Review A,86(3):032324, 2012. DOI: 10.1103/Phys-RevA.86.032324. URL https://link.aps.org/doi/10.1103/PhysRevA.86.032324.

[53] Sergio Boixo, Sergei V Isakov, Vadim NSmelyanskiy, Ryan Babbush, Nan Ding,Zhang Jiang, Michael J Bremner, John MMartinis, and Hartmut Neven. Char-acterizing quantum supremacy in near-term devices. Nature Physics, 14(6):595–600, 2018. DOI: 10.1038/s41567-018-0124-x. URL https://doi.org/10.1038/s41567-018-0124-x.

[54] Jarrod McClean, Nicholas Rubin, KevinSung, Ian David Kivlichan, Xavier Bonet-Monroig, Yudong Cao, Chengyu Dai,Eric Schuyler Fried, Craig Gidney, Bren-dan Gimby, et al. OpenFermion: theelectronic structure package for quan-tum computers. Quantum Science andTechnology, 2020. DOI: 10.1088/2058-9565/ab8ebc. URL https://doi.org/10.1088/2058-9565/ab8ebc.

[55] Michael A. Nielsen and Isaac L. Chuang.Quantum Computation and Quantum Infor-mation: 10th Anniversary Edition. Cam-bridge University Press, 2010. DOI:10.1017/CBO9780511976667. URL https://doi.org/10.1017/CBO9780511976667.

[56] Andrew W Cross, Lev S Bishop, John ASmolin, and Jay M Gambetta. Open quan-tum assembly language. arXiv preprintarXiv:1707.03429, 2017.

[57] Shiro Tamiya and Yuya O Nakagawa. Cal-culating nonadiabatic couplings and Berry’sphase by variational quantum eigensolvers.arXiv preprint arXiv:2003.01706, 2020.DOI: 10.1103/PhysRevResearch.3.023244.

URL https://doi.org/10.1103/PhysRevResearch.3.023244.

[58] Yohei Ibe, Yuya O Nakagawa, TakahiroYamamoto, Kosuke Mitarai, Qi Gao, andTakao Kobayashi. Calculating transitionamplitudes by variational quantum eigen-solvers. arXiv preprint arXiv:2002.11724,2020.

[59] Pascual Jordan and Eugene P Wigner.About the pauli exclusion princi-ple. Z. Phys, 47(631):14–75, 1928.DOI: 10.1007/BF01331938. URLhttps://doi.org/10.1007/BF01331938.

[60] Sergey B Bravyi and Alexei Yu Kitaev.Fermionic quantum computation. Annalsof Physics, 298(1):210–226, 2002. DOI:10.1006/aphy.2002.6254. URL https://doi.org/10.1006/aphy.2002.6254.

[61] Intel Intrinsics Guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/, 2020.

[62] OpenMP Specifications. https://www.openmp.org/specifications/, 2020.

[63] quantum-benchmarks. https://github.com/Roger-luo/quantum-benchmarks,2020.

[64] Benchmark codes of this paper will be up-loaded to. https://github.com/qulacs/benchmark-qulacs, 2020.

[65] Intel-QS repository . https://github.com/iqusoft/intel-qs, 2020.

[66] Daniel Gottesman. The heisenberg represen-tation of quantum computers. arXiv preprintquant-ph/9807006, 1998.

[67] Scott Aaronson and Daniel Gottesman.Improved simulation of stabilizer circuits.Physical Review A, 70(5):052328, 2004. DOI:10.1103/PhysRevA.70.052328. URL https://10.1103/PhysRevA.70.052328.

[68] Leslie G Valiant. Quantum circuits thatcan be simulated classically in polyno-mial time. SIAM Journal on Com-puting, 31(4):1229–1254, 2002. DOI:10.1137/S0097539700377025. URL https://doi.org/10.1137/S0097539700377025.

[69] Barbara M Terhal and David P DiVincenzo.Classical simulation of noninteracting-fermion quantum circuits. Physical ReviewA, 65(3):032325, 2002. DOI: 10.1103/Phys-RevA.65.032325. URL https://doi.org/10.1103/PhysRevA.65.032325.


https://doi.org/10.22331/q-2020-07-16-296

https://doi.org/10.22331/q-2020-07-16-296

https://doi.org/10.22331/q-2020-07-16-296

https://doi.org/10.22331/q-2020-07-16-296





https://doi.org/10.1103/PhysRevA.86.032324


https://link.aps.org/doi/10.1103/PhysRevA.86.032324

https://link.aps.org/doi/10.1103/PhysRevA.86.032324

https://doi.org/10.1038/s41567-018-0124-x

https://doi.org/10.1038/s41567-018-0124-x

https://doi.org/10.1038/s41567-018-0124-x

https://doi.org/10.1038/s41567-018-0124-x

https://doi.org/10.1088/2058-9565/ab8ebc




https://doi.org/10.1017/CBO9780511976667

https://doi.org/10.1017/CBO9780511976667

https://doi.org/10.1017/CBO9780511976667

https://doi.org/10.1017/CBO9780511976667




https://doi.org/10.1007/BF01331938

https://doi.org/10.1007/BF01331938

https://doi.org/10.1006/aphy.2002.6254




https://software.intel.com/sites/landingpage/IntrinsicsGuide/



https://www.openmp.org/specifications/

https://www.openmp.org/specifications/

https://github.com/Roger-luo/quantum-benchmarks

https://github.com/Roger-luo/quantum-benchmarks

https://github.com/qulacs/benchmark-qulacs

https://github.com/qulacs/benchmark-qulacs

https://github.com/iqusoft/intel-qs

https://github.com/iqusoft/intel-qs



https://10.1103/PhysRevA.70.052328

https://10.1103/PhysRevA.70.052328

https://doi.org/10.1137/S0097539700377025

https://doi.org/10.1137/S0097539700377025

https://doi.org/10.1137/S0097539700377025

https://doi.org/10.1137/S0097539700377025





[70] Emanuel Knill. Fermionic linear opticsand matchgates. arXiv preprint quant-ph/0108033, 2001.


Documents

Masahiro Nakadai, y Shiro Tamiya, y - arXiv