Transition role of entangled data in quantum machine learning

We first recap the task of learning quantum dynamics. Let ${{{{{{{\boldsymbol{U}}}}}}}}\in {\mathbb{S}}{\mathbb{U}}({2}^{n})$ be the target unitary and ${{{{{{{\boldsymbol{O}}}}}}}}\in {{\mathbb{C}}}^{{2}^{n}\times {2}^{n}}$ be the observable which is a Hermitian matrix acting on an n-qubit quantum system. Here we specify the observable as the projective measurement ${{{{{{{\boldsymbol{O}}}}}}}}=\left\vert {{{{{{{\boldsymbol{o}}}}}}}}\right\rangle \left\langle {{{{{{{\boldsymbol{o}}}}}}}}\right\vert$ since any observable reads out the classical information from the quantum system via their eigenvectors. The goal of the quantum dynamics learning is to predict the functions of the form

$${{{{{{{{\rm{f}}}}}}}}}_{{{{{{{{\boldsymbol{U}}}}}}}}}({{{{{{{\boldsymbol{\psi }}}}}}}})={{{{{{{\rm{Tr}}}}}}}}({{{{{{{\boldsymbol{O}}}}}}}}{{{{{{{\boldsymbol{U}}}}}}}}\left\vert {{{{{{{\boldsymbol{\psi }}}}}}}}\right\rangle \left\langle {{{{{{{\boldsymbol{\psi }}}}}}}}\right\vert {{{{{{{{\boldsymbol{U}}}}}}}}}^{{{{\dagger}}} }),$$

(1)

where $\left\vert {{{{{{{\boldsymbol{\psi }}}}}}}}\right\rangle$ is an n-qubit quantum state living in a 2ⁿ-dimensional Hilbert space ${{{{{{{{\mathcal{H}}}}}}}}}_{{{{{{{{\mathcal{X}}}}}}}}}$. This task can be done by employing the training data ${{{{{{{\mathcal{S}}}}}}}}$ to construct a unitary ${{{{{{{{\boldsymbol{V}}}}}}}}}_{{{{{{{{\mathcal{S}}}}}}}}}$, i.e., the learned hypothesis has the form of ${{{{{{{{\rm{h}}}}}}}}}_{{{{{{{{\mathcal{S}}}}}}}}}({{{{{{{\boldsymbol{\psi }}}}}}}})={{{{{{{\rm{Tr}}}}}}}}({{{{{{{\boldsymbol{O}}}}}}}}{{{{{{{{\boldsymbol{V}}}}}}}}}_{{{{{{{{\mathcal{S}}}}}}}}}\left\vert {{{{{{{\boldsymbol{\psi }}}}}}}}\right\rangle \left\langle {{{{{{{\boldsymbol{\psi }}}}}}}}\right\vert {{{{{{{{\boldsymbol{V}}}}}}}}}_{{{{{{{{\mathcal{S}}}}}}}}}^{{{{\dagger}}} })$, which is expected to accurately approximate f_U(ψ) for the unseen data. While the learned unitary acts on an n-qubit system ${{{{{{{{\mathcal{H}}}}}}}}}_{{{{{{{{\mathcal{X}}}}}}}}}$, the input state could be entangled with a reference system ${{{{{{{{\mathcal{H}}}}}}}}}_{{{{{{{{\mathcal{R}}}}}}}}}$, i.e., $\left\vert {{{{{{{\boldsymbol{\psi }}}}}}}}\right\rangle \in {{{{{{{{\mathcal{H}}}}}}}}}_{{{{{{{{\mathcal{X}}}}}}}}}\otimes {{{{{{{{\mathcal{H}}}}}}}}}_{{{{{{{{\mathcal{R}}}}}}}}}$. We suppose that all input states have the same Schmidt rank r ∈ {1, ⋯ , 2ⁿ}. Then the response of the state $\left\vert {{{{{{{{\boldsymbol{\psi }}}}}}}}}_{j}\right\rangle$ is given by the measurement output ${{{{{{{{\boldsymbol{o}}}}}}}}}_{j}={\sum }_{k=1}^{m}{{{{{{{{\boldsymbol{o}}}}}}}}}_{jk}/m$, where m is the number of measurements and o_jk is the output of the k-th measurement of the observable O on the output quantum state $({{{{{{{\boldsymbol{U}}}}}}}}\otimes {{\mathbb{I}}}_{{{{{{{{\mathcal{R}}}}}}}}})\vert {{{{{{{{\boldsymbol{\psi }}}}}}}}}_{j}\rangle$. In this manner, the training data with N examples takes the form ${{{{{{{\mathcal{S}}}}}}}}={\{(\vert {{{{{{{{\boldsymbol{\psi }}}}}}}}}_{j}\rangle,{{{{{{{{\boldsymbol{o}}}}}}}}}_{j}):\vert {{{{{{{{\boldsymbol{\psi }}}}}}}}}_{j}\rangle \in {{{{{{{{\mathcal{H}}}}}}}}}_{{{{{{{{\mathcal{X}}}}}}}}}\otimes {{{{{{{{\mathcal{H}}}}}}}}}_{{{{{{{{\mathcal{R}}}}}}}}},{\mathbb{E}}[{{{{{{{{\boldsymbol{o}}}}}}}}}_{j}]={u}_{j}\}}_{j=1}^{N}$ with ${u}_{j}={{{{{\rm{Tr}}}}}}( ( {{{{{\boldsymbol{U}}}}}}^{{{\dagger}} } {{{{{\boldsymbol{O}}}}}} {{{{{\boldsymbol{U}}}}}}\otimes {\mathbb{I}}_{{{{{\mathcal{R}}}}}})\vert {{{{{\boldsymbol{\psi }}}}}}_{j}\rangle \langle {{{{{\boldsymbol{\psi }}}}}}_{j} \vert )$ being the expectation value of the observable O on the state $({{{{{\boldsymbol{U}}}}}}\otimes {\mathbb{I}}_{{{{{\mathcal{R}}}}}})\vert {{{{{\boldsymbol{\psi }}}}}}_{j}\rangle$ and N being the size of the training data. Notably, in quantum dynamics learning, sample complexity refers to the size of training data N, or equivalently, the number of quantum states in the training data; query complexity refers to the total number of queries of the explored quantum system, i.e., the production of sample complexity and the number of measurements Nm.

The risk function is a crucial measure in statistical learning theory to quantify how well the hypothesis function ${{{{{{{{\rm{h}}}}}}}}}_{{{{{{{{\mathcal{S}}}}}}}}}$ performs in predicting f_U, defined as

$${{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\boldsymbol{U}}}}}}}}}({{{{{{{{\boldsymbol{V}}}}}}}}}_{{{{{{{{\mathcal{S}}}}}}}}})=\int{{{{{{{\rm{d}}}}}}}}{{{{{{{\boldsymbol{\psi }}}}}}}}{\left({{{{{{{{\rm{f}}}}}}}}}_{{{{{{{{\boldsymbol{U}}}}}}}}}({{{{{{{\boldsymbol{\psi }}}}}}}})-{{{{{{{{\rm{h}}}}}}}}}_{{{{{{{{\mathcal{S}}}}}}}}}({{{{{{{\boldsymbol{\psi }}}}}}}})\right)}^{2},$$

(2)

where the integral is over the uniform Haar measure dψ on the state space. Intuitively, ${{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\boldsymbol{U}}}}}}}}}({{{{{{{{\boldsymbol{V}}}}}}}}}_{{{{{{{{\mathcal{S}}}}}}}}})$ amounts to the average square error distance between the true output f(ψ) and the hypothesis output ${{{{{{{{\rm{h}}}}}}}}}_{{{{{{{{\mathcal{S}}}}}}}}}({{{{{{{\boldsymbol{\psi }}}}}}}})$. Moreover, we follow the treatments in ref. ⁴⁸ choosing the Haar unitary as the target unitary. Additionally, we construct a sampling rule of the training input states which approximates the uniform distribution of all entangled states with Schmidt rank r (refer to Supplementary Note 2).

Under the above setting, we prove the following quantum NFL theorem in learning quantum dynamics, where the formal statement and proof are deferred to Supplementary Note 3.

Theorem 1

(Quantum NFL theorem in learning quantum dynamics, informal). Following the settings in Eq. (1), suppose that the training error of the learned hypothesis on the training data ${{{{{{{\mathcal{S}}}}}}}}$ is less than $\varepsilon={{{{{{{\mathcal{O}}}}}}}}(1/{2}^{n})$. Then the lower bound of the averaged prediction error in Eq. (2) yields

$${{\mathbb{E}}}_{{{{{{{{\boldsymbol{U}}}}}}}},{{{{{{{\mathcal{S}}}}}}}}}{{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\boldsymbol{U}}}}}}}}}({{{{{{{{\boldsymbol{V}}}}}}}}}_{{{{{{{{\mathcal{S}}}}}}}}})\ge \Omega \left(\frac{{\tilde{\varepsilon }}^{2}}{{4}^{n}}\left(1-\frac{N\cdot \min \{m/({2}^{n}r{c}_{1}),rn\}}{{2}^{n}{c}_{2}}\right)\right),$$

where ${c}_{1}=128/{\tilde{\varepsilon }}^{2}$, ${c}_{2}=\min \{{(1-2\tilde{\varepsilon })}^{2},{(64{\tilde{\varepsilon }}^{2}-1)}^{2}\}$, $\tilde{\varepsilon }=\Theta ({2}^{n}\varepsilon )$, and the expectation is taken over all target unitary U, entangled states $\left\vert {{{{{{{{\boldsymbol{\psi }}}}}}}}}_{j}\right\rangle$ and measurement outputs o_j.

The achieved results indicate the transition role of the entangled data in determining the prediction error. Particularly, when a sufficient number of measurements m is allowed such that the Schmidt rank r obeys $r < \sqrt{m/({c}_{1}{2}^{n}n)}$, the minimum term in the achieved lower bound refers to Nrn and hence increasing r can constantly decrease the prediction error. Accordingly, in the two extreme cases of r = 1 and r = 2ⁿ, achieving zero averaged risk requires N = 2ⁿc₂/n and N = 1 training input states, where the latter achieves an exponential reduction in the number of training data compared with the former. This observation implies that the entangled data empower QML with provable quantum advantage, which accords with the achieved results of ref. ⁴⁸ in the ideal coherent learning protocol with infinite measurements.

By contrast, in the scenario with $r\ge \sqrt{m/({c}_{1}{2}^{n}n)}$, increasing r could enlarge the prediction error. This result indicates that the entangled data can be harmful to achieving quantum advantages, which contrasts with previous results where the entanglement (e.g., entangled operations or measurements) is believed to contribute to the quantum advantage^48,53,54,55. This counterintuitive phenomenon stems from the fact that when incoherently learning quantum dynamics, information obtained from each measurement decreases with the increased r and hence a small m is incapable of extracting all information of the target unitary carried by the entangled state.

Another implication of Theorem 1 is that although the number of measurements m contributes to a small prediction error, it is not decisive to the ultimate performance of the prediction error. Specifically, when m ≥ r²c₁2ⁿn, further increasing m could not help decrease the prediction error which is determined by the entanglement and the size of the training data, i.e., r and N. Meanwhile, at least r²c₁2ⁿn measurements are required to fully utilize the power of entangled data. These results suggest that the value of m should be adaptive to r to pursue a low prediction error.

We next comprehend the scenario in which the lower bound of averaged risk in Theorem 1 reaches zero and correlate with the results in quantum state learning and quantum dynamics learning^{26,27,29,30,56,57}. In particular, the main focus of those studies is proving the minimum query complexity of the target unitary to warrant zero risk. The results in Theorem 1 indicate that the minimum query complexity is Nm = Ω(4ⁿrc₁c₂), implying the proportional relation between the entanglement degree r and the query complexity. Notably, this lower bound is tighter than that achieved in ref. ²⁶ in the same setting. The achieved results in terms of query complexity are also non-trivial, as previous works show that query complexity can benefit from using entanglement in quantum data^58,59 and quantum measurements^26,30. The advance of our results stems from the fact that ref. ²⁶ simply employs Holevo’s theorem to give an upper bound on the extracted information in a single measurement, while our bound integrates more refined analysis such as the consideration of Schmidt rank r, the direct use of a connection between the mutual information of the target unitary U and the measurement outputs o_j, and the KL-divergence of related distributions (refer to Supplementary Note 3 for more details). Moreover, the adopted projective measurement O in Eqn. (1) hints that the learning task explored in our study amounts to learning a pure state U^†OU. From the perspective of state learning, the derived lower bound in Theorem 1 is optimal for the nonadaptive measurement with a constant number of outcomes⁶⁰. Taken together, while the entangled data hold the promise of gaining advantages in terms of the sample complexity for achieving the same level of prediction error, they may be inferior to the training data without entanglement in terms of query complexity.

The transition role of entanglement explained above leads to the following construction rule of quantum learning models. First, when a large number of measurements is allowed, the entangled data is encouraged to be used for improving the prediction performance. To this end, initial research efforts^{61,62,63,64,65,66}, which develop effective methods for preparing and storing entangled states, may contribute to QML. Second, when the total number of measurements is limited, it is advised to refrain from using entangled data for learning quantum dynamics.

Remark. (i) The training error scaling $\varepsilon={{{{{{{\mathcal{O}}}}}}}}(1/{2}^{n})$ in Theorem 1 and the factor of the achieved lower bound ${\tilde{\varepsilon }}^{2}/{4}^{n}$ comes from the consideration of average performance over Haar unitaries where the expectation value of observable O scales as ${{{{{{{\rm{Tr}}}}}}}}({{{{{{{\boldsymbol{O}}}}}}}})/{2}^{n}$ (Refer to Supplementary Note 2). (ii) The results of the transition role for entangled data achieved in Theorem 1 can be generalized to the mixed states because the mixed state can be produced by taking the partial trace of a pure entangled state.

In a more generic learning setting, the observable used in the target function defined in Eqn. (1) and the measurement used for collecting the response of training data o could be arbitrary and varied. In particular, we consider that the observable O defined in Eqn. (1) could be arbitrary Hermitian operator satisfying ∥O∥₁≤∞. The response a_j for given input states $\left\vert {{{{{{{{\boldsymbol{\psi }}}}}}}}}_{j}\right\rangle$ could be obtained from measuring the output states on system ${{{{{{{\mathcal{X}}}}}}}}$ with ℓ-outcome POVM. The training dataset in this case refers to ${{{{{{{{\mathcal{S}}}}}}}}}_{\ell }={\{(\left\vert {{{{{{{{\boldsymbol{\psi }}}}}}}}}_{j}\right\rangle,{{{{{{{{\boldsymbol{a}}}}}}}}}_{j}):\left\vert {{{{{{{{\boldsymbol{\psi }}}}}}}}}_{j}\right\rangle \in {{{{{{{{\mathcal{H}}}}}}}}}_{{{{{{{{\mathcal{X}}}}}}}}{{{{{{{\mathcal{R}}}}}}}}},{{{{{{{{\boldsymbol{a}}}}}}}}}_{j}=({{{{{{{{\boldsymbol{a}}}}}}}}}_{j1},\cdots \,,{{{{{{{{\boldsymbol{a}}}}}}}}}_{jm}),{{{{{{{{\boldsymbol{a}}}}}}}}}_{jk}\in \{{z}_{1},\cdots \,,{z}_{\ell }\}\}}_{j=1}^{N}$, where $\vert {{{{{{{{\boldsymbol{\psi }}}}}}}}}_{j}\rangle$ refers to the entangled states with Schmidt rank r, a_j is the m-measurement outputs with ℓ-outcome POVM, and ${\{{z}_{i}\}}_{i=1}^{\ell }$ is the ℓ possible outcomes of the employed POVM. In this case, denoting the learned unitary as ${{{{{{{{\boldsymbol{V}}}}}}}}}_{{{{{{{{{\mathcal{S}}}}}}}}}_{\ell }}$, we get the following quantum NFL theorem in learning quantum dynamics for generic measurements, where the formal statement and proof are deferred to Supplementary Note 4.

Theorem 2

(Quantum NFL theorem in learning quantum dynamics for generic measurements, informal) Following the settings in Eq. (1) with arbitrary O satisfying ∥O∥₁≤∞, suppose the learned hypothesis is learned from training data ${{{{{{{{\mathcal{S}}}}}}}}}_{\ell }$. Then the lower bound of the averaged prediction error in Eqn. (2) yields

$${{\mathbb{E}}}_{{{{{{{{\boldsymbol{U}}}}}}}},{{{{{{{{\mathcal{S}}}}}}}}}_{\ell }}{{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\boldsymbol{U}}}}}}}}}({{{{{{{{\boldsymbol{V}}}}}}}}}_{{{{{{{{{\mathcal{S}}}}}}}}}_{\ell }})\ge {\varepsilon }^{2}\left(1-\frac{N\cdot \min \{4m/r,6m\ell /{2}^{n}r,rn\}}{\log (| {{{{{{{{\mathcal{X}}}}}}}}}_{2\varepsilon }({{{{{{{\boldsymbol{O}}}}}}}})| )}\right)$$

where $| {{{{{{{{\mathcal{X}}}}}}}}}_{2\varepsilon }({{{{{{{\boldsymbol{O}}}}}}}})|$ refers to the model complexity and only depends on ε and the employed observable O. For projective measurement ${{{{{{{\boldsymbol{O}}}}}}}}=\left\vert {{{{{{{\boldsymbol{o}}}}}}}}\right\rangle \left\langle {{{{{{{\boldsymbol{o}}}}}}}}\right\vert$, $\log (| {{{{{{{{\mathcal{X}}}}}}}}}_{2\varepsilon }({{{{{{{\boldsymbol{O}}}}}}}})| )={2}^{n}{c}_{2}$ is given in the denominator of the achieve lower bound in Theorem 1.

The achieved results in Theorem 2 deliver three implications. First, the transition role of entangled data still holds for arbitrary observable and POVM. In particular, no matter how large the number of possible outcomes of POVM ℓ is, increasing the Schmidt rank will decrease the prediction error as long as the number of measurements m satisfies $\min \{4m/r,6m\ell /{2}^{n}r\}\le rn$, and increase the prediction error otherwise. Second, when the observable is projective measurement and the number of possible outcomes ℓ is of constant order, the achieved result in Theorem 2 reduces to the results achieved in Theorem 1 for the case of employing projective measurement up to a constant factor. Third, increasing the number of possible outcomes of POVM ℓ can exponentially reduce the number of measurements required to achieve the same level of prediction error. Particularly, considering two extreme cases of the possible outcomes of POVM ℓ being constant scaling Θ(1) and exponential scaling Θ(2ⁿ), achieving the same level of prediction error requires the query complexity scaling with the order of ${2}^{n}r\log (| {{{{{{{{\mathcal{X}}}}}}}}}_{2\varepsilon }({{{{{{{\boldsymbol{O}}}}}}}})| )$ and $r\log (| {{{{{{{{\mathcal{X}}}}}}}}}_{2\varepsilon }({{{{{{{\boldsymbol{O}}}}}}}})| )$, where the latter case achieves an exponential reduction in terms of the query complexity.

Numerical results

We conduct numerical simulations to exhibit the transition role of entangled data, the effect of the number of measurements, and the training data size in determining the prediction error. The omitted construction details and results are deferred to Supplementary Note 5.

We focus on the task of learning an n-qubit unitary under a fixed projective measurement ${{{{{{{\boldsymbol{O}}}}}}}}={(\left\vert {{{{{{{\boldsymbol{0}}}}}}}}\right\rangle \left\langle {{{{{{{\boldsymbol{0}}}}}}}}\right\vert )}^{\otimes n}$. The number of qubits is n = 4. The target unitary U_X is chosen uniformly from a discrete set ${\{{{{{{{{{\boldsymbol{U}}}}}}}}}_{i}\}}_{i=1}^{M}$, where M = 2ⁿ refers to the set size and the operators ${{{{{{{{\boldsymbol{U}}}}}}}}}_{j}^{{{{\dagger}}} }{{{{{{{\boldsymbol{O}}}}}}}}{{{{{{{{\boldsymbol{U}}}}}}}}}_{j}$ with U_j in this set are orthogonal such that the operators ${{{{{{{{\boldsymbol{U}}}}}}}}}_{j}^{{{{\dagger}}} }{{{{{{{\boldsymbol{O}}}}}}}}{{{{{{{{\boldsymbol{U}}}}}}}}}_{j}$ are well distinguished. The entangled states in ${{{{{{{\mathcal{S}}}}}}}}$ is uniformly sampled from the set $\{{\sum }_{j=1}^{r}\sqrt{{c}_{j}}{{{{{{{{\boldsymbol{U}}}}}}}}}_{j}\left\vert {{{{{{{\boldsymbol{0}}}}}}}}\right\rangle \otimes \left\vert {{{{{{{{\boldsymbol{\xi }}}}}}}}}_{j}\right\rangle \,| \,{(\sqrt{{c}_{1}},\cdots,\sqrt{{c}_{r}})}^{\top }\in {\mathbb{S}}{\mathbb{U}}(r),\,\left\vert {{{{{{{{\boldsymbol{\xi }}}}}}}}}_{j}\right\rangle \in {\mathbb{S}}{\mathbb{U}}({2}^{n})\}$. The size of training data is N ∈ {1, 2, ⋯ , 16} and the Schmidt rank takes r = {2⁰, ⋯ , 2⁴}. The number of measurements takes m ∈ {10, 100, 300, ⋯ , 5000, 20000}. We record the averaged prediction error by learning four different 4-qubit unitaries for 10 training data.

The simulation results are displayed in Fig. 2. Particularly, Fig. 2a shows that for both the cases of N = 2 and N = 8, the prediction error constantly decreases with respect to an increased number of measurements m and increased Schmidt rank r when the number of measurements is large enough, namely m > 1000. On the other hand, for a small number of measurements with m ≤ 100 in the case of N = 8, as the Schmidt rank is continually increased, the averaged prediction error initially decreases and then increases after the Schmidt rank surpasses a critical point which is r = 3 for m = 10 and r = 4 for m = 100. This phenomenon accords with the theoretical results in Theorem 1 in the sense that the entangled data play a transition role in determining the prediction error for a limited number of measurements. This observation is also verified in Fig. 2b for the varied sizes of training data, where for the small measurement times m = 10, increasing the Schmidt rank could be not helpful for decreasing the prediction error. By contrast, a large training data size consistently contributes to a small prediction error, which echoes with Theorem 1.