# Parallel-TEBD

The parallel-TEBD is a version of the TEBD algorithm adapted to run on multiple hosts. The task of parallelizing TEBD could be achieved in various ways.

• As a first option, one could use the OpenMP API (this would probably be the simplest way to do it), using preprocessor directives to decide which portion of the code should be parallelized. The drawback of this is that one is confined to Symmetric multiprocessing (SMP) architectures and the user has no control on how the code is parallelized. An Intel extension of OpenMP, called Cluster OpenMP [2], is a socket-based implementation of OpenMP which can make use of a whole cluster of SMP machines; this spares the user of explicitly writing messaging code while giving access to multiple hosts via a distributed shared-memory system. The OpenMP paradigm (hence its extension Cluster OpenMP as well) allows the user a straightforward parallelization of serial code by embedding a set of directives in it.
• The second option is using the Message Passing Interface (MPI) API. MPI can treat each core of the multi-core machines as separate execution host, so a cluster of, let's say, 10 compute nodes with dual-core processors will appear as 20 compute nodes, on which the MPI application can be distributed. MPI offers the user more control over the way the program is parallelized. The drawback of MPI is that is not very easy to implement and the programmer has to have a certain understanding of parallel simulation systems.
• For the determined programmer the third option would probably be the most appropriate: to write one's own routines, using a combination of threads and TCP/IP sockets to complete the task. The threads are necessary in order to make the socket-based communication between the programs non-blocking (the communication between programs has to take place in threads, so that the main thread doesn't have to wait for the communication to end and can execute other parts of the code). This option offers the programmer complete control over the code and eliminates any overhead which might come from the use of the Cluster OpenMP or MPI libraries.

This article introduces the conceptual basis of the implementation, using MPI-based pseudo-code for exemplification, while not restricting itself to MPI - the same basic schema could be implemented with the use of home-grown messaging routines.

## Introduction

The TEBD algorithm is a good candidate for parallel computing because the exponential operators used to calculate the time-evolution factorize under the Suzuki-Trotter expansion. A detailed presentation of the way TEBD works is given in the main article. Here we concern ourselves only with its parallel implementation.

## Implementation

For our purposes, we will use the canonical form of the MPS as introduced by Guifré Vidal in his original papers. Hence, we will write the function of state ${\displaystyle |\Psi \rangle }$ as:

${\displaystyle |\Psi \rangle =\sum \limits _{i_{1},..,i_{N}=1}^{M}\sum \limits _{\alpha _{1},..,\alpha _{N-1}=0}^{\chi }\Gamma _{\alpha _{1}}^{[1]i_{1}}\lambda _{\alpha _{1}}^{[1]}\Gamma _{\alpha _{1}\alpha _{2}}^{[2]i_{2}}\lambda _{\alpha _{2}}^{[2]}\Gamma _{\alpha _{2}\alpha _{3}}^{[3]i_{3}}\lambda _{\alpha _{3}}^{[3]}\cdot ..\cdot \Gamma _{\alpha _{N-2}\alpha _{N-1}}^{[{N-1}]i_{N-1}}\lambda _{\alpha _{N-1}}^{[N-1]}\Gamma _{\alpha _{N-1}}^{[N]i_{N}}|{i_{1},i_{2},..,i_{N-1},i_{N}}\rangle }$

This function describes a N-point lattice which we would like to compute on P different compute nodes. Let us suppose, for the sake of simplicity, that N=2k*P, where k is an integer number. This means that if we distribute the lattice points evenly among the compute nodes (the easiest scenario), an even number of lattice points 2k is assigned to each compute node. Indexing the lattice points from 0 to N-1 (note that the usual indexing is 1,N) and the compute nodes from 0 to P-1, the lattice points would be distributed as follows among the nodes:

 NODE 0: 0, 1, ..., 2k-1
NODE 1: 2k, 2k+1, ..., 4k-1
...
NODE m: m*2k, ..., (m+1)*2k - 1
...
NODE P-1: (P-1)*2k, ..., N-1


Using the canonical form of the MPS, we define ${\displaystyle \lambda _{\alpha _{l}}^{[l]}}$ as "belonging" to node m if m*2k ≤ l ≤ (m+1)*2k - 1. Similarly, we use the index l to assign the ${\displaystyle {\Gamma }'s}$ to a certain lattice point. This means that ${\displaystyle \Gamma _{\alpha _{0}}^{[0]i_{0}}}$ and ${\displaystyle \Gamma _{\alpha _{l-1}\alpha _{l}}^{[l]i_{l}},l=1,2k-1}$, belong to NODE 0, as well as ${\displaystyle \lambda _{\alpha _{l}}^{[l]},l=0,2k-2}$. A parallel version of TEBD implies that the computing nodes need to exchange information among them. The information exchanged will be the MPS matrices and singular values lying at the border between neighbouring compute nodes. How this is done, it will be explained below.

The TEBD algorithm divides the exponential operator performing the time-evolution into a sequence of two-qubit gates of the form:

${\displaystyle e^{{\frac {-i\delta }{\hbar }}H_{k,k+1}}.}$

Setting the Planck constant to 1, the time-evolution is expressed as:

${\displaystyle |\Psi (t+\delta )\rangle =e^{{-i\delta }{\frac {F}{2}}}e^{{-i\delta }G}e^{{-i\delta }{\frac {F}{2}}}|\Psi (t)\rangle ,}$

where H = F + G,

${\displaystyle F\equiv \sum _{k=0}^{{\frac {N}{2}}-1}(H_{2k,2k+1})=\sum _{k=0}^{{\frac {N}{2}}-1}(F_{2k}),}$

${\displaystyle G\equiv \sum _{k=0}^{{\frac {N}{2}}-2}(H_{2k+1,2k+2})=\sum _{k=0}^{{\frac {N}{2}}-2}(G_{2k+1}).}$

What we can explicitly compute in parallel is the sequence of gates ${\displaystyle e^{{-i}{\frac {\delta }{2}}F_{2k}},e^{{-i\delta }{G_{2k+1}}}.}$ Each of the compute node can apply most of the two-qubit gates without needing information from its neighbours. The compute nodes need to exchange information only at the borders, where two-qubit gates cross them, or just need information from the other side. We will now consider all three sweeps, two even and one odd and see what information has to be exchanged. Let us see what is happening on node m during the sweeps.

### First (even) sweep

The sequence of gates that has to be applied in this sweep is:

${\displaystyle e^{{-i}{\frac {\delta }{2}}F_{m*2k}},e^{{-i}{\frac {\delta }{2}}F_{m*2k+2}},...,e^{{-i}{\frac {\delta }{2}}F_{(m+1)*2k-2}}}$

Already for computing the first gate, process m needs information from its lowest neighbour, m-1. On the other side, m doesn't need anything from its "higher" neighbour, m+1, because it has all the information it needs to apply the last gate. So the best strategy for m is to send a request to m-1, postponing the calculation of the first gate for later, and continue with the calculation of the other gates. What m does is called non-blocking communication. Let's look at this in detail. The tensors involved in the calculation of the first gate are:[1]

## References

1. ^ Guifré Vidal, Efficient Classical Simulation of Slightly Entangled Quantum Computations, PRL 91, 147902 (2003)[1]