SYCL

SYCL
Original author(s)	Khronos Group
Developer(s)	Khronos Group
Initial release	March 2014
Operating system	Cross-platform
Platform	Cross-platform
Type	High-level programming language
Website	www.khronos.org/sycl

SYCL is a higher-level programming model for OpenCL as a single-source domain specific embedded language (DSEL) based on pure C++11 for SYCL 1.2 and C++14 for SYCL 2.2 to improve programming productivity. This is a standard developed by Khronos Group, announced in March 2014.

Purpose

SYCL (pronounced ‘sickle’) is a royalty-free, cross-platform abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL that enables code for heterogeneous processors to be written in a “single-source” style using completely standard C++. SYCL enables single source development where C++ template functions can contain both host and device code to construct complex algorithms that use OpenCL acceleration, and then re-use them throughout their source code on different types of data.

While originally developed for use with OpenCL and SPIR it is actually a more general heterogeneous framework able to target other systems.

Versions

SYCL was introduced at GDC in March 2014 with provisional version 1.2^[1], the current version SYCL 1.2 final was introduced at IWOCL 2015 in May 2015^[2].

SYCL 2.2 provisional was introduced at IWOCL 2016 in May 2016^[3].

The public versions are:

SYCL 1.2 targeting OpenCL 1.2 hardware features with an OpenCL 1.2 interoperability mode;
provisional SYCL 2.2 targeting OpenCL 2.2 hardware features with an OpenCL 2.2 interoperability mode.

Example

The following example shows the single-source pure C++ programming model defining an implicit task graph of 3 kernels running on a default accelerator.

#include <CL/sycl.hpp>
#include <iostream>
using namespace cl::sycl;
// Size of the matrices
constexpr size_t N = 2000;
constexpr size_t M = 3000;
int main() {
  // Create a queue to work on default device
  queue q;
  // Create some 2D buffers with N×M float values for our matrices
  buffer<double, 2> a{{ N, M }};
  buffer<double, 2> b{{ N, M }};
  buffer<double, 2> c{{ N, M }};
  // Launch a first asynchronous kernel to initialize buffer "a"
  q.submit([&](auto &cgh) {
      // The kernel write "a", so get a write accessor on it
      auto A = a.get_access<access::mode::write>(cgh);

      // Enqueue parallel kernel on an N×M 2D iteration space
      cgh.parallel_for<class init_a>({ N, M },
                         [=] (auto index) {
                           A[index] = index[0]*2 + index[1];
                         });
    });
  // Launch an asynchronous kernel to initialize buffer "b"
  q.submit([&](auto &cgh) {
      // The kernel write "b", so get a write accessor on it
      auto B = b.get_access<access::mode::write>(cgh);
      // Enqueue a parallel kernel on an N×M 2D iteration space
      cgh.parallel_for<class init_b>({ N, M },
                         [=] (auto index) {
                           B[index] = index[0]*2014 + index[1]*42;
                         });
    });
  // Launch an asynchronous kernel to compute matrix addition c = a + b
  q.submit([&](auto &cgh) {
      // In the kernel "a" and "b" are read, but "c" is written
      // Since the kernel reads "a" and "b", the runtime will add implicitly
      // a producer-consumer dependency to the previous kernels producing them.
      auto A = a.get_access<access::mode::read>(cgh);
      auto B = b.get_access<access::mode::read>(cgh);
      auto C = c.get_access<access::mode::write>(cgh);

      // Enqueue a parallel kernel on an N×M 2D iteration space
      cgh.parallel_for<class matrix_add>({ N, M },
                                     [=] (auto index) {
                                       C[index] = A[index] + B[index];
                                     });
    });
  /* Request an access to read "c" from the host-side. The SYCL runtime
     will wait for "c" to be ready available on the host side before
     returning the accessor.
     This means that there is no communication happening in the loop nest below.  */
  auto C = c.get_access<access::mode::read>();
  std::cout << std::endl << "Result:" << std::endl;
  for (size_t i = 0; i < N; i++)
    for (size_t j = 0; j < M; j++)
      // Compare the result to the analytic value
      if (C[i][j] != i*(2 + 2014) + j*(1 + 42)) {
        std::cout << "Wrong value " << C[i][j] << " on element "
                  << i << ' ' << j << std::endl;
        exit(-1);
      }

  std::cout << "Good computation!" << std::endl;
  return 0;
}

Tutorials

There are a few tutorials in the ComputeCpp SYCL guides^[4].

References

^ Khronos Group (19 March 2014). "Khronos Releases SYCL 1.2 Provisional Specification". Khronos. Retrieved 20 August 2017.
^ Khronos Group (11 May 2015). "Khronos Releases SYCL 1.2 Final Specification". Khronos. Retrieved 20 August 2017.
^ Khronos Group (18 April 2016). "Khronos Releases OpenCL 2.2 Provisional Specification with OpenCL C++ Kernel Language". Khronos. Retrieved 18 September 2017.
^ "Introduction to GPGPU programming with SYCL". Codeplay. Retrieved 3 October 2017.

External links

[sycl-gdc-2014-1] Khronos Group (19 March 2014). "Khronos Releases SYCL 1.2 Provisional Specification". Khronos. Retrieved 20 August 2017.

[sycl-iwocl-2015-2] Khronos Group (11 May 2015). "Khronos Releases SYCL 1.2 Final Specification". Khronos. Retrieved 20 August 2017.

[sycl-iwocl-2016-3] Khronos Group (18 April 2016). "Khronos Releases OpenCL 2.2 Provisional Specification with OpenCL C++ Kernel Language". Khronos. Retrieved 18 September 2017.

[computecpp-sycl-guides-4] "Introduction to GPGPU programming with SYCL". Codeplay. Retrieved 3 October 2017.

[1]

[2]

[3]

[4]

v t e Khronos Group Standards
Active	EGL glTF NNEF OpenCL OpenVG OpenVX OpenXR SPIR SYCL Vulkan
Inactive	COLLADA OpenGL ES SC WebGL OpenKODE OpenMAX OpenSL ES OpenWF WebCL

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing