Jump to content

Roofline model: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Undid revision 769993637 by Pranith (talk)
Pranith (talk | contribs)
Line 2: Line 2:
The '''Roofline model''' is an intuitive visual [[Analytical Performance Modeling|performance model]] used to provide [[Computer performance|performance]] estimates of a given [[compute kernel]]<nowiki/> or application running on [[Multi-core processor|multi-core]], [[Many-core processing unit|many-core]], or [[Hardware acceleration|accelerator]] [[Microarchitecture|processor architecture]]<nowiki/>s, by showing inherent hardware limitations, and potential benefit and priority of [[Program optimization|optimizations]]. By combining [[Locality of reference|locality]], [[Bandwidth (computing)|bandwidth]], and different [[Parallel computing|parallelization]] paradigms into a single performance figure, the model can be an effective alternative to assess the quality of attained performance instead of using simple percent-of-peak estimates, as it provides insights on both the implementation and inherent performance limitations.
The '''Roofline model''' is an intuitive visual [[Analytical Performance Modeling|performance model]] used to provide [[Computer performance|performance]] estimates of a given [[compute kernel]]<nowiki/> or application running on [[Multi-core processor|multi-core]], [[Many-core processing unit|many-core]], or [[Hardware acceleration|accelerator]] [[Microarchitecture|processor architecture]]<nowiki/>s, by showing inherent hardware limitations, and potential benefit and priority of [[Program optimization|optimizations]]. By combining [[Locality of reference|locality]], [[Bandwidth (computing)|bandwidth]], and different [[Parallel computing|parallelization]] paradigms into a single performance figure, the model can be an effective alternative to assess the quality of attained performance instead of using simple percent-of-peak estimates, as it provides insights on both the implementation and inherent performance limitations.


The most basic Roofline model can be visualized by plotting [[FLOPS|floating-point performance]] as a function of machine peak performance, machine peak bandwidth, and [[operational intensity]]. The resultant curve is effectively a performance bound under which kernel or application performance exists, and includes two platform-specific performance ceilings: a ceiling derived from the memory bandwidth and one derived from the processor’s peak performance.
The most basic Roofline model can be visualized by plotting [[FLOPS|floating-point performance]] as a function of machine peak performance, machine peak bandwidth, and arithmetic intensity. The resultant curve is effectively a performance bound under which kernel or application performance exists, and includes two platform-specific performance ceilings: a ceiling derived from the memory bandwidth and one derived from the processor’s peak performance.


== Related terms and performance metrics ==
== Related terms and performance metrics ==
Line 14: Line 14:
The ''memory traffic'' <math>Q</math> denotes the number of [[byte]]s of memory transfers incurred during the execution of the kernel or application.<ref name=":0" /> In contrast to <math>W</math>, <math>Q</math> is heavily dependent on the properties of the chosen platform, such as for instance the structure of the [[Cache (computing)|cache]] hierarchy.<ref name=":0" />
The ''memory traffic'' <math>Q</math> denotes the number of [[byte]]s of memory transfers incurred during the execution of the kernel or application.<ref name=":0" /> In contrast to <math>W</math>, <math>Q</math> is heavily dependent on the properties of the chosen platform, such as for instance the structure of the [[Cache (computing)|cache]] hierarchy.<ref name=":0" />


=== Operational intensity ===
=== Arithmetic intensity ===
The ''[[operational intensity]]'' <math>I</math>, also referred to as ''arithmetic intensity'',<ref name="WilliamsPhD" /><ref>{{Cite web|url=https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/|title=Roofline Performance Model|last=|first=|date=|website=|publisher=Lawrence Berkeley National Laboratory|access-date=19 June 2016}}</ref> is the ratio of the work <math>W</math> to the memory traffic <math>Q</math>:<ref name=":0" /><math display="block">I = {W \over Q}</math>and denotes the number of operations per byte of memory traffic. When the work <math>W</math> is expressed as ''FLOPS'', the resulting operational intensity <math>I</math> will be the ratio of floating point operations to total data movement (''FLOPS/byte'').
The ''[[arithmetic intensity]]'' <math>I</math>, also referred to as ''operational intensity'',<ref name="WilliamsPhD" /><ref>{{Cite web|url=https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/|title=Roofline Performance Model|last=|first=|date=|website=|publisher=Lawrence Berkeley National Laboratory|access-date=19 June 2016}}</ref> is the ratio of the work <math>W</math> to the memory traffic <math>Q</math>:<ref name=":0" /><math display="block">I = {W \over Q}</math>and denotes the number of operations per byte of memory traffic. When the work <math>W</math> is expressed as ''FLOPS'', the resulting arithmetic intensity <math>I</math> will be the ratio of floating point operations to total data movement (''FLOPS/byte'').


== Naïve Roofline ==
== Naïve Roofline ==
[[File:Example of a naive Roofline model.svg|thumb|Example of a naïve Roofline plot where two [[Compute kernel|kernels]] are reported. The first has an [[operational intensity]] <math>O_1</math> that is underneath the peak bandwidth ceiling, and is then ''[[Memory bound function|memory-bound]]''. Instead, the second has an operational intensity <math>O_2</math> that is underneath the peak performance ceiling, and thus is ''[[CPU-bound|compute-bound]]''.]]
[[File:Example of a naive Roofline model.svg|thumb|Example of a naïve Roofline plot where two [[Compute kernel|kernels]] are reported. The first has an arithmetic intensity <math>O_1</math> that is underneath the peak bandwidth ceiling, and is then ''[[Memory bound function|memory-bound]]''. Instead, the second has an arithmetic intensity <math>O_2</math> that is underneath the peak performance ceiling, and thus is ''[[CPU-bound|compute-bound]]''.]]
The ''naïve Roofline''<ref name="WilliamsPhD">{{cite thesis |type=Ph.D. |last=Williams |first=Samuel W. |date=2008 |title=Auto-tuning Performance on Multicore Computers |publisher=University of California at Berkeley}}</ref> is obtained by applying simple bound and bottleneck analysis.<ref>{{Cite journal|last=Kourtis|first=Kornilios|last2=Goumas|first2=Georgios|last3=Koziris|first3=Nectarios|date=2008-01-01|title=Optimizing Sparse Matrix-vector Multiplication Using Index and Value Compression|url=http://doi.acm.org/10.1145/1366230.1366244|journal=Proceedings of the 5th Conference on Computing Frontiers|series=CF '08|location=New York, NY, USA|publisher=ACM|pages=87–96|doi=10.1145/1366230.1366244|isbn=9781605580777}}</ref> In this formulation of the Roofline model, there are only two parameters: peak [[Computer performance|performance]] and peak [[Bandwidth (computing)|bandwidth]] <nowiki/>of the specific [[Microarchitecture|architecture]]; and one variable: [[operational intensity]]. The peak performance, in general expressed as [[FLOPS|GFLOPS]], can be usually derived from architectural manuals, while the peak bandwidth, that references to peak [[Dynamic random-access memory|DRAM]] bandwidth to be specific, is instead obtained via [[Benchmark (computing)|benchmarking]].<ref name=":0" /><ref name="WilliamsPhD" /> The resulting plot, in general with both [[Cartesian coordinate system|axes]] in [[logarithmic scale]], is then derived by the following formula:<ref name=":0">{{Cite journal|last=Ofenbeck|first=G.|last2=Steinmann|first2=R.|last3=Caparros|first3=V.|last4=Spampinato|first4=D. G.|last5=Püschel|first5=M.|date=2014-03-01|title=Applying the roofline model|url=http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6844463|journal=2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)|pages=76–85|doi=10.1109/ISPASS.2014.6844463}}</ref><math display="block">P = \min \begin{cases} \pi\\ \beta \times I \end{cases}</math>where <math>P</math> is the ''attainable performance,'' <math>\pi</math> is the ''peak performance'', <math>\beta</math> is the ''peak bandwidth'' and <math>I</math> is the ''operational intensity.'' The point at which the performance saturates at the peak performance level <math>\pi</math> - i.e. where the diagonal and horizontal roof meet - is defined as ''ridge point.''<ref name=":1">{{Cite journal|last=Williams|first=Samuel|last2=Waterman|first2=Andrew|last3=Patterson|first3=David|date=2009-04-01|title=Roofline: An Insightful Visual Performance Model for Multicore Architectures|url=http://doi.acm.org/10.1145/1498765.1498785|journal=Commun. ACM|volume=52|issue=4|pages=65–76|doi=10.1145/1498765.1498785|issn=0001-0782}}</ref> The ridge point offers insight on the machine's overall performance, by providing the minimum operational intensity required to be able to achieve peak performance, and by suggesting at a glance the amount of effort required by the programmer to achieve peak performance.<ref name=":1" />
The ''naïve Roofline''<ref name="WilliamsPhD">{{cite thesis |type=Ph.D. |last=Williams |first=Samuel W. |date=2008 |title=Auto-tuning Performance on Multicore Computers |publisher=University of California at Berkeley}}</ref> is obtained by applying simple bound and bottleneck analysis.<ref>{{Cite journal|last=Kourtis|first=Kornilios|last2=Goumas|first2=Georgios|last3=Koziris|first3=Nectarios|date=2008-01-01|title=Optimizing Sparse Matrix-vector Multiplication Using Index and Value Compression|url=http://doi.acm.org/10.1145/1366230.1366244|journal=Proceedings of the 5th Conference on Computing Frontiers|series=CF '08|location=New York, NY, USA|publisher=ACM|pages=87–96|doi=10.1145/1366230.1366244|isbn=9781605580777}}</ref> In this formulation of the Roofline model, there are only two parameters: peak [[Computer performance|performance]] and peak [[Bandwidth (computing)|bandwidth]] <nowiki/>of the specific [[Microarchitecture|architecture]]; and one variable: arithmetic intensity. The peak performance, in general expressed as [[FLOPS|GFLOPS]], can be usually derived from architectural manuals, while the peak bandwidth, that references to peak [[Dynamic random-access memory|DRAM]] bandwidth to be specific, is instead obtained via [[Benchmark (computing)|benchmarking]].<ref name=":0" /><ref name="WilliamsPhD" /> The resulting plot, in general with both [[Cartesian coordinate system|axes]] in [[logarithmic scale]], is then derived by the following formula:<ref name=":0">{{Cite journal|last=Ofenbeck|first=G.|last2=Steinmann|first2=R.|last3=Caparros|first3=V.|last4=Spampinato|first4=D. G.|last5=Püschel|first5=M.|date=2014-03-01|title=Applying the roofline model|url=http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6844463|journal=2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)|pages=76–85|doi=10.1109/ISPASS.2014.6844463}}</ref><math display="block">P = \min \begin{cases} \pi\\ \beta \times I \end{cases}</math>where <math>P</math> is the ''attainable performance,'' <math>\pi</math> is the ''peak performance'', <math>\beta</math> is the ''peak bandwidth'' and <math>I</math> is the ''arithmetic intensity.'' The point at which the performance saturates at the peak performance level <math>\pi</math> - i.e. where the diagonal and horizontal roof meet - is defined as ''ridge point.''<ref name=":1">{{Cite journal|last=Williams|first=Samuel|last2=Waterman|first2=Andrew|last3=Patterson|first3=David|date=2009-04-01|title=Roofline: An Insightful Visual Performance Model for Multicore Architectures|url=http://doi.acm.org/10.1145/1498765.1498785|journal=Commun. ACM|volume=52|issue=4|pages=65–76|doi=10.1145/1498765.1498785|issn=0001-0782}}</ref> The ridge point offers insight on the machine's overall performance, by providing the minimum airthmetic intensity required to be able to achieve peak performance, and by suggesting at a glance the amount of effort required by the programmer to achieve peak performance.<ref name=":1" />


A given [[Compute kernel|kernel]] or application is then characterized by a point given by its operational intensity <math>I</math> (on the x-axis). The attainable performance <math>P</math> is then computed by drawing a vertical line that hits the Roofline curve. Hence. the kernel or application is said to be ''[[Memory bound function|memory-bound]]'' if <math>I \leq \pi/\beta</math>. Conversely, if <math>I \geq \pi/\beta</math>, the [[computation]] is said to be [[CPU-bound|''compute-bound'']].<ref name=":0" />
A given [[Compute kernel|kernel]] or application is then characterized by a point given by its arithmetic intensity <math>I</math> (on the x-axis). The attainable performance <math>P</math> is then computed by drawing a vertical line that hits the Roofline curve. Hence. the kernel or application is said to be ''[[Memory bound function|memory-bound]]'' if <math>I \leq \pi/\beta</math>. Conversely, if <math>I \geq \pi/\beta</math>, the [[computation]] is said to be [[CPU-bound|''compute-bound'']].<ref name=":0" />


== Adding ceilings to the model ==
== Adding ceilings to the model ==
Line 44: Line 44:
=== Locality walls ===
=== Locality walls ===


If the ideal assumption that [[operational intensity]] is solely a function of the kernel is removed, and the cache topology - and therefore [[CPU cache#Cache miss|cache misses]] - is taken into account, the operational intensity clearly becomes dependent on a combination of kernel and architecture. This may result in a degradation in performance depending on the balance between the resultant operational intensity and the ''ridge point''. Unlike "proper" ceilings, the resulting lines on the Roofline plot are vertical barriers through which operational intensity cannot pass without optimization. For this reason, they are referenced to as ''locality walls'' or ''operational intensity'' ''walls.''<ref name="WilliamsPhD" /><ref name=":1" />
If the ideal assumption that arithmetic intensity is solely a function of the kernel is removed, and the cache topology - and therefore [[CPU cache#Cache miss|cache misses]] - is taken into account, the arithmetic intensity clearly becomes dependent on a combination of kernel and architecture. This may result in a degradation in performance depending on the balance between the resultant arithmetic intensity and the ''ridge point''. Unlike "proper" ceilings, the resulting lines on the Roofline plot are vertical barriers through which arithmetic intensity cannot pass without optimization. For this reason, they are referenced to as ''locality walls'' or ''arithmetic intensity'' ''walls.''<ref name="WilliamsPhD" /><ref name=":1" />


== Extension of the model ==
== Extension of the model ==

Revision as of 04:30, 6 July 2017

An example of a Roofline model in its basic form. As the image shows, the curve consists of two platform-specific performance ceilings: the processor’s peak performance.and a ceiling derived from the memory bandwidth. Both axes are in logarithmic scale

The Roofline model is an intuitive visual performance model used to provide performance estimates of a given compute kernel or application running on multi-core, many-core, or accelerator processor architectures, by showing inherent hardware limitations, and potential benefit and priority of optimizations. By combining locality, bandwidth, and different parallelization paradigms into a single performance figure, the model can be an effective alternative to assess the quality of attained performance instead of using simple percent-of-peak estimates, as it provides insights on both the implementation and inherent performance limitations.

The most basic Roofline model can be visualized by plotting floating-point performance as a function of machine peak performance, machine peak bandwidth, and arithmetic intensity. The resultant curve is effectively a performance bound under which kernel or application performance exists, and includes two platform-specific performance ceilings: a ceiling derived from the memory bandwidth and one derived from the processor’s peak performance.

Work

The work denotes the number of operations performed by a given kernel or application.[1] This metric may refer to any type of operation, from number of array points updated per second, to number of integer operations per second, to number of floating point operations per second (FLOPS), and the choice of one or another is driven by convenience. In the majority of the cases however, is expressed as FLOPS.[1][2][3][4][5]

Note that the work is a property of the given kernel or application and thus depend just partially on the platform characteristics.

Memory traffic

The memory traffic denotes the number of bytes of memory transfers incurred during the execution of the kernel or application.[1] In contrast to , is heavily dependent on the properties of the chosen platform, such as for instance the structure of the cache hierarchy.[1]

Arithmetic intensity

The arithmetic intensity , also referred to as operational intensity,[2][6] is the ratio of the work to the memory traffic :[1]and denotes the number of operations per byte of memory traffic. When the work is expressed as FLOPS, the resulting arithmetic intensity will be the ratio of floating point operations to total data movement (FLOPS/byte).

Naïve Roofline

Example of a naïve Roofline plot where two kernels are reported. The first has an arithmetic intensity that is underneath the peak bandwidth ceiling, and is then memory-bound. Instead, the second has an arithmetic intensity that is underneath the peak performance ceiling, and thus is compute-bound.

The naïve Roofline[2] is obtained by applying simple bound and bottleneck analysis.[7] In this formulation of the Roofline model, there are only two parameters: peak performance and peak bandwidth of the specific architecture; and one variable: arithmetic intensity. The peak performance, in general expressed as GFLOPS, can be usually derived from architectural manuals, while the peak bandwidth, that references to peak DRAM bandwidth to be specific, is instead obtained via benchmarking.[1][2] The resulting plot, in general with both axes in logarithmic scale, is then derived by the following formula:[1]where is the attainable performance, is the peak performance, is the peak bandwidth and is the arithmetic intensity. The point at which the performance saturates at the peak performance level - i.e. where the diagonal and horizontal roof meet - is defined as ridge point.[3] The ridge point offers insight on the machine's overall performance, by providing the minimum airthmetic intensity required to be able to achieve peak performance, and by suggesting at a glance the amount of effort required by the programmer to achieve peak performance.[3]

A given kernel or application is then characterized by a point given by its arithmetic intensity (on the x-axis). The attainable performance is then computed by drawing a vertical line that hits the Roofline curve. Hence. the kernel or application is said to be memory-bound if . Conversely, if , the computation is said to be compute-bound.[1]

Adding ceilings to the model

The naïve Roofline provides just an upper bound (the theoretical maximum) to performance. Although it can still give useful insights on the attainable performance, it does not provide a complete picture of what is actually limiting it. If for instance the considered kernel or application performs far below the Roofline, it might be useful to capture other performance ceilings, other than simple peak bandwidth and performance, to better guide the programmer on which optimization to implement, or even to assess the suitability of the architecture used with respect to the analyzed kernel or application.[2] The added ceilings impose then a limit on the attainable performance that is below the actual Roofline, and indicate that the kernel or application cannot break through anyone of these celining without first performing the associated optimization.[2][3]

The Roofline plot can be expanded upon three different aspects: communication, adding the bandwidth ceilings; computation, adding the so-called in-core ceilings; and locality, adding the locality walls.

Bandwidth ceilings

The bandwidth ceilings are bandwidth diagonals placed below the idealized peak bandwidth diagonal. Their existence is due to the lack of some kind of memory related architectural optimization, such as cache coherence, or software optimization, such as poor exposure of concurrency (that in turn limit bandwidth usage).[2][3]

In-core ceilings

The in-core ceilings are roofline-like curve beneath the actual roofline that may be present due to the lack of some form of parallelism. These ceilings effectively limit how high performance can reach. Performance cannot exceed an in-core ceiling until the underlying lack of parallelism is expressed and exploited. The ceilings can be also derived from architectural optimization manuals other than benchmarks.[2][3]

Locality walls

If the ideal assumption that arithmetic intensity is solely a function of the kernel is removed, and the cache topology - and therefore cache misses - is taken into account, the arithmetic intensity clearly becomes dependent on a combination of kernel and architecture. This may result in a degradation in performance depending on the balance between the resultant arithmetic intensity and the ridge point. Unlike "proper" ceilings, the resulting lines on the Roofline plot are vertical barriers through which arithmetic intensity cannot pass without optimization. For this reason, they are referenced to as locality walls or arithmetic intensity walls.[2][3]

Extension of the model

Since its introduction,[2][3] the model has been further extended to account for a broader set of metrics and hardware-related bottlenecks. Already available in literature there are extensions that take into account the impact of NUMA organization of memory,[5] of out-of-order execution,[8] of memory latencies,[8][9] and to model at a finer grain the cache hierarchy[4][8] in order to better understand what is actually limiting performance and drive the optimization process.

Also, the model has been extended to better suit specific architectures and the related characteristics, such as FPGAs.[10]

See also

References

  1. ^ a b c d e f g h Ofenbeck, G.; Steinmann, R.; Caparros, V.; Spampinato, D. G.; Püschel, M. (2014-03-01). "Applying the roofline model". 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS): 76–85. doi:10.1109/ISPASS.2014.6844463.
  2. ^ a b c d e f g h i j Williams, Samuel W. (2008). Auto-tuning Performance on Multicore Computers (Ph.D.). University of California at Berkeley.
  3. ^ a b c d e f g h Williams, Samuel; Waterman, Andrew; Patterson, David (2009-04-01). "Roofline: An Insightful Visual Performance Model for Multicore Architectures". Commun. ACM. 52 (4): 65–76. doi:10.1145/1498765.1498785. ISSN 0001-0782.
  4. ^ a b Ilic, A.; Pratas, F.; Sousa, L. (2014-01-01). "Cache-aware Roofline model: Upgrading the loft". IEEE Computer Architecture Letters. 13 (1): 21–24. doi:10.1109/L-CA.2013.6. ISSN 1556-6056.
  5. ^ a b Lorenzo, Oscar G.; Pena, Tomás F.; Cabaleiro, José C.; Pichel, Juan C.; Rivera, Francisco F. (2014-03-31). "Using an extended Roofline Model to understand data and thread affinities on NUMA systems". Annals of Multicore and GPU Programming. 1 (1): 56–67. ISSN 2341-3158.
  6. ^ "Roofline Performance Model". Lawrence Berkeley National Laboratory. Retrieved 19 June 2016.
  7. ^ Kourtis, Kornilios; Goumas, Georgios; Koziris, Nectarios (2008-01-01). "Optimizing Sparse Matrix-vector Multiplication Using Index and Value Compression". Proceedings of the 5th Conference on Computing Frontiers. CF '08. New York, NY, USA: ACM: 87–96. doi:10.1145/1366230.1366244. ISBN 9781605580777.
  8. ^ a b c Cabezas, V. C.; Püschel, M. (2014-10-01). "Extending the roofline model: Bottleneck analysis with microarchitectural constraints". 2014 IEEE International Symposium on Workload Characterization (IISWC): 222–231. doi:10.1109/IISWC.2014.6983061.
  9. ^ Lorenzo, O. G.; Pena, T. F.; Cabaleiro, J. C.; Pichel, J. C.; Rivera, F. F. (2014-03-26). "3DyRM: a dynamic roofline model including memory latency information". The Journal of Supercomputing. 70 (2): 696–708. doi:10.1007/s11227-014-1163-4. ISSN 0920-8542.
  10. ^ da Silva, Bruno; Braeken, An; D'Hollander, Erik H.; Touhafi, Abdellah (2013-01-01). "Performance Modeling for FPGAs: Extending the Roofline Model with High-level Synthesis Tools". Int. J. Reconfig. Comput. 2013: 7:7–7:7. doi:10.1155/2013/428078. ISSN 1687-7195.{{cite journal}}: CS1 maint: unflagged free DOI (link)

Available Tools