<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Julia Community 🟣: Kjartan Thor Wikfeldt</title>
    <description>The latest articles on Julia Community 🟣 by Kjartan Thor Wikfeldt (@wikfeldt).</description>
    <link>https://forem.julialang.org/wikfeldt</link>
    <image>
      <url>https://forem.julialang.org/images/DL3jJ2OkMVPROmqfcLKq3XYjpaJ_1mE6suZI1IVEkKs/rs:fill:90:90/g:sm/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L3VzZXIvcHJvZmls/ZV9pbWFnZS80ODMv/YzFiZTgyZjQtODdm/Yy00YmJhLTg1MDUt/MDRiZTM0MTFhYzAy/LmpwZWc</url>
      <title>Julia Community 🟣: Kjartan Thor Wikfeldt</title>
      <link>https://forem.julialang.org/wikfeldt</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.julialang.org/feed/wikfeldt"/>
    <language>en</language>
    <item>
      <title>A brief tour of Julia for High Performance Computing</title>
      <dc:creator>Kjartan Thor Wikfeldt</dc:creator>
      <pubDate>Tue, 05 Jul 2022 13:07:15 +0000</pubDate>
      <link>https://forem.julialang.org/wikfeldt/a-brief-tour-of-julia-for-high-performance-computing-5deb</link>
      <guid>https://forem.julialang.org/wikfeldt/a-brief-tour-of-julia-for-high-performance-computing-5deb</guid>
      <description>&lt;p&gt;This blog post first appeared in an &lt;a href="https://enccs.se/news/2022/07/julia-for-hpc/"&gt;ENCCS newsletter&lt;/a&gt;. ENCCS is the Swedish &lt;a href="https://www.eurocc-access.eu/"&gt;EuroCC center&lt;/a&gt; which aims to empower Swedish industry, academia and public sector to leverage HPC, AI, and HPDA efficiently and effectively, with a particular focus on &lt;a href="https://eurohpc-ju.europa.eu/index_en"&gt;EuroHPC resources&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This post gives an overview of Julia's features and capabilities for high performance computing (HPC) and presents a walkthrough on how to benchmark, optimize, parallelize and GPU-port a simple but realistic example problem. It builds on topics covered in the ENCCS workshop &lt;a href="https://enccs.github.io/Julia-for-HPC/"&gt;Julia for high-performance scientific computing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To follow along with the code examples and do your own experiments you will need access to the Julia REPL. The official &lt;a href="https://julialang.org/downloads/"&gt;installation instructions&lt;/a&gt; provide a quick and painless way to install Julia on all major operating systems, but you can also run Julia in the cloud for a small cost via &lt;a href="https://juliahub.com/"&gt;JuliaHub&lt;/a&gt; (who sponsorded the ENCCS workshop by giving participants free credits to run Julia on GPUs!). Code examples below can be copy-pasted into the Julia REPL. If you prefer an integrated development environment, there is an excellent &lt;a href="https://www.julia-vscode.org/"&gt;Julia VSCode extension&lt;/a&gt; which is the prefered environment for many Julia developers.&lt;/p&gt;

&lt;p&gt;To install the packages needed to run code found in this post, copy-paste the following dependency list into a file Project.toml in a new directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[deps]&lt;/span&gt;
&lt;span class="py"&gt;BenchmarkTools&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"&lt;/span&gt;
&lt;span class="py"&gt;CUDA&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"052768ef-5323-5732-b1bb-66c8b64840ba"&lt;/span&gt;
&lt;span class="py"&gt;Distributed&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"8ba89e20-285c-5b6f-9357-94700520ee1b"&lt;/span&gt;
&lt;span class="py"&gt;Plots&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"91a5bcdd-55d7-5caf-9e0b-520d859cae80"&lt;/span&gt;
&lt;span class="py"&gt;SharedArrays&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1a1011a3-84de-559e-8e89-a11a2f7dc383"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then start Julia in the directory using the command &lt;code&gt;julia --project&lt;/code&gt; which tells Julia to look for a Project.toml file in the current directory. Finally run the following commands inside the REPL to activate a new software environment and download the dependencies listed in Project.toml:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Pkg&lt;/span&gt;
&lt;span class="n"&gt;Pkg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;activate&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"."&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Pkg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;instantiate&lt;/span&gt;&lt;span class="x"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Introducing the toy problem
&lt;/h2&gt;

&lt;p&gt;As an example numerical problem, we will consider the discretized Laplace operator which is used widely in applications including numerical analysis, many physical problems, image processing and even machine learning. Here we will consider a simple two-dimensional implementation with a finite-difference formula, which reads:&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;uout(i,j)=0.25[u(i−1,j)+u(i+1,j)+u(i,j−1)+u(i,j+1)]
u_{out}(i,j) = 0.25[u(i-1,j) + u(i+1, j) + u(i, j-1) + u(i, j+1)]
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;u&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;o&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;u&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;j&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;0.25&lt;/span&gt;&lt;span class="mopen"&gt;[&lt;/span&gt;&lt;span class="mord mathnormal"&gt;u&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;j&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;u&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;j&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;u&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;j&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;u&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;j&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mclose"&gt;)]&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;



&lt;p&gt;In Julia, this can be implemented as&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; lap2d!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;end&lt;/span&gt; 
    &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that we follow the Julia convention of appending an exclamation mark to functions that mutate their arguments.&lt;/p&gt;

&lt;p&gt;We now start by initializing the two 64-bit floating point arrays &lt;code&gt;u&lt;/code&gt; and &lt;code&gt;unew&lt;/code&gt;, and arbitrarily set the boundaries to 10.0 to have something interesting to simulate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;M&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt;
&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt;
&lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;# set boundary conditions&lt;/span&gt;
&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;.=&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;
&lt;span class="n"&gt;unew&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To simulate something that resembles e.g. the evolution of temperature in a 2D heat conductor (although we've completely ignored physical constants and time-stepping involved in solving the heat equation), we could run a loop of say 50,000 "time" steps and visualize the results with the &lt;code&gt;heatmap&lt;/code&gt; method of the &lt;code&gt;Plots&lt;/code&gt; package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;50_000&lt;/span&gt;
    &lt;span class="n"&gt;lap2d!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="c"&gt;# copy new computed field to old array&lt;/span&gt;
    &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Plots&lt;/span&gt;
&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://forem.julialang.org/images/E_whbnj3km68285qBIyWqgTJNFYBXXNepbsxjZ9q510/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL3I2/cnk1NmJkY2NycTRo/Zmg4dTB6LnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/E_whbnj3km68285qBIyWqgTJNFYBXXNepbsxjZ9q510/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL3I2/cnk1NmJkY2NycTRo/Zmg4dTB6LnBuZw" alt="heatmap" width="598" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This might however take around an hour on a standard CPU, so it's time to start optimizing and later on parallelizing and GPU-porting our function! But first we need a way to benchmark our code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarking
&lt;/h2&gt;

&lt;p&gt;Base Julia already has the &lt;code&gt;@time&lt;/code&gt; macro to print the time it takes to execute an expression. However, to get more accurate values it is better to rely on the &lt;a href="https://juliaci.github.io/BenchmarkTools.jl/dev/manual/"&gt;BenchmarkTools.jl&lt;/a&gt; framework, which provides convenient macros for benchmarking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;@btime&lt;/code&gt;: quick timing, prints the time an expression takes and the memory allocated&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;@benchmark&lt;/code&gt;: run a fuller benchmark on a given expression.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's try it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;BenchmarkTools&lt;/span&gt;

&lt;span class="nd"&gt;@btime&lt;/span&gt; &lt;span class="n"&gt;lap2d!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;#  24.982 ms (0 allocations: 0 bytes)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Optimization
&lt;/h2&gt;

&lt;p&gt;Before we start optimizing it is worth mentioning an important performance pitfall: multidimensional arrays in Julia are in &lt;a href="https://en.wikipedia.org/wiki/Row-_and_column-major_order"&gt;column-major order&lt;/a&gt; like in Fortran, Matlab and R, but opposite to that of C/C++ and Python (numpy). If we were to mistakenly have the inner for-loop run over the second dimension (along rows) of the array (&lt;code&gt;for i in 2:N-1; for j in 2:M-1; unew[i, j] = ...&lt;/code&gt;) this could increase the execution time by 2-3 orders of magnitude because of cache misses in every single inner loop iteration! The Julia manual contains a useful list of additional &lt;a href="https://docs.julialang.org/en/v1/manual/performance-tips/"&gt;performance pitfalls and tips&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We first consider the &lt;code&gt;@inbounds&lt;/code&gt; macro which eliminates array bounds checking within expressions. This can save considerable time, but should only be used if you are sure that no out-of-bounds indices are used. A good practice to ensure this is to use the &lt;code&gt;eachindex&lt;/code&gt; method when iterating over collections. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="x"&gt;];&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;eachindex&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;println&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the &lt;code&gt;lap2d!&lt;/code&gt; function, we'll have to content ourselves with manually ensuring that no out-of-bounds errors can occur. The &lt;code&gt;@inbounds&lt;/code&gt; macro can be inserted in front of an expression or a code block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; lap2d!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="nd"&gt;@inbounds&lt;/span&gt; &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;end&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should result in considerable speedup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="nd"&gt;@btime&lt;/span&gt; &lt;span class="n"&gt;lap2d!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;#  15.205 ms (0 allocations: 0 bytes)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Another attempt at optimizing serial performance in Julia is to use the experimental &lt;code&gt;@simd&lt;/code&gt; macro. This grants extra liberties to the LLVM compiler to allow loop re-ordering which in some cases can yield a speedup, although often the compiler is able to vectorize inner loops without the use of &lt;code&gt;@simd&lt;/code&gt; and no speedup will be seen. Using &lt;code&gt;@simd&lt;/code&gt; comes with a few requirements which you can see in the manual (type &lt;code&gt;?@simd&lt;/code&gt; in the REPL). For our Laplacian function, adding &lt;code&gt;@simd&lt;/code&gt; to the inner-most loop does not improve performance on an AMD Epyc 7H12 CPU in the &lt;a href="https://www.izum.si/en/hpc-en/"&gt;Vega EuroHPC supercomputer&lt;/a&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Parallelization
&lt;/h2&gt;

&lt;p&gt;We now move on to parallelization. It's worth mentioning first that Julia has inbuilt automatic parallelism which automatically multithreads certain operations. Consider the multiplication of two large arrays:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we run this in a Julia session and monitor the resource usage (e.g. via &lt;code&gt;top&lt;/code&gt; on MacOS/Linux or the Task Manager in Windows) we can see that all cores on our computer are used!&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;lap2d!&lt;/code&gt; function will not be automatically multithreaded, so we will first resort to the in-built &lt;code&gt;Threads&lt;/code&gt; library. First let's see how many threads we have access to in our Julia session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;Threads&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nthreads&lt;/span&gt;&lt;span class="x"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will probably return 1, unless you have configured your Julia REPL, Jupyter Notebook kernel or VSCode Julia extension to use multiple threads. To start Julia with multiple threads, we need to add the &lt;code&gt;-t/--threads&lt;/code&gt; flag when starting the REPL or set the &lt;code&gt;Julia: Num Threads&lt;/code&gt; option in VSCode. The &lt;code&gt;-t&lt;/code&gt; flag supports an &lt;code&gt;auto&lt;/code&gt; keyword which automatically sets &lt;code&gt;nthreads&lt;/code&gt; to the number of CPU threads on the machine.&lt;/p&gt;

&lt;p&gt;After restarting a Julia session with e.g. 2 threads (&lt;code&gt;julia -t 2&lt;/code&gt;), we can use the main multithreading approach of adding the &lt;code&gt;Threads.@threads&lt;/code&gt; macro to the outermost for loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="c"&gt;# Threads is in the standard library so append the module with a dot&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Threads&lt;/span&gt;

&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; lap2d!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="nd"&gt;@threads&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="nd"&gt;@inbounds&lt;/span&gt; &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;end&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How efficiently does it parallelize?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="nd"&gt;@btime&lt;/span&gt; &lt;span class="n"&gt;lap2d!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;#  11.943 ms (11 allocations: 1.02 KiB)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A modest speedup, but not insignificant. Clearly the overhead of managing the threads balances the benefit of parallelization since the computation in the inner loop is very simple and fast. Adding more threads will also not help.&lt;/p&gt;

&lt;p&gt;Another way to accomplish parallelization on a single computer (i.e. utilizing shared memory) is to use the &lt;code&gt;SharedArrays&lt;/code&gt; package. A SharedArray is an array which is shared between threads in a shared-memory machine (in contrast to a &lt;a href="https://github.com/JuliaParallel/DistributedArrays.jl"&gt;DistributedArray&lt;/a&gt; which is distributed over multiple machines) but can be operated on with methods from the in-built &lt;code&gt;Distributed&lt;/code&gt; module which handles message-passing parallelism between multiple concurrent Julia &lt;em&gt;processes&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;We can implement a new method for the &lt;code&gt;lap2d!&lt;/code&gt; function which takes SharedArrays. To distribute the work in the double for-loop among parallel processes (&lt;em&gt;workers&lt;/em&gt;), we add the &lt;code&gt;@distributed&lt;/code&gt; macro to spawn independent workers and the &lt;code&gt;@sync&lt;/code&gt; macro to synchronize the workers at the end:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;SharedArrays&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Distributed&lt;/span&gt;

&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; lap2d!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;SharedArray&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;SharedArray&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="nd"&gt;@sync&lt;/span&gt; &lt;span class="nd"&gt;@distributed&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="nd"&gt;@inbounds&lt;/span&gt; &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;end&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even though we still intend to run our code on a single computer, the use of the &lt;code&gt;Distributed&lt;/code&gt; module brings us into the territory of distributed computing which deserves a short discussion. &lt;code&gt;Distributed&lt;/code&gt; enables us to create workers not only on single shared-memory machines but also on multiple interconnected workstations or across multiple nodes of a supercomputer. Each worker is an independent Julia process which receives work from the master process. One typical parallelization strategy with &lt;code&gt;Distributed&lt;/code&gt; is to perform parallel reductions using constructs like &lt;code&gt;@distributed (+) for i=1:N ; f(i) ; ...&lt;/code&gt; (summation with &lt;code&gt;(+)&lt;/code&gt; can be replaced by any other reducer function). This divides work evenly between workers and is efficient for problems with fast inner loops. Another is to do parallel mapping with &lt;code&gt;pmap&lt;/code&gt; in constructs like &lt;code&gt;pmap(f, iterable)&lt;/code&gt;, which is efficient to use with expensive function calls since it performs automatic load-balancing. More details about these methods can be found &lt;a href="https://enccs.github.io/Julia-for-HPC/parallelization/"&gt;elsewhere&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Returning to our Laplacian function, we now need to create multiple &lt;em&gt;workers&lt;/em&gt;. This can be done similarly to multithreading by adding the &lt;code&gt;-p/--procs&lt;/code&gt; flag when we launch Julia, which also accepts an &lt;code&gt;auto&lt;/code&gt; keyword. Alternatively, we can use methods from &lt;code&gt;Distributed&lt;/code&gt; to manage worker processes within a Julia session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Distributed&lt;/span&gt;

&lt;span class="n"&gt;println&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nprocs&lt;/span&gt;&lt;span class="x"&gt;())&lt;/span&gt;   &lt;span class="c"&gt;# returns 1&lt;/span&gt;
&lt;span class="n"&gt;addprocs&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;         &lt;span class="c"&gt;# add 4 workers&lt;/span&gt;
&lt;span class="n"&gt;println&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nprocs&lt;/span&gt;&lt;span class="x"&gt;())&lt;/span&gt;   &lt;span class="c"&gt;# all processes, returns 5&lt;/span&gt;
&lt;span class="n"&gt;println&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nworkers&lt;/span&gt;&lt;span class="x"&gt;())&lt;/span&gt; &lt;span class="c"&gt;# only worker processes, returns 4&lt;/span&gt;
&lt;span class="n"&gt;rmprocs&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workers&lt;/span&gt;&lt;span class="x"&gt;())&lt;/span&gt;  &lt;span class="c"&gt;# remove all worker processes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let us add 4 worker processes, create new SharedArrays out of &lt;code&gt;u&lt;/code&gt; and &lt;code&gt;unew&lt;/code&gt;, and take advantage of multiple dispatch to measure the performance of the SharedArray-specialized method of the &lt;code&gt;lap2d!&lt;/code&gt; function!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Distributed&lt;/span&gt;
&lt;span class="n"&gt;addprocs&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;su&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;SharedArray&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;sunew&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;SharedArray&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;);&lt;/span&gt;

&lt;span class="nd"&gt;@btime&lt;/span&gt; &lt;span class="n"&gt;lap2d!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;su&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sunew&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;#   14.106 ms (678 allocations: 31.56 KiB)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not as efficient as using multithreading. The conclusion that can be drawn from these tests is that very simple calculations benefit little from parallelization due to the overhead in managing threads or processes.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPU programming
&lt;/h2&gt;

&lt;p&gt;Last but not least, we will turn our attention to porting our code to GPUs using the &lt;a href="https://github.com/JuliaGPU/CUDA.jl"&gt;CUDA.jl&lt;/a&gt; package. For working on AMD or Intel GPUs there are corresponding packages &lt;a href="https://amdgpu.juliagpu.org/stable/"&gt;AMDGPU.jl&lt;/a&gt; and &lt;a href="https://github.com/JuliaGPU/oneAPI.jl"&gt;oneAPI.jl&lt;/a&gt;, respectively. CUDA.jl offers both user-friendly high-level abstractions that require little programming effort and a lower level approach for writing kernels for fine-grained control. Unless you have an NVIDIA GPU on your own computer along with CUDA libraries, you will need to use JuliaHub or a GPU-powered supercomputer for the following code examples!&lt;/p&gt;

&lt;p&gt;GPU programming with Julia can be as simple as using CuArray instead of regular Julia arrays. The CuArray type closely resembles Base.Array which enables us to write generic code which works on both types. The following code copies an array to the GPU, executes a simple operation on the GPU, and then copies the result back to the CPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;CUDA&lt;/span&gt;

&lt;span class="n"&gt;A_d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CuArray&lt;/span&gt;&lt;span class="x"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;A_d&lt;/span&gt; &lt;span class="o"&gt;.+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A_d&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, the overhead of copying data to the GPU makes such simple calculations very slow.&lt;/p&gt;

&lt;p&gt;Let's see just how much performance one can get out of a GPU by performing a large matrix multiplication instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;BenchmarkTools&lt;/span&gt;

&lt;span class="c"&gt;# array on CPU&lt;/span&gt;
&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;# array on GPU&lt;/span&gt;
&lt;span class="n"&gt;A_d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CuArray&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@btime&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;
&lt;span class="c"&gt;#   3.130 s (2 allocations: 512.00 MiB)&lt;/span&gt;
&lt;span class="nd"&gt;@btime&lt;/span&gt; &lt;span class="n"&gt;A_d&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A_d&lt;/span&gt;
&lt;span class="c"&gt;#   15.190 μs (32 allocations: 640 bytes)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's 200 times faster!&lt;/p&gt;

&lt;p&gt;A powerful way to program GPUs with arrays is through Julia’s higher-order array abstractions. The simple element-wise addition we saw above, &lt;code&gt;A_d .+= 1&lt;/code&gt;, is an example of this, but more general constructs can be created with &lt;code&gt;broadcast&lt;/code&gt;, &lt;code&gt;map&lt;/code&gt;, &lt;code&gt;reduce&lt;/code&gt;, &lt;code&gt;accumulate&lt;/code&gt; etc:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;accumulate&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Moreover, the NVIDIA libraries contain precompiled kernels for common operations like matrix multiplication (cuBLAS), fast Fourier transforms (cuFFT), linear solvers (cuSOLVER), etc., and these kernels are wrapped in CUDA.jl and can be used directly with CuArrays:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="c"&gt;# create a 100x100 Float32 random array and an uninitialized array&lt;/span&gt;
&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CuArray&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;Float32&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;}(&lt;/span&gt;&lt;span class="nb"&gt;undef&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# use cuBLAS for matrix multiplication&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;LinearAlgebra&lt;/span&gt;
&lt;span class="n"&gt;mul!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# use cuSOLVER for QR factorization&lt;/span&gt;
&lt;span class="n"&gt;qr&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# solve equation A*X == B&lt;/span&gt;
&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;\&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;

&lt;span class="c"&gt;# use cuFFT for FFT&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CUFFT&lt;/span&gt;
&lt;span class="n"&gt;fft&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unfortunately, not all algorithms can be made to work with the higher-level array abstractions or preexisting kernels. In cases like our Laplacian operator it’s necessary to explicitly write our own GPU kernel.&lt;/p&gt;

&lt;p&gt;In principle we could run our function as-is on a GPU by converting the input arrays to CuArrays and compile and run a GPU kernel using the &lt;code&gt;@cuda&lt;/code&gt; macro (&lt;code&gt;_d&lt;/code&gt; is often used for arrays living on the GPU device):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;CUDA&lt;/span&gt;

&lt;span class="n"&gt;u_d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CuArray&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;unew_d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CuArray&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# WRONG!&lt;/span&gt;
&lt;span class="nd"&gt;@cuda&lt;/span&gt; &lt;span class="n"&gt;lap2d!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u_d&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unew_d&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, the performance would be terrible because each thread on the GPU would be performing the same loop. What we have to do is to remove the loop over all elements and instead use the special &lt;code&gt;threadIdx&lt;/code&gt;, &lt;code&gt;blockIdx&lt;/code&gt; and &lt;code&gt;blockDim&lt;/code&gt; functions to split array indices between computing elements on the GPU.&lt;/p&gt;

&lt;p&gt;The following image  shows how threads and thread blocks map to the hardware architecture of a GPU in the case of a 1D array:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://forem.julialang.org/images/UGR_GEwo4zaHx1lbpHQB1ChCMJbTLEARM1OM3tqdI-w/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL3Nl/YnlyN2I3ZXR3czZv/aWdhMnU1LnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/UGR_GEwo4zaHx1lbpHQB1ChCMJbTLEARM1OM3tqdI-w/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL3Nl/YnlyN2I3ZXR3czZv/aWdhMnU1LnBuZw" alt="GPU threads and blocks" width="880" height="393"&gt;&lt;/a&gt;&lt;br&gt;
(adapted from &lt;a href="https://enccs.github.io/CUDA"&gt;this CUDA tutorial&lt;/a&gt;, distributed under &lt;a href="https://creativecommons.org/licenses/by/4.0/"&gt;CC BY 4.0&lt;/a&gt;, Copyright (c) Artem Zhmurov and individual contributors)&lt;/p&gt;

&lt;p&gt;SM here stands for "streaming multiprocessor", the highest level of granularity in the GPU architecture. We see that for example index 11 in the array falls in block 2, has threadIdx=3 and that there are 4 threads per block. Thus, a for loop &lt;code&gt;for i=1:N&lt;/code&gt; is usually replaced by &lt;code&gt;i = threadIdx().x + (blockIdx().x - 1) * blockDim().x&lt;/code&gt; when writing a GPU kernel.&lt;/p&gt;

&lt;p&gt;Let's see how our Laplacian function looks after porting it to the GPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="nd"&gt;@inbounds&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; lap2d_gpu!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="x"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="x"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="x"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="x"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="x"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="x"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt;
        &lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;nothing&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use &lt;code&gt;@inbounds&lt;/code&gt; for the same purpose as earlier, and we need the conditional to make sure we only use appropriate indices. Note also that GPU kernels always need to return &lt;code&gt;nothing&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To launch this kernel we need to decide the number of threads and blocks in two dimensions (in each dimension, &lt;code&gt;nthreads * nblocks&lt;/code&gt; should give the array size along that dimension). Setting the number of threads to 1 would be inefficient as only one GPU core per SM would be doing actual work. Setting the number of threads to the entire size of the array could be efficient, if only the GPU architecture supported it! Different families of GPUs come with different hardware limitations. We can call a function to see the maximum number of threads per block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attribute&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="x"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On an NVIDIA A100, a high-end GPU found on many current supercomputers (including &lt;a href="https://www.izum.si/en/vega-en/"&gt;Vega&lt;/a&gt;), the maximum number of threads is 1024. We can for example set the following numbers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;threads&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="c"&gt;# 32*32=1024&lt;/span&gt;
&lt;span class="n"&gt;blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N÷threads&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt; &lt;span class="n"&gt;M÷threads&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We launch the kernel with the &lt;code&gt;@cuda&lt;/code&gt; macro and time it with &lt;code&gt;@btime&lt;/code&gt;, but need to add &lt;code&gt;CUDA.@sync&lt;/code&gt; to synchronize the GPU so that we don't measure only the kernel launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;u_d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CuArray&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;unew_d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CuArray&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@btime&lt;/span&gt; &lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nd"&gt;@sync&lt;/span&gt; &lt;span class="nd"&gt;@cuda&lt;/span&gt; &lt;span class="n"&gt;blocks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;blocks&lt;/span&gt; &lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt; &lt;span class="n"&gt;lap2d_gpu2!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;u_d&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;unew_d&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;#   248.392 μs (73 allocations: 4.25 KiB)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Around a factor of 45 faster than the multithreaded version!&lt;/p&gt;

&lt;p&gt;But we might be able to do even better than this by using more threads in the x-dimension, since arrays in Julia are stored column-wise in memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;threads&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N÷threads&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt; &lt;span class="n"&gt;M÷threads&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
&lt;span class="nd"&gt;@btime&lt;/span&gt; &lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nd"&gt;@sync&lt;/span&gt; &lt;span class="nd"&gt;@cuda&lt;/span&gt; &lt;span class="n"&gt;blocks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;blocks&lt;/span&gt; &lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt; &lt;span class="n"&gt;lap2d_gpu2!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;u_d&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;unew_d&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;#   214.671 μs (26 allocations: 1.33 KiB)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fortunately, the performance is not overly sensitive on this choice and thread configurations ranging from &lt;code&gt;(128, 8)&lt;/code&gt; to &lt;code&gt;(512, 2)&lt;/code&gt; all yield similar results. &lt;/p&gt;

&lt;p&gt;GPUs typically perform significantly better in single precision, and some algorithms (e.g. molecular dynamics and neural networks) are sufficiently robust to handle reduced precision. Indeed, switching from double to single precision reduces the runtime by around 30-35\%:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;unew_d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CuArray&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Float32&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unew&lt;/span&gt;&lt;span class="x"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;u_d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CuArray&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Float32&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="x"&gt;));&lt;/span&gt;
&lt;span class="nd"&gt;@btime&lt;/span&gt; &lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nd"&gt;@sync&lt;/span&gt; &lt;span class="nd"&gt;@cuda&lt;/span&gt; &lt;span class="n"&gt;blocks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;blocks&lt;/span&gt; &lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt; &lt;span class="n"&gt;lap2d_gpu2!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;u_d&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;unew_d&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;#   163.131 μs (26 allocations: 1.33 KiB)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now easily afford to run a million iterations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1_000_000&lt;/span&gt;
    &lt;span class="nd"&gt;@cuda&lt;/span&gt; &lt;span class="n"&gt;blocks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;blocks&lt;/span&gt; &lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt; &lt;span class="n"&gt;lap2d_gpu!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u_d&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unew_d&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;u_d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unew_d&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u_d&lt;/span&gt;&lt;span class="x"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://forem.julialang.org/images/JB4oJ1N9HEOgXSVwqbIOSjuOj2g8GoaOSkawxKOZZtc/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL20w/ZWtjbGVoYTUyZ216/M3l0YzV2LnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/JB4oJ1N9HEOgXSVwqbIOSjuOj2g8GoaOSkawxKOZZtc/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL20w/ZWtjbGVoYTUyZ216/M3l0YzV2LnBuZw" alt="Heatmap-1M" width="600" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Further learning
&lt;/h2&gt;

&lt;p&gt;More details on using Julia for HPC can be found in the &lt;a href="https://enccs.github.io/Julia-for-HPC/"&gt;ENCCS Julia lesson&lt;/a&gt; on which this post is based. There is also a large and rapidly growing number of online learning resources for Julia, some of which are listed on &lt;a href="https://julialang.org/learning/"&gt;https://julialang.org/learning/&lt;/a&gt;. Finally, anyone interested in hearing about what's happening in the world of Julia should consider attending the free online &lt;a href="https://juliacon.org/2022/"&gt;JuliaCon 2022 conference&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>hpc</category>
      <category>gpu</category>
    </item>
  </channel>
</rss>
