sparta.github.io/bench.txt at master · sparta/sparta.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
"SPARTA WWW Site"_sws :c
:link(sws,index.html)

:line

SPARTA Benchmarks :h3

This page gives SPARTA performance on several benchmark problems, run
on different machines, both in serial and parallel.  When the hardware
supports it, results using the the accelerator options currently
available in the code are also shown.

All the information is provided below to run these tests or similar
tests on your own machine.  This includes info on how to build SPARTA,
how to launch it with the appropriate command-line arguments, and
links to input and output files generated by all the benchmark tests.
Note that input files and a few sample output files are also provided
in the {bench} directory of the SPARTA distribution.  See the
bench/README file for details.

Benchmark results: :h4

"Free"_#free_old = free molecular flow in a box, older results on a large BG/Q machine
"Collide"_#collide_old = collisional molecular flow in a box , older results on a large BG/Q machine
<IMG SRC = "images/new.gif"> "Free"_#free = same as above, with accelerator options and new machines
<IMG SRC = "images/new.gif"> "Collide"_#collide = same as above, with accelerator options and new machines
<IMG SRC = "images/new.gif"> "Sphere"_#sphere = flow around a sphere, with accelerator options and new machines :ul

Additional info: :h4

"Accelerator options"_#accelerate
"Machines and node hardware"_#machines
"How to build SPARTA and run the benchmarks"_#howto
"How to interpret the plots"_#plots :ul

:line

Free molecular flow in a box :h4,link(free_old)

This benchmark is for particles advecting in free molecular flow (no
collsions) on a regular grid overlaying a 3d closed box with
reflective boundaries.  The size of the grid was varied; the particle
counts is always 10x the number of grid cells.  Particles were
initialized with a thermal temperature (no streaming velocity) so they
move in random directions.  Since there is very little computation to
do, this is a good stress test of the communication capabilities of
SPARTA and the machines it is run on.

The input script for this problem is bench/in.free in the SPARTA
distribution.

This plot shows timings results in particle moves/sec/node, for runs
of different sizes on varying node counts of two different machines.
Problems as small as 1M grid cells (10M particles) and as large as 10B
grid cells (100B particles) were run.

Chama is an Intel cluster with Infiniband described "below"_#machines.
Each node of chama has dual 8-core Intel Sandy Bridge CPUs.  These
tests were run on all 16 cores of each node, i.e. with 16 MPI
tasks/node.  Up to 1024 nodes were used (16K MPI tasks).  Mira is an
IBM BG/Q machine at Argonne National Labs.  It has 16 cores per node.
These tests were run with 4 MPI tasks/core, for a total of 64 MPI
tasks/node.  Up to 8K nodes were used (512K MPI tasks).

The plot shows that a Chama node is about 2x faster than a BG/Q node.

Each individual curve in the plot is a strong scaling test, where the
same size problem is run on more and more nodes.  Perfect scalability
would be a horizontal line.  The curves show some initial super-linear
speed-up as the particle count/node decreased, due to cache effects,
then a slow-down as more nodes are added due to too-few particles/node
and increased communication costs.

Jumping from curve-to-curve as node count increases is a weak scaling
test, since the problem size is increasing with node count.  Again a
horizontal line would represent perfect weak scaling.

:c,image(images/bench_free_small.jpg,images/bench_free.jpg)

Click on the image to see a larger version.

:line

Collisional flow in a box :h4,link(collide_old)

This benchmark is for particles undergoing collisional flow.
Everything about the problem is the same as the free molecular flow
problem described above, except that collisions were enabled, which
requires extra computation, as well as particle sorting each timestep
to identify particles in the same grid cell.

The input script for this problem is bench/in.collide in the
SPARTA distribution.

As above, this plot shows timings results in particle moves/sec/node,
for runs of different sizes on varying node counts.  Data for the same
two machines is shown: "chama"_#machine (Intel cluster with Ifiniband
at Sandia) and mira (IBM BG/Q at ANL).  Comparing these timings to the
free molecule flow plot in the previous section shows the cost of
collisions (and sorting) slows down the performance by a factor of
about 2.5x.  Cache effects (super-linear speed-up) are smaller due to
the increased computational costs.

For collisional flow, problems as small as 1M grid cells (10M
particles) and as large as 1B grid cells (10B particles) were run.

The discussion above regarding strong and weak scaling also applies to
this plot.  For any curve, a horizontal line would represent perfect
weak scaling.

:c,image(images/bench_collide_small.jpg,images/bench_collide.jpg)

Click on the image to see a larger version.

:line
:line

Free benchmark :h4,link(free)

"in.free"_bench/in.free input script :ul

As described above, this benchmark is for particles advecting in free
molecular flow (no collisions) on a regular grid overlaying a 3d
closed box with reflective boundaries. The size of the grid was
varied; the particle counts is always 10x the number of grid
cells. Particles were initialized with a thermal temperature (no
streaming velocity) so they move in random directions. Since there is
very little computation to do, this is a good stress test of the
communication capabilities of SPARTA and the machines it is run on.

Additional packages needed for this benchmark: none

Comments:

In the data below, K = 1000 particles, so 1M = 1024*1000. :ul

:line

[Free single core and single node performance:]

Best timings for any accelerator option as a function of problem size.
Running on a single CPU or KNL core.  Running on a single CPU or KNL
node or a single GPU.  Only for double precision.

:image(bench/plot_free_core_best_small.jpg,bench/plot_free_core_best.jpg)
:image(bench/plot_free_node_best_small.jpg,bench/plot_free_node_best.jpg)

"Table for single core"_bench/plot_free_core_best.html
"Table for single node"_bench/plot_free_node_best.html :ul

:line

[Free strong and weak scaling:]

Fastest timing for any accelerator option running on multiple CPU or
KNL or a single GPUs, as a function of node count.  For strong scaling
of 2 problem sizes: 8M particles, 64M particles.  For weak scaling of
2 problem sizes: 1M particles/node, 16M particles/node.  Only for a
single GPU/node, only double precision.

Strong scaling means the same size problem is run on successively more
nodes.  Weak scaling means the problem size doubles each time the node
count doubles.  See a fuller description "here"_#interpret of how to
interpret these plots.

:image(bench/plot_free_strong_8M_best_small.jpg,bench/plot_free_strong_8M_best.jpg)
:image(bench/plot_free_strong_64M_best_small.jpg,bench/plot_free_strong_64M_best.jpg)
:image(bench/plot_free_weak_1M_best_small.jpg,bench/plot_free_weak_1M_best.jpg)
:image(bench/plot_free_weak_16M_best_small.jpg,bench/plot_free_weak_16M_best.jpg)

"Table for strong scaling of 8M particles"_bench/plot_free_strong_8M_best.html
"Table for strong scaling of 64M particles"_bench/plot_free_strong_64M_best.html
"Table for weak scaling of 1M particles/node"_bench/plot_free_weak_1M_best.html
"Table for weak scaling of 16M particles/node"_bench/plot_free_weak_16M_best.html :ul

:line

[Free performance details]:

Modes: per-core, per-node, strong scaling, weak scaling
Hardware: CPU, KNL, GPU options
Within plot: accelerator packages, one or multiple GPUs/node :ul

Mode | SPARTA Version | Hardware | Machine | Size | Plot | Table
core | 23Dec17 | SandyBridge | chama | 1K-16K | "plot"_bench/plot_free_chama_core_CPU.jpg | "table"_bench/plot_free_chama_core_CPU.html
core | 23Dec17 | Haswell | mutrino | 1K-16K | "plot"_bench/plot_free_mutrino_core_CPU.jpg | "table"_bench/plot_free_mutrino_core_CPU.html
core | 23Dec17 | Broadwell | serrano | 1K-16K | "plot"_bench/plot_free_serrano_core_CPU.jpg | "table"_bench/plot_free_serrano_core_CPU.html
core | 23Dec17 | KNL | mutrino | 1K-16K | "plot"_bench/plot_free_mutrino_core_KNL.jpg | "table"_bench/plot_free_mutrino_core_KNL.html
node | 23Dec17 | SandyBridge | chama | 32K-128M | "plot"_bench/plot_free_chama_node_CPU.jpg | "table"_bench/plot_free_chama_node_CPU.html
node | 23Dec17 | Haswell | mutrino | 32K-128M | "plot"_bench/plot_free_mutrino_node_CPU.jpg | "table"_bench/plot_free_mutrino_node_CPU.html
node | 23Dec17 | Broadwell | serrano | 32K-128M | "plot"_bench/plot_free_serrano_node_CPU.jpg | "table"_bench/plot_free_serrano_node_CPU.html
node | 23Dec17 | KNL | mutrino | 32K-128M | "plot"_bench/plot_free_mutrino_node_KNL.jpg | "table"_bench/plot_free_mutrino_node_KNL.html
node | 23Dec17 | K80 | ride80 | 32K-128M | "plot"_bench/plot_free_ride80_node_GPU.jpg | "table"_bench/plot_free_ride80_node_GPU.html
node | 23Dec17 | P100 | ride100 | 32K-128M | "plot"_bench/plot_free_ride100_node_GPU.jpg | "table"_bench/plot_free_ride100_node_GPU.html
strong | 23Dec17 | SandyBridge | chama | 8M | "plot"_bench/plot_free_chama_strong_8M_CPU.jpg | "table"_bench/plot_free_chama_strong_8M_CPU.html
strong | 23Dec17 | Haswell | mutrino | 8M | "plot"_bench/plot_free_mutrino_strong_8M_CPU.jpg | "table"_bench/plot_free_mutrino_strong_8M_CPU.html
strong | 23Dec17 | Broadwell | serrano | 8M | "plot"_bench/plot_free_serrano_strong_8M_CPU.jpg | "table"_bench/plot_free_serrano_strong_8M_CPU.html
strong | 23Dec17 | KNL | mutrino | 8M | "plot"_bench/plot_free_mutrino_strong_8M_KNL.jpg | "table"_bench/plot_free_mutrino_strong_8M_KNL.html
strong | 23Dec17 | K80 | ride80 | 8M | "plot"_bench/plot_free_ride80_strong_8M_GPU.jpg | "table"_bench/plot_free_ride80_strong_8M_GPU.html
strong | 23Dec17 | P100 | ride100 | 8M | "plot"_bench/plot_free_ride100_strong_8M_GPU.jpg | "table"_bench/plot_free_ride100_strong_8M_GPU.html
strong | 23Dec17 | SandyBridge | chama | 64M | "plot"_bench/plot_free_chama_strong_64M_CPU.jpg | "table"_bench/plot_free_chama_strong_64M_CPU.html
strong | 23Dec17 | Haswell | mutrino | 64M | "plot"_bench/plot_free_mutrino_strong_64M_CPU.jpg | "table"_bench/plot_free_mutrino_strong_64M_CPU.html
strong | 23Dec17 | Broadwell | serrano | 64M | "plot"_bench/plot_free_serrano_strong_64M_CPU.jpg | "table"_bench/plot_free_serrano_strong_64M_CPU.html
strong | 23Dec17 | KNL | mutrino | 64M | "plot"_bench/plot_free_mutrino_strong_64M_KNL.jpg | "table"_bench/plot_free_mutrino_strong_64M_KNL.html
strong | 23Dec17 | K80 | ride80 | 64M | "plot"_bench/plot_free_ride80_strong_64M_GPU.jpg | "table"_bench/plot_free_ride80_strong_64M_GPU.html
strong | 23Dec17 | P100 | ride100 | 64M | "plot"_bench/plot_free_ride100_strong_64M_GPU.jpg | "table"_bench/plot_free_ride100_strong_64M_GPU.html
weak | 23Dec17 | SandyBridge | chama | 1M/node | "plot"_bench/plot_free_chama_weak_1M_CPU.jpg | "table"_bench/plot_free_chama_weak_1M_CPU.html
weak | 23Dec17 | Haswell | mutrino | 1M/node | "plot"_bench/plot_free_mutrino_weak_1M_CPU.jpg | "table"_bench/plot_free_mutrino_weak_1M_CPU.html
weak | 23Dec17 | Broadwell | serrano | 1M/node | "plot"_bench/plot_free_serrano_weak_1M_CPU.jpg | "table"_bench/plot_free_serrano_weak_1M_CPU.html
weak | 23Dec17 | KNL | mutrino | 1M/node | "plot"_bench/plot_free_mutrino_weak_1M_KNL.jpg | "table"_bench/plot_free_mutrino_weak_1M_KNL.html
weak | 23Dec17 | K80 | ride80 | 1M/node | "plot"_bench/plot_free_ride80_weak_1M_GPU.jpg | "table"_bench/plot_free_ride80_weak_1M_GPU.html
weak | 23Dec17 | P100 | ride100 | 1M/node | "plot"_bench/plot_free_ride100_weak_1M_GPU.jpg | "table"_bench/plot_free_ride100_weak_1M_GPU.html
weak | 23Dec17 | SandyBridge | chama | 16M/node | "plot"_bench/plot_free_chama_weak_16M_CPU.jpg | "table"_bench/plot_free_chama_weak_16M_CPU.html
weak | 23Dec17 | Haswell | mutrino | 16M/node | "plot"_bench/plot_free_mutrino_weak_16M_CPU.jpg | "table"_bench/plot_free_mutrino_weak_16M_CPU.html
weak | 23Dec17 | Broadwell | serrano | 16M/node | "plot"_bench/plot_free_serrano_weak_16M_CPU.jpg | "table"_bench/plot_free_serrano_weak_16M_CPU.html
weak | 23Dec17 | KNL | mutrino | 16M/node | "plot"_bench/plot_free_mutrino_weak_16M_KNL.jpg | "table"_bench/plot_free_mutrino_weak_16M_KNL.html
weak | 23Dec17 | K80 | ride80 | 16M/node | "plot"_bench/plot_free_ride80_weak_16M_GPU.jpg | "table"_bench/plot_free_ride80_weak_16M_GPU.html
weak | 23Dec17 | P100 | ride100 | 16M/node | "plot"_bench/plot_free_ride100_weak_16M_GPU.jpg | "table"_bench/plot_free_ride100_weak_16M_GPU.html
:tb(s=|,ea=c)

:line
:line

Collide benchmark :h4,link(collide)

"in.collide"_bench/in.collide input script
"in.collide.kokkos_cuda"_bench/in.collide.gpu.steps variant for Kokkos/Cuda package :ul

As described above, this benchmark is for particles undergoing
collisional flow. Everything about the problem is the same as the free
molecular flow problem described above, except that collisions were
enabled, which requires extra computation, as well as particle sorting
each timestep to identify particles in the same grid cell.

Additional packages needed for this benchmark: none

Comments:

In the data below, K = 1000 particles, so 1M = 1024*1000. :ul

:line

[Collide single core and single node performance:]

Best timings for any accelerator option as a function of problem size.
Running on a single CPU or KNL core.  Running on a single CPU or KNL
or a single GPU.  Only for double precision.

:image(bench/plot_collide_core_best_small.jpg,bench/plot_collide_core_best.jpg)
:image(bench/plot_collide_node_best_small.jpg,bench/plot_collide_node_best.jpg)

"Table for single core"_bench/plot_collide_core_best.html
"Table for single node"_bench/plot_collide_node_best.html :ul

:line

[Collide strong and weak scaling:]

Fastest timing for any accelerator option running on multiple CPU or
KNL or a single GPUs, as a function of node count.  For strong scaling
of 2 problem sizes: 8M particles, 64M particles.  For weak scaling of
2 problem sizes: 1M particles/node, 16M particles/node.  Only for a
single GPU/node, only double precision.

Strong scaling means the same size problem is run on successively more
nodes.  Weak scaling means the problem size doubles each time the node
count doubles.  See a fuller description "here"_#interpret of how to
interpret these plots.

:image(bench/plot_collide_strong_8M_best_small.jpg,bench/plot_collide_strong_8M_best.jpg)
:image(bench/plot_collide_strong_64M_best_small.jpg,bench/plot_collide_strong_64M_best.jpg)
:image(bench/plot_collide_weak_1M_best_small.jpg,bench/plot_collide_weak_1M_best.jpg)
:image(bench/plot_collide_weak_16M_best_small.jpg,bench/plot_collide_weak_16M_best.jpg)

"Table for strong scaling of 8M particles"_bench/plot_collide_strong_8M_best.html
"Table for strong scaling of 64M particles"_bench/plot_collide_strong_64M_best.html
"Table for weak scaling of 1M particles/node"_bench/plot_collide_weak_1M_best.html
"Table for weak scaling of 16M particles/node"_bench/plot_collide_weak_16M_best.html :ul

:line

[Collide performance details]:

Modes: per-core, per-node, strong scaling, weak scaling
Hardware: CPU, KNL, GPU options
Within plot: accelerator packages, one or multiple GPUs/node :ul

Mode | SPARTA Version | Hardware | Machine | Size | Plot | Table
core | 23Dec17 | SandyBridge | chama | 1K-16K | "plot"_bench/plot_collide_chama_core_CPU.jpg | "table"_bench/plot_collide_chama_core_CPU.html
core | 23Dec17 | Haswell | mutrino | 1K-16K | "plot"_bench/plot_collide_mutrino_core_CPU.jpg | "table"_bench/plot_collide_mutrino_core_CPU.html
core | 23Dec17 | Broadwell | serrano | 1K-16K | "plot"_bench/plot_collide_serrano_core_CPU.jpg | "table"_bench/plot_collide_serrano_core_CPU.html
core | 23Dec17 | KNL | mutrino | 1K-16K | "plot"_bench/plot_collide_mutrino_core_KNL.jpg | "table"_bench/plot_collide_mutrino_core_KNL.html
node | 23Dec17 | SandyBridge | chama | 32K-128M | "plot"_bench/plot_collide_chama_node_CPU.jpg | "table"_bench/plot_collide_chama_node_CPU.html
node | 23Dec17 | Haswell | mutrino | 32K-128M | "plot"_bench/plot_collide_mutrino_node_CPU.jpg | "table"_bench/plot_collide_mutrino_node_CPU.html
node | 23Dec17 | Broadwell | serrano | 32K-128M | "plot"_bench/plot_collide_serrano_node_CPU.jpg | "table"_bench/plot_collide_serrano_node_CPU.html
node | 23Dec17 | KNL | mutrino | 32K-128M | "plot"_bench/plot_collide_mutrino_node_KNL.jpg | "table"_bench/plot_collide_mutrino_node_KNL.html
node | 23Dec17 | K80 | ride80 | 32K-128M | "plot"_bench/plot_collide_ride80_node_GPU.jpg | "table"_bench/plot_collide_ride80_node_GPU.html
node | 23Dec17 | P100 | ride100 | 32K-128M | "plot"_bench/plot_collide_ride100_node_GPU.jpg | "table"_bench/plot_collide_ride100_node_GPU.html
strong | 23Dec17 | SandyBridge | chama | 8M | "plot"_bench/plot_collide_chama_strong_8M_CPU.jpg | "table"_bench/plot_collide_chama_strong_8M_CPU.html
strong | 23Dec17 | Haswell | mutrino | 8M | "plot"_bench/plot_collide_mutrino_strong_8M_CPU.jpg | "table"_bench/plot_collide_mutrino_strong_8M_CPU.html
strong | 23Dec17 | Broadwell | serrano | 8M | "plot"_bench/plot_collide_serrano_strong_8M_CPU.jpg | "table"_bench/plot_collide_serrano_strong_8M_CPU.html
strong | 23Dec17 | KNL | mutrino | 8M | "plot"_bench/plot_collide_mutrino_strong_8M_KNL.jpg | "table"_bench/plot_collide_mutrino_strong_8M_KNL.html
strong | 23Dec17 | K80 | ride80 | 8M | "plot"_bench/plot_collide_ride80_strong_8M_GPU.jpg | "table"_bench/plot_collide_ride80_strong_8M_GPU.html
strong | 23Dec17 | P100 | ride100 | 8M | "plot"_bench/plot_collide_ride100_strong_8M_GPU.jpg | "table"_bench/plot_collide_ride100_strong_8M_GPU.html
strong | 23Dec17 | SandyBridge | chama | 64M | "plot"_bench/plot_collide_chama_strong_64M_CPU.jpg | "table"_bench/plot_collide_chama_strong_64M_CPU.html
strong | 23Dec17 | Haswell | mutrino | 64M | "plot"_bench/plot_collide_mutrino_strong_64M_CPU.jpg | "table"_bench/plot_collide_mutrino_strong_64M_CPU.html
strong | 23Dec17 | Broadwell | serrano | 64M | "plot"_bench/plot_collide_serrano_strong_64M_CPU.jpg | "table"_bench/plot_collide_serrano_strong_64M_CPU.html
strong | 23Dec17 | KNL | mutrino | 64M | "plot"_bench/plot_collide_mutrino_strong_64M_KNL.jpg | "table"_bench/plot_collide_mutrino_strong_64M_KNL.html
strong | 23Dec17 | K80 | ride80 | 64M | "plot"_bench/plot_collide_ride80_strong_64M_GPU.jpg | "table"_bench/plot_collide_ride80_strong_64M_GPU.html
strong | 23Dec17 | P100 | ride100 | 64M | "plot"_bench/plot_collide_ride100_strong_64M_GPU.jpg | "table"_bench/plot_collide_ride100_strong_64M_GPU.html
weak | 23Dec17 | SandyBridge | chama | 1M/node | "plot"_bench/plot_collide_chama_weak_1M_CPU.jpg | "table"_bench/plot_collide_chama_weak_1M_CPU.html
weak | 23Dec17 | Haswell | mutrino | 1M/node | "plot"_bench/plot_collide_mutrino_weak_1M_CPU.jpg | "table"_bench/plot_collide_mutrino_weak_1M_CPU.html
weak | 23Dec17 | Broadwell | serrano | 1M/node | "plot"_bench/plot_collide_serrano_weak_1M_CPU.jpg | "table"_bench/plot_collide_serrano_weak_1M_CPU.html
weak | 23Dec17 | KNL | mutrino | 1M/node | "plot"_bench/plot_collide_mutrino_weak_1M_KNL.jpg | "table"_bench/plot_collide_mutrino_weak_1M_KNL.html
weak | 23Dec17 | K80 | ride80 | 1M/node | "plot"_bench/plot_collide_ride80_weak_1M_GPU.jpg | "table"_bench/plot_collide_ride80_weak_1M_GPU.html
weak | 23Dec17 | P100 | ride100 | 1M/node | "plot"_bench/plot_collide_ride100_weak_1M_GPU.jpg | "table"_bench/plot_collide_ride100_weak_1M_GPU.html
weak | 23Dec17 | SandyBridge | chama | 16M/node | "plot"_bench/plot_collide_chama_weak_16M_CPU.jpg | "table"_bench/plot_collide_chama_weak_16M_CPU.html
weak | 23Dec17 | Haswell | mutrino | 16M/node | "plot"_bench/plot_collide_mutrino_weak_16M_CPU.jpg | "table"_bench/plot_collide_mutrino_weak_16M_CPU.html
weak | 23Dec17 | Broadwell | serrano | 16M/node | "plot"_bench/plot_collide_serrano_weak_16M_CPU.jpg | "table"_bench/plot_collide_serrano_weak_16M_CPU.html
weak | 23Dec17 | KNL | mutrino | 16M/node | "plot"_bench/plot_collide_mutrino_weak_16M_KNL.jpg | "table"_bench/plot_collide_mutrino_weak_16M_KNL.html
weak | 23Dec17 | K80 | ride80 | 16M/node | "plot"_bench/plot_collide_ride80_weak_16M_GPU.jpg | "table"_bench/plot_collide_ride80_weak_16M_GPU.html
weak | 23Dec17 | P100 | ride100 | 16M/node | "plot"_bench/plot_collide_ride100_weak_16M_GPU.jpg | "table"_bench/plot_collide_ride100_weak_16M_GPU.html
:tb(s=|,ea=c)

:line
:line

Sphere benchmark :h4,link(sphere)

"in.sphere"_bench/in.sphere input script
"in.sphere.kokkos_cuda"_bench/in.sphere.gpu.steps variant for Kokkos/Cuda package :ul

This benchmark is for particles flowing around a sphere.

Comments:

In the data below, K = 1000 particles, so 1M = 1024*1000. :ul

:line

[Sphere single core and single node performance:]

Best timings for any accelerator option as a function of problem size.
Running on a single CPU or KNL core.  Running on a single CPU or KNL
node or a single GPU.  Only for double precision.

:image(bench/plot_sphere_core_best_small.jpg,bench/plot_sphere_core_best.jpg)
:image(bench/plot_sphere_node_best_small.jpg,bench/plot_sphere_node_best.jpg)

"Table for single core"_bench/plot_sphere_core_best.html
"Table for single node"_bench/plot_sphere_node_best.html :ul

:line

[Sphere strong and weak scaling:]

Fastest timing for any accelerator option running on multiple CPU or
KNL or a single GPUs, as a function of node count.  For strong scaling
of 2 problem sizes: 8M particles, 64M particles.  For weak scaling of
2 problem sizes: 1M particles/node, 16M particles/node.  Only for a
single GPU/node, only double precision.

Strong scaling means the same size problem is run on successively more
nodes.  Weak scaling means the problem size doubles each time the node
count doubles.  See a fuller description "here"_#interpret of how to
interpret these plots.

:image(bench/plot_sphere_strong_8M_best_small.jpg,bench/plot_sphere_strong_8M_best.jpg)
:image(bench/plot_sphere_strong_64M_best_small.jpg,bench/plot_sphere_strong_64M_best.jpg)
:image(bench/plot_sphere_weak_1M_best_small.jpg,bench/plot_sphere_weak_1M_best.jpg)
:image(bench/plot_sphere_weak_16M_best_small.jpg,bench/plot_sphere_weak_16M_best.jpg)

"Table for strong scaling of 8M particles"_bench/plot_sphere_strong_8M_best.html
"Table for strong scaling of 64M particles"_bench/plot_sphere_strong_64M_best.html
"Table for weak scaling of 1M particles/node"_bench/plot_sphere_weak_1M_best.html
"Table for weak scaling of 16M particles/node"_bench/plot_sphere_weak_16M_best.html :ul

:line

[Sphere performance details]:

Modes: per-core, per-node, strong scaling, weak scaling
Hardware: CPU, KNL, GPU options
Within plot: accelerator packages, one or multiple GPUs/node :ul

Mode | SPARTA Version | Hardware | Machine | Size | Plot | Table
core | 23Dec17 | SandyBridge | chama | 8K-16K | "plot"_bench/plot_sphere_chama_core_CPU.jpg | "table"_bench/plot_sphere_chama_core_CPU.html
core | 23Dec17 | Haswell | mutrino | 8K-16K | "plot"_bench/plot_sphere_mutrino_core_CPU.jpg | "table"_bench/plot_sphere_mutrino_core_CPU.html
core | 23Dec17 | Broadwell | serrano | 8K-16K | "plot"_bench/plot_sphere_serrano_core_CPU.jpg | "table"_bench/plot_sphere_serrano_core_CPU.html
core | 23Dec17 | KNL | mutrino | 8K-16K | "plot"_bench/plot_sphere_mutrino_core_KNL.jpg | "table"_bench/plot_sphere_mutrino_core_KNL.html
node | 23Dec17 | SandyBridge | chama | 32K-128M | "plot"_bench/plot_sphere_chama_node_CPU.jpg | "table"_bench/plot_sphere_chama_node_CPU.html
node | 23Dec17 | Haswell | mutrino | 32K-128M | "plot"_bench/plot_sphere_mutrino_node_CPU.jpg | "table"_bench/plot_sphere_mutrino_node_CPU.html
node | 23Dec17 | Broadwell | serrano | 32K-128M | "plot"_bench/plot_sphere_serrano_node_CPU.jpg | "table"_bench/plot_sphere_serrano_node_CPU.html
node | 23Dec17 | KNL | mutrino | 32K-128M | "plot"_bench/plot_sphere_mutrino_node_KNL.jpg | "table"_bench/plot_sphere_mutrino_node_KNL.html
node | 23Dec17 | K80 | ride80 | 32K-128M | "plot"_bench/plot_sphere_ride80_node_GPU.jpg | "table"_bench/plot_sphere_ride80_node_GPU.html
node | 23Dec17 | P100 | ride100 | 32K-128M | "plot"_bench/plot_sphere_ride100_node_GPU.jpg | "table"_bench/plot_sphere_ride100_node_GPU.html
strong | 23Dec17 | SandyBridge | chama | 8M | "plot"_bench/plot_sphere_chama_strong_8M_CPU.jpg | "table"_bench/plot_sphere_chama_strong_8M_CPU.html
strong | 23Dec17 | Haswell | mutrino | 8M | "plot"_bench/plot_sphere_mutrino_strong_8M_CPU.jpg | "table"_bench/plot_sphere_mutrino_strong_8M_CPU.html
strong | 23Dec17 | Broadwell | serrano | 8M | "plot"_bench/plot_sphere_serrano_strong_8M_CPU.jpg | "table"_bench/plot_sphere_serrano_strong_8M_CPU.html
strong | 23Dec17 | KNL | mutrino | 8M | "plot"_bench/plot_sphere_mutrino_strong_8M_KNL.jpg | "table"_bench/plot_sphere_mutrino_strong_8M_KNL.html
strong | 23Dec17 | K80 | ride80 | 8M | "plot"_bench/plot_sphere_ride80_strong_8M_GPU.jpg | "table"_bench/plot_sphere_ride80_strong_8M_GPU.html
strong | 23Dec17 | P100 | ride100 | 8M | "plot"_bench/plot_sphere_ride100_strong_8M_GPU.jpg | "table"_bench/plot_sphere_ride100_strong_8M_GPU.html
strong | 23Dec17 | SandyBridge | chama | 64M | "plot"_bench/plot_sphere_chama_strong_64M_CPU.jpg | "table"_bench/plot_sphere_chama_strong_64M_CPU.html
strong | 23Dec17 | Haswell | mutrino | 64M | "plot"_bench/plot_sphere_mutrino_strong_64M_CPU.jpg | "table"_bench/plot_sphere_mutrino_strong_64M_CPU.html
strong | 23Dec17 | Broadwell | serrano | 64M | "plot"_bench/plot_sphere_serrano_strong_64M_CPU.jpg | "table"_bench/plot_sphere_serrano_strong_64M_CPU.html
strong | 23Dec17 | KNL | mutrino | 64M | "plot"_bench/plot_sphere_mutrino_strong_64M_KNL.jpg | "table"_bench/plot_sphere_mutrino_strong_64M_KNL.html
strong | 23Dec17 | K80 | ride80 | 64M | "plot"_bench/plot_sphere_ride80_strong_64M_GPU.jpg | "table"_bench/plot_sphere_ride80_strong_64M_GPU.html
strong | 23Dec17 | P100 | ride100 | 64M | "plot"_bench/plot_sphere_ride100_strong_64M_GPU.jpg | "table"_bench/plot_sphere_ride100_strong_64M_GPU.html
weak | 23Dec17 | SandyBridge | chama | 1M/node | "plot"_bench/plot_sphere_chama_weak_1M_CPU.jpg | "table"_bench/plot_sphere_chama_weak_1M_CPU.html
weak | 23Dec17 | Haswell | mutrino | 1M/node | "plot"_bench/plot_sphere_mutrino_weak_1M_CPU.jpg | "table"_bench/plot_sphere_mutrino_weak_1M_CPU.html
weak | 23Dec17 | Broadwell | serrano | 1M/node | "plot"_bench/plot_sphere_serrano_weak_1M_CPU.jpg | "table"_bench/plot_sphere_serrano_weak_1M_CPU.html
weak | 23Dec17 | KNL | mutrino | 1M/node | "plot"_bench/plot_sphere_mutrino_weak_1M_KNL.jpg | "table"_bench/plot_sphere_mutrino_weak_1M_KNL.html
weak | 23Dec17 | K80 | ride80 | 1M/node | "plot"_bench/plot_sphere_ride80_weak_1M_GPU.jpg | "table"_bench/plot_sphere_ride80_weak_1M_GPU.html
weak | 23Dec17 | P100 | ride100 | 1M/node | "plot"_bench/plot_sphere_ride100_weak_1M_GPU.jpg | "table"_bench/plot_sphere_ride100_weak_1M_GPU.html
weak | 23Dec17 | SandyBridge | chama | 16M/node | "plot"_bench/plot_sphere_chama_weak_16M_CPU.jpg | "table"_bench/plot_sphere_chama_weak_16M_CPU.html
weak | 23Dec17 | Haswell | mutrino | 16M/node | "plot"_bench/plot_sphere_mutrino_weak_16M_CPU.jpg | "table"_bench/plot_sphere_mutrino_weak_16M_CPU.html
weak | 23Dec17 | Broadwell | serrano | 16M/node | "plot"_bench/plot_sphere_serrano_weak_16M_CPU.jpg | "table"_bench/plot_sphere_serrano_weak_16M_CPU.html
weak | 23Dec17 | KNL | mutrino | 16M/node | "plot"_bench/plot_sphere_mutrino_weak_16M_KNL.jpg | "table"_bench/plot_sphere_mutrino_weak_16M_KNL.html
weak | 23Dec17 | K80 | ride80 | 16M/node | "plot"_bench/plot_sphere_ride80_weak_16M_GPU.jpg | "table"_bench/plot_sphere_ride80_weak_16M_GPU.html
weak | 23Dec17 | P100 | ride100 | 16M/node | "plot"_bench/plot_sphere_ride100_weak_16M_GPU.jpg | "table"_bench/plot_sphere_ride100_weak_16M_GPU.html
:tb(s=|,ea=c)

:line
:line

Accelerator options :h4,link(accelerate)

SPARTA has an accelerator option implemented via the KOKKOS package,
"accelerator packages"_doc/Section_accelerate.html. The KOKKOS
packages support multiple hardware options.

For acceleration on a CPU:

CPU = reference implementation, no package, no acceleration
Kokkos/OMP = "Kokkos package"_doc/accelerate_kokkos.html with OMP option via OpenMP
Kokkos/serial = "Kokkos package"_doc/accelerate_kokkos.html with serial option for non-threaded operation on CPUs :ul

For acceleration on an Intel KNL:

CPU/KNL = reference implementation, no package, no acceleration
Kokkos/KNL = "Kokkos package"_doc/accelerate_kokkos.html with KNL option
Kokkos/serial = "Kokkos package"_doc/accelerate_kokkos.html with KNL/serial option :ul

For acceleration on an NVIDIA GPU:

Kokkos/Cuda = "Kokkos package"_doc/accelerate_kokkos.html with CUDA option :ul

:line

Machines and node hardware :h4,link(machines)

Benchmarks were run on the following machines and node hardware.

[chama] = Intel SandyBridge CPUs

1232-node cluster
node = dual-socket Sandy Bridge:2S:8C @ 2.6 GHz, 16 cores, no hyperthreading
interconnect = Qlogic Infiniband 4x QDR, fat tree :ul

[mutrino] = Intel Haswell CPUs or Intel KNLs

~100 CPU nodes
node = dual-socket Haswell 2.3 GHz CPU, 32 cores + 2x hyperthreading
~100 KNL nodes
node = Knights Landing processor, 68 cores + 4x hyperthreading
interconnect = Cray Aries Dragonfly :ul

[serrano] = Intel Broadwell CPUs

1122-node cluster
node = dual-socket Broadwell 2.1 GHz CPU E5-2695, 36 cores + 2x hyperthreading
interconnect = Omni-Path :ul

[ride80] = IBM Power8 CPUs with NVIDIA K80 GPUs

~10 nodes
node CPU = dual Power8 3.42 GHz CPU (Firestone), 16 cores + 8x hyperthreading
each node has 2 Tesla K80 GPUs (each K80 is "dual" with 2 internal GPUs)
interconnect = Infiniband :ul

[ride100] = IBM Power8 CPUs with NVIDIA P100 GPUs

~10 nodes
one node = dual Power8 3.42 GHz CPU (Garrison), 16 cores + 8x hyperthreading
each node has 2 Pascal P100 GPUs
interconnect = Infiniband :ul

:line

How to build SPARTA and run the benchmarks :h4,link(howto)

This table shows which accelerator packages were used on which
machines:

Machine | Hardware | CPU | Kokkos/OMP | Kokkos/KNL | Kokkos/Cuda
chama | SandyBridge | yes | yes | no | no
mutrino | Haswell/KNL | yes | yes | yes | no
serrano | Broadwell | yes | yes | no | no
ride80 | K80 | no | no | no | yes
ride100 | P100 | no | no | no | yes :tb(s=|,ea=c)

These are the software environments on each machine and the Makefiles
used to build SPARTA with different accelerator packages.

[chama]

Intel 17.0.2 icc compiler, GNU 4.9.2 g++ compiler, OpenMPI-Intel 2.0
module load intel/17.0.2.174; module load gnu/4.9.2; module load openmpi-intel/2.0
Makefiles: "Makefile.chama_cpu"_bench/Makefile.chama_cpu, "Makefile.chama_kokkos_omp"_bench/Makefile.chama_kokkos_omp, "Makefile.chama_kokkos_serial"_bench/Makefile.chama_kokkos_serial :ul

[mutrino]

Intel 17.0.2 icc compiler, Cray MPICH 7.5.2
module load intel/17.0.2; module load cray-mpich/7.5.2; module load craype-haswell   # for Haswell
module load intel/17.0.2; module load cray-mpich/7.5.2; module load craype-mic-knl   # for KNL
Makefiles: "Makefile.mutrino_cpu"_bench/Makefile.mutrino_cpu, "Makefile.mutrino_kokkos_omp"_bench/Makefile.mutrino_kokkos_omp, "Makefile.mutrino_kokkos_serial"_bench/Makefile.mutrino_kokkos_serial, "Makefile.mutrino_knl"_bench/Makefile.mutrino_knl, "Makefile.mutrino_kokkos_knl"_bench/Makefile.mutrino_kokkos_knl, "Makefile.mutrino_kokkos_serial_knl"_bench/Makefile.mutrino_kokkos_serial_knl  :ul

[serrano]

Intel 17.0.2 compiler, GNU 4.9.3 g++ compiler, OpenMPI-Intel 2.0
module load intel/17.0.2.174; module load gcc/4.9.3; module load openmpi-intel/2.0
Makefiles: "Makefile.serrano_cpu"_bench/Makefile.serrano_cpu, "Makefile.serrano_kokkos_omp"_bench/Makefile.serrano_kokkos_omp, "Makefile.serrano_kokkos_serial"_bench/Makefile.serrano_kokkos_serial :ul

[ride80]

GNU 4.9.3 g++ compiler, OpenMPI 1.10.5, Cuda 8.0.44
module load openmpi/1.10.6/gcc/4.9.3/cuda/8.0.44
Makefiles: "Makefile.ride80_kokkos_cuda"_bench/Makefile.ride80_kokkos_cuda :ul

[ride100]

GNU 4.9.3 g++ compiler, OpenMPI 1.10.5, Cuda 8.0.44
module load openmpi/1.10.6/gcc/4.9.3/cuda/8.0.44
Makefiles: "Makefile.ride100_kokkos_cuda"_bench/Makefile.ride100_kokkos_cuda :ul

If a specific benchmark requires a build with additional package(s)
installed, it is noted with the benchmark info below.

With the software environment initialized (e.g. modules loaded) and
the machine Makefiles copied into src/MAKE/MINE, building SPARTA is
straightforward:

cp Makefile.serrano_kokkos_omp sparta/src/MAKE/MINE   # for example
cd sparta/src
make yes-kokkos                                       # install accelerator package(s) supported by the Makefile
make serrano_kokkos_omp                               # target = suffix of Makefile.machine :pre

This should produce an executable named spa_machine,
e.g. spa_serrano_kokkos_omp.  If desired, you can copy the executable to a
directory where you run the benchmark.

IMPORTANT NOTE: Achieving best performance for the benchmarks (or your
own input script) on a particular machine with a particular
accelerator option, requires attention to the following issues.

mpirun command-line arguments which control how MPI tasks and threads
are assigned to nodes and cores. :ulb,l

SPARTA command-line arguments which invoke a specific accelerator
package and its options.  This may include options that are part of
the "package"_doc/package.html command, which can be specified in the
input script, or as below, invoked from the command line. :l

Some of the benchmarks use slightly-modified input scripts (indicated
below), depending on which package is used.  This is to boost
performance of a specific accelerator option. :l

Performance can be a strong function of problem size (see plots
below).  In addition, performance of an accelerator package can vary
with MPI tasks/node, MPI tasks/GPU, threads/MPI task, or hardware
threads/core (hyperthreading).  In the tables below we show which
choices gave best performance for specific problem sizes.  But you may
need to experiment for your simulation or machine. :l,ule

All of the plots below include a link to a table with details on all
of these issues.  The table shows the mpirun (or equivalent) command
used to produce each data point on each curve in the plot, the SPARTA
command-line arguments used to get best performance with a particular
package on that hardware, and a link to the logfile produced by the
benchmark run.

:line

How to interpret the plots :h4,link(plots)

All the plots below have particles or nodes on the x-axis, and performance
on the y-axis.  On all the plots, better performance is up and worse
performance is down.  For all the plots:

Data is normalized so that ideal performance (with respect to particle or
node count) would be a horizontal line.  :ulb,l

If a curve trends downward (moving to the right) it means scalability
is falling off.  For example, in the strong-scaling plots, this is
typically because the problem size/node is getting smaller as the
number of nodes increases. :l

If a point is missing from a curve, the simulation may have run out of memory or time,
or the number of requested nodes was greater than the number of nodes on the machine. :l

If a curve trends upward, scalability is increasing.  For example, in
the per-node plots for GPUs, simulations typically run faster (on a
per-particle basis) as the system size increases. :l,ule

Per-core and per-node plots:

The y-axis is millions of particle-timesteps/sec, running on one core or
an entire node. :ulb,l

To infer timesteps/sec, divide the y-axis value by the number of
particles in the simulation. :l

The inverse of the y-axis value is sec/particle/timestep. :l

To estimate how long a simulation with N particles for M timesteps will
take in CPU seconds, multiply the inverse by N*M times 1
million. :l,ule

Strong-scaling and weak-scaling plots:

Strong scaling means a problem of the same size is run on
successively more nodes.  :ulb,l

Weak scaling means the problem size is doubled each time the node
count doubles.  For example, if the problem size on 1 node is a
million particles, then the problem size on 512 nodes is ~1/2 billion
particles. :l

The y-axis is millions of particle-timesteps/sec/node. :l

To infer timesteps/sec, multiply the y-axis value by the number of
nodes and divide by the number of particles in the simulation. :l

The inverse of the y-axis value is sec-node/particle/timestep. :l

To estimate how long a simulation with N particles for M timesteps on P
nodes will take in CPU seconds, multiply the inverse by N*M
times 1 million and divide by P. :l,ule