Optimizations for 3 channel models by jfsantos · Pull Request #238 · sdatkinson/NeuralAmpModelerCore

jfsantos · 2026-03-03T23:21:16Z

Following previous updates for specific model sizes, this PR optimizes blocks that are in the hot path for models with 3 channels in/3 channels out:

NeuralAmpModelerCore/NAM/dsp.cpp — Conv1x1::process_()
- Added 3×1 inline path (out_ch==3, in_ch==1): broadcast-multiply, eliminates 16 Eigen fallback calls/block (15 input_mixin + 1 rechannel)
- Added 1×3 inline path (out_ch==1, in_ch==3) with fused bias: dot product for head_rechannel, eliminates 1 Eigen call/block
- Fused bias into 3×3 GEMM: eliminates separate bias addition pass for 15 layer1x1 calls/block
- Replaced Eigen fallback with generic inline triple-loop GEMM: safety net for any unseen matrix sizes
- Added bias_fused flag to skip redundant bias pass when already fused
NeuralAmpModelerCore/NAM/conv1d.cpp — Conv1D::Process()
- Added fused k=6 3×3 path: reads all 6 input taps and computes full convolution in one pass per frame. Eliminates setZero + 6 separate ring
buffer reads + 6 GEMM accumulations (reduces 7 passes to 1)
NeuralAmpModelerCore/NAM/wavenet.h + wavenet.cpp — _Layer
- Added _skip_head_copy flag (set when no head1x1 and no gating)
- GetOutputHead() returns _z directly when flag is set
- Skips 15 memcpys of 576 bytes each per block

Of all of the optimizations, the only one that might not be worth adding is replacing the Eigen fallback for any size of GEMM. We might want to have some kind of threshold where we actually use Eigen. Perhaps worth some more benchmarking.

Additionally, this removes the prewarm call from create_dsp, as that was not ideal for embedded targets. This gives the client code a chance to change MaxBufferSize before calling reset() (which then calls prewarm().

Developed with support and sponsorship from TONE3000

sdatkinson

[Edit: This was generated by an experimental agent command.]

Reviewed locally.

Verdict: No blocking issues. A few callouts:

create_dsp() / prewarm removal (get_dsp.cpp): Clients that call get_dsp() and process without calling Reset() first will see transient output. Worth documenting that Reset() (or ResetAndPrewarm()) must be called before first process().
Generic inline GEMM (dsp.cpp ~727): As noted in the PR, a size threshold for switching back to Eigen on larger matrices might be worth benchmarking.
film.h: The 3-channel FiLM optimizations are a nice addition; could be mentioned in the PR description for completeness.

Otherwise the optimizations look correct: weight layout and bias fusion logic are consistent with existing paths, and _skip_head_copy correctly handles the no-head1x1 / no-gating case.

sdatkinson

Just if you can get that prewarm crit that's the biggest one for me.

The second is the fallback to normal eigen. The best would be if you can do some profiling to refine the guess, but if you're strapped for time then an "8" and a comment to flag that it's not optimized is ok in a pinch.

sdatkinson · 2026-03-04T01:25:53Z

NAM/dsp.cpp

    {
-      // Fall back to Eigen for larger matrices where it's more efficient
-      _output.leftCols(num_frames).noalias() = this->_weight * input.leftCols(num_frames);
+      // Generic inline GEMM for any matrix size (avoids Eigen overhead for small matrices)


Yeah if you want to maybe make some configurable threshold for going back to Eigen and default it so that maybe 8 starts using Eigen? then that seems reasonable

[I also wonder if this is us just re-inventing Eigen ;) ]

sdatkinson · 2026-03-04T01:29:23Z

NAM/get_dsp.cpp

-  // "pre-warm" the model to settle initial conditions
-  // Can this be removed now that it's part of Reset()?
-  out->prewarm();
+  // Prewarm is left to the caller so it can call SetMaxBufferSize() first.


[C] For backward compatibility, can you instead make the "4096" configurable at compilation (and keep 4096 the default)?

sdatkinson · 2026-03-04T01:33:39Z

NAM/wavenet.cpp

-      std::memcpy(this->_output_head.data(), this->_z.data(), total * sizeof(float));
-    }
-    else
+    // When _skip_head_copy is true, GetOutputHead() returns _z directly, so no copy needed.


Yeah, we can tighten that sort of stuff up...thanks :)

Optimizations for 3 channel models

3360cba

sdatkinson reviewed Mar 4, 2026

View reviewed changes

sdatkinson requested changes Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for 3 channel models#238

Optimizations for 3 channel models#238
jfsantos wants to merge 1 commit intosdatkinson:mainfrom
jfsantos:feature/3-channel-opts

jfsantos commented Mar 3, 2026

Uh oh!

sdatkinson left a comment •

edited

Loading

Uh oh!

sdatkinson left a comment

Uh oh!

sdatkinson Mar 4, 2026

Uh oh!

sdatkinson Mar 4, 2026

Uh oh!

sdatkinson Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jfsantos commented Mar 3, 2026

Uh oh!

sdatkinson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

sdatkinson Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

sdatkinson Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

sdatkinson Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sdatkinson left a comment •

edited

Loading