Skip to content

Optimizations for 3 channel models#238

Open
jfsantos wants to merge 1 commit intosdatkinson:mainfrom
jfsantos:feature/3-channel-opts
Open

Optimizations for 3 channel models#238
jfsantos wants to merge 1 commit intosdatkinson:mainfrom
jfsantos:feature/3-channel-opts

Conversation

@jfsantos
Copy link
Contributor

@jfsantos jfsantos commented Mar 3, 2026

Following previous updates for specific model sizes, this PR optimizes blocks that are in the hot path for models with 3 channels in/3 channels out:

  1. NeuralAmpModelerCore/NAM/dsp.cpp — Conv1x1::process_()
    • Added 3×1 inline path (out_ch==3, in_ch==1): broadcast-multiply, eliminates 16 Eigen fallback calls/block (15 input_mixin + 1 rechannel)
    • Added 1×3 inline path (out_ch==1, in_ch==3) with fused bias: dot product for head_rechannel, eliminates 1 Eigen call/block
    • Fused bias into 3×3 GEMM: eliminates separate bias addition pass for 15 layer1x1 calls/block
    • Replaced Eigen fallback with generic inline triple-loop GEMM: safety net for any unseen matrix sizes
    • Added bias_fused flag to skip redundant bias pass when already fused
  2. NeuralAmpModelerCore/NAM/conv1d.cpp — Conv1D::Process()
    - Added fused k=6 3×3 path: reads all 6 input taps and computes full convolution in one pass per frame. Eliminates setZero + 6 separate ring
    buffer reads + 6 GEMM accumulations (reduces 7 passes to 1)
  3. NeuralAmpModelerCore/NAM/wavenet.h + wavenet.cpp — _Layer
    - Added _skip_head_copy flag (set when no head1x1 and no gating)
    - GetOutputHead() returns _z directly when flag is set
    - Skips 15 memcpys of 576 bytes each per block

Of all of the optimizations, the only one that might not be worth adding is replacing the Eigen fallback for any size of GEMM. We might want to have some kind of threshold where we actually use Eigen. Perhaps worth some more benchmarking.

Additionally, this removes the prewarm call from create_dsp, as that was not ideal for embedded targets. This gives the client code a chance to change MaxBufferSize before calling reset() (which then calls prewarm().

Developed with support and sponsorship from TONE3000

Copy link
Owner

@sdatkinson sdatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Edit: This was generated by an experimental agent command.]

Reviewed locally.

Verdict: No blocking issues. A few callouts:

  1. create_dsp() / prewarm removal (get_dsp.cpp): Clients that call get_dsp() and process without calling Reset() first will see transient output. Worth documenting that Reset() (or ResetAndPrewarm()) must be called before first process().

  2. Generic inline GEMM (dsp.cpp ~727): As noted in the PR, a size threshold for switching back to Eigen on larger matrices might be worth benchmarking.

  3. film.h: The 3-channel FiLM optimizations are a nice addition; could be mentioned in the PR description for completeness.

Otherwise the optimizations look correct: weight layout and bias fusion logic are consistent with existing paths, and _skip_head_copy correctly handles the no-head1x1 / no-gating case.

Copy link
Owner

@sdatkinson sdatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just if you can get that prewarm crit that's the biggest one for me.

The second is the fallback to normal eigen. The best would be if you can do some profiling to refine the guess, but if you're strapped for time then an "8" and a comment to flag that it's not optimized is ok in a pinch.

{
// Fall back to Eigen for larger matrices where it's more efficient
_output.leftCols(num_frames).noalias() = this->_weight * input.leftCols(num_frames);
// Generic inline GEMM for any matrix size (avoids Eigen overhead for small matrices)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if you want to maybe make some configurable threshold for going back to Eigen and default it so that maybe 8 starts using Eigen? then that seems reasonable

[I also wonder if this is us just re-inventing Eigen ;) ]

// "pre-warm" the model to settle initial conditions
// Can this be removed now that it's part of Reset()?
out->prewarm();
// Prewarm is left to the caller so it can call SetMaxBufferSize() first.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[C] For backward compatibility, can you instead make the "4096" configurable at compilation (and keep 4096 the default)?

std::memcpy(this->_output_head.data(), this->_z.data(), total * sizeof(float));
}
else
// When _skip_head_copy is true, GetOutputHead() returns _z directly, so no copy needed.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can tighten that sort of stuff up...thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants