The Waterfall Arithmetic Unit (WAU) is a configurable arithmetic compute fabric for FPGAs: a 2D grid of small ALU cores wired together by a packet-switched mesh, designed to stream pipelines of math operations (add, multiply, max, FMA, ...) from a host program. Think of it as a tiny, generator-driven dataflow accelerator you can drop onto a real board.
This repository is the toolchain that builds one. You describe your kernel in a high-level form — an arithmetic expression, a constrained pseudo-C snippet, or a .cw program — and the Python generator emits the full Verilog (cores, mesh, coordinator, host MMIO), a compiled schedule, and a software reference model used as a correctness oracle. No hand-written RTL, no separate compiler stack.
It is silicon-verified: the same flow has been taken end-to-end onto a Terasic DE0-Nano (Intel Cyclone IV E), where 795/795 random and corner-case operand pairs round-tripped through the live mesh and matched the software reference (see DE0-Nano demo).
Typical uses: experimenting with small FPGA-side math accelerators, teaching dataflow / NoC concepts on real silicon, or as a reusable reference for the "high-level kernel → generated RTL → working bitstream" path.
This repository now contains a working foundation for:
- device-aware WAU configuration (real FPGA presets included),
- flow compilation (flow stages -> core assignments with fallback cores),
- DAG/node-based flow compilation with explicit 2D placement directives,
- per-core capability constraints (operations and data types), with capability-aware CW lowering that prunes incompatible candidate cores before validation,
- multi-program scheduling with async dependency-aware execution and recurrence support,
- offline scheduling (cycle timeline + encoded schedule words),
- routing-aware (locality-weighted) core selection via
scheduler.locality_bias(default off): biases candidate cores toward their dependencies' placed cores to cut transfer hops without inflating makespan/latency, - constrained pseudo-C accumulator frontend (
compile-pseudoc) and kernel-style.cwfrontend (compile-cw) in addition to expression compilation, - a real
.cwlanguage front-end (cw-lint/cw-eval: lexer → AST → host-side interpreter) with classes and magic methods for compile-time type handling — operator overloading and type-conversion hooks (__to_float__/__to_int__/__convert__) the compiler can invoke to bridge precisions dynamically, - CW software reference model + benchmark value scoreboard (
scoreboard_pass_ratiogate on top of latency/makespan), - Verilog emission for a multi-issue coordinator (keeps up to
coordinator.max_in_flightdistinct flows executing concurrently across the core mesh, so independent flows actually overlap on different cores at runtime), core/station, ALU, explicit highway routers/links, top-level grid, and a memory-mapped host control/status register file (wau_host_mmio), - reusable generated-project assembly through
thirds/veribuilder, an externalizable Python package for parameterized Verilog project manifests, feature-gated files, simple templates, headers, and deterministic file emission, - configurable station cache size and replacement policy (FIFO/LRU) via
compiler.station_cache, - runtime observability counters for highway hops/stalls/forwards/local-deliveries and per-core cache hit/lookup rate, aggregated at top-level and exposed via MMIO,
- CI matrix (python tests + randomized stress + iverilog tests + autotuned CW benchmark) with artifact archival.
From repository root:
PYTHONPATH=src/python python3 -m waugen validate --config src/python/configs/wau_de0_nano_demo.json
PYTHONPATH=src/python python3 -m waugen generate --config src/python/configs/wau_de0_nano_demo.json --out src/verilog/generated --summaryAdvanced 2D multi-program example (DAG + recurrence + load-balancing directives):
PYTHONPATH=src/python python3 -m waugen validate --config src/python/configs/wau_2d_multiprogram_demo.json
PYTHONPATH=src/python python3 -m waugen generate --config src/python/configs/wau_2d_multiprogram_demo.json --out src/verilog/generated_2d --summaryCompile a basic high-level expression into a new flow and merge it into a config:
PYTHONPATH=src/python python3 -m waugen compile-expr \
--expr '((a + b) * 3) - b' \
--flow-id 30 \
--name expr_compiled_flow \
--entry 1,0 \
--base-config src/python/configs/wau_de0_nano_demo.json \
--out-config src/python/configs/wau_de0_nano_compiled_expr.jsonCompile a constrained pseudo-C pipeline program into a new flow:
PYTHONPATH=src/python python3 -m waugen compile-pseudoc \
--program 'acc = a; acc = acc + b; acc = acc * 3; acc -= b;' \
--flow-id 31 \
--name pseudoc_flow \
--entry 1,1 \
--base-config src/python/configs/wau_de0_nano_demo.json \
--out-config src/python/configs/wau_de0_nano_compiled_expr.jsonCompile an advanced WAU kernel-style .cw program into a DAG flow and execution program:
PYTHONPATH=src/python python3 -m waugen compile-cw \
--program-file docs/example-program.cw \
--flow-id 90 \
--name cw_conv2d_residual_reference \
--entry 0,0 \
--max-in-flight 2 \
--lane-parallelism 2 \
--placement-policy balance \
--lowering-profile throughput_optimized \
--base-config src/python/configs/wau_2d_multiprogram_demo.json \
--out-config src/python/configs/wau_example_pogram_compiled.json \
--replace-existing \
--program-id 90 \
--program-name cw_reference_program \
--program-priority 4 \
--program-replicas 2 \
--program-max-parallel-flows 1 \
--program-load-balance least_busyExecute a .cw program on the host (real parser + interpreter), including
classes with magic methods used for compile-time type conversion. This path is
separate from compile-cw (it does not lower to RTL); it is for compiler-side
behaviour that should not run on the WAU, such as custom numeric formats and
their conversions:
# Run main() and print its output + return value.
PYTHONPATH=src/python python3 -m waugen cw-eval \
--program-file docs/samples/types/fixed_point.cw
# Ask the compiler to convert an expression to a dtype via the class's
# conversion magic methods (__convert__ / __to_float__ / __to_int__).
PYTHONPATH=src/python python3 -m waugen cw-eval \
--program-file docs/samples/types/fixed_point.cw \
--convert 'new q8_8(384)' float32 # -> 1.5Validate .cw syntax and @wau pragmas without lowering:
PYTHONPATH=src/python python3 -m waugen cw-lint \
--program-file docs/samples/types/fixed_point.cw
# Add the current compile-cw template check for RTL-lowered kernels.
PYTHONPATH=src/python python3 -m waugen cw-lint \
--program-file docs/example-program.cw \
--compile-template.cw classes (declared with class or the legacy space keyword) support
Python-style magic methods: __init__, the arithmetic/comparison operators
(__add__, __sub__, __mul__, __div__, __mod__, __eq__, __lt__, …),
__neg__, conversion hooks (__to_int__, __to_float__, and the generic
__convert__(target_dtype)), and __str__. a + b on a class instance calls
a.__add__(b); a builtin cast like float32(x) on an instance dispatches to its
conversion hook, and the same dispatch is exposed to the toolchain through
waugen.cw_lang.Interpreter.convert(value, dtype).
The accepted host-side .cw grammar, pragma contract, and the narrower
compile-cw RTL-template requirements are documented in
docs/cw-language.md.
Rank synthesis-time architecture candidates for a workload config (core disposition/grid shape, heavy-op specialization via core capabilities, on-chip memory split, and external-DRAM reliance):
PYTHONPATH=src/python python3 -m waugen arch-search \
--config src/python/configs/wau_example_pogram_compiled.json \
--out-report .build/arch_search/report.json \
--out-summary .build/arch_search/summary.txt \
--top 10Every candidate runs through the real compile_project -> build_schedule
pipeline, so makespan/transfer-hop/fallback numbers are the generator's own;
area/BRAM/DSP figures come from the versioned wau_resource_model_v1
estimator checked against the device preset's datasheet capacity, and DRAM
traffic from dram_model_v1. Ranking is arch_search_rank_v1: feasible
first, then lower makespan, transfer hops, DRAM bytes, peak utilization.
Run all RTL test cases with iverilog (generation + compile + simulation):
./scripts/run_iverilog_tests.shThis runs:
tests/rtl/tb_wau_operation_alu.v(ALU opcode behavior),tests/rtl/tb_wau_top_demo.v(end-to-end flow execution via coordinator/highway/core grid),tests/rtl/tb_wau_highway_mesh.v(neighbor forwarding, backpressure, androuter_hop_countadvancement),tests/rtl/tb_wau_host_mmio.v(MMIO register map: writes, reads, output_pending sticky semantics, observability counter readback).
Run the python unit-test suite (compiler/scheduler/CW frontends/program stress matrix/CW reference scoreboard):
PYTHONPATH=src/python python3 -m unittest discover -s tests/python -p "test_*.py" -vRun randomized multi-flow scheduler stress (also sweeps compiler.station_cache.{entries, replacement_policy}) and emit a coverage-style summary:
PYTHONPATH=src/python python3 scripts/run_randomized_stress.py --start-seed 2000 --count 25 --report .build/randomized_stress_report.jsonRun fast end-to-end compile/validate/generate/RTL checks for the .cw reference and write a benchmark snapshot:
./scripts/run_cw_example_benchmark.shThis script updates benchmarks/example_pogram_benchmark.txt as the persistent latest-reference log, including:
- compile/validate/generate timing,
- schedule metrics,
- effective CW execution stress-benchmark latency/results from generated RTL simulation,
- per-case
expected_valueandscoreboard=match|...lines plus aggregatescoreboard_total,scoreboard_matches,scoreboard_pass_ratio, - stress latency percentiles (
p50,p95), - placement-quality metrics (
fallback_instruction_ratio, per-flow fallback ratio, true dependency-edge estimated transfer hops, critical-path tail), - bottleneck summaries (
busiest_core, core hotspots, node latency hotspots, dependency hotspots), - reproducibility profile metadata and benchmark ranking score.
The testbench $fatals on any value mismatch against the software reference in
waugen.cw_reference, so the scoreboard is a hard correctness gate on top of
the latency/makespan targets. The reference is also exposed as
.build/cw_iverilog/cw_scoreboard.json for downstream tooling.
Latest tuned result as of 2026-06-14 UTC (retaining the staged autotune winner and re-validating deterministic scheduling, capability-aware CW lowering, configurable station cache, and the value scoreboard):
- selected tuning point:
lane=2,placement=balance,profile=throughput_optimized,priority=4,replicas=2,max_parallel=1,max_in_flight=2,load_balance=least_busy,scheduler_policy=weighted_fair exec_latency_cycles_avg=68.00,exec_latency_cycles_p95=70.00,makespan_cycles=42fallback_instruction_ratio=0.3043(21/69) anddependency_edges_v1=104hops across 105 true data-dependency edges- 3-run stability check:
median=68.00,p95=68.00,3/3passing - scoreboard:
8/8deterministic cases match the software reference (scoreboard_pass_ratio=1.0)
Run autotune sweep to search best score (lowest exec_latency_cycles_avg, then makespan_cycles, then total_ms):
TUNE_MODE=1 ./scripts/run_cw_example_benchmark.shAutotune writes:
- best/latest benchmark log:
benchmarks/example_pogram_benchmark.txt - full sweep summary:
benchmarks/example_pogram_tuning_latest.txt - JSON sidecars:
benchmarks/example_pogram_benchmark_latest.jsonbenchmarks/example_pogram_benchmark_best.jsonbenchmarks/example_pogram_benchmark_history.json
Default autotune now uses a staged coordinate search rather than one flat exhaustive grid:
- topology stage:
lane_parallelism,placement_policy,lowering_profile - program stage:
replicas,max_parallel_flows,priority,max_in_flight - scheduler stage:
load_balance,scheduler.program_policy
Replay saved autotune candidates without rerunning the full sweep:
REPLAY_MODE=best-and-stage-winners ./scripts/run_cw_example_benchmark.shSupported modes are best, stage-winners, best-and-stage-winners, and
worst. REPLAY_SUMMARY_FILE selects the source summary; replay uses isolated
configs/build directories and writes
benchmarks/example_pogram_replay_latest.txt without replacing the canonical
benchmark sidecars. The report compares saved and current latency/makespan and
labels both hop-metric versions so historical proxy values are not treated as
directly comparable to dependency_edges_v1.
Run stability mode with repeated samples (median and p95 latency summary):
MULTI_RUNS=5 ./scripts/run_cw_example_benchmark.shThis writes:
benchmarks/example_pogram_benchmark.txt(best sample with appended stability section),benchmarks/example_pogram_multirun_latest.txt(full multi-run summary).
Run regression-guard mode against the best sidecar baseline:
REGRESSION_CHECK=1 ./scripts/run_cw_example_benchmark.shUseful guardrail knobs:
REGRESSION_MAX_LATENCY_DELTA(default0.00)REGRESSION_MAX_MAKESPAN_DELTA(default0)REGRESSION_MAX_TOTAL_MS_DELTA(default250)REGRESSION_BASELINE_JSON(defaultbenchmarks/example_pogram_benchmark_best.json)
Manual tuning knobs are available as environment variables:
CW_LANE_PARALLELISM(example:4)CW_PLACEMENT_POLICY(localityorbalance)CW_LOWERING_PROFILE(reference,latency_optimized,throughput_optimized)PROGRAM_REPLICASandPROGRAM_MAX_PARALLELPROGRAM_PRIORITYandPROGRAM_LOAD_BALANCESCHEDULER_PROGRAM_POLICYCW_MAX_IN_FLIGHTCW_DTYPERUN_PROFILE(tag run intent in benchmark metadata)
Optional direct syntax check of generated RTL:
iverilog -g2005-sv -I src/verilog/generated -o /tmp/wau_sim \
src/verilog/generated/wau_operation_alu.v \
src/verilog/generated/wau_core_station.v \
src/verilog/generated/wau_core.v \
src/verilog/generated/wau_coordinator.v \
src/verilog/generated/wau_top.vsrc/python/waugen/: generator packageconfig.py: JSON schema parsing + validation (includescompiler.station_cacheandcompiler.core_capabilities)device_library.py: real device presetsoperation_library.py: built-in operation templatesbasic_compiler.py: basic high-level expression compiler to WAU flow stagescw_compiler.py:.cwkernel-style lowering with capability-aware candidate pruningbenchmark_replay.py: saved autotune summary parser and replay-plan selectioncw_reference.py: software reference model for CW flows (drives the value scoreboard)compiler.py: flow-to-core compilation with adaptive fallbacksscheduler.py: offline schedule timeline + 64-bit word encodingverilog_emit.py: WAU-specific RTL + report renderers (router/cache observability counters and thewau_host_mmioregister file live here); generated-project assembly is delegated tothirds/veribuildercli.py: CLI entrypoint
thirds/veribuilder/: standalone-ready Python package for dynamic Verilog project constructionsrc/veribuilder/core.py:VerilogProject,GeneratedFile,VerilogHeader, andTemplateRendererpyproject.toml: package metadata for publishing or installing separately
src/python/configs/wau_de0_nano_demo.json: example configurationsrc/python/configs/wau_de0_nano_compiled_expr.json: example output ofcompile-exprsrc/python/configs/wau_de0_nano_compiled_pseudoc.json: example output ofcompile-pseudocsrc/python/configs/wau_example_pogram_compiled.json: example output ofcompile-cwsrc/python/configs/wau_2d_multiprogram_demo.json: advanced DAG + multi-program examplesrc/verilog/generated/: generated output artifactstests/rtl/: SystemVerilog/Verilog testbenches (ALU, top demo, highway mesh + hop counters, MMIO register file)tests/python/: Python unit tests for compiler helpers, CW reference scoreboard, and program-level priority/replicas/policy stress matrixscripts/run_randomized_stress.py: randomized multi-flow stress (CI input)scripts/run_iverilog_tests.sh: iverilog test runnerscripts/run_cw_example_benchmark.sh: CW kernel benchmark, autotune, saved-candidate replay, multi-run stability, regression check.github/workflows/ci.yml: CI matrix (python tests, randomized stress, iverilog tests, autotuned CW benchmark) with artifact uploadsbenchmarks/example_pogram_benchmark.txt: tracked benchmark/reference metrics for.cwflow compilationbenchmarks/example_pogram_tuning_latest.txt: latest autotune sweep summarybenchmarks/example_pogram_replay_latest.txt: latest saved-candidate replay comparisonbenchmarks/example_pogram_multirun_latest.txt: latest multi-run stability summarybenchmarks/example_pogram_benchmark_latest.json: machine-readable latest benchmark snapshotbenchmarks/example_pogram_benchmark_best.json: machine-readable best-known benchmark snapshotbenchmarks/example_pogram_benchmark_history.json: benchmark history for trend checksbenchmarks/de0_nano_basic_benchmark.txt: silicon-verified reference run on the DE0-Nano (resource fit, per-corner Fmax, 795/795 scoreboard pass, live observability counters)demo/de0-nano/basic-example/: end-to-end physical deployment — Quartus 25.1 project + reusable vJTAG MMIO bridge RTL + reusable Python/TCL host stack + automation scripts; produces the artifact above
A generate run emits:
wau_defs.vh: project/device/operation constants (now alsoWAU_STATION_CACHE_ENTRIESandWAU_STATION_CACHE_POLICY_{FIFO,LRU})wau_operation_alu.v: arithmetic opcode execution unitwau_neighbor_forward.v: directional valid/ready packet forwarding linkwau_highway_router.v: per-core XY router with local/neighbor arbitration, plus 32-bithop_count/stall_count/local_delivered_count/forward_countobservability counterswau_highway_mesh.v: generated 2D router mesh interconnect, exposing per-router counter buseswau_core_station.v: per-core station (dispatch, latency control, configurable FIFO/LRU multi-entry input/result cache,cache_hit_count/cache_lookup_count)wau_core.v: core wrapperwau_coordinator.v: flow orchestrator with runtime adaptive fallback selection and packetized dispatch/result channelswau_host_mmio.v: 32-bit memory-mapped host control/status register file with observability counter readback<output_module_name>.v(demo:wau_top.v): top-level 2D core grid, exportingobs_total_hop_count/stall_count/forward_count/local_delivered_count/cache_hit_count/cache_lookup_countwau_de0_nano_top.v(for DE0-NANO preset): board wrapper that instantiateswau_host_mmiofor external Avalon-MM-style hosts and emulates writes from KEY[1]/SW[3:0] for stand-alone demoswau_program.json: compiled flow programwau_schedule.json: human-readable schedule timelinewau_schedule.hex: encoded 64-bit schedule words
Main JSON fields:
project,output_module_namedevicepreset(e.g.intel_de0_nano,intel_agilex7_fm,xilinx_artix7_100t)grid.x,grid.y- widths/depths (
data_width,flow_id_width,opcode_width,local_ram_depth,global_ram_depth) data_types(e.g.["int32", "float16", "float32"])coordinator_mode,enable_runtime_auto_adapt
abstractionlanguage(wau_flow_irorwau_pseudoc)version(integer, currently1)
operations- library-driven (
library+overrides) and/orcustom
- library-driven (
compilerrouting(waterfall,serpentine,manual)allow_adaptive_reroute,fallback_radius,allow_cycle_recurrencecore_capabilities: per-core operation/data type constraints (also consumed by CW lowering to prune incompatible candidate cores up-front)station_cache:{ "entries": <1..32>, "replacement_policy": "fifo" | "lru" }(defaultentries=4,replacement_policy=fifo)
schedulerstrategy(round_robin,serial, ordependency_aware)program_policy(weighted_fair,strict_priority,round_robin)locality_bias(float>= 0, default0.0): routing-aware core-selection tiebreaker that weights each candidate core by its Manhattan hop distance to the cores holding the node's true data-dependency results. Applied only after the earliest-free-cycle key, so it shrinks transfer hops without inflating makespan/latency;0.0disables locality weighting. Scheduler ties use explicit replica/runtime-node keys, so output is stable across Python hash seeds.wau_schedule.jsonexports the matchingdependency_edges_v1metric name, hop total/count/average, and unresolved-edge count.
coordinatormax_in_flight(int[1,16], default4): hardware capacity of the generatedwau_coordinator— the number of distinct flows it can keep executing concurrently across the core mesh (one accumulator context per slot). Independent flows injected back-to-back overlap on different cores instead of running strictly one-at-a-time.1reproduces the legacy serial coordinator. Emitted asWAU_COORD_MAX_IN_FLIGHT. Per-flow results are unchanged; a single in-flight flow keeps identical timing.
flowsid,name,entry, optionalexit- per-stage:
op, optionalcore,fallback_core,immediate_b,allow_adaptive,dtype - per-node (DAG):
id,op,deps,placement(core/fallback_core/candidate_cores/fixed/directive),dtype,recurrent,max_iterations
programsid,name,flows,priority,replicas,max_parallel_flows,load_balanceallow_async,allow_out_of_order
compile-cw supports optional .cw pragmas for practical tuning:
// @wau lane_parallelism=4
// @wau max_in_flight=4
// @wau preferred_dtype=float32
// @wau placement_policy=locality
// @wau lowering_profile=latency_optimized
// @wau program_priority=4
// @wau program_load_balance=least_busyPrecedence is:
- explicit CLI flags (
--lane-parallelism,--max-in-flight,--dtype) win, - otherwise pragma values are used,
- otherwise compile defaults apply.
Use cw-lint --compile-template as a fast preflight for .cw sources intended
for compile-cw; use plain cw-lint for host-side language programs that are
not meant to lower onto the WAU grid.
wau_host_mmio exposes a small 32-bit register file with a simple
mmio_read/mmio_write/mmio_address/mmio_writedata/mmio_readdata bus that
external host software (Avalon-MM, NIOS-II, on-chip CPU, etc.) can drive. The
DE0-NANO wrapper instantiates it and additionally emulates writes from KEY[1]
plus SW[3:0] for stand-alone board demos.
Word-addressed map:
| Addr | Name | Access | Meaning |
|---|---|---|---|
0x00 |
CTRL |
RW | [0] soft_reset_request (auto-clears), [1] enable_auto_adapt |
0x01 |
STATUS |
R | [0] host_in_ready, [1] host_out_valid, [2] output_pending (sticky) |
0x02 |
FLOW_ID |
RW | Flow id used by next TRIGGER |
0x03 |
IN_A |
RW | Operand A latched into the coordinator on TRIGGER |
0x04 |
IN_B |
RW | Operand B latched into the coordinator on TRIGGER |
0x05 |
TRIGGER |
W1S | Any write raises host_in_valid until accepted |
0x10 |
OUT_FLOW |
R | Last host_out_flow_id (reading also clears output_pending) |
0x11 |
OUT_VAL |
R | Last host_out_value (reading also clears output_pending) |
0x12 |
HOPS |
R | obs_total_hop_count (sum across control/data router meshes) |
0x13 |
STALLS |
R | obs_total_stall_count |
0x14 |
FORWARDS |
R | obs_total_forward_count (packets forwarded between neighbors) |
0x15 |
DELIVRD |
R | obs_total_local_delivered_count (packets exiting the mesh locally) |
0x16 |
CACHE_H |
R | obs_total_cache_hit_count (sum across all core stations) |
0x17 |
CACHE_L |
R | obs_total_cache_lookup_count |
The same counters are also available as direct ports on wau_top for
non-MMIO integrations.
.github/workflows/ci.yml runs on every push and PR:
python-tests: fullunittestdiscovery ontests/python(compiler, scheduler, CW frontends, CW reference scoreboard, program-stress matrix).randomized-stress: 50-seed sweep ofscripts/run_randomized_stress.pywith JSON report artifact.iverilog-tests: installs Icarus Verilog and runsscripts/run_iverilog_tests.sh(uploads generated RTL as artifact).cw-benchmark: runsscripts/run_cw_example_benchmark.shwith the autotuned knobs, surfaces a summary into the GitHub Step Summary, and uploadsbenchmarks/*pluscw_scoreboard.jsonas artifacts (30-day retention).
This is a robust basis, not final silicon architecture:
- Control-plane dispatch and data-plane results now traverse explicit neighbor-linked highway meshes with valid/ready backpressure.
- Runtime adaptation is implemented as primary/fallback/candidate core selection per node, constrained by per-core capability metadata.
- Compiler and scheduler outputs are designed so an external compiler/scheduler stack can replace or augment coordinator behavior.
- Current pseudo-C frontend targets accumulator-style pipelines (
acc = a; acc = acc <op> ...) to stay compatible with the present coordinator execution model.
The demo/de0-nano/basic-example/ project is the first end-to-end physical
deployment of the WAU on actual FPGA silicon: a Terasic DE0-Nano (Intel
Cyclone IV E EP4CE22F17C6) talking to a Python host over USB-Blaster + Altera
virtual JTAG. It exists as a working reference for everyone who wants to take
the generator's RTL and put it onto a real board.
What the demo bundles:
- Reusable RTL — a generic
wau_vjtag_bridge.v(4-bit IR JTAG↔MMIO master, with TCK↔CLOCK_50 CDC done right via toggle-sync + double-FF data crossing) and a thinvJTAG.vwrappingsld_virtual_jtag. Drop them into any Altera design that needs a host-driven Avalon-MM-style register file. - Reusable host stack — a layered Python library
(
waujtag.TCLClient→MMIO→WAU→Bench) plus aquartus_stp-hosted TCL line-protocol server. The lower layers know nothing about WAU and can drive any compatible bridge. - A working Quartus 25.1 project — pin assignments, SDC, board top wiring
the WAU
wau_host_mmioand the bridge, and Make/PowerShell automation that goes from JSON config → RTL →.sof→ programmed board → benchmark report.
Reference run captured 2026-05-24 (see
benchmarks/de0_nano_basic_benchmark.txt
for the full machine-readable snapshot):
| Flow | Stages | Reference | n | Pass | Throughput | p50 / p95 |
|---|---|---|---|---|---|---|
flow1_accumulate_and_scale |
3 | ((a + b) * 3) - b |
265 | 265/265 | 85.2 op/s | 15 / 16 ms |
flow2_max_then_scale |
3 | (max(a, b) - b) * 2 |
265 | 265/265 | 90.7 op/s | 15 / 16 ms |
flow3_fma_a_b_plus_b |
2 | a * b + b |
265 | 265/265 | 92.7 op/s | 15 / 16 ms |
| Aggregate scoreboard | 795 | 795/795 (100 %) | ~90 op/s |
Live router/cache observability deltas confirm the data plane really does
traverse the mesh (not a degenerate short-circuit): 6 890 total hops,
0 stall events, 4 240 packets locally delivered, 93 / 2 120 station-cache
hits (4.4 % — expected for random operand pairs).
Post-fit on EP4CE22F17C6 (Quartus Standard 25.1, 2×2 grid, int32, 4 ops):
| Metric | Used | Available | % |
|---|---|---|---|
| Total logic elements | 8 248 | 22 320 | 37 % |
| Dedicated logic registers | 3 652 | 22 320 | 16 % |
| Embedded 9-bit multipliers | 24 | 132 | 18 % |
| Total memory bits | 0 | 608 256 | 0 % |
| I/O pins | 66 | 154 | 43 % |
Setup timing closes at the Fast corner (+4.06 ns slack) and the empirically verified room-temperature build runs cleanly at 50 MHz. Per-corner Fmax: 36 MHz @ slow-85 °C, 40 MHz @ slow-0 °C, > 50 MHz @ fast-0 °C — see section 4 of the benchmark txt for the honest worst-case story.
- The WAU works on real silicon. 795 / 795 random + corner-case operand pairs round-tripped through the live mesh and matched the software reference, at 4 different signed flows spanning add / sub / mul / max across all four cores of the 2×2 grid, with zero stall events recorded.
- The generator's flow IR → Verilog pipeline is production-faithful.
The same Python compiler that produces
wau_program.jsonfor the testbench also produces the bitstream that just passed on hardware, without any per-board manual RTL edits. - The vJTAG bridge + Python stack are reusable. They were written device-agnostic and the demo deliberately uses them as libraries, so any follow-on project (different grid, different ops, different board) only has to write its own board-level pin wrapper.
- Two real architectural issues were uncovered and documented honestly
rather than papered over:
dst_core % GRID_Xinwau_highway_router.vinfers anLPM_DIVIDEper router port when GRID_X is not a power of 2. A 3×2 grid blows past the EP4CE22 LE budget (26 866 vs 22 320). Power-of-2 grids collapse the mod/div to bit-selects and fit with room to spare.wau_operation_alu.vemits a purely combinational signeddivwhose 32-bit settling time exceeds one 50 MHz period on Cyclone IV E, andwau_core_station.vlatchesalu_out_valueon the first cycle after dispatch — so divide results are captured before the divider settles and read back as garbage. The benchmark excludesdivfor this reason; the upstream fix is to defer the result-latch towait_cycles == 0or to swap in a pipelinedLPM_DIVIDE.
- Where the throughput goes. Per-trigger wall-clock latency (~15 ms)
is dominated by USB-Blaster JTAG round-trip, not by the WAU. The WAU
itself completes a 2–3 stage flow in well under 20 cycles at 50 MHz
(< 400 ns). To turn this into a real compute benchmark instead of a
control benchmark, the natural next step is a host-side burst loader
that streams many operands through MMIO before draining results — the
wau_host_mmioregister file already supports the pattern.
See ROADMAP.md for the full plan. Recommended follow-ups now that observability/MMIO/CI/cache-policy basics are in place:
- closed-loop on-FPGA benchmarking that pushes new schedules through the MMIO bus without reflashing the bitstream,
- deepen the
waugen arch-searchreports (first simulation-side slice landed: ranked grid-shape/op-specialization/memory-split/DRAM candidates) with synthesis-tool-calibrated area/fmax numbers and board-measured scores, - CW software reference parity across the wider operation set (currently calibrated against add/mul/max paths used by the example kernel).
PolyForm Noncommercial License 1.0.0 - Copyright 2026 Riccardo Cecchini
