Skip to content

fix: input validation for env vars and narrowing conversions#378

Open
randomizedcoder wants to merge 3 commits into
NVIDIA:masterfrom
randomizedcoder:sa/02-atoi-timer
Open

fix: input validation for env vars and narrowing conversions#378
randomizedcoder wants to merge 3 commits into
NVIDIA:masterfrom
randomizedcoder:sa/02-atoi-timer

Conversation

@randomizedcoder

Copy link
Copy Markdown

Static analysis findings

flawfinder (Level 2, CWE-190) and clang-tidy (cert-err34-c) flagged unsafe integer/float parsing from environment variables. GCC -Wconversion and -Wsign-conversion flagged implicit narrowing conversions in timer.cc.

Finding Tool Location Severity
atoi() with no range check or error detection flawfinder common.cu:1330, util.cu:555 CWE-190 integer overflow
atof() with no error detection flawfinder common.cu:1596 CWE-20 improper input validation
Implicit signed → unsigned narrowing GCC -Wsign-conversion timer.cc:10 Data loss risk
Implicit uint64_tdouble narrowing GCC -Wconversion timer.cc:20,25 Precision loss risk

Why these matter

atoi() silently returns 0 on garbage input — indistinguishable from atoi("0"). For NCCL_TESTS_DEVICE=abc, the current code silently selects GPU 0 instead of reporting an error. For NCCL_TESTS_DEVICE=99999999999, atoi() has undefined behavior on integer overflow — it typically wraps to a garbage value, which is then passed as a GPU device ID to CUDA.

atof() has the same problem. atof("not_a_number") silently returns 0.0, so NCCL_TESTS_MIN_BW=garbage silently sets the minimum bandwidth threshold to 0.0 instead of the intended default of -1 (disabled).

The codebase already has a parseInt() helper using strtoul() with error checking, but it's behind #ifdef MPI_SUPPORT and not used for these env vars.

timer.cc narrowing conversions are semantically safe (durations are non-negative, nanosecond counts fit in double precision) but should be explicit to satisfy -Wconversion and signal intentionality to reviewers.

Git history

The affected code spans three files and three different time periods:

Commit Date Author File Description
d313d20 2022-09-20 Sylvain Jeaugey common.cu, timer.cc "Update NCCL tests" — introduced NCCL_TESTS_MIN_BW with atof() and the timer.cc narrowing patterns
00f5281 2025-10-17 David Addison common.cu, util.cu "Add support for JSON output to perf test framework" — moved NCCL_TESTS_DEVICE handling with atoi() into the current structure; added the util.cu copy
760c467 2025-11-02 David Addison common.cu "Add memory usage report option" — re-introduced atof() for NCCL_TESTS_MIN_BW in the current location

The atoi pattern for NCCL_TESTS_DEVICE traces back to d313d20 (Sep 2022) — nearly 4 years old. The atof pattern for NCCL_TESTS_MIN_BW was originally introduced in the same 2022 commit. Both patterns have been carried forward through multiple refactors by different contributors. The timer.cc narrowing has been unchanged since its creation in the same 2022 commit.

These are all inherited patterns rather than deliberate choices by the current maintainers. The atoi/atof idiom was standard practice in older C code but has been superseded by strtol/strtod which provide error detection that atoi/atof fundamentally cannot.

Changes

Commit 1: Tests (TDD — intentionally FAIL before fix)

  • tests/c/test_input_validation.c — 9 test cases:
    • test_source_verified — greps source files for unsafe patterns (intentionally FAILS)
    • test_atoi_garbage_exploit — proves atoi("abc") is indistinguishable from atoi("0")
    • test_atoi_overflow_exploit — proves atoi("99999999999") wraps to garbage value
    • test_atof_garbage_exploit — proves atof("xyz") is indistinguishable from atof("0")
    • test_strtol_validation — table-driven: 9 cases proving strtol correctness
    • test_strtod_validation — table-driven: 7 cases proving strtod correctness
    • test_strtol_device_id_pattern — reproduces the exact NCCL_TESTS_DEVICE safe parsing
    • test_signed_to_unsigned_cast — demonstrates signed→unsigned wrap behavior
    • test_uint64_to_double_precision — proves precision loss for uint64_t > 2^53

Commit 2: Fix atoi/atof

  • common.cu: atoistrtol for NCCL_TESTS_DEVICE, atofstrtod for NCCL_TESTS_MIN_BW
  • util.cu: atoistrtol for NCCL_TESTS_DEVICE
  • Invalid values now print a warning to stderr and fall back to the default

Commit 3: Fix timer.cc narrowing

  • 3x static_cast for explicit narrowing conversions

Test plan

  • make -C tests/c test — 9/9 pass after all commits
  • Commit 1 alone shows 8/9 pass (source-verification intentionally fails)
  • Commits 2+3 bring source-verification to PASS → 9/9
  • Adversarial tests prove the vulnerabilities are real

Files changed

  • src/common.cuatoistrtol (NCCL_TESTS_DEVICE), atofstrtod (NCCL_TESTS_MIN_BW)
  • src/util.cuatoistrtol (NCCL_TESTS_DEVICE)
  • src/timer.cc — 3x static_cast for narrowing conversions
  • tests/c/test_input_validation.c — 9 test cases (new file)

🤖 Generated with Claude Code

randomizedcoder and others added 3 commits April 28, 2026 16:41
Table-driven tests for two classes of unsafe implicit conversion:

  - atoi/atof: demonstrates silent failure on garbage input, integer
    overflow, and empty strings vs strtol/strtod error detection
  - narrowing: demonstrates uint64_t -> double precision loss for
    large values and signed -> unsigned sign change

Source-verification test intentionally FAILS at this commit to
demonstrate the unsafe patterns exist before the fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: randomizedcoder dave.seddon.ca@gmail.com <dave.seddon.ca@gmail.com>
atoi() silently returns 0 on non-numeric input and has undefined
behavior on integer overflow (CWE-190). atof() similarly provides
no error detection. Replace with strtol/strtod which detect:

  - Non-numeric input (endptr == start)
  - Overflow (errno == ERANGE)
  - Partial parse (trailing garbage after number)

Environment variables affected:
  - NCCL_TESTS_DEVICE: GPU device ID (common.cu + util.cu)
  - NCCL_TESTS_MIN_BW: minimum bandwidth threshold (common.cu)

Invalid values now print a warning and fall back to the default.

Signed-off-by: dave.seddon.ca@gmail.com
Signed-off-by: randomizedcoder dave.seddon.ca@gmail.com <dave.seddon.ca@gmail.com>
GCC -Wconversion and -Wsign-conversion flag three implicit
narrowing conversions:

  - now(): chrono::duration::rep (signed long) to uint64_t
  - elapsed(): uint64_t to double
  - reset(): uint64_t to double

Add static_cast to make the conversions explicit and silence
the warnings. All conversions are semantically safe (durations
are non-negative, nanosecond counts fit in double precision).

Signed-off-by: dave.seddon.ca@gmail.com
Signed-off-by: randomizedcoder dave.seddon.ca@gmail.com <dave.seddon.ca@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant