Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new LISA test suite and sample runbook/CSV to validate that provisioned Azure VMs match declared hardware specifications (CPU, memory, NICs, disks, NVMe, and optional perf thresholds), driven by a CSV combinator.
Changes:
- Introduces
VmSpecValidationtest suite with checks for CPU/memory, NIC counts, disk/NVMe counts, and basic fio/ntttcp perf validation. - Adds an example runbook using the CSV combinator to iterate VM sizes and pass expected_* variables into each test iteration.
- Adds a sample
vm_specs.csvwith one VM size row to demonstrate the format.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| lisa/microsoft/testsuites/vm_spec_validation/vm_spec_validation.py | New VM spec validation test suite (functional + light perf checks). |
| lisa/microsoft/runbook/examples/vm_specs.csv | Sample expected-spec CSV schema + example row. |
| lisa/microsoft/runbook/examples/vm_spec_validation.yml | Example runbook wiring CSV combinator variables into the new suite. |
lisa/microsoft/testsuites/vm_spec_validation/vm_spec_validation.py
Outdated
Show resolved
Hide resolved
lisa/microsoft/testsuites/vm_spec_validation/vm_spec_validation.py
Outdated
Show resolved
Hide resolved
lisa/microsoft/testsuites/vm_spec_validation/vm_spec_validation.py
Outdated
Show resolved
Hide resolved
lisa/microsoft/testsuites/vm_spec_validation/vm_spec_validation.py
Outdated
Show resolved
Hide resolved
lisa/microsoft/testsuites/vm_spec_validation/vm_spec_validation.py
Outdated
Show resolved
Hide resolved
lisa/microsoft/testsuites/vm_spec_validation/vm_spec_validation.py
Outdated
Show resolved
Hide resolved
lisa/microsoft/testsuites/vm_spec_validation/vm_spec_validation.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| @TestCaseMetadata( | ||
| description=""" | ||
| Verify that the VM exposes the expected number of GPU devices. | ||
|
|
||
| This test is skipped when the ``expected_gpu_count`` variable | ||
| is empty or zero. | ||
|
|
||
| Steps: | ||
| 1. Read expected_gpu_count from the runbook variables. | ||
| 2. Query GPU PCI devices via ``lspci``. | ||
| 3. Assert the GPU device count matches expected_gpu_count. | ||
| """, | ||
| priority=5, | ||
| requirement=simple_requirement(min_gpu_count=1), | ||
| ) |
There was a problem hiding this comment.
This test case claims it is skipped when expected_gpu_count is empty/zero, but the requirement includes simple_requirement(min_gpu_count=1), which will prevent the case from running on non-GPU SKUs (so it won't actually exercise the skip logic or validate expected_gpu_count=0). Consider removing the min_gpu_count requirement and letting the test skip based on the variable, or alternatively update the description/variables to align with the enforced GPU requirement.
lisa/microsoft/testsuites/vm_spec_validation/vm_spec_validation.py
Outdated
Show resolved
Hide resolved
lisa/microsoft/testsuites/vm_spec_validation/vm_spec_validation.py
Outdated
Show resolved
Hide resolved
lisa/microsoft/testsuites/vm_spec_validation/vm_spec_validation.py
Outdated
Show resolved
Hide resolved
| # Start ntttcp receiver (server) in the background | ||
| server_process = server_ntttcp.run_as_server_async( | ||
| server_nic_name, | ||
| run_time_seconds=60, # 60s for a stable throughput measurement | ||
| ports_count=ports_count, | ||
| ) | ||
|
|
||
| try: | ||
| # Run ntttcp sender (client) | ||
| client_ntttcp.run_as_client( | ||
| client_nic_name, | ||
| server_ip=server_node.internal_address, | ||
| threads_count=1, | ||
| run_time_seconds=60, # match the server duration | ||
| ports_count=ports_count, | ||
| ) | ||
| finally: | ||
| # Stop the server and collect its result | ||
| server_node.tools[Kill].by_name(server_ntttcp.command) | ||
|
|
||
| server_executable_result = server_process.wait_result() | ||
| server_result = server_ntttcp.create_ntttcp_result( | ||
| server_executable_result | ||
| ) | ||
|
|
||
| # ntttcp reports throughput in Gbps — convert to Mbps to match | ||
| # the runbook variable unit. | ||
| measured_bw_mbps = int(server_result.throughput_in_gbps * 1000) | ||
|
|
||
| # Allow tolerance below the declared max | ||
| bw_floor_mbps = int( | ||
| expected_bw_mbps * (100 - _PERF_TOLERANCE_PERCENT) / 100 | ||
| ) | ||
| log.info( | ||
| f"VM size: {vm_size} — ntttcp network bandwidth: " | ||
| f"measured {measured_bw_mbps} Mbps, " | ||
| f"expected >= {bw_floor_mbps} Mbps " | ||
| f"(declared: {expected_bw_mbps} Mbps, " | ||
| f"tolerance: {_PERF_TOLERANCE_PERCENT}%)" | ||
| ) | ||
| assert_that(measured_bw_mbps).described_as( | ||
| f"VM size {vm_size}: expected network throughput " | ||
| f">= {bw_floor_mbps} Mbps (declared {expected_bw_mbps} Mbps " | ||
| f"with {_PERF_TOLERANCE_PERCENT}% tolerance) but measured " | ||
| f"only {measured_bw_mbps} Mbps" |
There was a problem hiding this comment.
For bandwidth validation, this uses ntttcp directly without the usual Ntttcp.setup_system(...)/restore_system(...) tuning and without any retry/cleanup logic. Existing ntttcp-based perf tests (e.g. lisa/microsoft/testsuites/performance/common.py) explicitly tune sysctls/firewall and include retries to reduce flakiness. Consider following the same pattern here to improve reliability of the measured throughput vs. declared expectations.
| # Start ntttcp receiver (server) in the background | |
| server_process = server_ntttcp.run_as_server_async( | |
| server_nic_name, | |
| run_time_seconds=60, # 60s for a stable throughput measurement | |
| ports_count=ports_count, | |
| ) | |
| try: | |
| # Run ntttcp sender (client) | |
| client_ntttcp.run_as_client( | |
| client_nic_name, | |
| server_ip=server_node.internal_address, | |
| threads_count=1, | |
| run_time_seconds=60, # match the server duration | |
| ports_count=ports_count, | |
| ) | |
| finally: | |
| # Stop the server and collect its result | |
| server_node.tools[Kill].by_name(server_ntttcp.command) | |
| server_executable_result = server_process.wait_result() | |
| server_result = server_ntttcp.create_ntttcp_result( | |
| server_executable_result | |
| ) | |
| # ntttcp reports throughput in Gbps — convert to Mbps to match | |
| # the runbook variable unit. | |
| measured_bw_mbps = int(server_result.throughput_in_gbps * 1000) | |
| # Allow tolerance below the declared max | |
| bw_floor_mbps = int( | |
| expected_bw_mbps * (100 - _PERF_TOLERANCE_PERCENT) / 100 | |
| ) | |
| log.info( | |
| f"VM size: {vm_size} — ntttcp network bandwidth: " | |
| f"measured {measured_bw_mbps} Mbps, " | |
| f"expected >= {bw_floor_mbps} Mbps " | |
| f"(declared: {expected_bw_mbps} Mbps, " | |
| f"tolerance: {_PERF_TOLERANCE_PERCENT}%)" | |
| ) | |
| assert_that(measured_bw_mbps).described_as( | |
| f"VM size {vm_size}: expected network throughput " | |
| f">= {bw_floor_mbps} Mbps (declared {expected_bw_mbps} Mbps " | |
| f"with {_PERF_TOLERANCE_PERCENT}% tolerance) but measured " | |
| f"only {measured_bw_mbps} Mbps" | |
| # Allow tolerance below the declared max | |
| bw_floor_mbps = int( | |
| expected_bw_mbps * (100 - _PERF_TOLERANCE_PERCENT) / 100 | |
| ) | |
| # Retry up to 3 times to reduce transient bandwidth variance. | |
| max_attempts = 3 | |
| best_measured_bw_mbps = 0 | |
| server_ntttcp.setup_system() | |
| client_ntttcp.setup_system() | |
| try: | |
| for attempt in range(1, max_attempts + 1): | |
| # Start ntttcp receiver (server) in the background | |
| server_process = server_ntttcp.run_as_server_async( | |
| server_nic_name, | |
| run_time_seconds=60, # 60s for a stable throughput measurement | |
| ports_count=ports_count, | |
| ) | |
| try: | |
| # Run ntttcp sender (client) | |
| client_ntttcp.run_as_client( | |
| client_nic_name, | |
| server_ip=server_node.internal_address, | |
| threads_count=1, | |
| run_time_seconds=60, # match the server duration | |
| ports_count=ports_count, | |
| ) | |
| finally: | |
| # Stop the server and collect its result | |
| server_node.tools[Kill].by_name(server_ntttcp.command) | |
| server_executable_result = server_process.wait_result() | |
| server_result = server_ntttcp.create_ntttcp_result( | |
| server_executable_result | |
| ) | |
| # ntttcp reports throughput in Gbps — convert to Mbps to match | |
| # the runbook variable unit. | |
| measured_bw_mbps = int(server_result.throughput_in_gbps * 1000) | |
| best_measured_bw_mbps = max( | |
| best_measured_bw_mbps, measured_bw_mbps | |
| ) | |
| log.info( | |
| f"VM size: {vm_size} — ntttcp network bandwidth attempt " | |
| f"{attempt}/{max_attempts}: measured {measured_bw_mbps} Mbps, " | |
| f"expected >= {bw_floor_mbps} Mbps " | |
| f"(declared: {expected_bw_mbps} Mbps, " | |
| f"tolerance: {_PERF_TOLERANCE_PERCENT}%)" | |
| ) | |
| if measured_bw_mbps >= bw_floor_mbps: | |
| break | |
| finally: | |
| server_node.tools[Kill].by_name(server_ntttcp.command) | |
| server_ntttcp.restore_system() | |
| client_ntttcp.restore_system() | |
| log.info( | |
| f"VM size: {vm_size} — best ntttcp network bandwidth: " | |
| f"measured {best_measured_bw_mbps} Mbps, " | |
| f"expected >= {bw_floor_mbps} Mbps " | |
| f"(declared: {expected_bw_mbps} Mbps, " | |
| f"tolerance: {_PERF_TOLERANCE_PERCENT}%)" | |
| ) | |
| assert_that(best_measured_bw_mbps).described_as( | |
| f"VM size {vm_size}: expected network throughput " | |
| f">= {bw_floor_mbps} Mbps (declared {expected_bw_mbps} Mbps " | |
| f"with {_PERF_TOLERANCE_PERCENT}% tolerance) but best measured " | |
| f"value across {max_attempts} attempt(s) was only " | |
| f"{best_measured_bw_mbps} Mbps" |
| # Usage: | ||
| # lisa -r lisa\microsoft\runbook\examples\vm_spec_validation.yml \ | ||
| # --variable "subscription_id:<your_sub_id>" \ | ||
| # --variable "csv_file:lisa\microsoft\runbook\examples\vm_specs.csv" | ||
| # | ||
| # You can also override location, marketplace_image, etc. from the CLI. | ||
|
|
||
| test_pass: "adhoc" | ||
|
|
||
| name: vm_spec_validation | ||
|
|
||
| include: | ||
| - path: ../azure.yml | ||
|
|
||
| extension: | ||
| - ../../testsuites | ||
|
|
||
| variable: | ||
| # Path to the CSV file with VM specifications (override via CLI if needed) | ||
| - name: csv_file | ||
| value: 'lisa\microsoft\runbook\examples\vm_specs.csv' | ||
|
|
There was a problem hiding this comment.
This runbook example uses Windows-style backslashes in paths (both in the CLI example and the default csv_file value). Other runbook examples in this repo use forward slashes (e.g. lisa/microsoft/runbook/examples/git_bisect.yml), which are more portable across OSes/shells. Consider switching to forward slashes in the example paths to make the sample runbook work out-of-the-box on Linux runners.
| ) -> None: | ||
| expected_memory_gb = _get_int_var(variables, VAR_EXPECTED_MEMORY_GB) | ||
| expected_memory_mb = expected_memory_gb * 1024 | ||
| actual_memory_mb = node.tools[Free].get_free_memory_mb() |
There was a problem hiding this comment.
verify_vm_memory is comparing the expected total RAM against Free.get_free_memory_mb(), which returns free/unused memory and will vary with workload/OS caches. This will cause false failures even when the VM has the correct total RAM. Use total memory (e.g., Free.get_total_memory_gb() / total in MiB) for the comparison instead of free memory.
| actual_memory_mb = node.tools[Free].get_free_memory_mb() | |
| actual_memory_mb = int(node.tools[Free].get_total_memory_gb() * 1024) |
There was a problem hiding this comment.
get_free_memory_mb gives better granularity than "get_total_memory_gb() * 1024"
| expected_bw * (100 + _PERF_TOLERANCE_PERCENT) / 100 | ||
| ) | ||
| log.info( | ||
| f"VM size: {vm_size} - fio seq read across " | ||
| f"{len(data_disks)} disk(s): " | ||
| f"measured {measured_bw} MiB/s, " | ||
| f"expected {bw_floor}-{bw_ceiling} MiB/s " | ||
| f"(declared: {expected_bw} MBps)" | ||
| ) |
There was a problem hiding this comment.
The throughput check mixes units: the docstring/CSV variable is labeled as "MBps" but the measurement derives from fio IOPS with block_size=1024K, which is effectively MiB/s. Either define the runbook variable as MiB/s, or convert MB/s to MiB/s (1000/1024) before applying the tolerance and reporting results to avoid confusion and inconsistent expectations.
| iodepth=64, | ||
| numjob=4, | ||
| block_size="4K", | ||
| size_gb=8192, |
There was a problem hiding this comment.
fio.launch(..., size_gb=8192, ...) is potentially misleading: in LISA's Fio tool, size_gb is passed as --size=<value>M (MiB), so 8192 means ~8 GiB per target, not 8192 GiB. Add an inline comment clarifying the intended unit/size (or adjust the value) so future readers don't assume this is a multi-terabyte workload.
| size_gb=8192, | |
| size_gb=8192, # LISA Fio passes this as MiB, so this is ~8 GiB per target. |
| filename=filename, | ||
| mode="randread", | ||
| iodepth=64, | ||
| numjob=thread_count, | ||
| block_size="4K", | ||
| size_gb=8192, | ||
| time=120, | ||
| overwrite=True, |
There was a problem hiding this comment.
The fio workload parameters here (e.g., iodepth=64, numjob=thread_count, time=120) materially change test duration and load but are currently undocumented magic numbers. Add short inline comments (or named module-level constants) for these parameters so it’s clear why these values were chosen and how to tune them safely.
| class VmSpecValidation(TestSuite): | ||
| """Validate VM hardware against runbook-declared specifications.""" | ||
|
|
||
| # ------------------------------------------------------------------ | ||
| # CPU validation | ||
| # ------------------------------------------------------------------ | ||
| @TestCaseMetadata( | ||
| description=""" | ||
| Verify that the VM's vCPU count matches the expected value from |
There was a problem hiding this comment.
The PR description’s listed "Key Test Cases" includes names that don’t match the implemented methods in this suite (e.g., it mentions verify_vm_max_premium_data_disks / verify_vm_premium_ssd_storage_throughput / verify_vm_spec_summary, but the code defines verify_vm_max_premium_ssd_disk_count / verify_vm_premium_ssd_throughput and has no verify_vm_spec_summary). Please update the PR description (or align method names/implement the missing case) so readers can reliably run the referenced tests.
| # Strip any non-digit characters (e.g. commas, units, spaces) before parsing. | ||
| digits_only = re.sub(r"[^\d]", "", str(raw)) | ||
| if not digits_only: | ||
| raise SkippedException( | ||
| f"Variable '{name}' contains no digits - skipping check." | ||
| ) | ||
| value = int(digits_only) | ||
| if value <= 0: | ||
| raise SkippedException( | ||
| f"Variable '{name}' is {value} (zero or negative) - skipping check." | ||
| ) |
There was a problem hiding this comment.
_get_int_var() removes all non-digit characters before parsing. This drops the sign, so values like "-1" become "1" and will not be skipped as intended. Consider parsing with a regex that preserves an optional leading +/- (or explicitly rejecting any value containing '-' before stripping), so negative values reliably trigger the "zero or negative" skip path.
| iodepth=64, | ||
| numjob=4, | ||
| block_size="4K", | ||
| size_gb=8192, |
There was a problem hiding this comment.
The fio wrapper's size_gb argument is passed through as --size=<value>M (MiB) in lisa/tools/fio.py, so size_gb=8192 actually means ~8 GiB. This is easy to misread and accidentally change. Please add an inline comment (or define a named constant) clarifying the unit and intent for 8192 here (and consider using an expression like 8 * 1024 for readability).
| size_gb=8192, | |
| # Fio.size_gb is passed through as MiB (--size=<value>M), | |
| # so 8 * 1024 here intentionally means an 8 GiB fio size. | |
| size_gb=8 * 1024, |
| mode="randread", | ||
| iodepth=64, | ||
| numjob=4, | ||
| block_size="4K", | ||
| size_gb=8192, | ||
| time=120, | ||
| overwrite=True, |
There was a problem hiding this comment.
These fio parameters include several hard-coded values (iodepth=64, numjob=4, runtime=120, size_gb=8192) that materially control test behavior and result stability. Per repo guidelines, please either introduce named constants for them or add brief inline comments explaining why these specific values were chosen.
| filename=filename, | ||
| mode="read", | ||
| iodepth=64, | ||
| numjob=thread_count, | ||
| block_size="1024K", | ||
| size_gb=8192, | ||
| time=120, | ||
| overwrite=True, | ||
| ) |
There was a problem hiding this comment.
This throughput test also uses hard-coded fio parameters (iodepth=64, numjob=thread_count, block_size=1024K, runtime=120, size_gb=8192). Please add inline comments or pull them into named constants so future maintainers can safely tune them and understand the tradeoffs.
| def verify_vm_max_premium_ssd_disk_count( | ||
| self, node: Node, log: Logger, variables: Dict[str, Any] | ||
| ) -> None: | ||
| expected_max_disks = _get_int_var(variables, VAR_EXPECTED_MAX_DISKS) | ||
| vm_size = variables.get(VAR_VM_SIZE, "unknown") |
There was a problem hiding this comment.
PR description/test list mentions verify_vm_max_premium_data_disks, but this suite defines verify_vm_max_premium_ssd_disk_count. To avoid confusion for runbook authors and reviewers, please align the method name (or update the PR description/runbook docs) so the documented test case names match the actual ones.
| best_iops = 0 | ||
| for i in range(_PERF_ITERATIONS): | ||
| result = fio.launch( | ||
| name=f"iops_all_disks_{i}", | ||
| filename=filename, | ||
| mode="randread", | ||
| iodepth=64, | ||
| numjob=thread_count, |
There was a problem hiding this comment.
numjob=thread_count with iodepth=64 can create an extremely high total queue depth on large core-count VMs and trigger libaio limits (EAGAIN) or severe perf drop. Consider capping numjob (the repo’s perf helpers cap to 256 for libaio) and/or bounding total qdepth to reduce flakiness.
| best_iops = 0 | |
| for i in range(_PERF_ITERATIONS): | |
| result = fio.launch( | |
| name=f"iops_all_disks_{i}", | |
| filename=filename, | |
| mode="randread", | |
| iodepth=64, | |
| numjob=thread_count, | |
| fio_numjob = min( | |
| thread_count, 256 | |
| ) # Cap libaio jobs to avoid EAGAIN/perf collapse on large-core VMs. | |
| max_total_queue_depth = ( | |
| 256 * 64 | |
| ) # Preserve the prior effective upper bound of numjob * iodepth. | |
| fio_iodepth = max( | |
| 1, max_total_queue_depth // fio_numjob | |
| ) # Ensure total queue depth stays bounded while keeping iodepth >= 1. | |
| log.info( | |
| f"VM size: {vm_size} - fio settings for aggregate IOPS: " | |
| f"thread_count={thread_count}, numjob={fio_numjob}, " | |
| f"iodepth={fio_iodepth}, total_queue_depth={fio_numjob * fio_iodepth}" | |
| ) | |
| best_iops = 0 | |
| for i in range(_PERF_ITERATIONS): | |
| result = fio.launch( | |
| name=f"iops_all_disks_{i}", | |
| filename=filename, | |
| mode="randread", | |
| iodepth=fio_iodepth, | |
| numjob=fio_numjob, |
| fio = node.tools[Fio] | ||
| cpu = node.tools[Lscpu] | ||
| thread_count = cpu.get_thread_count() | ||
| best_bw_mbps = 0 | ||
| for i in range(_PERF_ITERATIONS): | ||
| result = fio.launch( | ||
| name=f"storage_throughput_all_disks_{i}", | ||
| filename=filename, | ||
| mode="read", | ||
| iodepth=64, | ||
| numjob=thread_count, | ||
| block_size="1024K", | ||
| size_gb=8192, | ||
| time=120, | ||
| overwrite=True, | ||
| ) |
There was a problem hiding this comment.
Same fio concurrency concern here: numjob=thread_count (with iodepth=64) can exceed async I/O limits on large VMs. Consider capping numjob similarly to other perf code in the repo to avoid intermittent failures.
| bw_ceiling = int( | ||
| expected_bw * (100 + _PERF_TOLERANCE_PERCENT) / 100 | ||
| ) | ||
| log.info( | ||
| f"VM size: {vm_size} - fio seq read across " | ||
| f"{len(data_disks)} disk(s): " | ||
| f"measured {measured_bw} MBps, " | ||
| f"expected {bw_floor}-{bw_ceiling} MBps " | ||
| f"(declared: {expected_bw} MBps)" | ||
| ) | ||
| assert_that(measured_bw).described_as( | ||
| f"VM size {vm_size}: expected storage throughput " | ||
| f"between {bw_floor} and {bw_ceiling} MBps " | ||
| f"(declared {expected_bw} MBps with " | ||
| f"{_PERF_TOLERANCE_PERCENT}% tolerance) across " | ||
| f"{len(data_disks)} disk(s) but measured " | ||
| f"{measured_bw} MBps" | ||
| ).is_between(bw_floor, bw_ceiling) |
There was a problem hiding this comment.
The test description says it validates throughput is “at least” the expected value, but the assertion enforces an upper bound (is_between(bw_floor, bw_ceiling)). If the measured throughput exceeds the declared value by > tolerance, this will fail even though it meets the “at least” requirement. Either update the docstring to match a bounded-range check, or change the assertion to enforce only a minimum.
| bw_ceiling = int( | |
| expected_bw * (100 + _PERF_TOLERANCE_PERCENT) / 100 | |
| ) | |
| log.info( | |
| f"VM size: {vm_size} - fio seq read across " | |
| f"{len(data_disks)} disk(s): " | |
| f"measured {measured_bw} MBps, " | |
| f"expected {bw_floor}-{bw_ceiling} MBps " | |
| f"(declared: {expected_bw} MBps)" | |
| ) | |
| assert_that(measured_bw).described_as( | |
| f"VM size {vm_size}: expected storage throughput " | |
| f"between {bw_floor} and {bw_ceiling} MBps " | |
| f"(declared {expected_bw} MBps with " | |
| f"{_PERF_TOLERANCE_PERCENT}% tolerance) across " | |
| f"{len(data_disks)} disk(s) but measured " | |
| f"{measured_bw} MBps" | |
| ).is_between(bw_floor, bw_ceiling) | |
| log.info( | |
| f"VM size: {vm_size} - fio seq read across " | |
| f"{len(data_disks)} disk(s): " | |
| f"measured {measured_bw} MBps, " | |
| f"expected at least {bw_floor} MBps " | |
| f"(declared: {expected_bw} MBps)" | |
| ) | |
| assert_that(measured_bw).described_as( | |
| f"VM size {vm_size}: expected storage throughput " | |
| f"at least {bw_floor} MBps " | |
| f"(declared {expected_bw} MBps with " | |
| f"{_PERF_TOLERANCE_PERCENT}% tolerance) across " | |
| f"{len(data_disks)} disk(s) but measured " | |
| f"{measured_bw} MBps" | |
| ).is_greater_than_or_equal_to(bw_floor) |
| # Allow tolerance around the declared max | ||
| bw_floor_mbps = int( | ||
| expected_bw_mbps | ||
| * (100 - _PERF_TOLERANCE_PERCENT) | ||
| / 100 | ||
| ) | ||
| bw_ceiling_mbps = int( | ||
| expected_bw_mbps | ||
| * (100 + _PERF_TOLERANCE_PERCENT) | ||
| / 100 | ||
| ) | ||
| log.info( | ||
| f"VM size: {vm_size} — ntttcp network bandwidth: " | ||
| f"measured {measured_bw_mbps} Mbps, " | ||
| f"expected {bw_floor_mbps}-{bw_ceiling_mbps} Mbps " | ||
| f"(declared: {expected_bw_mbps} Mbps, " | ||
| f"tolerance: {_PERF_TOLERANCE_PERCENT}%)" | ||
| ) | ||
| assert_that(measured_bw_mbps).described_as( | ||
| f"VM size {vm_size}: expected network throughput " | ||
| f"between {bw_floor_mbps} and {bw_ceiling_mbps} Mbps " | ||
| f"(declared {expected_bw_mbps} Mbps with " | ||
| f"{_PERF_TOLERANCE_PERCENT}% tolerance) but measured " | ||
| f"{measured_bw_mbps} Mbps" | ||
| ).is_between(bw_floor_mbps, bw_ceiling_mbps) |
There was a problem hiding this comment.
The network bandwidth docstring says “at least” expected throughput, but the assertion enforces an upper bound (is_between). This can fail if the VM measures higher than the declared spec. Either document that this is a “within tolerance” validation, or assert only a lower bound.
| # Allow tolerance around the declared max | |
| bw_floor_mbps = int( | |
| expected_bw_mbps | |
| * (100 - _PERF_TOLERANCE_PERCENT) | |
| / 100 | |
| ) | |
| bw_ceiling_mbps = int( | |
| expected_bw_mbps | |
| * (100 + _PERF_TOLERANCE_PERCENT) | |
| / 100 | |
| ) | |
| log.info( | |
| f"VM size: {vm_size} — ntttcp network bandwidth: " | |
| f"measured {measured_bw_mbps} Mbps, " | |
| f"expected {bw_floor_mbps}-{bw_ceiling_mbps} Mbps " | |
| f"(declared: {expected_bw_mbps} Mbps, " | |
| f"tolerance: {_PERF_TOLERANCE_PERCENT}%)" | |
| ) | |
| assert_that(measured_bw_mbps).described_as( | |
| f"VM size {vm_size}: expected network throughput " | |
| f"between {bw_floor_mbps} and {bw_ceiling_mbps} Mbps " | |
| f"(declared {expected_bw_mbps} Mbps with " | |
| f"{_PERF_TOLERANCE_PERCENT}% tolerance) but measured " | |
| f"{measured_bw_mbps} Mbps" | |
| ).is_between(bw_floor_mbps, bw_ceiling_mbps) | |
| # Allow tolerance below the declared throughput to account for | |
| # measurement variance while still enforcing a minimum expectation. | |
| bw_floor_mbps = int( | |
| expected_bw_mbps | |
| * (100 - _PERF_TOLERANCE_PERCENT) | |
| / 100 | |
| ) | |
| log.info( | |
| f"VM size: {vm_size} — ntttcp network bandwidth: " | |
| f"measured {measured_bw_mbps} Mbps, " | |
| f"expected at least {bw_floor_mbps} Mbps " | |
| f"(declared: {expected_bw_mbps} Mbps, " | |
| f"tolerance: {_PERF_TOLERANCE_PERCENT}%)" | |
| ) | |
| assert_that(measured_bw_mbps).described_as( | |
| f"VM size {vm_size}: expected network throughput " | |
| f"at least {bw_floor_mbps} Mbps " | |
| f"(declared {expected_bw_mbps} Mbps with " | |
| f"{_PERF_TOLERANCE_PERCENT}% tolerance) but measured " | |
| f"{measured_bw_mbps} Mbps" | |
| ).is_greater_than_or_equal_to(bw_floor_mbps) |
| result = fio.launch( | ||
| name=f"nvme_iops_all_disks_{i}", | ||
| filename=filename, | ||
| mode="randread", | ||
| iodepth=64, | ||
| numjob=4, | ||
| block_size="4K", | ||
| size_gb=8192, | ||
| time=120, | ||
| overwrite=True, |
There was a problem hiding this comment.
Fio.launch(size_gb=...) is passed to fio as --size=<value>M (MiB). Using size_gb=8192 is likely intended as 8192 MiB (~8 GiB), but the code reads like 8192 GB. Consider introducing a clearly named local constant (e.g., size_mb = 8192) and passing that, or add an inline note about units.
| def verify_vm_premium_ssd_iops( | ||
| self, node: Node, log: Logger, variables: Dict[str, Any] | ||
| ) -> None: |
There was a problem hiding this comment.
PR description’s “Key Test Cases” list doesn’t match the implemented method names (e.g., it lists verify_vm_max_premium_data_disks/verify_vm_premium_disk_iops/verify_vm_premium_ssd_storage_throughput, but this suite defines verify_vm_max_premium_ssd_disk_count/verify_vm_premium_ssd_iops/verify_vm_premium_ssd_throughput). Please align the PR description (and any run instructions) with the actual test method names to avoid confusion when selecting cases.
| # Usage: | ||
| # lisa -r lisa\microsoft\runbook\examples\vm_spec_validation.yml \ | ||
| # --variable "subscription_id:<your_sub_id>" \ | ||
| # --variable "csv_file:lisa\microsoft\runbook\examples\vm_specs.csv" | ||
| # | ||
| # You can also override location, marketplace_image, etc. from the CLI. | ||
|
|
||
| test_pass: "adhoc" | ||
|
|
||
| name: vm_spec_validation | ||
|
|
||
| include: | ||
| - path: ../azure.yml | ||
|
|
||
| extension: | ||
| - ../../testsuites | ||
|
|
||
| variable: | ||
| # Path to the CSV file with VM specifications (override via CLI if needed) | ||
| - name: csv_file | ||
| value: 'lisa\microsoft\runbook\examples\vm_specs.csv' | ||
|
|
There was a problem hiding this comment.
This example runbook uses Windows-style backslashes in paths (e.g., lisa\microsoft\...). On Linux/macOS runners this is usually treated as a literal backslash and can cause “file not found”. Consider switching example paths to forward slashes for cross-platform usability.
Description
This module validates that Azure VMs match their declared hardware specifications.
A sample run book and CSV file included in examples.
The CSV file defines the expected properties for each VM size (CPU count, memory, NIC count, max disks, local NVMe disks, IOPS, etc.), and the LISA CSV combinator in runbook iterates over each row, provisioning a VM and running validation tests against it.
Related Issue
None
Type of Change
Checklist
Test Validation
Key Test Cases:
Impacted LISA Features:
None
Tested Azure Marketplace Images:
Test Results