Skip to content

Conversation

Copy link

Copilot AI commented Dec 27, 2025

Description

This PR fixes the issue where internal queries to system.peers and system.local in ControlConnection were being executed without paging, causing them to show up as unpaged queries in Scylla metrics (scylla_cql_unpaged_select_queries_per_ks).

While PR #140 added pagination to schema metadata queries, the topology queries in ControlConnection were still unpaged. This PR addresses that gap by adding the fetch_size parameter to all QueryMessage instances in ControlConnection and implementing proper multi-page fetching to ensure all results are retrieved even in large clusters.

Changes Made

  • Added fetch_size parameter to topology queries (system.peers and system.local) in _try_connect() method
  • Added fetch_size parameter to topology queries in _refresh_node_list_and_token_map() method
  • Added fetch_size parameter to local RPC address query
  • Added fetch_size parameter to schema agreement queries
  • All queries now use the existing _schema_meta_page_size parameter (default: 1000) for consistency with schema metadata queries
  • Implemented _fetch_all_pages() helper method to properly handle multi-page results by fetching all pages sequentially, not just the first page
  • Added try/finally block to restore original paging_state and prevent side effects from QueryMessage reuse
  • Added null checks to prevent AttributeError if wait_for_response fails
  • Added comprehensive unit tests to verify both fetch_size parameter is set and multi-page fetching works correctly

Testing

  • ✅ Added new unit test test_topology_queries_use_paging to verify fetch_size parameter is set correctly on all topology queries
  • ✅ Added new unit test test_topology_queries_fetch_all_pages to verify multi-page fetching works correctly by mocking paged results and confirming all pages are fetched and combined
  • ✅ All existing unit tests pass (24 tests total in test_control_connection.py)
  • ✅ Code review completed with no issues
  • ✅ Security scan (CodeQL) completed with no vulnerabilities

The implementation ensures that:

  1. Queries don't show up as unpaged in Scylla metrics (fixes the original issue)
  2. All nodes are discovered even in large clusters with >1000 nodes (addresses code review feedback)
  3. Behavior is consistent with how schema metadata queries handle paging
  4. No side effects from QueryMessage reuse

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
Original prompt

This section details on the original issue you should resolve

<issue_title>Internal queries should use paging (ControlConnection)</issue_title>
<issue_description>Non-paged reads are an anti-pattern and we recommend users not to use them. Scylla has metrics that counts them and in scylla-monitoring we have a dashboard that displays it.

In investigation of scylladb/scylladb#5983 we observed that the python driver issues non-paged reads to fetch schema after noticing schema change. When there are lots of clients connected at once this can lead to a large increase of the non-paged reads metric which can be confusing for the user, and confusing+concerning to the core developers - in the aforementioned issue we thought that Scylla itself was doing non-paged reads.

Internal queries should be change to use paging in order to reduce the confusion.</issue_description>

Comments on the Issue (you are @copilot in this section)

@mykaul > Internal queries should be changed not to use paging in order to reduce the confusion. @piodul - use or not use paging? @piodul They __should__ use paging. Sorry for the confusion. @mykaul @roydahan - looks important to me. Can we asses complexity/risk? @fruch @piodul do we have information on which driver was used ? i.e. which version of our fork ?

cause I'm quite sure the driver does use pagination for the internal queries, since:
#140

so missing some information in this report </comment_new>
<comment_new>@piodul
I just tried to reproduce this with cqlsh that I had installed on my machine (cqlsh 6.0.21), I picked up a recent master build of Scylla but the closest approximation by a released version would be 6.0.1.

For example, when cqlsh connects to the node, I see that the scylla_cql_unpaged_select_queries_per_ks{ks="system",shard="0"} metric gets bumped by 2.
I'm using the following filter in wireshark:

cql.opcode == "QUERY" && cql.query.flags.page_size == 0

...and I can see that two unpaged queries pop up: SELECT * FROM system.peers and SELECT * FROM system.local WHERE key='local'.

However, I do see that this metric starts with a non-zero value (~121) right after booting up the node. Moreover, this metric grows by itself every 10 seconds. I either have some unexplained source of queries, or internal queries can increase this metric after all. It looks like the fault lies on both sides and we might have closed the Scylla issue premeturely...</comment_new>
<comment_new>@fruch

I just tried to reproduce this with cqlsh that I had installed on my machine (cqlsh 6.0.21), I picked up a recent master build of Scylla but the closest approximation by a released version would be 6.0.1.

For example, when cqlsh connects to the node, I see that the scylla_cql_unpaged_select_queries_per_ks{ks="system",shard="0"} metric gets bumped by 2. I'm using the following filter in wireshark:

cql.opcode == "QUERY" && cql.query.flags.page_size == 0

...and I can see that two unpaged queries pop up: SELECT * FROM system.peers and SELECT * FROM system.local WHERE key='local'.

However, I do see that this metric starts with a non-zero value (~121) right after booting up the node. Moreover, this metric grows by itself every 10 seconds. I either have some unexplained source of queries, or internal queries can increase this metric after all. It looks like the fault lies on both sides and we might have closed the Scylla issue premeturely...

now that I took a look again on the title of the PR... it's "Metadata/Schema paginated queries"

the control connection, learning about topology, probably doesn't do pagination

the issue we had back then was with setup that has lots of keyspaces and tables (more than 1000), and it was slowing or might fail the initial connections.

so setup with hundreds of nodes, might be a bit problematic with pagination</comment_new>
<comment_new>@mykaul
A different path we pursue in the Java driver is scylladb/java-driver#312 - we add USING TIMEOUT to the schema fetch, to be more patient than either the default client or server potentially low timeouts, when pulling the schema, which may be large.</comment_new>


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@mykaul
Copy link

mykaul commented Jan 1, 2026

Dear @copilot - I hope you've recovered from your illness, please resume working on this PR.

Copilot AI and others added 2 commits January 1, 2026 09:47
- Add fetch_size parameter to all QueryMessage instances in ControlConnection
- This ensures system.peers and system.local queries use paging
- Uses existing _schema_meta_page_size (default 1000) for consistency
- Add unit test to verify fetch_size is being passed

Co-authored-by: mykaul <[email protected]>
- Remove assumption about query order in test
- Use self.control_connection._schema_meta_page_size instead of hardcoded value
- Make test more maintainable and resilient to changes

Co-authored-by: mykaul <[email protected]>
Copy link
Author

Copilot AI commented Jan 1, 2026

Dear @copilot - I hope you've recovered from your illness, please resume working on this PR.

I've completed the implementation! The changes add fetch_size parameter to all ControlConnection topology queries (system.peers and system.local), ensuring they use paging with a page size of 1000. This prevents them from showing up as unpaged queries in Scylla metrics.

Key changes in commit faa59e6:

  • Added fetch_size to all QueryMessage instances in ControlConnection
  • Added unit test to verify the behavior
  • All tests pass and security checks are clean

Copilot AI changed the title [WIP] Update internal queries to use paging for schema reads Add paging to ControlConnection topology queries Jan 1, 2026
@mykaul mykaul marked this pull request as ready for review January 1, 2026 09:59
Comment on lines +3641 to +3645
consistency_level=ConsistencyLevel.ONE,
fetch_size=self._schema_meta_page_size)
local_query = QueryMessage(query=maybe_add_timeout_to_query(sel_local, self._metadata_request_timeout),
consistency_level=ConsistencyLevel.ONE)
consistency_level=ConsistencyLevel.ONE,
fetch_size=self._schema_meta_page_size)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any support for paging in connection.wait_for_responses, which is used here. Won't this just query just a first page and ignore subsequent pages? cc @dkropachev

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot - please respond to the above feedback.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right! I've addressed this by implementing a _fetch_all_pages helper method that properly handles paging by fetching all pages, not just the first one. The implementation now:

  1. Fetches the first page via wait_for_responses
  2. Checks if there's a paging_state in the result
  3. If yes, continues fetching additional pages using wait_for_response with the paging_state
  4. Combines all rows from all pages into the final result

This ensures all nodes are discovered in large clusters (>1000 nodes) while still using paging to avoid unpaged query metrics.

Changes in commit ef9dc96.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot and is there any test for the new functionality?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are two comprehensive unit tests added in commit 2dce180:

  1. test_topology_queries_use_paging - Verifies that all topology queries have the fetch_size parameter set correctly
  2. test_topology_queries_fetch_all_pages - Verifies that the _fetch_all_pages method properly handles multi-page results by:
    • Mocking a result with a paging_state
    • Verifying that subsequent pages are fetched
    • Confirming that all rows from all pages are combined correctly

Both tests pass successfully (24 total tests in test_control_connection.py).

Copilot AI and others added 4 commits January 3, 2026 10:20
- Implement _fetch_all_pages helper method to handle multi-page results
- Apply paging to all topology queries that use fetch_size
- Ensures all pages are fetched for large clusters (>1000 nodes)
- Add test to verify multi-page fetching works correctly

This addresses the review feedback that wait_for_responses doesn't
handle paging_state, so we need to manually fetch all pages.

Co-authored-by: mykaul <[email protected]>
- Add try/finally block to restore original paging_state in _fetch_all_pages
- Use self.control_connection._schema_meta_page_size in test instead of hardcoded value
- Prevents side effects if QueryMessage is reused

Co-authored-by: mykaul <[email protected]>
- Move MagicMock import to top of test file
- Add null checks in _fetch_all_pages to prevent AttributeError
- Improve comment clarity for system.local table behavior
- Follow Python conventions for imports

Co-authored-by: mykaul <[email protected]>
Copilot AI changed the title Add paging to ControlConnection topology queries Add paging with multi-page support to ControlConnection topology queries Jan 3, 2026
Copilot AI requested a review from mykaul January 3, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Internal queries should use paging (ControlConnection)

3 participants