Skip to content

Update cuDF Python user guides to current/pandas 3 behaviors#22720

Open
mroeschke wants to merge 7 commits into
rapidsai:mainfrom
mroeschke:cudf/doc/user_guide_ref
Open

Update cuDF Python user guides to current/pandas 3 behaviors#22720
mroeschke wants to merge 7 commits into
rapidsai:mainfrom
mroeschke:cudf/doc/user_guide_ref

Conversation

@mroeschke
Copy link
Copy Markdown
Contributor

Description

Similar to #22689, updates several user guides to 26.08 behavior where pandas 3 is supported

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke self-assigned this May 29, 2026
@mroeschke mroeschke added doc Documentation non-breaking Non-breaking change labels May 29, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e2eab8a0-3fe3-46d6-b753-e46807bf8d77

📥 Commits

Reviewing files that changed from the base of the PR and between 3660638 and 2b1c0b1.

📒 Files selected for processing (1)
  • docs/cudf/source/cudf/data-types.md
✅ Files skipped from review due to trivial changes (1)
  • docs/cudf/source/cudf/data-types.md

📝 Walkthrough

Summary by CodeRabbit

  • Documentation
    • Clarified that copy-on-write is the default since 26.08 (aligned with pandas 3.0), removed opt-in instructions, and shortened guidance on defensive copies.
    • Expanded supported data types: updated unsigned integer defaults, timezone-aware datetimes, decimals, lists/structs; added a "Specifying dtypes" section with examples and clarified "object" is string-only.
    • Reworked pandas/cudf compatibility guide: clarified dtype exclusions, missing-value semantics (cudf.NA), iteration and GPU→CPU→GPU workflow, and ordering/nondeterminism guidance.

Walkthrough

Three cuDF docs updated for 26.08: copy-on-write is now the default and implicit; data-types guidance expanded with a new "Specifying data types" section and clarifications for object, decimal, and nested types; and the pandas-compatibility guide reorganized with clearer behavioral guidance.

Changes

cuDF 26.08 Documentation Updates

Layer / File(s) Summary
Copy-on-write default behavior
docs/cudf/source/cudf/copy-on-write.md
Introduction and conclusion rewritten to state copy-on-write is the default since 26.08 and that defensive copy() calls are no longer needed.
Supported data types table and defaults
docs/cudf/source/cudf/data-types.md
Supported data types overview and default dtype table updated (including unsigned integer defaults); note retained that pandas.PeriodDtype/pandas.SparseDtype are unsupported; footer reference removed.
Specifying data types and decimal intro
docs/cudf/source/cudf/data-types.md
New "Specifying data types" section added documenting accepted pandas-like and Arrow dtype arguments with runnable examples; decimal lead-in wording adjusted.
Object dtype and nested types
docs/cudf/source/cudf/data-types.md
object dtype clarified as string-only in cuDF with updated example; nested (List/Struct) intro rewritten to describe child-element typing and adjusted py:class references.
pandas/cudf comparison introduction
docs/cudf/source/cudf/pandas-comparison.md
Document header/intro rewritten to clarify compatibility scope, dtype eligibility, and that missing values use cudf.NA; transitional sentence added.
pandas/cudf behavioral guidance
docs/cudf/source/cudf/pandas-comparison.md
Iteration guidance updated to recommend GPU→CPU→GPU conversion (.to_arrow()/.to_pandas() then reconstruct), result-ordering and floating-point nondeterminism rephrased, duplicate-column limitation restated, and .apply() limitations relocated.

🎯 3 (Moderate) | ⏱️ ~20 minutes


Possibly related PRs

  • rapidsai/cudf#22352: Overlapping changes to pandas-compatibility/nested-data documentation and py:class references in docs/cudf/source/cudf/data-types.md.

Suggested labels

Python


Suggested reviewers

  • wence-
  • rjzamora
  • bdice
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: updating cuDF Python user guides to reflect current behavior and pandas 3 compatibility, which aligns with all three modified documentation files.
Description check ✅ Passed The description clearly relates to the changeset by referencing similar PR 22689 and stating the purpose of updating user guides to 26.08 behavior for pandas 3 support, matching the documentation updates in the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/cudf/source/cudf/data-types.md (1)

158-163: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add explicit imports in this example block.

This section uses pd and cudf without showing imports in the same block. For runnable snippets, please include prerequisites locally (or add a short note that imports were defined earlier).

As per coding guidelines, documentation changes should prioritize completeness and clarity, including clear prerequisites.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/cudf/source/cudf/data-types.md` around lines 158 - 163, The example
block uses pd and cudf but doesn't show their imports; update the snippet to
include explicit prerequisite imports (e.g., import pandas as pd and import
cudf) at the top of the same fenced code block (or add a short note that imports
were defined earlier) so the example using pd.Series and cudf constructs is
runnable and self-contained; target the example that references pd and cudf in
the data-types example.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/cudf/source/cudf/copy-on-write.md`:
- Line 5: Fix the duplicated word in the intro sentence by replacing the phrase
"share the the same underlying data" with "share the same underlying data" in
the Copy-on-write documentation (the sentence beginning "Copy-on-write is a
memory management strategy...") so the wording is clear and free of the repeated
"the".

In `@docs/cudf/source/cudf/data-types.md`:
- Line 17: Fix the malformed datetime dtype in the table row by adding the
missing closing quote to `'datetime64[us]'` so the list reads `'datetime64[s]'`,
`'datetime64[ms]'`, `'datetime64[us]'`, `'datetime64[ns]'`; update the Datetime
table row (the string containing the datetime64 entries) to ensure all dtype
tokens are consistently quoted.
- Around line 3-6: Update the sentence that reads "cuDF also support data types
from the `[Arrow type system](...)`" to correct the grammar by changing
"support" to "supports" and fix the Arrow link formatting by removing the
surrounding backticks so it becomes a normal Markdown link (e.g., "cuDF also
supports data types from the [Arrow type system](...)"). Ensure the rest of the
sentence remains unchanged.

In `@docs/cudf/source/cudf/pandas-comparison.md`:
- Line 179: Replace the phrase "floating point results" with the hyphenated
compound adjective "floating-point results" in the sentence that begins "Series
of floats. If you need to compare floating point results, you" in the
documentation to improve readability and conform to the style guide.
- Around line 30-33: The sentence "cuDF all the data types in pandas..." is
missing the verb; update the sentence containing the phrase "cuDF all the data
types in pandas except for `pandas.PeriodDtype`, `pandas.SparseDtype`" to
include "supports" so it reads "cuDF supports all the data types in pandas
except for `pandas.PeriodDtype`, `pandas.SparseDtype` and third-party
`ExtensionDtype`s..." and keep the rest of the paragraph and links (e.g., "Data
Types") unchanged.
- Around line 145-147: The sentence mixes two different behaviors and should be
split and clarified: explain one case that to get a predictable (sorted) order
you can pass sort=True, and separately explain that to match pandas' default
behavior (which may be unsorted) you can enable mode.pandas_compatible or
explicitly use sort=False; update the text around the `sort=True` and
`sort=False` mentions and `mode.pandas_compatible` so each behavior and when to
use it is stated clearly and without contradiction.

---

Outside diff comments:
In `@docs/cudf/source/cudf/data-types.md`:
- Around line 158-163: The example block uses pd and cudf but doesn't show their
imports; update the snippet to include explicit prerequisite imports (e.g.,
import pandas as pd and import cudf) at the top of the same fenced code block
(or add a short note that imports were defined earlier) so the example using
pd.Series and cudf constructs is runnable and self-contained; target the example
that references pd and cudf in the data-types example.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eaf3394a-dac8-4f7d-a4a2-38500659d301

📥 Commits

Reviewing files that changed from the base of the PR and between b99f73f and 30bae30.

📒 Files selected for processing (3)
  • docs/cudf/source/cudf/copy-on-write.md
  • docs/cudf/source/cudf/data-types.md
  • docs/cudf/source/cudf/pandas-comparison.md

Comment thread docs/cudf/source/cudf/copy-on-write.md Outdated
Comment thread docs/cudf/source/cudf/data-types.md
Comment thread docs/cudf/source/cudf/data-types.md Outdated
Comment thread docs/cudf/source/cudf/pandas-comparison.md Outdated
Comment thread docs/cudf/source/cudf/pandas-comparison.md
Comment thread docs/cudf/source/cudf/pandas-comparison.md
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/cudf/source/cudf/data-types.md (1)

50-55: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Example output conflicts with the requested dtype.

The example creates cudf.Series(..., dtype=pd.Float64Dtype()) but shows dtype: Float32. This undermines trust in the example’s correctness.

Suggested doc fix
 >>> s = cudf.Series([1, 2, 3], dtype=pd.Float64Dtype())
 >>> s
 0    1.0
 1    2.0
 2    3.0
-dtype: Float32
+dtype: Float64

As per coding guidelines, documentation changes should prioritize Accuracy: Verify code examples compile and run correctly and Consistency.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/cudf/source/cudf/data-types.md` around lines 50 - 55, The example shows
a mismatch between the requested dtype and the printed dtype: the call to
cudf.Series(..., dtype=pd.Float64Dtype()) should produce a Float64 dtype but the
output shows Float32; update the example so the shown output matches the created
Series (either change the constructor to use pd.Float32Dtype() or change the
displayed dtype to Float64) and verify with cudf.Series(...) that the printed
dtype and values are accurate; reference the cudf.Series call and the
dtype=pd.Float64Dtype() token when making the correction.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/cudf/source/cudf/data-types.md`:
- Line 19: Update the Timedelta (duration) row to use the canonical dtype
strings by replacing `'timedelta[s]'`, `'timedelta[ms]'`, `'timedelta[us]'`,
`'timedelta[ns]'` with `'timedelta64[s]'`, `'timedelta64[ms]'`,
`'timedelta64[us]'`, `'timedelta64[ns]'`; ensure the table cell under the
"Timedelta (duration)" entry and any nearby examples or references use the
`timedelta64[...]` form for consistency with the documented/accepted dtype
style.

---

Outside diff comments:
In `@docs/cudf/source/cudf/data-types.md`:
- Around line 50-55: The example shows a mismatch between the requested dtype
and the printed dtype: the call to cudf.Series(..., dtype=pd.Float64Dtype())
should produce a Float64 dtype but the output shows Float32; update the example
so the shown output matches the created Series (either change the constructor to
use pd.Float32Dtype() or change the displayed dtype to Float64) and verify with
cudf.Series(...) that the printed dtype and values are accurate; reference the
cudf.Series call and the dtype=pd.Float64Dtype() token when making the
correction.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 338fdd55-ba76-4e8d-9167-ecfaa19a7318

📥 Commits

Reviewing files that changed from the base of the PR and between 30bae30 and 008c8c4.

📒 Files selected for processing (3)
  • docs/cudf/source/cudf/copy-on-write.md
  • docs/cudf/source/cudf/data-types.md
  • docs/cudf/source/cudf/pandas-comparison.md
✅ Files skipped from review due to trivial changes (1)
  • docs/cudf/source/cudf/pandas-comparison.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/cudf/source/cudf/copy-on-write.md

Comment thread docs/cudf/source/cudf/data-types.md Outdated
Copy link
Copy Markdown
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly need to figure out what to do with copy-on-write.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we drop this section altogether? We need dev docs for how CoW is implemented, but as far as user-facing behavior now that this is the default and only behavior of pandas I don't know if we need to discuss it in cudf's user-facing docs at all anymore.

and dictionary-like data.
cuDF largely uses the same [data type objects](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) supported by pandas, including
numeric, datetime, timedelta, and string data types. cuDF also supports
data types from the [Arrow type system](https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings) such as decimals, list,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also mention the pandas nullable types?

@vyasr
Copy link
Copy Markdown
Contributor

vyasr commented May 30, 2026

The CI failure is https://github.com/rapidsai/cudf/actions/runs/26669664425/job/78612813093?pr=22720#step:13:5928

/__w/cudf/cudf/docs/cudf/source/cudf/data-types.md:25: WARNING: py:class reference target not found: cudf.core.dtypes.IntervalDtype [ref.class]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Documentation non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants