Skip to content

Commit db22a92

Browse files
timsaucerclaude
andauthored
docs: add upstream sync process documentation (#1524)
* docs: add upstream sync process documentation Document the three-PR workflow used to sync to a newer upstream apache/datafusion version: bump crate deps + fix breakage, consolidate transitive deps, then fill API and documentation gaps via /check-upstream. Cross-reference from dev/release/README.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add audit-skill-md skill New AI agent skill at .ai/skills/audit-skill-md/SKILL.md to keep the user-facing skills/datafusion_python/SKILL.md in sync with the public Python API. Audits SessionContext, DataFrame, Expr, and functions surfaces for new APIs not covered, stale mentions, examples that drifted from idiomatic style, and missing version notes. Wired into PR 3 of the upstream sync workflow documented in dev/release/upstream-sync.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: verify upstream sync completed before release Add a checklist item to "Preparing the main Branch" pointing release managers at dev/release/upstream-sync.md so the crate bump, dependency consolidation, and /check-upstream and /audit-skill-md passes are confirmed done before the release branch is cut. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: scope upstream sync cargo update to datafusion family Replace `cargo update -p datafusion` with an explicit multi-`-p` invocation listing every `datafusion-*` workspace dependency, so PR 1 of the upstream-sync workflow refreshes only the datafusion family and leaves other transitives for PR 2 to consolidate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: correct datafusion-* pin location in upstream sync PR 1 step 1 incorrectly stated downstream `datafusion-*` crates are pinned in `crates/core/Cargo.toml`. Pins live in the root `[workspace.dependencies]`; per-crate manifests inherit via `workspace = true`. Reword step 1 to point at the right file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: restore workspace.package version bump in upstream sync PR 1 step 1 must also bump `[workspace.package].version` because the `datafusion-python` major version tracks the upstream `datafusion` major. The previous reword dropped that instruction. Reinstate it alongside the `[workspace.dependencies]` updates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: align audit-skill-md description with body version phrasing Frontmatter description referenced "requires upstream DataFusion vX", but the body of the skill settles on the `datafusion-python NN` form (consistent with the package/upstream-major equivalence). Switch the description to match so the skill speaks one language end to end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: fold make-pythonic step into PR 3 of upstream sync Audit-skill-md documents the order `/check-upstream` -> `/make-pythonic` (optional) -> `/audit-skill-md`, but PR 3 of the upstream-sync workflow only listed the first and last. Insert the make-pythonic pass as step 3 so signatures get aligned before the SKILL.md audit, avoiding example churn. Drops the orphan trailing paragraph in favor of inline guidance on when to defer larger reshapes to their own PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: drop literal Cargo.toml version from audit-skill-md inputs Replace literal `version = "53.0.0"` example with a pointer to the `[workspace.package]` field plus an `NN.0.0` placeholder so the skill prose does not drift each major bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 13b2c47 commit db22a92

3 files changed

Lines changed: 463 additions & 0 deletions

File tree

.ai/skills/audit-skill-md/SKILL.md

Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
---
21+
name: audit-skill-md
22+
description: Audit the user-facing skill at skills/datafusion_python/SKILL.md against the current public Python API. Find new APIs that should be documented, stale mentions of removed/renamed APIs, examples that drifted from current idiomatic style, and places that need a "requires datafusion-python NN or newer" note. Run after upstream syncs and before each release.
23+
argument-hint: [scope] (e.g., "session-context", "dataframe", "expr", "functions", "patterns", "pitfalls", "version-notes", "all")
24+
---
25+
26+
# Audit `skills/datafusion_python/SKILL.md`
27+
28+
You are auditing the user-facing skill at
29+
[`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md)
30+
against the current state of the Python API. The skill is the source of truth
31+
for how AI coding assistants are taught to write `datafusion-python` code, so
32+
it must match what the project actually ships. This skill identifies gaps
33+
caused by upstream syncs, refactors, or renames, and (if asked) applies the
34+
edits directly to `SKILL.md`.
35+
36+
The skill is most usefully run **after** the `check-upstream` step of an
37+
upstream sync (see `dev/release/upstream-sync.md`) — once any new APIs are
38+
exposed, this skill makes sure they get documented.
39+
40+
## What the skill covers
41+
42+
The user-facing `SKILL.md` documents these public surfaces. This list is not
43+
exhaustive — if a new top-level area is added (e.g., a new `Catalog` API
44+
exposed at the package root), include it.
45+
46+
| Surface | Module | Sections in SKILL.md |
47+
|---|---|---|
48+
| `SessionContext` | `python/datafusion/context.py` | "Data Loading" |
49+
| `DataFrame` | `python/datafusion/dataframe.py` | "DataFrame Operations Quick Reference", "Executing and Collecting Results", "Idiomatic Patterns" |
50+
| `Expr` | `python/datafusion/expr.py` | "Expression Building", "Common Pitfalls" |
51+
| `functions` | `python/datafusion/functions.py` | "Available Functions (Categorized)", scattered uses throughout |
52+
| Top-level helpers (`col`, `lit`, `WindowFrame`, ...) | `python/datafusion/__init__.py` | "Import Conventions", "Core Abstractions" |
53+
54+
## Scope argument
55+
56+
The user may specify a scope via `$ARGUMENTS` to limit the audit. If no scope
57+
is given or `all` is specified, audit every area.
58+
59+
| Scope | Audit target |
60+
|---|---|
61+
| `session-context` | `SessionContext` methods and the "Data Loading" section |
62+
| `dataframe` | `DataFrame` methods and the operations / executing / patterns sections |
63+
| `expr` | `Expr` methods/operators and the "Expression Building" section |
64+
| `functions` | `functions.py` `__all__` and the "Available Functions (Categorized)" section |
65+
| `patterns` | "Idiomatic Patterns" section — confirm patterns still match recommended style |
66+
| `pitfalls` | "Common Pitfalls" — confirm each pitfall still reproduces, drop ones fixed upstream |
67+
| `version-notes` | Cross-check version annotations (see below) |
68+
| `all` | Everything above |
69+
70+
## Inputs to read
71+
72+
Before producing the report:
73+
74+
1. `skills/datafusion_python/SKILL.md` — the document being audited.
75+
2. The relevant Python module(s) for the chosen scope. Public surface is the
76+
`__all__` list (where defined) plus `class` and `def` symbols not prefixed
77+
with `_`.
78+
3. `Cargo.toml` (root) for the current `datafusion-python` version — read
79+
the `version` field under `[workspace.package]` (format `NN.0.0`). The
80+
major version always matches the upstream `datafusion` crate, so a
81+
single `datafusion-python` version expresses both.
82+
`python/datafusion/__init__.py`'s `__version__` is the same value
83+
exposed at runtime.
84+
4. Recent commits touching the relevant module(s) for context on what
85+
changed since the last sync:
86+
```bash
87+
git log --oneline -- python/datafusion/dataframe.py | head -20
88+
```
89+
90+
## What to look for
91+
92+
Walk through each scoped area and flag four kinds of issues.
93+
94+
### 1. New APIs not mentioned
95+
96+
For each public symbol in the module's `__all__` (or each public class
97+
method), check whether it appears anywhere in `SKILL.md`. A symbol is
98+
"covered" if it shows up in:
99+
100+
- A code block (the strongest signal — it's demonstrated).
101+
- The "Available Functions (Categorized)" list.
102+
- The SQL-to-DataFrame Reference table.
103+
104+
**Decide whether each missing symbol deserves an entry.** Not every public
105+
symbol belongs in `SKILL.md` — the skill is curated for the patterns users
106+
hit daily, not exhaustive API reference. Use these heuristics:
107+
108+
- **Add it** if it replaces or supersedes something already in the skill
109+
(e.g., a new operation that is the idiomatic alternative to a documented
110+
workaround).
111+
- **Add it** if it fits a category already present (a new aggregate function
112+
goes in the aggregate list; a new join type goes in the joining section).
113+
- **Add it** if it changes how a documented pattern should be written.
114+
- **Skip it** if it is genuinely niche / advanced / experimental.
115+
- **Skip it** if it is internal plumbing exposed for FFI but not user-facing.
116+
117+
When you flag a missing symbol, include a one-line proposed insertion point
118+
(which section / which table row) so a reviewer can decide quickly.
119+
120+
### 2. Stale mentions
121+
122+
For each function name, method name, or import shown in `SKILL.md`, verify it
123+
still exists in the current API:
124+
125+
- Function names mentioned in prose or in the categorized list should appear
126+
in `python/datafusion/functions.py`'s `__all__`.
127+
- Method calls in code blocks should resolve against the current class.
128+
- Imports (`from datafusion import ...`) should succeed against the current
129+
`__init__.py`.
130+
131+
A quick way to check imports without running them:
132+
133+
```bash
134+
python -c "from datafusion import SessionContext, col, lit; from datafusion import functions as F; print('ok')"
135+
```
136+
137+
For each stale mention, propose either:
138+
- a rename to the current name, or
139+
- removal if the API is gone with no replacement.
140+
141+
### 3. Examples that drifted from idiomatic style
142+
143+
The skill teaches a Pythonic style: prefer plain strings to `col(...)` when a
144+
column reference is all you need; prefer raw Python values to `lit(...)`
145+
where auto-wrapping applies. Recent refactors (see the `make-pythonic`
146+
skill) keep moving more functions toward accepting native types.
147+
148+
For each code example in `SKILL.md`, check:
149+
150+
- Does it use `lit(value)` where a raw value would work? Comparison RHS,
151+
arithmetic with a column, etc. all auto-wrap. (Reserve `lit()` for the
152+
cases listed in pitfall #2.)
153+
- Does it use `col("name")` where a plain string would work? `select(...)`,
154+
`aggregate([keys], ...)`, `sort(...)`, `sort_by(...)` all accept plain
155+
name strings.
156+
- Do `functions.py` calls match the current pythonic signature for that
157+
function? If `make-pythonic` recently changed a signature (e.g.,
158+
`repeat(string, n: Expr | int)`), the example should pass `3` rather than
159+
`lit(3)`.
160+
- Does any example use a deprecated or removed parameter name?
161+
162+
For drift, propose the updated snippet. If the change is purely stylistic
163+
and the older form still works, mark the suggestion as **non-blocking**.
164+
165+
### 4. Missing or stale version notes
166+
167+
When an API depends on a specific version, the skill should say so —
168+
otherwise an agent referencing the skill in an older project will write
169+
code that fails at import or at runtime.
170+
171+
`datafusion-python` shares its major version number with the upstream
172+
`datafusion` crate (e.g., `datafusion-python 53.x` tracks upstream
173+
`datafusion 53`). Always express version requirements in terms of
174+
`datafusion-python` only — there is no need to call out upstream and
175+
package versions separately.
176+
177+
Add a version note when:
178+
179+
- A method or function shown in the skill was added in a specific release
180+
(e.g., a new `DataFrame` method that didn't exist before 53).
181+
- A breaking change altered behavior in a specific release (signature
182+
change, default-value change, new required argument).
183+
- A pitfall was fixed in a specific release. Either annotate the pitfall
184+
block with "fixed in datafusion-python NN, kept here for users on older
185+
versions" or remove it once the supported floor moves past that version.
186+
187+
Format for version notes (inline, italicized):
188+
189+
```markdown
190+
*Requires datafusion-python 53 or newer.*
191+
```
192+
193+
For each missing/stale version note, propose the exact line and where it
194+
belongs.
195+
196+
## How to discover changes since the last audit
197+
198+
If the user supplies a previous version or commit SHA where the audit was
199+
last run, diff against it:
200+
201+
```bash
202+
# Public-API-relevant changes since SHA <prev>
203+
git log --oneline <prev>..HEAD -- python/datafusion/
204+
205+
# Whose signatures actually moved
206+
git diff <prev>..HEAD -- python/datafusion/functions.py | grep '^[+-]def '
207+
```
208+
209+
If no prior audit point is given, fall back to "since the last upstream
210+
sync" by inspecting commits that touch `Cargo.toml`'s `datafusion` pin:
211+
212+
```bash
213+
git log --oneline -- Cargo.toml | grep -i datafusion | head -5
214+
```
215+
216+
## Output Format
217+
218+
Produce a report grouped by scope. Each finding is one bullet with a
219+
proposed action, so a maintainer can review the list quickly and apply
220+
edits in order.
221+
222+
```
223+
## SKILL.md Audit (scope: <scope>)
224+
225+
Audited against:
226+
- skills/datafusion_python/SKILL.md @ <git SHA / "working tree">
227+
- datafusion-python <version>
228+
229+
### New APIs to cover
230+
- `DataFrame.foo()` — added in datafusion-python 53. Insert in "DataFrame Operations Quick Reference" under <subsection>.
231+
Proposed snippet:
232+
```python
233+
df.foo(...)
234+
```
235+
236+
### Stale mentions
237+
- "old_function_name" referenced in the categorized list (line N) — renamed to "new_function_name". Replace.
238+
239+
### Drifted examples
240+
- "Filtering" section, `df.filter(col("a") > lit(10))` — drop `lit(10)`, auto-wrap applies. (non-blocking)
241+
- "Aggregation" section, `df.aggregate([col("region")], ...)` — pass `"region"` as a plain string per "Projection" guidance.
242+
243+
### Version notes
244+
- `DataFrame.foo()` block needs *Requires datafusion-python 53 or newer.*
245+
- "Common Pitfalls" #N — fixed in datafusion-python 53; remove the pitfall and update the SQL-to-DataFrame row to no longer flag the workaround.
246+
247+
### No-change confirmed
248+
- `SessionContext` data-loading section — all entries match current API.
249+
```
250+
251+
If asked to apply the changes, edit `skills/datafusion_python/SKILL.md`
252+
directly with `Edit` tool calls, one finding at a time, and re-run the
253+
relevant doctest sanity check at the end:
254+
255+
```bash
256+
pytest --doctest-modules python/datafusion -q
257+
```
258+
259+
## What NOT to flag
260+
261+
- **Internal helpers / underscored names.** Private symbols are not part of
262+
the user-facing surface.
263+
- **Functions intentionally omitted.** Niche / advanced APIs (custom
264+
catalogs, raw FFI plumbing, low-level execution plan accessors) live in
265+
the API reference, not the skill. If an omission was deliberate and a
266+
comment / commit explains why, leave it out.
267+
- **Style nits inside explanatory prose.** The skill mixes example code and
268+
prose; only enforce the pythonic style on actual code blocks.
269+
- **Function-by-function coverage of every `functions.py` symbol.** The
270+
"Available Functions (Categorized)" list is curated by category, not
271+
exhaustive. Adding a single new aggregate to the aggregate list is
272+
enough — the user follows the pointer to the API reference for the rest.
273+
274+
## Coordination with other skills
275+
276+
- Run `/check-upstream` first to expose any missing upstream APIs into the
277+
Python layer. Without that, this skill cannot recommend documenting
278+
something that is not yet exposed.
279+
- Run `/make-pythonic` before this skill if a Pythonic-signature pass is
280+
planned for a release — that way this skill can update examples to the
281+
final signature in one shot rather than churning them twice.
282+
- The order during an upstream sync (PR 3 of `dev/release/upstream-sync.md`)
283+
is therefore: `/check-upstream``/make-pythonic` (optional) →
284+
`/audit-skill-md`.

dev/release/README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,12 @@ release branch without blocking ongoing development in the `main` branch.
3333
We can cherry-pick commits from the `main` branch into `branch-53` as needed and then create new patch releases
3434
from that branch.
3535

36+
## Upstream Sync
37+
38+
Between releases the `main` branch is periodically synced to a newer upstream `apache/datafusion` version. This is
39+
broken into a three-PR workflow (bump + fix breakage, consolidate transitive deps, fill API and documentation gaps).
40+
See [`upstream-sync.md`](upstream-sync.md) for the full process.
41+
3642
## Detailed Guide
3743

3844
### Pre-requisites
@@ -53,6 +59,9 @@ You will also need access to the [datafusion](https://test.pypi.org/project/data
5359
Before creating a new release:
5460

5561
- We need to ensure that the main branch does not have any GitHub dependencies
62+
- Confirm the upstream sync workflow in [`upstream-sync.md`](upstream-sync.md) has been completed for this release cycle
63+
(crate bump + breakage fixes, transitive dependency consolidation, and the `/check-upstream` and `/audit-skill-md`
64+
passes). Any gaps surfaced by those skills should land before the release branch is cut.
5665
- a PR should be created and merged to update the major version number of the project
5766
- A new release branch should be created, such as `branch-53`
5867
- It is best to push this branch to the apache repository rather than a personal fork in case patch releases are required.

0 commit comments

Comments
 (0)