Skip to content

feat: remote build for Python sdk#76

Open
christianalfoni wants to merge 6 commits into
mainfrom
CSB-1450
Open

feat: remote build for Python sdk#76
christianalfoni wants to merge 6 commits into
mainfrom
CSB-1450

Conversation

@christianalfoni
Copy link
Copy Markdown
Collaborator

@christianalfoni christianalfoni commented May 12, 2026

Build snapshots without a local Docker daemon

❌ Current behavior

sdk.snapshots.create(CreateContextSnapshotParams(...)) always shelled out to the local docker daemon. The Python SDK refused to build at all if Docker wasn't installed, which blocked any environment without a daemon (CI containers, restricted devboxes, web playgrounds, etc.) from creating context-based snapshots.

flowchart LR
    A[snapshots.create<br/>CreateContextSnapshotParams] --> B{docker available?}
    B -- no --> X[RuntimeError]
    B -- yes --> C[docker buildx build<br/>locally]
    C --> D[docker push<br/>to bartender registry]
    D --> E[create_snapshot API]
Loading

✅ New behavior

snapshots.create now defaults to the remote image-builder service. The local docker buildx path is still available as an opt-in via TOGETHER_LOCAL_BUILD=1. Both paths return the same {image, architecture} shape so the registration logic that follows is identical.

flowchart LR
    A[snapshots.create<br/>CreateContextSnapshotParams] --> B{TOGETHER_LOCAL_BUILD=1?}
    B -- yes --> C[_build_and_register<br/>docker buildx + push]
    B -- no, default --> D[_build_image_via_builder]
    D --> E[POST context.tar.gz<br/>to builder.* host]
    E --> F[SSE log stream<br/>via httpx-sse]
    F --> G[GET /builds/:id<br/>resolve image_ref]
    C --> H[create_snapshot API]
    G --> H
Loading

The remote path is implemented by a new RemoteImageBuilderClient that:

  • Builds an in-memory tar.gz of the build context, honoring .dockerignore via pathspec.
  • POSTs the tarball to {builder_host}/builds with nydus_convert=true.
  • Streams build logs via SSE (httpx-sse.aconnect_sse) and retries transient connect failures while the build pod is scheduling.
  • Cancels the build job on asyncio.CancelledError so client-side aborts don't leak running pods.
  • Resolves the final image reference from GET /builds/:id once a done control event is received.

Both _submit and _get_status are wrapped in _with_retry from _utils for consistency with the rest of the SDK's retry behavior.

🤔 Assumptions

  • The image-builder host can be derived from the management API base URL by replacing api.bartender. (or api.) with builder..
  • The same TOGETHER_API_KEY is valid for both the management API and the builder service.
  • ubuntu-latest GitHub runners ship Docker preinstalled, so the local-build e2e test still works without an explicit docker/setup-buildx-action step.
  • TOGETHER_REMOTE_ARCHITECTURE, if set, must be a value accepted by CreateSnapshotBodyArchitecture — otherwise we raise a clear RuntimeError rather than silently falling back.

🧠 Decisions

  • Default to remote, opt into local. Local builds are an escape hatch for developers iterating quickly on a Dockerfile, not the steady state — most users don't have, or shouldn't need, a local Docker daemon.
  • Single return contract. Refactored _build_and_register to return {image, architecture} instead of completing the snapshot registration itself. This keeps create() as the single place that calls create_snapshot_api / alias_snapshot_api, so the two build paths share registration + alias logic.
  • Pass the api_key explicitly to SnapshotsNamespace. The builder client needs a bearer token, and reaching into the underlying generated ApiClient would couple us to generated code. Cleaner to pass api_key as a kwarg from TogetherSandbox.__init__.
  • Added httpx-sse>=0.4.0 and pathspec>=0.11.0 to pyproject.toml. Both are small, well-maintained, and avoid us having to hand-roll SSE parsing or .dockerignore matching.
  • Used monkeypatch.setenv / delenv(..., raising=False) in the e2e tests. Function-scoped, automatic teardown — the two tests can sit next to each other without leaking env state, and the remote test is robust to a workflow-level TOGETHER_LOCAL_BUILD=1 if someone ever sets one.

🔄 Discussions

  • Considered keeping a single test_create_from_context_with_alias test that just lets the SDK pick its path silently. Rejected because the two paths now exercise totally different infrastructure (local Docker vs. builder service over HTTPS+SSE) and a single test would lose the ability to attribute failures.
  • Considered baking the local-build env-var gate inside _build_and_register itself. Rejected because the branch should be visible at the call site so future readers see both paths in one place.

🧪 Testing

  • tests/e2e/test_snapshots.py::TestSnapshots::test_create_from_context_localmonkeypatch.setenv("TOGETHER_LOCAL_BUILD", "1"), asserts the local docker buildx path still produces a registered snapshot with the expected alias prefix.
  • tests/e2e/test_snapshots.py::TestSnapshots::test_create_from_context_remotemonkeypatch.delenv("TOGETHER_LOCAL_BUILD", raising=False), asserts the remote builder path produces a registered snapshot.
  • Manual verification of the diff against origin/main confirms _build_and_register and _build_image_via_builder now share the same {image, architecture} return shape consumed by create().
  • Not yet verified in CI: the e2e.yml workflow run on the CSB-1450 branch — the workflow trigger was updated to include this branch so the e2e job will fire on push.

📁 References

@christianalfoni christianalfoni changed the title feat: remote build feat: remote build for Python sdk May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant