Skip to content

Add autogluon-cloud python setup API#213

Merged
AnirudhDagar merged 6 commits into
autogluon:masterfrom
AnirudhDagar:improve_setup
May 19, 2026
Merged

Add autogluon-cloud python setup API#213
AnirudhDagar merged 6 commits into
autogluon:masterfrom
AnirudhDagar:improve_setup

Conversation

@AnirudhDagar
Copy link
Copy Markdown
Collaborator

@AnirudhDagar AnirudhDagar commented May 8, 2026

Introduces autogluon.cloud.bootstrap/register/status/teardown Python API. This functionality will be extended to a CLI interface as well using click and rich in a subsequent PR. Both will use the shared cloud_setup engine that provisions the CloudFormation stack, writes a per-profile config at ~/.autogluon/cloud.yaml, and tears it down cleanly.

Usage

from autogluon.cloud import bootstrap, status, teardown

bootstrap()       # uses the defaults
# OR
bootstrap(backend="sagemaker", session=<boto3 session>, stack_name="my_stack")

status()                                    # health check
teardown(delete_bucket_contents=True)       # cleanup

bootstrap() deploys the CloudFormation stack and calls method register to save outputs to ~/.autogluon/cloud.yaml.


Follow Up PRs in order:

  • CLI equivalent (autogluon-cloud bootstrap/status/teardown) built on the same setup engine
  • Wire the config auto-load into CloudPredictor.__init__ so users don't need to pass cloud_output_path= after bootstrap()
  • Update docs/tutorials/autogluon-cloud.md to lead with the agc.initialize() quick setup; add init/status/teardown to docs/api.rst

Note: Used opus 4.7 for development, please review carefully.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@AnirudhDagar AnirudhDagar force-pushed the improve_setup branch 2 times, most recently from f867ee5 to 8f35030 Compare May 12, 2026 13:05
@AnirudhDagar AnirudhDagar changed the title Add autogluon-cloud CLI and python setup API Add autogluon-cloud python setup API May 12, 2026
@AnirudhDagar AnirudhDagar marked this pull request as ready for review May 12, 2026 13:12
@AnirudhDagar AnirudhDagar requested a review from shchur May 12, 2026 13:13
Copy link
Copy Markdown
Collaborator

@shchur shchur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left a few comments

Comment thread src/autogluon/cloud/cloud_setup.py Outdated
Comment thread src/autogluon/cloud/cloud_setup.py
Comment thread setup.py
Comment thread src/autogluon/cloud/cloud_setup.py Outdated
- Public API is now `bootstrap`, `register`, `status`, `teardown`,
  exposed at top level (`from autogluon.cloud import bootstrap, ...`).
- Flat single-config YAML; removed Profile / multi-profile machinery.
- `register()` lets users persist existing role_arn/bucket without
  touching AWS; `bootstrap()` calls it after a successful CFN deploy.
- Replace `aws_profile` string with `session: Optional[boto3.Session]`.
- Verbose progress prints now include account ID and region.
- Strict RuntimeError when no AWS region can be detected.
- Rename local `Backend` Literal to `BackendName` (was shadowing the
  Backend ABC). Source `SUPPORTED_SETUP_BACKENDS` from backend/constant.py.
- Drop unused CONFIG_VERSION field.
Copy link
Copy Markdown
Collaborator

@shchur shchur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a bunch of minor comments, but overall this looks great!

Comment thread src/autogluon/cloud/cloud_setup.py Outdated

__all__ = ["bootstrap", "register", "status", "teardown"]

BackendName = Literal[SAGEMAKER, RAY_AWS]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a syntax error - I think Literal requires the arguments to be strings, not variables.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, wasn't aware of that. changing this now

Comment thread src/autogluon/cloud/cloud_setup.py Outdated
Comment thread src/autogluon/cloud/cloud_setup.py Outdated
Comment thread src/autogluon/cloud/cloud_setup.py
Comment thread src/autogluon/cloud/cloud_setup.py Outdated
Comment thread src/autogluon/cloud/cloud_setup.py Outdated
def teardown(
*,
session: Optional[boto3.Session] = None,
delete_bucket_contents: bool = False,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can put any more guardrails in place, this looks like a really dangerous operation 😬

Two ideas:

  1. Maybe we can just tell the user to empty the bucket themselves?
  2. Ask to put in the bucket name as a confirmation.

I am leaning towards 1

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i removed the s3 bucket deletion then. If it is empty it will anyway be removed with the stack removal. If it is not empty, let's not touch it.

Comment thread src/autogluon/cloud/cloud_setup.py Outdated
Comment on lines +168 to +173
return {
"found": True,
"config_path": str(get_config_path()),
"config": config,
"checks": checks,
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments

  1. status() returns a loose Dict[str, Any] with two shapes (found=True/False) — let's maybe return a TypedDict or dataclass, or None if there is no config found.

  2. Drop check_role param — instead, handle AccessDenied gracefully in _check_role/_check_stack by returning something like "ok (unverified)" rather than "failed". Currently they report failure when it's actually a caller permissions issue, not a broken resource.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks that makes sense, created a dataclass

Comment thread src/autogluon/cloud/cloud_setup.py Outdated
Comment thread src/autogluon/cloud/cloud_setup.py Outdated
raise
print(f"Stack {stack_name!r} already exists — reusing it.")

cfn.get_waiter("stack_create_complete").wait(StackName=stack_name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from our good friend:

_provision_stack waits on stack_create_complete even when the stack already exists — if it's already CREATE_COMPLETE, the waiter succeeds immediately, but if it's in UPDATE_ROLLBACK_COMPLETE or another terminal state, this could hang or error confusingly. Should either use describe_stacks to check current state after AlreadyExistsException, or just report the outputs directly.

Comment thread src/autogluon/cloud/config.py
- Config restructured: cloud.yaml is now keyed by backend name, so a user
  can have entries for sagemaker and ray_aws side by side. Introduces
  BackendConfig (per-backend record); CloudConfig wraps Dict[str, BackendConfig].
- bootstrap()/register() take backend= to select the slot.
- status() returns Dict[str, StatusReport], one entry per configured backend.
- teardown(backend=...) tears down that backend's stack and removes its config
  entry; teardown() (no arg) tears down everything.
- Typed status return via StatusReport dataclass.
- AccessDenied / Forbidden errors in _check_* now surface as
  "ok (unverified — caller lacks <perm>)" instead of "failed".
- Drop delete_bucket_contents from teardown(); user empties buckets first.
- _provision_stack: skip the create-waiter when stack already existed
  (avoids confusing hangs on ROLLBACK_COMPLETE etc).
- Rename register parameter role_arn → role to match SageMaker SDK convention.
- BackendName Literal uses string literals (PEP 586 compliant).
- Add inline comment explaining iam:GetRole RoleName parsing for ARNs with paths.
Copy link
Copy Markdown
Collaborator Author

@AnirudhDagar AnirudhDagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @shchur, I've addressed the comments and pushed an update for the same.

Comment thread src/autogluon/cloud/cloud_setup.py
Comment thread src/autogluon/cloud/cloud_setup.py Outdated

__all__ = ["bootstrap", "register", "status", "teardown"]

BackendName = Literal[SAGEMAKER, RAY_AWS]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, wasn't aware of that. changing this now

Comment thread src/autogluon/cloud/cloud_setup.py Outdated
we only check existence, not the caller's permission to assume it.
"""
try:
session.client("iam").get_role(RoleName=role_arn.rsplit("/", 1)[-1])
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iam:GetRole's RoleName parameter takes the bare role name without the path. Per the IAM docs, an ARN like arn:aws:iam::123:role/service-role/MyRole has path = /service-role/ and role name = MyRole. And RoleName only accepts the bare name (MyRole), it rejects path-prefixed values.
rsplit("/", 1)[-1] always gets the last segment (the role name) regardless of how many path components exist in between, so it works.

I ran a quick smoke test against real IAM to confirm:

iam.get_role(RoleName="NonExistentBareName")
# → NoSuchEntity (format accepted, role just doesn't exist)

iam.get_role(RoleName="service-role/NonExistentName")
# → ValidationError: roleName must contain only alphanumeric and +=,.@_-

iam.get_role(RoleName="team/prod/NonExistentName")
# → ValidationError: same

Comment thread src/autogluon/cloud/config.py
Comment thread src/autogluon/cloud/config.py Outdated
Comment thread src/autogluon/cloud/cloud_setup.py Outdated
def teardown(
*,
session: Optional[boto3.Session] = None,
delete_bucket_contents: bool = False,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i removed the s3 bucket deletion then. If it is empty it will anyway be removed with the stack removal. If it is not empty, let's not touch it.

Comment thread src/autogluon/cloud/cloud_setup.py Outdated
Comment on lines +168 to +173
return {
"found": True,
"config_path": str(get_config_path()),
"config": config,
"checks": checks,
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks that makes sense, created a dataclass

Comment thread src/autogluon/cloud/cloud_setup.py
Comment thread src/autogluon/cloud/config.py Outdated
Copy link
Copy Markdown
Collaborator

@shchur shchur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Feel free to merge after addressing the remaining comments

Comment thread src/autogluon/cloud/backend/constant.py Outdated
Comment thread src/autogluon/cloud/config.py Outdated
Comment thread src/autogluon/cloud/cloud_setup.py Outdated
Comment thread src/autogluon/cloud/cloud_setup.py Outdated
Comment thread setup.py
- SUPPORTED_SETUP_BACKENDS → SUPPORTED_BACKENDS
- AUTOGLUON_CLOUD_CONFIG_DIR → AG_CONFIG_DIR (match repo's AG_* env-var convention)
- status(): collapse `for name in list(...)` + None-check to `for name, cfg in config.backends.items()`
- Fold _PERMISSION_ERROR_CODES into _is_permission_error (single call site)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AnirudhDagar
Copy link
Copy Markdown
Collaborator Author

Thanks @shchur for multiple rounds of reviews and helping make the design much better! I'll merge once the CI is green.

@AnirudhDagar AnirudhDagar merged commit cdbd945 into autogluon:master May 19, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants