Add autogluon-cloud python setup API#213
Conversation
f867ee5 to
8f35030
Compare
8f35030 to
209855c
Compare
209855c to
3960b60
Compare
shchur
left a comment
There was a problem hiding this comment.
Thanks! Left a few comments
- Public API is now `bootstrap`, `register`, `status`, `teardown`, exposed at top level (`from autogluon.cloud import bootstrap, ...`). - Flat single-config YAML; removed Profile / multi-profile machinery. - `register()` lets users persist existing role_arn/bucket without touching AWS; `bootstrap()` calls it after a successful CFN deploy. - Replace `aws_profile` string with `session: Optional[boto3.Session]`. - Verbose progress prints now include account ID and region. - Strict RuntimeError when no AWS region can be detected. - Rename local `Backend` Literal to `BackendName` (was shadowing the Backend ABC). Source `SUPPORTED_SETUP_BACKENDS` from backend/constant.py. - Drop unused CONFIG_VERSION field.
shchur
left a comment
There was a problem hiding this comment.
Just a bunch of minor comments, but overall this looks great!
|
|
||
| __all__ = ["bootstrap", "register", "status", "teardown"] | ||
|
|
||
| BackendName = Literal[SAGEMAKER, RAY_AWS] |
There was a problem hiding this comment.
This looks like a syntax error - I think Literal requires the arguments to be strings, not variables.
There was a problem hiding this comment.
i see, wasn't aware of that. changing this now
| def teardown( | ||
| *, | ||
| session: Optional[boto3.Session] = None, | ||
| delete_bucket_contents: bool = False, |
There was a problem hiding this comment.
I wonder if we can put any more guardrails in place, this looks like a really dangerous operation 😬
Two ideas:
- Maybe we can just tell the user to empty the bucket themselves?
- Ask to put in the bucket name as a confirmation.
I am leaning towards 1
There was a problem hiding this comment.
yeah, i removed the s3 bucket deletion then. If it is empty it will anyway be removed with the stack removal. If it is not empty, let's not touch it.
| return { | ||
| "found": True, | ||
| "config_path": str(get_config_path()), | ||
| "config": config, | ||
| "checks": checks, | ||
| } |
There was a problem hiding this comment.
Two comments
-
status()returns a looseDict[str, Any]with two shapes (found=True/False) — let's maybe return aTypedDictordataclass, orNoneif there is no config found. -
Drop
check_roleparam — instead, handleAccessDeniedgracefully in_check_role/_check_stackby returning something like"ok (unverified)"rather than"failed". Currently they report failure when it's actually a caller permissions issue, not a broken resource.
There was a problem hiding this comment.
Thanks that makes sense, created a dataclass
| raise | ||
| print(f"Stack {stack_name!r} already exists — reusing it.") | ||
|
|
||
| cfn.get_waiter("stack_create_complete").wait(StackName=stack_name) |
There was a problem hiding this comment.
Comment from our good friend:
_provision_stackwaits onstack_create_completeeven when the stack already exists — if it's already CREATE_COMPLETE, the waiter succeeds immediately, but if it's in UPDATE_ROLLBACK_COMPLETE or another terminal state, this could hang or error confusingly. Should either usedescribe_stacksto check current state afterAlreadyExistsException, or just report the outputs directly.
- Config restructured: cloud.yaml is now keyed by backend name, so a user can have entries for sagemaker and ray_aws side by side. Introduces BackendConfig (per-backend record); CloudConfig wraps Dict[str, BackendConfig]. - bootstrap()/register() take backend= to select the slot. - status() returns Dict[str, StatusReport], one entry per configured backend. - teardown(backend=...) tears down that backend's stack and removes its config entry; teardown() (no arg) tears down everything. - Typed status return via StatusReport dataclass. - AccessDenied / Forbidden errors in _check_* now surface as "ok (unverified — caller lacks <perm>)" instead of "failed". - Drop delete_bucket_contents from teardown(); user empties buckets first. - _provision_stack: skip the create-waiter when stack already existed (avoids confusing hangs on ROLLBACK_COMPLETE etc). - Rename register parameter role_arn → role to match SageMaker SDK convention. - BackendName Literal uses string literals (PEP 586 compliant). - Add inline comment explaining iam:GetRole RoleName parsing for ARNs with paths.
AnirudhDagar
left a comment
There was a problem hiding this comment.
Thanks for the review @shchur, I've addressed the comments and pushed an update for the same.
|
|
||
| __all__ = ["bootstrap", "register", "status", "teardown"] | ||
|
|
||
| BackendName = Literal[SAGEMAKER, RAY_AWS] |
There was a problem hiding this comment.
i see, wasn't aware of that. changing this now
| we only check existence, not the caller's permission to assume it. | ||
| """ | ||
| try: | ||
| session.client("iam").get_role(RoleName=role_arn.rsplit("/", 1)[-1]) |
There was a problem hiding this comment.
iam:GetRole's RoleName parameter takes the bare role name without the path. Per the IAM docs, an ARN like arn:aws:iam::123:role/service-role/MyRole has path = /service-role/ and role name = MyRole. And RoleName only accepts the bare name (MyRole), it rejects path-prefixed values.
rsplit("/", 1)[-1] always gets the last segment (the role name) regardless of how many path components exist in between, so it works.
I ran a quick smoke test against real IAM to confirm:
iam.get_role(RoleName="NonExistentBareName")
# → NoSuchEntity (format accepted, role just doesn't exist)
iam.get_role(RoleName="service-role/NonExistentName")
# → ValidationError: roleName must contain only alphanumeric and +=,.@_-
iam.get_role(RoleName="team/prod/NonExistentName")
# → ValidationError: same| def teardown( | ||
| *, | ||
| session: Optional[boto3.Session] = None, | ||
| delete_bucket_contents: bool = False, |
There was a problem hiding this comment.
yeah, i removed the s3 bucket deletion then. If it is empty it will anyway be removed with the stack removal. If it is not empty, let's not touch it.
| return { | ||
| "found": True, | ||
| "config_path": str(get_config_path()), | ||
| "config": config, | ||
| "checks": checks, | ||
| } |
There was a problem hiding this comment.
Thanks that makes sense, created a dataclass
shchur
left a comment
There was a problem hiding this comment.
Thanks! Feel free to merge after addressing the remaining comments
- SUPPORTED_SETUP_BACKENDS → SUPPORTED_BACKENDS - AUTOGLUON_CLOUD_CONFIG_DIR → AG_CONFIG_DIR (match repo's AG_* env-var convention) - status(): collapse `for name in list(...)` + None-check to `for name, cfg in config.backends.items()` - Fold _PERMISSION_ERROR_CODES into _is_permission_error (single call site) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @shchur for multiple rounds of reviews and helping make the design much better! I'll merge once the CI is green. |
Introduces
autogluon.cloud.bootstrap/register/status/teardownPython API. This functionality will be extended to a CLI interface as well usingclickandrichin a subsequent PR. Both will use the sharedcloud_setupengine that provisions the CloudFormation stack, writes a per-profile config at ~/.autogluon/cloud.yaml, and tears it down cleanly.Usage
bootstrap()deploys the CloudFormation stack and calls methodregisterto save outputs to~/.autogluon/cloud.yaml.Follow Up PRs in order:
autogluon-cloud bootstrap/status/teardown) built on the same setup engineCloudPredictor.__init__so users don't need to passcloud_output_path=afterbootstrap()docs/tutorials/autogluon-cloud.mdto lead with theagc.initialize()quick setup; addinit/status/teardowntodocs/api.rstNote: Used opus 4.7 for development, please review carefully.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.