Skip to content

Add Docker-layer caching for the PyTorch build + bump PyTorch versions/clean up PRs#453

Open
puneetmatharu wants to merge 2 commits into
ARM-software:mainfrom
puneetmatharu:pytorch-docker-layer-cache
Open

Add Docker-layer caching for the PyTorch build + bump PyTorch versions/clean up PRs#453
puneetmatharu wants to merge 2 commits into
ARM-software:mainfrom
puneetmatharu:pytorch-docker-layer-cache

Conversation

@puneetmatharu
Copy link
Copy Markdown
Contributor

@puneetmatharu puneetmatharu commented May 6, 2026

This PR will rely on the Dockerfile_2_28_aarch64 to handle caching of OpenBLAS and ACL inside the builder image. We build this image and it sits around as a tagged image until the user prunes it. So even if the user wipes the final toolsolutions-pytorch:latest image, they can just restart from the previous complete builder image (i.e. BUILDER_IMAGE_NAME). Note that the trap can remove the running build container but it will not wipe BUILDER_IMAGE_NAME.

The caveat here is that it makes ccache support more difficult for the ACL and OpenBLAS build. The changes I've made in my PyTorch PR don't actually allow this to work (as I don't pass the local mounted directory and it's difficult to do with the docker build), so there's no actual ccache support for that. However, assuming the user doesn't wipe BUILDER_IMAGE_NAME, the user just pays a one-off cost and then only needs to rely on ccache for new builds of PyTorch itself. I think it's a reasonable loss for a big win.

I can (and will) therefore remove all ccache changes from pytorch #182655 to support this new direction.

Copilot AI review requested due to automatic review settings May 6, 2026 15:27
@puneetmatharu puneetmatharu force-pushed the pytorch-docker-layer-cache branch from 80da100 to 7aa2afe Compare May 6, 2026 15:29
@puneetmatharu
Copy link
Copy Markdown
Contributor Author

Hmm. We seem to be getting stuck on answer_questions.py in the smoke tests. Needs investigation.

@puneetmatharu
Copy link
Copy Markdown
Contributor Author

I think the self-tests will pass now but there's a bug in that the local ccache directory never gets used by the pytorch builder image to speed up builds. Need to figure out how to pass that. One for Monday...

Comment thread ML-Frameworks/pytorch-aarch64/build-wheel.sh
Comment thread ML-Frameworks/pytorch-aarch64/build.sh Outdated
Comment thread ML-Frameworks/pytorch-aarch64/build.sh Outdated
@puneetmatharu puneetmatharu force-pushed the pytorch-docker-layer-cache branch 3 times, most recently from 82c3752 to 05682a8 Compare May 8, 2026 12:43
@puneetmatharu
Copy link
Copy Markdown
Contributor Author

@jondea this is good to go (i.e. the tests are passing) if you're happy with the changes

- name: Print ccache disk usage
run: du -sh "${{ env.CCACHE_LOCAL_DIR }}" || true

- name: Report final ccache build stats
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, why have we got rid of this line (but not the one above)? It feels orthogonal to the rest of the change. (I won't block the merge though)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Because we won't have access to a env.TORCH_BUILD_CONTAINER_ID_FILE anymore. We also don't have access to TORCH_BUILD_CONTAINER after the build has completed because it's cleaned up with trap cleanup EXIT. However, I can start a fresh container from the final builder image and check the stats from there. Have added that as an update

jondea
jondea previously approved these changes May 13, 2026
@puneetmatharu
Copy link
Copy Markdown
Contributor Author

Good to go @jondea !

Copy link
Copy Markdown
Contributor

@nSircombe nSircombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants