Skip to content

Releases: kubernetes-sigs/kueue

Kueue v0.6.0-rc.1

23 Jan 19:14
v0.6.0-rc.1
44adc22

Choose a tag to compare

Kueue v0.6.0-rc.1 Pre-release
Pre-release

Changes since v0.5.0:

Changes by Kind

API Change

  • Add the config field .waitForPodsReady.requeuingTimestamp to allow admins configure the timestamp used when sorting workloads that were evicted due to their Pods not becoming ready in time. (#1542, @nstogner)
  • Extend the information returned for the pending workloads in cluster queue, that is used to determine the workload position, including the workload position itself. (#1362, @PBundyra)
  • Extend visibility API by adding an endpoint that allows a user to fetch information about pending workloads and their position in LocalQueue. (#1365, @PBundyra)
  • Introduces an on-demand API endpoint for fetching pending workloads in a cluster queue (#1251, @PBundyra)
  • The OwnerReferences field in PendingWorkload's metadata is now filled with the information about the owning Job (#1378, @PBundyra)
  • Visibility.PendingWorkload does not implement runtime.Object interface anymore (#1386, @PBundyra)

Feature

  • A stopPolicy field in the ClusterQueue allows to hold or drain a ClusterQueue (#1299, @trasc)

  • Add MultiKueue support for JobSet (#1606, @trasc)

  • Add Prebuilt Workload support for JobSets. (#1575, @trasc)

  • Add events for transitions of the provisioning AdmissionCheck (#1271, @stuton)

  • Add prebuilt workload support for batch/job. (#1358, @trasc)

  • Add support for groups of plain Pods. (#1319, @achernevskii)

  • Add validation for clusterQueue: when cohort is empty, borrowingLimit must be nil. (#1525, @B1F030)

  • Allow configuring featureGates on helm charts. (#1314, @B1F030)

  • Allow decrease reclaimable pods to 0 for suspended job (#1277, @yaroslava-serdiuk)

  • At log level 6, the usage of ClusterQueues and cohorts is included in logs.

    The status of the internal cache and queues is also logged on demand when a SIGUSR2 is sent to kueue, regardless of the log level. (#1528, @alculquicondor)

  • Basic implementation of MultiKueue for Job. This doesn't include support for live status updates. (#1313, @trasc)

  • Increase the default number of reconcilers for Pod and Workload objects to 5, each. (#1589, @alculquicondor)

  • Jobs preserve their position in the queue if the number of pods change before being admitted (#1223, @yaroslava-serdiuk)

  • Make the image build setting CGO_ENABLED configurable (#1391, @anishasthana)

  • RBAC to visibility into Local Queues is fixed (#1412, @PBundyra)

  • Support for a mechanism to suspend a running Job without requeueing (#1252, @vicentefb)

  • Support for preemption while borrowing (#1397, @mimowo)

  • Support for retry of provisioning request.

    When ProvisioningACC is enabled, and there are existing ProvisioningRequests, they are going to be recreated.
    This may cause a job failures for some long-running jobs which were using the ProvisioningRequests. (#1351, @mimowo)

  • The image gcr.io/k8s-staging-kueue/debug:main, along with the script ./hack/dump_cache.sh can be used to trigger a dump of the internal cache into the logs. (#1541, @alculquicondor)

  • The leaderElection field in the Configuration API is now defaulted.
    Leader election is now enabled by default. (#1598, @astefanutti)

  • The priority sorting within the cohort could be disabled by setting --prioritySortingWithinCohort to false (#1406, @yaroslava-serdiuk)

  • Visibility.PendingWorkload object has the metav1.CreationTimestamp field filled with the value of corresponding kueue.Workload (#1404, @PBundyra)

Bug or Regression

  • Add Missing RBAC on integration finalizers sub-resources (#1486, @astefanutti)

  • Add Mutating WebhookConfigurations for the AdmissionCheck, RayJob, and JobSet to helm charts (#1567, @B1F030)

  • Add Validating/Mutating WebhookConfigurations for the KubeflowJobs like PyTorchJob (#1460, @tenzen-y)

  • Added event for QuotaReserved and fixed event for Admitted to trigger when admission checks complete (#1436, @trasc)

  • Avoid recreating a Workload for a finished Job and finalize a job when the workload is declared finished. (#1383, @achernevskii)

  • Do not (re)create ProvReq is the state of admission check is Ready (#1617, @mimowo)

  • Fix a bug in the pod integration that unexpected errors will occur when the pod isn't found (#1512, @achernevskii)

  • Fix a bug that plain pods managed by kueue will remain a terminating condition forever. (#1342, @tenzen-y)

  • Fix client-go libraries bug that can not operate clusterScoped resources like ClusterQueue and ResourceFlavor. (#1294, @tenzen-y)

  • Fix fungibility policy Preempt where it was not able to utilize the next flavor if preemption was not possible. (#1366, @alculquicondor)

  • Fix handling of preemption within a cohort when there is no borrowingLimit. In that case,
    during preemption, the permitted resources to borrow were calculated as if borrowingLimit=0, instead of unlimited.

    As a consequence, when using reclaimWithinCohort, it was possible that a workload, scheduled to ClusterQueue with no borrowingLimit, would preempt more workloads than needed, even though it could fit by borrowing. (#1561, @mimowo)

  • Fix the synchronization of the admission check state based on the second provisioning request (#1585, @mimowo)

  • Fixed fungibility policy whenCanPreempt: Preempt. The admission should happen in the flavor for which preemptions were issued. (#1332, @alculquicondor)

  • Pending workload from StrictFIFO ClusterQueue doesn't block borrowing from other ClusterQueues (#1399, @yaroslava-serdiuk)

  • Remove finalizer from Workloads that are orphaned (have no owners). (#1523, @achernevskii)

  • Trigger an eviction for an admitted Job after an admission check changed state to Rejected. (#1562, @trasc)

  • Visibility endpoints return 404 code for non-existent queues (#1415, @PBundyra)

  • Webhooks are served in non-leading replicas (#1509, @astefanutti)

Other (Cleanup or Flake)

  • Adding toleration to a job leads to update workload (#1304, @stuton)

Kueue v0.5.2

18 Jan 20:40
v0.5.2
cb7714c

Choose a tag to compare

Changes since v0.5.1:

Bug or Regression

  • Add Missing RBAC on integration finalizers sub-resources (#1486, @astefanutti)
  • Added event for QuotaReserved and fixed event for Admitted to trigger when admission checks complete (#1436, @trasc)
  • Avoid recreating a Workload for a finished Job and finalize a job when the workload is declared finished. (#1572, @alculquicondor)
  • Fix a bug in the pod integration where a Workload can be left with a finalizer when a pod is not found. (#1524, @achernevskii)
  • Remove finalizer from Workloads that are orphaned (have no owners). (#1523, @achernevskii, @woehrl01, @trasc)
  • Add Mutating WebhookConfigurations for the AdmissionCheck, RayJob, and JobSet to helm charts (#1570, @B1F030)
  • Add Validating/Mutating WebhookConfigurations for the KubeflowJobs like PyTorchJob (#1462, @tenzen-y)
  • Add events for transitions of the provisioning AdmissionCheck (#1394, @stuton)
  • Support for retry of provisioning request. (#1595, @mimowo)
  • Webhooks are served in non-leading replicas (#1511, @astefanutti)

Kueue v0.5.1

28 Nov 20:01
v0.5.1
8b9b1e8

Choose a tag to compare

Changes since v0.5.0:

Bug or Regression

  • Fix client-go libraries bug that can not operate clusterScoped resources like ClusterQueue and ResourceFlavor. (#1294, @tenzen-y)
  • Fixed fungiblity policy whenCanPreempt: Preempt. The admission should happen in the flavor for which preemptions were issued. (#1332, @alculquicondor)
  • Fix a bug that plain pods managed by kueue will remain a terminating condition forever. (#1342, @tenzen-y)
  • Fix fungibility policy Preempt where it was not able to utilize the next flavor if preemption was not possible. (#1366, @alculquicondor, @KunWuLuan)

Kueue v0.5.0

25 Oct 21:39
739ebb1

Choose a tag to compare

Changes since v0.4.0:

Highlights

  • AdmissionChecks: a mechanism for internal or external components to influence whether a Workload can be admitted.
  • Integration with cluster-autoscaler's ProvisioningRequest via AdmissionChecks.
  • Information about pending workloads in a ClusterQueue status.
  • Metrics for resource usage of ClusterQueues and LocalQueues.
  • Policy to control whether to preempt or borrow before trying the next flavors.
  • Partial admission graduated to Beta.
  • Workload priority, independent from Pod priority.
  • New integrations:
    • All Kubeflow training APIs
    • Single plain Pods

Changes by Kind

Feature

  • A mechanism for AdmissionChecks to provide labels, annotations, tolerations and node selectors to the pod templates when starting a job (#1180, @mimowo)
  • A reference standalone controller that can be used to support plain Pods using taints and tolerations, which can be used in Kubernetes versions that don't support scheduling gates. (#1111, @nstogner)
  • Add Active condition to AdmissionChecks (#1193, @trasc)
  • Add optional cluster queue resource quota and usage metrics. (#982, @trasc)
  • Add support for AdmissionChecks, a mechanism for internal or external components to influence whether a Workload can be admitted. (#1045, @trasc)
  • Add support for single plain Pods. (#1072, @achernevskii)
  • Add support for workload Priority (#1081, @Gekko0114)
  • Add tolerations to ResourceFlavor. Kueue injects these tolerations to the jobs that are assigned to the flavor when admitted. (#1248, @trasc)
  • Added pprof endpoints for profiling (#978, @stuton)
  • Allow the admission of multiple workloads within one scheduling cycle while borrowing. (#1039, @trasc)
  • An option to synchronize batch/job.completions with parallelism in case of partial admission (#971, @trasc)
  • Expose cluster queue information about pending workloads (#1069, @stuton)
  • Expose probe configurations to helm chart (#986, @yyzxw)
  • Graduate Partial admission to Beta. (#1221, @trasc)
  • Integrate with Cluster Autoscaler's ProvisioningRequest via two stage admission (#1154, @trasc)
  • Manage cluster queue active state based on admission checks life cycle. (#1079, @trasc)
  • Metrics for usage and reservations in ClusterQueues and LocalQueues. (#1206, @trasc)
  • Options to allow workloads to borrow quota or preempt other workloads before trying the next flavor in the list (#849, @KunWuLuan)
  • Support kubeflow.org/mxjob (#1183, @tenzen-y)
  • Support kubeflow.org/paddlejob (#1142, @tenzen-y)
  • Support kubeflow.org/pytorchjob (#995, @tenzen-y)
  • Support kubeflow.org/tfjob (#1068, @tenzen-y)
  • Support kubeflow.org/xgboostjob (#1114, @tenzen-y)
  • Workload objects have the label kueue.x-k8s.io/job-uid where the value matches the uid of the parent job, whether that's a Job, MPIJob, RayJob, JobSet (#1032, @achernevskii)

Bug or Regression

  • Adjust resources (based on LimitRanges, PodOverhead and resource limits) on existing Workloads when a LocalQueue is created (#1197, @alculquicondor)
  • Ensure the ClusterQueue status is updated as the number of pending workloads changes. (#1135, @mimowo)
  • Fix resuming of RayJob after preempted. (#1156, @kerthcet)
  • Fixed missing create verb for webhook (#1035, @stuton)
  • Fixed scheduler to only allow one admission or preemption per cycle within a cohort that has ClusterQueues borrowing quota (#1023, @alculquicondor)
  • Helm: Enable the JobSet integration by default (#1184, @tenzen-y)
  • Improve job controller to be resilient to API failures during preemption (#1005, @alculquicondor)
  • Prevent workloads in ClusterQueue with StrictFIFO from blocking higher priority workloads in other ClusterQueues in the same cohort that require preemption (#1024, @alculquicondor)
  • Terminate Kueue when there is an internal failure during setup, so that it can be retried. (#1077, @alculquicondor)

Other (Cleanup or Flake)

Kueue v0.4.2

11 Oct 20:01
417b060

Choose a tag to compare

Changes since v0.4.1:

Bug or Regression

  • Adjust resources (based on LimitRanges, PodOverhead and resource limits) on existing Workloads when a LocalQueue is created (#1197, @alculquicondor)
  • Fix resuming of RayJob after preempted. (#1190, @kerthcet)

Kueue v0.4.1

15 Aug 13:40
328bb66

Choose a tag to compare

Bug or Regression

  • Fixed missing create verb for webhook (#1053, @stuton)
  • Fixed scheduler to only allow one admission or preemption per cycle within a cohort that has ClusterQueues borrowing quota (#1029, @alculquicondor)
  • Prevent workloads in ClusterQueue with StrictFIFO from blocking higher priority workloads in other ClusterQueues in the same cohort that require preemption (#1030, @alculquicondor)

Kueue v0.4.0

07 Jul 14:41
5cc79d1

Choose a tag to compare

Changes since v0.3.0:

API Change

Feature

  • Add client-go libraries. (#789, @tenzen-y)
  • Add support for Kuberay's RayJobs. (#667, @trasc)
  • Add support for dynamic reclaim in the JobSet integration. (#901, @trasc)
  • Add support for partial workload admission (#771, @trasc)
  • Add the support for dynamic resources reclaim. (#756, @trasc)
  • Allow scheduler to admit more jobs when the head job have not reached the PodReady=true status. (#708, @KunWuLuan)
  • Allow specifying the manager pod and container security context instead of hardcoded values (#878, @bh-tt)
  • Feature gates for alpha/experimental features is introduced to Kueue Project. (#788, @kerthcet)
  • Ignoring integrations if crd wasn't installed otherwise all integrations are enabled by default (#883, @stuton)
  • Integrate JobSet into kueue (#762, @mcariatm)

Bug or Regression

  • Add permission to update frameworkjob status. (#797, @tenzen-y)
  • Fix a bug that updates events for clusterQueues are created endlessly. (#907, @tenzen-y)
  • Fix a bug where a child batch/job of an unmanaged parent (doesn't have queue name) was being suspended. (#835, @tenzen-y)
  • Fix panic in cluster queue if resources and coveredResources do not have the same length. (#787, @kannon92)
  • Fix: Enforce borrowed=0 if ClusterQueue doesn't belong to a cohort. (#759, @tenzen-y)
  • Fix: Potential over-admission within cohort when borrowing. (#805, @trasc)
  • Fixed preemption to prefer preempting workloads that were more recently admitted. (#843, @stuton)
  • Fixed the suspend=true add to the job/mpijob by the default webhook has not taken effect. (#758, @fjding)

Other (Cleanup or Flake)

  • Add validation for child jobs without ownerReference. (#865, @tenzen-y)

Kueue v0.3.2

13 Jun 14:51
ff63c63

Choose a tag to compare

Changes since v0.3.1:

Bug or Regression

  • Add permission to update frameworkjob status. (#798, @tenzen-y)
  • Fix a bug where a child batch/job of an unmanaged parent (doesn't have queue name) was being suspended. (#839, @tenzen-y)
  • Fix panic in cluster queue if resources and coveredResources do not have the same length. (#799, @kannon92)
  • Fix: Potential over-admission within cohort when borrowing. (#822, @trasc)
  • Fixed preemption to prefer preempting workloads that were more recently admitted. (#845, @stuton)

Kueue v0.3.1

16 May 18:55
50f628a

Choose a tag to compare

Changes since v0.3.0:

Bug fixes

  • Fix a bug that the validation webhook doesn't validate the queue name set as a label when creating MPIJob. #711
  • Fix a bug that updates a queue name in workloads with an empty value when using framework jobs that use batch/job internally, such as MPIJob. #713
  • Fix a bug in which borrowed values are set to a non-zero value even though the ClusterQueue doesn't belong to a cohort. #761
  • Fixed adding suspend=true job/mpijob by the default webhook. #765

Kueue v0.3.0

06 Apr 21:07
0e5db01

Choose a tag to compare

Changes since v0.2.1:

Features

  • Support for kubeflow's MPIJob (v2beta1)
  • Upgrade the config.kueue.x-k8s.io API version from v1alpha1 to v1beta1. v1alpha1 is no longer supported.
    v1beta1 includes the following changes:
    • Add namespace to propagate the namespace where kueue is deployed to the webhook certificate.
    • Add internalCertManagement with fields enable, webhookServiceName and webhookSecretName.
    • Remove enableInternalCertManagement. Use internalCertManagement.enable instead.
  • Upgrade the kueue.x-k8s.io API version from v1alpha2 to v1beta1.
    v1alpha2 is no longer supported.
    v1beta1 includes the following changes:
    • ClusterQueue:
      • Immutability of spec.queueingStrategy.
      • Refactor quota.min and quota.max into nominalQuota and borrowingLimit.
      • Swap hieararchy between resources and flavors.
      • Group flavors and resources into spec.resourceGroups to make
        co-dependent resources explicit.
      • Move admission from spec to status.
      • Add conditions field to status.
    • LocalQueue:
      • Add admitted field in status.
      • Add conditions field to status.
    • Workload:
      • Add metadata to podSet templates.
      • Move admission into status.
    • ResourceFlavor:
      • Introduce spec to hold all fields.
      • Rename labels to nodeLabels.
      • Rename taints to nodeTaints.
  • Reduce API calls by setting .status.admission and updating the Admitted condition in the same API call.
  • Obtain queue names from label kueue.x-k8s.io/queue-name. The annotation with
    the same name is still supported, but it's now deprecated.
  • Multiplatform support for linux/amd64 and linux/arm64.
  • Validating webhook for batch/v1.Job validates kueue-specific labels and
    annotations.
  • Sequential admission of jobs https://kueue.sigs.k8s.io/docs/tasks/setup_sequential_admission/
  • Preemption within ClusterQueue and cohort https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#preemption
  • Support for LimitRanges when calculating jobs usage.
  • Library for integrating job-like CRDs (controller and webhooks) https://sigs.k8s.io/kueue/pkg/controller/jobframework

Production Readiness

Bug fixes

  • Fix job controller ClusterRole for clusters that enable OwnerReferencesPermissionEnforcement admission control validation #392
  • Fix race condition when admission attempt and requeuing happen at the same time #427
  • Atomically release quota and requeue previously inadmissible workloads #512
  • Fix support for leader election #580
  • Fix support for RuntimeClass when calculating jobs usage #565

Acknowledgments

Thanks to our contributors in this release, in no particular order:
@tenzen-y @mcariatm @moficodes @mwielgus @trasc @mimowo @alculquicondor @fjding @kerthcet @ArangoGutierrez @Fish-pro @rbarberop @cortespao @rptaylor @kannon92 @noryev @oginskis @charlieyu1996 @kincl @ahg-g