Releases: kubeflow/training-operator
v1.8.0-rc.0 release
New features
- Train/Fine-tune API Proposal for LLMs #1945 (deepanker13)
- Adding Training image needed for train api #1963 (deepanker13)
- [SDK] Train API #1962 (deepanker13)
- Train api dataset download changes #1959 (deepanker13)
- Train api init container creation #1958 (deepanker13)
- Publish trainer hugging face image #1985 (deepanker13)
- Support arm64 for Hugging Face trainer #2028 (tariq-hasan)
- Modify LLM Trainer to support BERT and Tiny LLaMA #2031 (andreyvelich)
- Implement webhook validations for the PyTorchJob #2035 (tenzen-y)
- Implement webhook validations for the XGBoostJob #2052 (tenzen-y)
- Implement webhook validation for the TFJob #2051 (tenzen-y)
- Implement webhook warnings for the MXJob #2058 (tenzen-y)
- Implement webhook validations for the PaddleJob #2057 (tenzen-y)
- Fail job for non-retryable exit codes #2071 (kellyaa)
- Adding fine tune example with s3 as the dataset store #2006 (deepanker13)
Bug fixes
- fix nproc env in elastic mode for pytorchjob #1948 (kuizhiqing)
- IsMasterRole fix in pytorchjob controller #1969 (deepanker13)
- fix: volcano podgroup should has a non-empty queue name #1977 (lowang-bh)
- Fix Master Label for PyTorchJob #1974 (andreyvelich)
- [SDK] Fix Worker and Master templates for PyTorchJob #1988 (andreyvelich)
- Fix import for HuggingFace Dataset Provider #2085 (andreyvelich)
- Upgrade controller-gen to v0.14.0 #2026 (champon1020)
- Fix Distributed Data Samplers in PyTorch Examples #2012 (andreyvelich)
- Fix URL in python SDK setup.py #2011 (garymm)
Misc
- Adding parallel support for coveralls #1956 (johnugeorge)
- torchrun example with cpu version pytorch #1965 (kuizhiqing)
- [SDK] Get Kubernetes Events for Job #1975 (andreyvelich)
- Fix Master Label for PyTorchJob #1974 (andreyvelich)
- [SDK] Add information about TrainingClient logging #1973 (andreyvelich)
- PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode #2067 (tenzen-y)
- SDK: Upgrade the minimum required Kubernetes version to v1.27.2 #2066 (tenzen-y)
- Test: Simplify and Identify pod-controller envtest #2084 (tenzen-y)
- E2E: Replace outdated images with latest ones #2083 (tenzen-y)
- Upgrade scheduler-plugins to v0.28.9 #2065 (tenzen-y)
v1.7.0 release
Breaking Changes
- Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
- Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)
New features
- Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
- Merge kubeflow/common to training-operator #1813 (johnugeorge)
- Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
- Implement suspend semantics #1859 (tenzen-y)
- Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
- Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)
Bug fixes
- Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
- Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
- Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)
Misc
- Removing reconciler code #1879 (johnugeorge)
- Make Condition and ReplicaStatus optional #1862 (tenzen-y)
- Use the same reasons for Condition and Event #1854 (tenzen-y)
- Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
- Clean up /pkg/common/util/v1 #1845 (tenzen-y)
- Refactoring tests in common/controller.v1 #1843 (tenzen-y)
- remove duplicate code of add task spec annotation #1839 (lowang-bh)
- fetch volcano log when e2e failed #1837 (lowang-bh)
- Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
- Replace dummy client with fake client #1818 (tenzen-y)
- Add default Intel MPI env variables to MPIJob #1804 (tkatila)
- Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
- xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
- make timeout configurable from e2e tests #1787 (nagar-ajay)
v1.7.0-rc.0 release
Breaking Changes
- Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
- Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)
New features
- Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
- Merge kubeflow/common to training-operator #1813 (johnugeorge)
- Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
- Implement suspend semantics #1859 (tenzen-y)
- Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
- Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)
Bug fixes
- Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
- Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
- Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)
Misc
- Removing reconciler code #1879 (johnugeorge)
- Make Condition and ReplicaStatus optional #1862 (tenzen-y)
- Use the same reasons for Condition and Event #1854 (tenzen-y)
- Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
- Clean up /pkg/common/util/v1 #1845 (tenzen-y)
- Refactoring tests in common/controller.v1 #1843 (tenzen-y)
- remove duplicate code of add task spec annotation #1839 (lowang-bh)
- fetch volcano log when e2e failed #1837 (lowang-bh)
- Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
- Replace dummy client with fake client #1818 (tenzen-y)
- Add default Intel MPI env variables to MPIJob #1804 (tkatila)
- Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
- xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
- make timeout configurable from e2e tests #1787 (nagar-ajay)
v1.6.0 release
Note: Since scheduler-plugins has changed API from sigs.k8s.io
with the x-k8s.io
, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: #1773
Note: Latest Python SDK 1.6 version does not support earlier training operator versions. The minimum training operator version required is v1.6.0 release. Related: #1702
New Features
- Support for k8s v1.25 in CI #1684 (johnugeorge)
- HPA support for PyTorch Elastic #1701 (johnugeorge)
- Adopting coschduling plugin #1724 (tenzen-y)
- Support for Paddlepaddle #1675 (kuizhiqing)
- Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659 (andreyvelich)
- [SDK] Use Training Client without Kube Config #1740 (andreyvelich)
- [SDK] Create Unify Training Client #1719 (andreyvelich)
Bug fixes
- [SDK] pod has no metadata attr anymore in the get_job_logs() … #1760 (yaobaiwei)
- Add PodGroup as controller watch source #1666 (ggaaooppeenngg)
- fix infinite loop in init-pytorch container #1756 (kidddddddddddddddddddddd)
- Fix the success condition of the job in PyTorchJob's Elastic mode. #1752 (Syulin7)
- Fix XGBoost conditions bug #1737 (tenzen-y)
- To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example #1733 (tenzen-y)
- fix: support MxNet single host training when update mxJob status #1644 (PeterChg)
- fix: fix mxnet failed to update StartTime and CompletionTime #1643 (PeterChg)
- Fix the default LeaderElectionID and make it an argument #1639 (goyalankit)
- fix: fix wrong parameter for resolveControllerRef #1583 (fighterhit)
- fix: tfjob with restartPolicy=ExitCode not work #1562 (cheimu)
- fix: Mac M1 compatible Dockerfile and bump TF version #1700 (terrytangyuan)
- Fix status lost #1697 (ggaaooppeenngg)
- handle all restart policies #1649 (abin-thomas-by)
- [chore] fix typo #1648 (tenzen-y)
Misc
- Add validation for verifying that the CustomJob (e.g., TFJob) name meets DNS1035 #1748 (tenzen-y)
- Configure controller worker threads #1707 (HeGaoYuan)
- Validation Spec consistency #1705 (HeGaoYuan)
- [SDK] Remove Final Keyword from constants #1676 (andreyvelich)
- Fix Python installation in CI #1759 (tenzen-y)
- Update mpijob_controller.go #1755 (yshalabi)
- Set the default value of CleanPodPolicy to None #1754 (Syulin7)
- Update join Slack link #1750 (Syulin7)
- Update latest operator image #1742 (johnugeorge)
- Run E2E with various Python versions to verify Python SDK #1741 (tenzen-y)
- Add Yuki to reviewer group #1739 (johnugeorge)
- Trim down CRD descriptions #1735 (tenzen-y)
- Add CI to build example images #1731 (tenzen-y)
- Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup #1730 (tenzen-y)
- Fix indents on examples for tensorflow #1726 (tenzen-y)
- docs: Update Kubernetes requirement and version matrix #1721 (terrytangyuan)
- chore: Update the use of MultiWorkerMirroredStrategy in TF #1715 (terrytangyuan)
- Removing deprecated Job Labels #1702 (johnugeorge)
- Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf_operator #1699 (dependabot[bot])
- Add myself to reviewer. #1689 (kuizhiqing)
- Upgrade the envtest version #1687 (tenzen-y)
- [chore] Upgrade some actions version #1686 (tenzen-y)
- Upgrade Golangci-lint #1685 (johnugeorge)
- Make a generic logger instead of the nil logger on dependent update #1680 (ggaaooppeenngg)
- Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf_operator #1669 (dependabot[bot])
- Removed GOARCH dependency for multiarch support #1674 (pranavpandit1)
- Update deployment.yaml #1668 (OmriShiv)
- Upgrade Go version to v1.19 #1663 (tenzen-y)
- Upgrade kubernetes versoin for test #1667 (tenzen-y)
- Adding support for linux/ppc64le in github actions for training-operator #1692 (amitmukati-2604)
- style: Refine name and signature of 2 replicaName functions #1660 (houz42)
- Update training operator sdk version to 1.5.0 #1651 (johnugeorge)
- Add finalizers to cluster-role #1646 (ArangoGutierrez)
- Update the cmd to support MPI operator in ReadME #1656 (denkensk)
Closed issues:
- The default value for CleanPodPolicy is inconsistent. #1753
- HPA support for PyTorch Elastic #1751
- Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state [#1745](https://github.com/kubeflow/t...
v1.6.0-rc.1 release
Note: Since scheduler-plugins has changed API from sigs.k8s.io
with the x-k8s.io
, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower
Merged pull requests:
- [SDK] pod has no metadata attr anymore in the get_job_logs() … #1760 (yaobaiwei)
- Fix Python installation in CI #1759 (tenzen-y)
- fix infinite loop in init-pytorch container #1756 (kidddddddddddddddddddddd)
- Update mpijob_controller.go #1755 (yshalabi)
- Set the default value of CleanPodPolicy to None #1754 (Syulin7)
- Fix the success condition of the job in PyTorchJob's Elastic mode. #1752 (Syulin7)
- Update join Slack link #1750 (Syulin7)
- Add validation for verifying that the CustomJob (e.g., TFJob) name meets DNS1035 #1748 (tenzen-y)
- Update latest operator image #1742 (johnugeorge)
- Run E2E with various Python versions to verify Python SDK #1741 (tenzen-y)
- [SDK] Use Training Client without Kube Config #1740 (andreyvelich)
- Add Yuki to reviewer group #1739 (johnugeorge)
- Fix XGBoost conditions bug #1737 (tenzen-y)
- Add E2E test for gang-scheduling #1736 (tenzen-y)
- Trim down CRD descriptions #1735 (tenzen-y)
- To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example #1733 (tenzen-y)
- Add CI to build example images #1731 (tenzen-y)
- Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup #1730 (tenzen-y)
- Fix indents on examples for tensorflow #1726 (tenzen-y)
- Adopting coschduling plugin #1724 (tenzen-y)
- docs: Update Kubernetes requirement and version matrix #1721 (terrytangyuan)
- [SDK] Create Unify Training Client #1719 (andreyvelich)
- chore: Update the use of MultiWorkerMirroredStrategy in TF #1715 (terrytangyuan)
- Configure controller worker threads #1707 (HeGaoYuan)
- Validation Spec consistency #1705 (HeGaoYuan)
- Removing deprecated Job Labels #1702 (johnugeorge)
- HPA support for PyTorch Elastic #1701 (johnugeorge)
- fix: Mac M1 compatible Dockerfile and bump TF version #1700 (terrytangyuan)
- Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf_operator #1699 (dependabot[bot])
- Fix status lost #1697 (ggaaooppeenngg)
- Adding support for linux/ppc64le in github actions for training-operator #1692 (amitmukati-2604)
- Add myself to reviewer. #1689 (kuizhiqing)
- Upgrade the envtest version #1687 (tenzen-y)
- [chore] Upgrade some actions version #1686 (tenzen-y)
- Upgrade Golangci-lint #1685 (johnugeorge)
- Support for k8s v1.25 in CI #1684 (johnugeorge)
- Make a generic logger instead of the nil logger on dependent update #1680 (ggaaooppeenngg)
- [SDK] Remove Final Keyword from constants #1676 (andreyvelich)
- [PaddlePaddle] support paddlejob #1675 (kuizhiqing)
- Removed GOARCH dependency for multiarch support #1674 (pranavpandit1)
- Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf_operator #1669 (dependabot[bot])
- Update deployment.yaml #1668 (OmriShiv)
- Upgrade kubernetes versoin for test #1667 (tenzen-y)
- Add PodGroup as controller watch source #1666 (ggaaooppeenngg)
- Upgrade Go version to v1.19 #1663 (tenzen-y)
- style: Refine name and signature of 2 replicaName functions #1660 (houz42)
- Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659 (andreyvelich)
- Update the cmd to support MPI operator in ReadME #1656 (denkensk)
- Update training operator sdk version to 1.5.0 #1651 (johnugeorge)
- handle all restart policies #1649 (abin-thomas-by)
- [chore] fix typo #1648 (tenzen-y)
- Add finalizers to cluster-role #1646 (ArangoGutierrez)
- fix: support MxNet single host training when update mxJob status #1644 (PeterChg)
- fix: fix mxnet failed to update StartTime and CompletionTime #1643 (PeterChg)
- Fix the default LeaderElectionID and make it an argument #1639 (goyalankit)
- fix: fix wrong parameter for resolveControllerRef #1583 (fighterhit)
- fix: tfjob with restartPolicy=ExitCode not work #1562 (cheimu)
Closed issues:
- The default value for CleanPodPolicy is inconsistent. #1753
- HPA support for PyTorch Elastic #1751
- Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
- paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
- *job API(master) cannot compatible with old job [#1725](https://github.com/kubeflow/training-opera...
v1.6.0-rc.0 release
v1.6.0-rc.0 release
v1.5.0 release
New Features
- Add clientset for MPIJob, PytorchJob, MXJob, and XGBoostJob #1610 (tenzen-y)
- Add all generation tools to Makefile #1609 (johnugeorge)
- Adding MPI python sdk #1608 (johnugeorge)
- Adding XGboost Python sdk #1607 (johnugeorge)
- Generating MPI python sdk #1606 (johnugeorge)
- Update k8s dependencies to v0.24.1 #1604 (johnugeorge)
- Migrate test framework to GHA #1603 (johnugeorge)
- Add mpi in update-codegen.sh #1600 (ggaaooppeenngg)
- MXNet SDK with Status check fix #1618 (johnugeorge)
Bug Fixes
- fix: MPIJob worker still running when NotEnoughResources #1621 (hackerboy01)
- fix comments for pytorch-controller #1620 (hackerboy01)
- fix: requeue when expire time is not up yet #1614 (Garrybest)
- Look for fully-qualified job role label in Python sdk #1588 (person142)
- fix torch env typo #1573 (kuizhiqing)
- Restart job on failure for Always,OnFailure Policy #1572 (georgkaleido)
- Increase success threshold #1568 (haoxins)
- update status.startTime for pytorchjob and xgboostjob #1567 (cheimu)
- fix: add mpijobs to kubeflow training role #1565 (henrysecond1)
- fix Pytorjob status inaccuracy when task replica scale down #1593 (PeterChg)
- fix: MPIJob cannot use gang-scheduling when --enable-gang-scheduling is set #1557 (cheimu)
- fix api reader issue #1551 (zw0610)
- fix label and CleanPodPolicy for mpi-controller #1550 (zw0610)
- fix UpdateJobStatusInApiServer when gang-scheduling is enabled #1549 (zw0610)
- fix: add namespace filtering when getting pods/services for jobs #1545 (henrysecond1)
- fix: set mpijob runPolicy.cleanPodPolicy to default none #1554 (cheimu)
Misc
- Update training controller image to latest #1625 (johnugeorge)
- Update SDK version to 1.5.0 #1624 (johnugeorge)
- Upgrade common to v0.4.3 #1623 (johnugeorge)
- Adding GHA for automatic image build and push #1615 (johnugeorge)
- Remove presubmit test depending on optional-test-infra #1596 (aws-kf-ci-bot)
- chore: stop action on first fail #1595 (jasonliu747)
- update img url in design doc #1591 (zw0610)
- Remove uncalled mpi-controller DeletePodsAndServices() #1558 (cheimu)
- Update MPIJob unit tests to use spec.runPolicy.cleanPodPolicy #1556 (cheimu)
- Remove
table-logger
dependency #1544 (person142) - Bump pyyaml from 5.1 to 5.4 in /py/kubeflow/tf_operator #1542 (dependabot[bot])
v1.5.0-rc.0 release
Closed issues:
- MPIJob worker still running when NotEnoughResources with enable-gang-scheduling==true? #1617
- unable to fetch TFJob when I use client.go run tfjob #1612
- Pytorchjob dist-mnist no training logs #1601
- kubectl get tfjob -o yaml, but not status output #1598
- missing image in tf_job_design_doc.md #1590
- Labels in Python client are out of date #1587
- PyTorchJob Pods "Not Ready" After Completing Training #1577
- cannot use "github.com/go-openapi/spec".Schema{...} (type "github.com/go-openapi/spec".Schema) as type "k8s.io/kube-openapi/pkg/validation/spec".Schema in field value #1576
- PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570
- pytorchjob doesn't have status.startTIme. #1566
- Optional-test-infra Deprecation Notice - Training #1561
- Should we update MPIJob unit test CleanPodPolicy field? #1555
- --enable-gang-scheduling=true doesn't work for MPIJob #1548
- PyTorchJob fails when creating a task with a different namespace but the same name #1543
- Reconcile PyTorchJob error: PyTorchJob.status.replicaStatuses: Invalid value: "null" after enable-gang-scheduling #1538
- Job TTLs not working #1533
- Support PodGroup in scheduler-plugins/coscheduling #1518
- support elastic training #1515
- Modified the configuration of RootLogger #1514
- Add checking import order in CI #1510
- Scale down of pytorchJob cause workers pod to restart #1509
- Support label selector based success/failure conditions #1507
- [feat] Support SuccessPolicy in PyTorchJob #1505
- pytorch elastic scheduler error #1504
- Could you add the example of MPIJob in this repository #1502
- [Feature] Create a Informer/ClientSet for PyTorch Jobs #1499
- [feature] Make init container injection logic availabel to all jobs #1498
- Roadmaps for 1.4 release #1496
- [bug] (MpiJob)Init container KubectlDeliveryImage should remain the ability that it can be specified from container parameters or environment variables. #1494
- Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org #1492
- Python PytorchJob: no attribute openapi_types for example code #1481
- PyTorch DistributedDataParallel training with multi nodes #1475
- Installing kubeflow-training breaks import for other kubeflow packages (katib, fairing, etc.) #1471
- Deprecate ksonnet and use python/golang to submit jobs #1468
- Help Wanted in ParameterServerStrategy Example. #1459
- Bug: SomeTimes Coredumped using tfjob #1456
- [question] PyTorchJob MNIST example training speed #1454
- tfjob status not match when EnableDynamicWorker set true #1452
- training-operator set scheduler error #1447
- [sdk]: Replace
TableLogger
component in the SDK for better support withipykernel>=6.x
#1446 - SDK: wait_for_job reports typeError #1445
- Update prometheus monitoring doc #1443
- Master branch should provide a nightly image #1433
- Clean up test folder before testing #1429
- Clean up TF specific docs #1424
- [feature] Support SchedulingPolicy in PyTorchJob #1414
- Hyperlinks in the "Overview" section is incorrect/not found #1411
- add workqueue metric #1407
- Validation fails for MXJob Tune example #1402
- Rate exceeded for aws ecr image #1400
- change layout to follow the standard of kubebuilder? #1397
- [example] kubeflow/tf-dist-mnist-test:1.0 is missing in v1.2-branch examples/v1/dist-mnist #1393
- Update kubeflow/website for 1.4 release #1392
- Cut beta release of tf-operator for 1.4 release #1385
- "invalid memory address or nil pointer dereference" #1382
- some questions about job sync #1379
- Provides a default Grafana dashboard #1376
- [feature] Support different PS/worker types #1369
- Need to copy all (mainly pytorch) framework's example dir to tf-operator/examples #1366
- Add more CRD validations markers to block invalid job on client apply #1363
- Update presubmit and post submit job triggers #1354
- Optimize post submit jobs flow #1353
- Enable leader election in controller manager using controllermanagerconfig #1350
- Support mpi jobs in universal operator #1345
- post-submit job failure in master branch #1343
- Improve observability of universal operator #1340
- Best practice to organize main.go and Dockerfile? #1333
- Should training operator keep clientset in the same repository? #1332
- Test image has incorrect tag? #1329
- Prepare e2e tests for all frameworks #1323
- Reduce e2e replica-restart-policy-tests running time #1319
- Improve logs structure by consolidating libs from controller runtime and controllers #1313
- Enable tests for all frameworks #1311
- [bug] The pod wil be recreated until the expectation expires #1306
- Upgrade CRDs to apiextensions.k8s.io/v1 #1304
- Add role details as new columns to
kubectl get jobs
output for CRD. #1301 - How to handle long pending pods in a TF-job? #1282
- Could you release a new version of Python SDK #1279
- Update swagger.json schema for TFJobSpec to include RunPolicy [#1278](https://github.com/kubeflow...
v1.4.0
Merged pull requests:
- extends path in __init__.py for SDK correctly #1531 (cakeislife100)
- Update manifests with latest image tag #1527 (johnugeorge)
- add option for mpi kubectl delivery #1525 (zw0610)
- restore option namespace in launch arguments #1524 (zw0610)
- remove unused scripts #1521 (zw0610)
- remove ChanYiLin from approvers #1513 (ChanYiLin)
- add StacktraceLevel for zapr #1512 (qiankunli)
- add unit tests for tensorflow controller #1511 (zw0610)
- add the example of MPIJob #1508 (hackerboy01)
- Added 2022 roadmap and migrated previous roadmap from kubeflow/common #1500 (terrytangyuan)
- Fix a typo in mpi controller log #1495 (LuBingtan)
- feat(pytorch): Add init container config to avoid DNS lookup failure #1493 (gaocegege)
- chore: Fix GitHub Actions script #1491 (tenzen-y)
- chore: Fix missspell in tfjob #1490 (tenzen-y)
- chore: Update OWNERS #1489 (gaocegege)
- Bump jinja2 from 2.10.1 to 2.11.3 in /py/kubeflow/tf_operator #1487 (dependabot[bot])
- fix comments for mpi-controller #1485 (hackerboy01)
- add expectation-related functions for other resources used in mpi-controller #1484 (zw0610)
- Add MPI job to README now that it's supported #1480 (terrytangyuan)
- add mpi doc #1477 (zw0610)
- Set Go version of base image to 1.17 #1476 (tenzen-y)
- update label for tf-controller #1474 (zw0610)
- Add Akuity to the list of adopters #1473 (terrytangyuan)
- Add PR template with doc checklist #1470 (andreyvelich)
- Add e2e failure debugging guidance #1469 (Jeffwan)
- chore: Add .gitattributes to ignore Jsonnet test code for linguist #1463 (terrytangyuan)
- Migrate additional examples from xgboost-operator #1461 (terrytangyuan)
- Minor edits to README.md #1460 (terrytangyuan)
- add mpi-operator(v1) to the unified operator #1457 (hackerboy01)
- fix tfjob status when enableDynamicWorker set true #1455 (zw0610)
- feat(pytorch): Support elastic training #1453 (gaocegege)
- fix: generate printer columns for job crds #1451 (henrysecond1)
- Fix README typo #1450 (davidxia)
- consistent naming for better readability #1449 (pramodrj07)
- Fix set scheduler error #1448 (qiankunli)
- Add CI to run the tests for Go #1440 (tenzen-y)
- fix: Add missing retrying package that failed the import #1439 (terrytangyuan)
- Generate a single
swagger.json
file for all frameworks #1437 (alembiewski) - Update links and files with the new URL #1434 (andreyvelich)
- chore: update CHANGELOG.md #1432 (Jeffwan)
- Add acknowledgement section in README to credit all contributors #1422 (terrytangyuan)
- Add Cisco to Adopters List #1421 (andreyvelich)
- Add Python SDK for Kubeflow Training Operator #1420 (alembiewski)
- docs: Move myself to approvers #1419 (terrytangyuan)
- fix hyperlinks in the 'overview' section #1418 (pramodrj07)
- docs: Migrate adopters of all operators to this repo #1417 (terrytangyuan)
- Feature/support pytorchjob set queue of volcano #1415 (qiankunli)
- Bump controller-tools to 0.6.0 and enable GenerateEmbeddedObjectMeta #1409 (Jeffwan)
- Update scripts to generate sdk for all frameworks #1389 (Jeffwan)
Closed issues:
- Question: What is the recommended way for Data Scientists to run a distributed training job #1535
- Restore KUBEFLOW_NAMESPACE options #1522
- Improve test coverage #1497
- swagger.json missing Pytorchjob.Spec.ElasticPolicy #1483
- [bug] Missing init container in PyTorchJob #1482
- PytorchJob DDP training will stop if I delete a worker pod #1478
- Write down e2e failure debug process #1467
- How can i add the Priorityclass to the TFjob? #1466
- github.com/go-logr/zapr.(*zapLogger).Error #1444
- Display coverage % in GitHub actions list #1442
- Add Go test to CI #1436
- Podgroup is constantly created and deleted after tfjob is success or failure #1426
- Cut official release of 1.3.0 #1425
- Add "not maintained" notice to other operator repos #1423
- Fail to install tf-operator in minikube because of the version of kubectl/kustomize #1381
- Python SDK for Kubeflow Training Operator #1380
- Rename this repo #1348
- Universal Operator Phase III: Graduate operator to production grade #1318
v1.4.0-rc.0 release
Features and improvements:
Fixed bugs:
- [bug] Missing init container in PyTorchJob #1482
- Fail to install tf-operator in minikube because of the version of kubectl/kustomize #1381
Closed issues:
- Restore KUBEFLOW_NAMESPACE options #1522
- Improve test coverage #1497
- swagger.json missing Pytorchjob.Spec.ElasticPolicy #1483
- PytorchJob DDP training will stop if I delete a worker pod #1478
- Write down e2e failure debug process #1467
- How can i add the Priorityclass to the TFjob? #1466
- github.com/go-logr/zapr.(*zapLogger).Error #1444
- Podgroup is constantly created and deleted after tfjob is success or failure #1426
- Cut official release of 1.3.0 #1425
- Add "not maintained" notice to other operator repos #1423
- Python SDK for Kubeflow Training Operator #1380
Merged pull requests:
- Update manifests with latest image tag #1527 (johnugeorge)
- add option for mpi kubectl delivery #1525 (zw0610)
- restore option namespace in launch arguments #1524 (zw0610)
- remove unused scripts #1521 (zw0610)
- remove ChanYiLin from approvers #1513 (ChanYiLin)
- add StacktraceLevel for zapr #1512 (qiankunli)
- add unit tests for tensorflow controller #1511 (zw0610)
- add the example of MPIJob #1508 (hackerboy01)
- Added 2022 roadmap and migrated previous roadmap from kubeflow/common #1500 (terrytangyuan)
- Fix a typo in mpi controller log #1495 (LuBingtan)
- feat(pytorch): Add init container config to avoid DNS lookup failure #1493 (gaocegege)
- chore: Fix GitHub Actions script #1491 (tenzen-y)
- chore: Fix missspell in tfjob #1490 (tenzen-y)
- chore: Update OWNERS #1489 (gaocegege)
- Bump jinja2 from 2.10.1 to 2.11.3 in /py/kubeflow/tf_operator #1487 (dependabot[bot])
- fix comments for mpi-controller #1485 (hackerboy01)
- add expectation-related functions for other resources used in mpi-controller #1484 (zw0610)
- Add MPI job to README now that it's supported #1480 (terrytangyuan)
- add mpi doc #1477 (zw0610)
- Set Go version of base image to 1.17 #1476 (tenzen-y)
- update label for tf-controller #1474 (zw0610)
- Add Akuity to the list of adopters #1473 (terrytangyuan)
- Add PR template with doc checklist #1470 (andreyvelich)
- Add e2e failure debugging guidance #1469 (Jeffwan)
- chore: Add .gitattributes to ignore Jsonnet test code for linguist #1463 (terrytangyuan)
- Migrate additional examples from xgboost-operator #1461 (terrytangyuan)
- Minor edits to README.md #1460 (terrytangyuan)
- add mpi-operator(v1) to the unified operator #1457 (hackerboy01)
- fix tfjob status when enableDynamicWorker set true #1455 (zw0610)
- feat(pytorch): Support elastic training #1453 (gaocegege)
- fix: generate printer columns for job crds #1451 (henrysecond1)
- Fix README typo #1450 (davidxia)
- consistent naming for better readability #1449 (pramodrj07)
- Fix set scheduler error #1448 (qiankunli)
- Add CI to run the tests for Go #1440 (tenzen-y)
- fix: Add missing retrying package that failed the import #1439 (terrytangyuan)
- Generate a single
swagger.json
file for all frameworks #1437 (alembiewski) - Update links and files with the new URL #1434 (andreyvelich)
- chore: update CHANGELOG.md #1432 (Jeffwan)
- Add acknowledgement section in README to credit all contributors #1422 (terrytangyuan)
- Add Cisco to Adopters List #1421 (andreyvelich)
- Add Python SDK for Kubeflow Training Operator #1420 (alembiewski)
- docs: Move myself to approvers #1419 (terrytangyuan)
- fix hyperlinks in the 'overview' section #1418 (pramodrj07)
- docs: Migrate adopters of all operators to this repo #1417 (terrytangyuan)
- Feature/support pytorchjob set queue of volcano #1415 (qiankunli)
- Bump controller-tools to 0.6.0 and enable GenerateEmbeddedObjectMeta #1409 (Jeffwan)
- Update scripts to generate sdk for all frameworks #1389 (Jeffwan)