Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Multiple Datasets and Domain-Specific Loss Calculation in Trainer #30725

Open
ghost opened this issue May 9, 2024 · 2 comments
Open
Labels
Feature request Request for a new feature trainer

Comments

@ghost
Copy link

ghost commented May 9, 2024

Feature request

I am currently working on a project that involves sequence level distillation across multiple domains, requiring the handling of separate datasets for each domain within a single training loop. Specifically, the challenge involves integrating data from four distinct domains, computing loss individually per domain, and then aggregating these losses into a global loss measure that can guide the overall training process.

Motivation

Ideally, the Trainer class would natively support the following features:

Multiple Dataset Handling: Ability to pass multiple datasets into the Trainer directly, with each dataset potentially representing a different domain.
Domain-Specific Loss Calculation: Support for defining and computing loss separately for each domain's dataset within the training loop and then integrating these losses into a global training objective.

Your contribution

Currently, the Trainer class in the Transformers library supports passing a single dataset for training and evaluation. To handle multiple datasets or to calculate domain-specific losses, one must subclass the Trainer and override methods such as compute_loss, which complicates the implementation and integration of domain-specific training strategies.

@amyeroberts amyeroberts added trainer Feature request Request for a new feature labels May 9, 2024
@amyeroberts
Copy link
Collaborator

cc @muellerzr @pacman100

@cw235
Copy link

cw235 commented May 10, 2024

Your feature request for enhancing the Trainer class in the Transformers library to support handling multiple datasets representing different domains and calculating domain-specific losses is indeed valuable for projects involving sequence level distillation across various domains. Here are some potential contributions that could address your requirements:

  1. Multiple Dataset Handling Support:

    • Modify the Trainer class to accept multiple datasets representing different domains directly as input. This enhancement would streamline the integration of diverse data sources within a single training loop.
  2. Domain-Specific Loss Calculation Integration:

    • Implement a mechanism within the Trainer class to define and compute losses separately for each domain's dataset during the training loop. This functionality would allow for domain-specific loss calculations and aggregation into a global training objective.
  3. Flexible Loss Aggregation Mechanism:

    • Introduce a flexible mechanism for aggregating domain-specific losses into a global training objective. This feature would enable users to define custom aggregation strategies based on the specific requirements of their projects.
  4. Unified Training Interface:

    • Enhance the Trainer class to provide a unified training interface that simplifies the implementation of domain-specific training strategies. This improvement would abstract away the complexity of subclassing and method overriding for users seeking to work with multiple datasets and domain-specific losses.
  5. Documentation and Examples:

    • Update the documentation of the Trainer class to include clear explanations and examples demonstrating how to leverage the new features for handling multiple datasets and calculating domain-specific losses. Providing detailed guidance would facilitate the adoption of these functionalities by the community.

By incorporating these features and enhancements into the Trainer class of the Transformers library, users working on projects involving sequence level distillation across diverse domains would benefit from a more streamlined and efficient training process. Additionally, these improvements would contribute to the versatility and usability of the library for a broad range of applications requiring multi-domain data integration and domain-specific training strategies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature trainer
Projects
None yet
Development

No branches or pull requests

2 participants