Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model tuning does not work #254

Open
sidoncloud opened this issue Nov 5, 2023 · 2 comments
Open

Model tuning does not work #254

sidoncloud opened this issue Nov 5, 2023 · 2 comments
Assignees

Comments

@sidoncloud
Copy link

sidoncloud commented Nov 5, 2023

Like most of the code uploaded by Google developers , your model tuning code that uses the stackoverflow data fails miserably giving the below errors.

{
  "summary": "Found 7 errors in your file. See 'errors' field for specific details.\nValidated 4000 examples for tokenization. Found 7 examples where either 'input_text' or 'output_text' exceeds the model token limits. See 'tokenization_issues' field for some specific examples.\nValidated 1000 examples for RAI. Found 43 examples that has RAI issues. See 'rai_issues' field for some specific examples.\n",
  "max_user_input_token_length": 8177,
  "tokenization_issues": [
    "Row: 122. Token limit exceeded for 'input_text' [tokens: 15851|limit: 8192] or 'output_text' [tokens: 24|limit: 1024]",
    "Row: 362. Token limit exceeded for 'input_text' [tokens: 13474|limit: 8192] or 'output_text' [tokens: 19|limit: 1024]",
    "Row: 391. Token limit exceeded for 'input_text' [tokens: 10643|limit: 8192] or 'output_text' [tokens: 34|limit: 1024]",
    "Row: 528. Token limit exceeded for 'input_text' [tokens: 9351|limit: 8192] or 'output_text' [tokens: 17|limit: 1024]",
    "Row: 840. Token limit exceeded for 'input_text' [tokens: 16309|limit: 8192] or 'output_text' [tokens: 33|limit: 1024]",
    "Row: 868. Token limit exceeded for 'input_text' [tokens: 20337|limit: 8192] or 'output_text' [tokens: 51|limit: 1024]",
    "Row: 1535. Token limit exceeded for 'input_text' [tokens: 8969|limit: 8192] or 'output_text' [tokens: 26|limit: 1024]"
  ],
  "rai_issues": [
    "Row: 15. RAI violation. High scores for categories Finance",
    "Row: 46. RAI violation. High scores for categories Finance",
    "Row: 275. RAI violation. High scores for categories Finance",
    "Row: 401. RAI violation. High scores for categories Finance",
    "Row: 444. RAI violation. High scores for categories Health",
    "Row: 503. RAI violation. High scores for categories Finance",
    "Row: 558. RAI violation. High scores for categories Finance",
    "Row: 571. RAI violation. High scores for categories Health",
    "Row: 848. RAI violation. High scores for categories Finance",
    "Row: 934. RAI violation. High scores for categories Finance",
    "... there are more cases ..."
  ],
  "errors": [
    "Row: 122. exceeds token limit",
    "Row: 362. exceeds token limit",
    "Row: 391. exceeds token limit",
    "Row: 528. exceeds token limit",
    "Row: 840. exceeds token limit",
    "Row: 868. exceeds token limit",
    "Row: 1535. exceeds token limit"
  ],
  "max_user_output_token_length": 79
}
@fmichaelobrien
Copy link
Member

Understood, I am new to this repo but an LLM enthusiast. I can try some reproduction and triage based on a specific use case and code specific run you encountered. Here to help.

@paulav6
Copy link

paulav6 commented Nov 20, 2023

I faced similar rai_issues even with private data. It marked when I had a person's name or asked about going to a specific bank website.
It went away once I removed those samples from my jsonl file. So, unless these examples were crucial, you could try removing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants