Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer supports multiple encodings, compatible with .Net Standard 2.0 #218

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

Frogley
Copy link

@Frogley Frogley commented Apr 7, 2023

Tokenizer supports multiple encodings: r50k_base, p50k_base, cl100k_base; supports encode and decode method.

  Tokenizer tokenizer = new Tokenizer("cl100k_base");
  Tokenizer tokenizer = new Tokenizer().FromModelName("gpt-3.5-turbo-0301");
  Tokenizer tokenizer = new Tokenizer().FromModel(Models.Model.TextDavinciV3);

  string str = @"床前明月光,疑是地上霜,举头望明月,低头思故乡。";
  int[] res = tokenizer.Encode(str);
  // res =[ 11795 232 25580 31958 9953 6708 231 3922 163 244 239 21043 30590 17905 52597 250 3922 3574 122 65455 4916 249 31958 9953 3922 8687 236 65455 91763 8067 227 18259 94 1811]
  string str2 = tokenizer.Decode(res);
  // str2 = "床前明月光,疑是地上霜,举头望明月,低头思故乡。"

kayhantolga and others added 4 commits March 20, 2023 18:32
…e encoding: r50k_base, p50k_base, cl100k_base; supports encode and decode method.

```C#
  Tokenizer tokenizer = new Tokenizer("cl100k_base");
  Tokenizer tokenizer = new Tokenizer().FromModelName("gpt-3.5-turbo-0301");
  Tokenizer tokenizer = new Tokenizer().FromModel(Models.Model.TextDavinciV3);

  string str = @"床前明月光,疑是地上霜,举头望明月,低头思故乡。";
  int[] res = tokenizer.Encode(str);
  // res =[ 11795 232 25580 31958 9953 6708 231 3922 163 244 239 21043 30590 17905 52597 250 3922 3574 122 65455 4916 249 31958 9953 3922 8687 236 65455 91763 8067 227 18259 94 1811]
  string str2 = tokenizer.Decode(res);
  // str2 = "床前明月光,疑是地上霜,举头望明月,低头思故乡。"
```
@Frogley Frogley changed the base branch from master to dev April 7, 2023 04:52
@kayhantolga
Copy link
Member

Hey @Frogley, I haven't forgotten about your PR. I am just trying to understand how the tokenizer works and comparing it against your PR, which is taking up a lot of time. I apologize for the delay. :/

@Frogley
Copy link
Author

Frogley commented May 18, 2023

Great. To be honest, my understanding of the core algorithm for the tokenizer is somewhat vague, I didn't fully grasp it. Basically, my PR is a translation of tiktoken/lib.rs from Rust into C#, with some simplifications. After the translation was complete, I did a few case tests and they were consistent. But I didn't do any extensive testing and comparison. Hope my work can be of help to you.

@kayhantolga kayhantolga added this to the 8.0.4 milestone Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants