Tokenizer in different code version #125

countytown · 2024-03-08T13:49:10Z

Hi~ Thanks a lot for the new version code which have made the framework much easier to understand. But I noticed that some details have also changed, e.g., the tokenizer part:

old version:

def tokenizer_X_token(prompt, tokenizer, X_token_index, return_tensors=None):
    prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split(f'<{X_INDEX_TOKEN[X_token_index].lower()}>')]
    ...

new version:

def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
    prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<image>')]
    ...

Should I worry about any performance degradation? Since

it looks like the video and image are treated as the same?
the original training samples include symbols like <image>\n and \n<video>?

In fact, I am trying to finetune with new modals like audio and depth, so is there any confict with current version (besides the languabind part)?

Thank you so much~☺

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer in different code version #125

Tokenizer in different code version #125

countytown commented Mar 8, 2024

Tokenizer in different code version #125

Tokenizer in different code version #125

Comments

countytown commented Mar 8, 2024