Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Vercel Serverless and Edge #6

Open
vshei opened this issue Aug 1, 2023 · 2 comments
Open

Support for Vercel Serverless and Edge #6

vshei opened this issue Aug 1, 2023 · 2 comments

Comments

@vshei
Copy link

vshei commented Aug 1, 2023

Hello,
The underlying package being used (https://github.com/dqbd/tiktoken) seems to run into issues in a Vercel Serverless environment. Our application currently is built on NextJS 13 and we are seeing this error in our logs:
Error: Missing tiktoken_bg.wasm

We saw this issue before when we tried using the dqpd/tiktoken library directly. We had to switch to using js-tiktoken to resolve this issue.

Per the README in the GitHub repo it seems like this is the difference between the two:
tiktoken (formally hosted at @dqbd/tiktoken): WASM bindings for the original Python library, providing full 1-to-1 feature parity.
js-tiktoken: Pure JavaScript port of the original library with the core functionality, suitable for environments where WASM is not well supported or not desired (such as edge runtimes).

I was wondering if it was possible to build a version using the js-tiktoken library for better portability and for folks on environments where WASM is not easy to work with. The error and fix (i.e. creation of js-tiktoken) can be seen here: transitive-bullshit/chatgpt-api#570

Thanks!

@iwasrobbed
Copy link

You can just use js-tiktoken with the claude.json file from this repo:

import claude from './claude.json'
import { Tiktoken, TiktokenBPE } from 'js-tiktoken'

// Modified from: https://github.com/anthropics/anthropic-tokenizer-typescript
// (they use an old version of Tiktoken that isn't edge compatible)

export function countTokens(text: string): number {
  const tokenizer = getTokenizer()
  const encoded = tokenizer.encode(text.normalize('NFKC'), 'all')
  return encoded.length
}

// ----------------------
// Private APIs
// ----------------------

const getTokenizer = (): Tiktoken => {
  const ranks: TiktokenBPE = {
    bpe_ranks: claude.bpe_ranks,
    special_tokens: claude.special_tokens,
    pat_str: claude.pat_str,
  }
  return new Tiktoken(ranks)
}

@Mypathissional
Copy link

Is the explicit number of tokens mentioned in the claude.json correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants