Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HF web service streaming response differs from OpenAI, breaking clients #1896

Open
4 tasks
dluc opened this issue May 14, 2024 · 0 comments
Open
4 tasks

HF web service streaming response differs from OpenAI, breaking clients #1896

dluc opened this issue May 14, 2024 · 0 comments

Comments

@dluc
Copy link

dluc commented May 14, 2024

System Info

Attempting to reuse an existing OpenAI client to stream responses from HF endpoint doesn't work due to a couple of differences. In my case the differences break the .NET client in Azure AI SDK, though I suspect it might affect other clients too.

Differences found:

  1. When streaming response tokens, OpenAI terminates the stream with a final [DONE] string, while HF simply stops sending tokens. Clients expecting [DONE] get stuck waiting either for another token of for the termination string.
  2. OpenAI supports '0.0 <= top_p <= 1.0', while HF supports only '0.0 < top_p < 1.0'
  3. When sending top_p = 0 to HF endpoint, the service replies 200 OK with an error {"error":"Input validation error: top_p must be > 0.0 and < 1.0","error_type":"validation"} and no final [DONE]. Given the status code and the lack of a termination, the error is parsed as data and causes a client to hang, waiting for the next token.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Example 1: error with top_p = 0

Request:

curl -v -X POST https://api-inference.huggingface.co/v1/chat/completions \
    -H "Authorization: Bearer ${HF_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"content":"how much is 1+1","role":"system"}],
      "max_tokens":50,
      "temperature":0,
      "top_p":0.0,
      "presence_penalty":0,
      "frequency_penalty":0,
      "stream":true,
      "model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO"}'

Response:

< HTTP/2 200
< date: Tue, 14 May 2024 22:33:46 GMT
< content-type: text/event-stream
< x-compute-type: 2-a100
< x-request-id: ...
< cache-control: no-cache
< access-control-allow-credentials: true
< vary: origin, Origin, Access-Control-Request-Method, Access-Control-Request-Headers
< x-accel-buffering: no
< access-control-allow-origin: *
< x-compute-characters: 67
< x-sha: ...
<
data:{"error":"Input validation error: `top_p` must be > 0.0 and < 1.0","error_type":"validation"}

OpenAI returns a response instead (see next).

Example 2: OpenAI response includes '[DONE]`

Request:

curl -X POST https://api.openai.com/v1/chat/completions \
    -H "Authorization: Bearer ${OPENAI_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"content":"how much is 1+1","role":"system"}],
      "max_tokens":5,
      "temperature":0,
      "top_p":0,
      "presence_penalty":0,
      "frequency_penalty":0,
      "stream":true,
      "model":"gpt-3.5-turbo"}'

Response:

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"1"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" +"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" "},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"1"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" equals"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"length"}]}

data: [DONE]

Example 3: HF response is missing '[DONE]`

Request:

curl -v -X POST https://api-inference.huggingface.co/v1/chat/completions \
    -H "Authorization: Bearer ${HF_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"content":"how much is 1+1","role":"system"}],
      "max_tokens":5,
      "temperature":0,
      "top_p":0.01,
      "presence_penalty":0,
      "frequency_penalty":0,
      "stream":true,
      "model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO"}'

Response:

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":"The"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" result"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" of"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" the"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" mathematical"},"logprobs":null,"finish_reason":"length"}]}

Expected behavior

Would be great if it was possible to reuse OpenAI clients (and apps built on these clients) simply by pointing them at https://api-inference.huggingface.co.

While it's possible to workaround the different range of top_p changing the code (if apps allow for it), the lack of termination strings makes it impossible to use these clients.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant