HF web service streaming response differs from OpenAI, breaking clients #1896

dluc · 2024-05-14T22:45:39Z

System Info

Attempting to reuse an existing OpenAI client to stream responses from HF endpoint doesn't work due to a couple of differences. In my case the differences break the .NET client in Azure AI SDK, though I suspect it might affect other clients too.

Differences found:

When streaming response tokens, OpenAI terminates the stream with a final [DONE] string, while HF simply stops sending tokens. Clients expecting [DONE] get stuck waiting either for another token of for the termination string.
OpenAI supports '0.0 <= top_p <= 1.0', while HF supports only '0.0 < top_p < 1.0'
When sending top_p = 0 to HF endpoint, the service replies 200 OK with an error {"error":"Input validation error: top_p must be > 0.0 and < 1.0","error_type":"validation"} and no final [DONE]. Given the status code and the lack of a termination, the error is parsed as data and causes a client to hang, waiting for the next token.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Example 1: error with top_p = 0

Request:

curl -v -X POST https://api-inference.huggingface.co/v1/chat/completions \
    -H "Authorization: Bearer ${HF_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"content":"how much is 1+1","role":"system"}],
      "max_tokens":50,
      "temperature":0,
      "top_p":0.0,
      "presence_penalty":0,
      "frequency_penalty":0,
      "stream":true,
      "model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO"}'

Response:

< HTTP/2 200
< date: Tue, 14 May 2024 22:33:46 GMT
< content-type: text/event-stream
< x-compute-type: 2-a100
< x-request-id: ...
< cache-control: no-cache
< access-control-allow-credentials: true
< vary: origin, Origin, Access-Control-Request-Method, Access-Control-Request-Headers
< x-accel-buffering: no
< access-control-allow-origin: *
< x-compute-characters: 67
< x-sha: ...
<
data:{"error":"Input validation error: `top_p` must be > 0.0 and < 1.0","error_type":"validation"}

OpenAI returns a response instead (see next).

Example 2: OpenAI response includes '[DONE]`

Request:

curl -X POST https://api.openai.com/v1/chat/completions \
    -H "Authorization: Bearer ${OPENAI_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"content":"how much is 1+1","role":"system"}],
      "max_tokens":5,
      "temperature":0,
      "top_p":0,
      "presence_penalty":0,
      "frequency_penalty":0,
      "stream":true,
      "model":"gpt-3.5-turbo"}'

Response:

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"1"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" +"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" "},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"1"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" equals"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"length"}]}

data: [DONE]

Example 3: HF response is missing '[DONE]`

Request:

curl -v -X POST https://api-inference.huggingface.co/v1/chat/completions \
    -H "Authorization: Bearer ${HF_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"content":"how much is 1+1","role":"system"}],
      "max_tokens":5,
      "temperature":0,
      "top_p":0.01,
      "presence_penalty":0,
      "frequency_penalty":0,
      "stream":true,
      "model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO"}'

Response:

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":"The"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" result"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" of"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" the"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" mathematical"},"logprobs":null,"finish_reason":"length"}]}

Expected behavior

Would be great if it was possible to reuse OpenAI clients (and apps built on these clients) simply by pointing them at https://api-inference.huggingface.co.

While it's possible to workaround the different range of top_p changing the code (if apps allow for it), the lack of termination strings makes it impossible to use these clients.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HF web service streaming response differs from OpenAI, breaking clients #1896

HF web service streaming response differs from OpenAI, breaking clients #1896

dluc commented May 14, 2024

HF web service streaming response differs from OpenAI, breaking clients #1896

HF web service streaming response differs from OpenAI, breaking clients #1896

Comments

dluc commented May 14, 2024

System Info

Information

Tasks

Reproduction

Example 1: error with top_p = 0

Example 2: OpenAI response includes '[DONE]`

Example 3: HF response is missing '[DONE]`

Expected behavior