Predict stop sequence matches during streaming #541

Y4hL · 2024-03-06T16:15:07Z

When streaming using mlx_lm/server.py we should predict potential stop sequence matches, and generate tokens until we know that there is no match. This prevents the server from sending parts of a stop sequence to the client before it finds the match.

Fixes #524

My implementation adds a new function called "sequence_overlap" which checks how much sequence 1 has overlap with sequence 2. It checks for larger overlaps first, and returns the overlap as an integer.

The server checks for overlaps, and generates more tokens before allowing the server to send them.

if any((sequence_overlap(tokens, sequence) for sequence in stop_id_sequences)):
    continue

The sequence_overlap implementation can be tested with this example:

from typing import Sequence


def sequence_overlap(s1: Sequence, s2: Sequence) -> int:
    """
    Check how much overlap two sequences have.
        Only checks the end of s1 overlapping the start of s2

    Args:
        s1 (Sequence): The first sequence, which end is checked
        s2 (Sequence): The second sequence, which beginning is checked

    Returns:
        int: The amount of overlap between s1 and s2
    """
    # Count down from the length of the smaller list -> Checks for larger overlaps first
    for index in range(min(len(s1), len(s2)), 0, -1):
        # Check if they have index amount of overlap
        if s1[-index:] == s2[:index]:
            return index
    return 0


stop_sequence = [27, 28, 29]

tokens = []
new_tokens = []

for token in range(50):
    tokens.append(token)
    new_tokens.append(token)

    # This should always be the first check, since it needs to be performed on every token
    if len(tokens) >= len(stop_sequence) and tokens[-len(stop_sequence):] == stop_sequence:
        print("Contains stop sequence:", new_tokens)
        tokens = tokens[:len(tokens) - len(stop_sequence)]
        new_tokens.clear()
        break

    # Generate tokens until we know that tokens does not contain stop sequence
    if sequence_overlap(tokens, stop_sequence):
        print("Found a possible start to a stop sequence:", new_tokens)
        continue

    # Process new tokens
    print("Processing new tokens:", new_tokens)
    new_tokens.clear()

# In the case that the generation ends with the start of a stop sequence
# We need to process leftovers, since it would call continue until a break
if new_tokens:
    print("Processing leftover new tokens", new_tokens)
    new_tokens.clear()

print("Full sequence:", tokens)

awni · 2024-03-06T16:16:37Z

Oh much better, thank you 😄

llms/mlx_lm/server.py

awni · 2024-03-14T04:34:41Z

What about adding your test (modified for unittest) as a test case to a new test file test_server.py in the tests directory: https://github.com/ml-explore/mlx-examples/tree/main/llms/tests ?

llms/mlx_lm/server.py

Y4hL · 2024-03-14T08:17:34Z

Yeah, I'll have a look into writing a unittest

Check for overlap of stop sequences and the tokens array for potential sequence matches after more tokens get generated. Generate tokens until we can confirm that the stop sequence is not met.

Added a test for the sequence_overlap method

awni · 2024-03-20T04:23:05Z

Hi! Is this ready to be reviewed again?

Y4hL · 2024-03-20T18:19:40Z

@awni yes

awni reviewed Mar 14, 2024

View reviewed changes

llms/mlx_lm/server.py Outdated Show resolved Hide resolved

awni reviewed Mar 14, 2024

View reviewed changes

llms/mlx_lm/server.py Outdated Show resolved Hide resolved

Y4hL added 4 commits March 14, 2024 12:07

Predict stop sequence matches during streaming

4e093e9

Check for overlap of stop sequences and the tokens array for potential sequence matches after more tokens get generated. Generate tokens until we can confirm that the stop sequence is not met.

fix typo

c923c6d

Change sequence_overlap logic

0e50465

range isn't inclusive, add 1 to max_overlap

46ff098

Y4hL force-pushed the sequence-predict branch from 4fa9d91 to 46ff098 Compare March 14, 2024 10:08

Add test_server.py

20fd4cf

Added a test for the sequence_overlap method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predict stop sequence matches during streaming #541

Predict stop sequence matches during streaming #541

Y4hL commented Mar 6, 2024 •

edited

awni commented Mar 6, 2024

awni commented Mar 14, 2024

Y4hL commented Mar 14, 2024

awni commented Mar 20, 2024

Y4hL commented Mar 20, 2024

Predict stop sequence matches during streaming #541

Are you sure you want to change the base?

Predict stop sequence matches during streaming #541

Conversation

Y4hL commented Mar 6, 2024 • edited

awni commented Mar 6, 2024

awni commented Mar 14, 2024

Y4hL commented Mar 14, 2024

awni commented Mar 20, 2024

Y4hL commented Mar 20, 2024

Y4hL commented Mar 6, 2024 •

edited