Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use a prompt for text analysis? #145

Open
rsoika opened this issue Apr 15, 2024 · 6 comments
Open

How to use a prompt for text analysis? #145

rsoika opened this issue Apr 15, 2024 · 6 comments

Comments

@rsoika
Copy link

rsoika commented Apr 15, 2024

I am still confused how to use the prompt within Mistral-7b-instruct if I want to analyze the content of a text, such as summary or categorization of the context.
So in my prompt I have a text and an instruction. I want to extract information form the given text. And it is not this often discussed chat-use-case.

I am currently using the following Prompt Template for my approach:

<s>[INST] My Instruction............
[/INST]
My Text.......
................

Is this a correct way to build a prompt for text analyses with Mistral-7b-instruct? Or should I separate the context and the instruction in some other way? Note, I also do not use the EOS here. Is this a problem?

I ask this because the results vary if I put the context before or after the instruction and I did not find a guide line for such a scenario.

I am also referring to this official documentation page which is hard to understand:

Note that <s> and </s> are special tokens for beginning of string (BOS) and end of string (EOS) while [INST] and [/INST] are regular strings.

What does this sentence mean? What is the difference between a string and a regular string?

Thanks for any tips.

@pandora-s-git
Copy link

pandora-s-git commented May 6, 2024

Hi, okay so it seems there was a misunderstanding on how the prompt works. The Chat Template you are using is supposed to work like this:

<s>[INST] My Instruction.... My Text.... Or other details... [/INST]

To better understand it, see it like this: [INST] => User: and [/INST] => Assistant:.
So its equivalent of this:

<s>User: My Instruction.... My Text.... Or other details... Assistant:

I hope this answers your doupts !

@rsoika
Copy link
Author

rsoika commented May 6, 2024

@pandora-s-git Thanks for your answer!

But still I do not fully understand the usage of [INST] and <s>.

I want to explain what I do so far. My goal is to analyze and extract content from business documents. And Mistral-7B-Instruct did a really good job so far!

My First Prompt template looks like this:

<s>
[INST] You are a clerk in a logistics company and your job is to check incoming invoices. [/INST]		

<FILECONTEXT>
	
</s>
[INST] Briefly summarize the data in this invoice document:

- Company
- Payment data (with invoice amount, bank details, payment term)
- Invoice items (with name, quantity, price and tax)

Don't calculate any amounts yourself, but only accept amounts that actually appear in the invoice document!

[/INST]

And <FILECONTEXT> is the place where I insert the ocr-extracted text from an invoice. This text looks terrible contains new lines, tabs and spaces everywhere. But the result generated by Mistral-7B-Instruct is clear and well structured:

1. Company Data:
- Invoice is issued by CXA CGM, a French logistics company with VAT number FRXXXXX.
- The invoice is for AlexXXXX GmbH in Germany, with VAT number DExxxxx.
- The services were provided by Simone xxxxx in Germany.

2. Payment Data:
- Total amount due: 82.50 EUR
- Payment should be made to XX & YYY BV, with IBAN DExxxxxxx and SWIFT code DEUTxxx.
- Payments should be sent to the account of Deutsche Bank xxxx, Germany.

3. Service Items:
- There are two service items listed on the invoice, each with a container number, size/type, billable days, and tax information.
- Container TCxxxx (20ST) had 1 calendar day of billable time and a rate of 41.25 EUR per day, resulting in an invoiced amount of 41.25 EUR before tax.
- Container TRyyyyy (20ST) had 1 calendar day of billable time and a rate of 41.25 EUR per day, resulting in an invoiced amount of 41.25 EUR before tax.
- The total excluding tax for both containers is 82.50 EUR, and the total including tax is also 82.50 EUR.
- There is no VAT charge listed on the invoice.

Would you say that the placement of <s> and [INST] tags makes sense in my prompt? It was all guessing as I do not understand the difference between a string and a regular sting the documentation mentioned.

I then generate much more complex prompts from the first prompt result using much more additional information and I useing few-shot learning. But this is hard to construct when you do not understand how mistral-7b treats the different kind of prompt tokens. For example: is it allowed to use <s> multipel times in one prompt?

@pandora-s-git
Copy link

pandora-s-git commented May 6, 2024

This is close, by string it's like a full completion.

Let me try to rexplain what everything means.

As you may know, LLMs are Text Completion Machines, so they only complete text.

We fine tunned them to complete dialogs so we can chat with them, these are the Instruct versions.

Basically we need to train them with something similar to a chat like:

User: User Question.
Assistant: Assistant Response.
...

But to make the key words more unique, we rennamed them with specific strings, here [INST] defines the users instruction, and [/INST] defines the end of the instruction, and so the assistants response.

However, the model needs a token that allows us to know when its finished, cause if not it will just continue trying to complete the text, thats the EOS token (End Of String), and we also have one for the beginning (BOS).

Here the BOS is <s> and EOS </s>, basically a string in this context is a full completion/exchange with the assistant until it considered the completion finished.

I think its better for ur use case to have something like this:

<s>[INST] Here is a text:
{Your document}

And then the explantion+Instruction
 [/INST]

The model will then complete and finish with a EOS (if it doesnt maybe by default ur code removes it).

Basically if you have a dialog with an LLM it should look like this:

<s>[INST] Hi [/INST] Hi, how can I help you?</s>[INST] Tell me a joke [/INST] (and the model continues here)

There are chat templates you can use directly, Its better to directly use the jinja template that its available in the tokenizer for example.

( Check the repo 'mistral-common', should help you )

@rsoika
Copy link
Author

rsoika commented May 6, 2024

Ok, Thanks a lot! This helps me really to understand mistral-7b-instruct much better. And now I beginn to make progress :-)
You explanation should become part of the official documentation of the chat-template

@pandora-s-git
Copy link

pandora-s-git commented May 6, 2024

I'm glad you liked it, here is a chat template that may be of use for u:

"{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"

If you dont know how to understand this, check how to use a jinja template and you should find some clues.

@rsoika
Copy link
Author

rsoika commented May 9, 2024

I just want to share my final prompt templates that I use now to summarize the content of complex business documents.

Categorize Document

First I use a template to just categorize the content in a company name and a language

<s>[INST] You are a clerk in a ... company and you job is to check business documents. Extract the language the document is written in and the company name of the sender.

Output the information in a JSON object. Create only the json object. Do not provide explanations or notes.

Example JSON Object:[/INST]
{
  "invoice.language": "English",
  "cdtr.name": "Kraxi GmbH",
}
</s>
[INST]

{My document content}

[/INST]

Summarize Document

Next I use the following template to summarize the data of the document.

<s>[INST] You are a clerk in a ... company and your job is to check business documents.....[/INST]

{My document content}

</s>
[INST] Briefly summarize this  document ......
.......
[/INST]

The results are very good now. Of course, every word in the instruction counts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants