Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation settings of INSTRUCTOR #67

Open
EliverQ opened this issue Jul 21, 2023 · 11 comments
Open

Evaluation settings of INSTRUCTOR #67

EliverQ opened this issue Jul 21, 2023 · 11 comments

Comments

@EliverQ
Copy link

EliverQ commented Jul 21, 2023

Hello! I have a very puzzling question that I would like to ask. Since your model is fine-tuned with instructions, why not use instructions during benchmark evaluations (e.g. MTEB)?

@hongjin-su
Copy link
Collaborator

Hi, the instructions are included in the evaluation. You may refer to the table 1 in our paper

@EliverQ
Copy link
Author

EliverQ commented Jul 22, 2023

As far as I understand, when evaluating MTEB in your code, the following lines are used:

model = INSTRUCTOR(args.model_name, cache_folder=args.cache_dir)
evaluation = MTEB(tasks=[args.task_name], task_langs=["en"])
evaluation.run(model, output_folder=args.output_dir, eval_splits=[args.split], args=args, overwrite_results=True)

During the execution of evaluation.run(), it utilizes INSTRUCTOR.encode() to encode the input sentences. However, when I print the sentences passed to INSTRUCTOR.encode() before tokenization, it appears that the corresponding task instructions are not added to these sentences.

I'm not sure if my understanding and evaluation method are correct. I would greatly appreciate it if you could provide me with answers. Thank you very much.

@hongjin-su
Copy link
Collaborator

Hi, could you share the scripts you print out the sentences? Also, make sure you have correctly installed the InstructorEmbedding library.

@EliverQ
Copy link
Author

EliverQ commented Jul 22, 2023

I just use the source code on the Github:

https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L478-L565

    def encode(self, sentences,
               batch_size: int = 32,
               show_progress_bar: bool = None,
               output_value: str = 'sentence_embedding',
               convert_to_numpy: bool = True,
               convert_to_tensor: bool = False,
               device: str = None,
               normalize_embeddings: bool = False):
        """
        Computes sentence embeddings

        :param sentences: the sentences to embed
        :param batch_size: the batch size used for the computation
        :param show_progress_bar: Output a progress bar when encode sentences
        :param output_value:  Default sentence_embedding, to get sentence embeddings. Can be set to token_embeddings to get wordpiece token embeddings. Set to None, to get all output values
        :param convert_to_numpy: If true, the output is a list of numpy vectors. Else, it is a list of pytorch tensors.
        :param convert_to_tensor: If true, you get one large tensor as return. Overwrites any setting from convert_to_numpy
        :param device: Which torch.device to use for the computation
        :param normalize_embeddings: If set to true, returned vectors will have length 1. In that case, the faster dot-product (util.dot_score) instead of cosine similarity can be used.

        :return:
           By default, a list of tensors is returned. If convert_to_tensor, a stacked tensor is returned. If convert_to_numpy, a numpy matrix is returned.
        """
        self.eval()
        if show_progress_bar is None:
            show_progress_bar = False

        if convert_to_tensor:
            convert_to_numpy = False

        if output_value != 'sentence_embedding':
            convert_to_tensor = False
            convert_to_numpy = False

        input_was_string = False
        if isinstance(sentences, str) or not hasattr(sentences, '__len__'): #Cast an individual sentence to a list with length 1
            sentences = [sentences]
            input_was_string = True

        if device is None:
            device = self._target_device

        self.to(device)

        all_embeddings = []
        if isinstance(sentences[0],list):
            lengths = []
            for sen in sentences:
                lengths.append(-self._text_length(sen[1]))
            length_sorted_idx = np.argsort(lengths)
        else:
            length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences])
        sentences_sorted = [sentences[idx] for idx in length_sorted_idx]

        for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):
            sentences_batch = sentences_sorted[start_index:start_index+batch_size]
            features = self.tokenize(sentences_batch)
            features = batch_to_device(features, device)

            with torch.no_grad():
                out_features = self.forward(features)

                if output_value == 'token_embeddings':
                    embeddings = []
                    for token_emb, attention in zip(out_features[output_value], out_features['attention_mask']):
                        last_mask_id = len(attention)-1
                        while last_mask_id > 0 and attention[last_mask_id].item() == 0:
                            last_mask_id -= 1

                        embeddings.append(token_emb[0:last_mask_id+1])
                elif output_value is None:  #Return all outputs
                    embeddings = []
                    for sent_idx in range(len(out_features['sentence_embedding'])):
                        row =  {name: out_features[name][sent_idx] for name in out_features}
                        embeddings.append(row)
                else:   #Sentence embeddings
                    embeddings = out_features[output_value]
                    embeddings = embeddings.detach()
                    if normalize_embeddings:
                        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

                    # fixes for #522 and #487 to avoid oom problems on gpu with large datasets
                    if convert_to_numpy:
                        embeddings = embeddings.cpu()

                all_embeddings.extend(embeddings)

        all_embeddings = [all_embeddings[idx] for idx in np.argsort(length_sorted_idx)]

I print sentences in the encode() at about Line 512

@hongjin-su
Copy link
Collaborator

For the case of MTEB, make sure that the library is correctly installed by following https://github.com/HKUNLP/instructor-embedding#mteb.

@EliverQ
Copy link
Author

EliverQ commented Jul 23, 2023

Sorry, I have tried to install this previously but failed with the message here:

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/tmp/tmp73vjuhzp/output.json

So I can only pip install mteb following the instruction at https://github.com/embeddings-benchmark/mteb

Maybe this is the reason for the problem?

@hongjin-su
Copy link
Collaborator

Yes, we should install the customized mteb package for correct evaluation.

@EliverQ
Copy link
Author

EliverQ commented Jul 24, 2023

Thanks! I've corrected my evaluation method following your customized mteb package. The performance of replicating INSTRUCTOR have been improved but still lower than yours. Here I still have some detailed questions:

  1. Have you been consistently overlooking the token embeddings of the instructions during the training and evaluation processes?
  2. Have you been consistently using mean pooling as the pooling method during the training and evaluation processes?

Thank you again for your patient response.

@hongjin-su
Copy link
Collaborator

Yes, we use the mean pooling in both the training and evaluation processes.

@ashokrajab
Copy link
Contributor

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/tmp/tmp73vjuhzp/output.json

I guess the reason for this issue might be due to the exit() statement present at https://github.com/HKUNLP/instructor-embedding/blob/main/evaluation/MTEB/setup.py#L42
@hongjin-su, could you kindly check this?

@hongjin-su
Copy link
Collaborator

Hi, you may check the permission of /tmp or /tmp/tmp73vjuhzp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants