Evaluation settings of INSTRUCTOR #67

EliverQ · 2023-07-21T06:06:35Z

Hello! I have a very puzzling question that I would like to ask. Since your model is fine-tuned with instructions, why not use instructions during benchmark evaluations (e.g. MTEB)?

hongjin-su · 2023-07-22T09:59:08Z

Hi, the instructions are included in the evaluation. You may refer to the table 1 in our paper

EliverQ · 2023-07-22T10:49:15Z

As far as I understand, when evaluating MTEB in your code, the following lines are used:

model = INSTRUCTOR(args.model_name, cache_folder=args.cache_dir)
evaluation = MTEB(tasks=[args.task_name], task_langs=["en"])
evaluation.run(model, output_folder=args.output_dir, eval_splits=[args.split], args=args, overwrite_results=True)

During the execution of evaluation.run(), it utilizes INSTRUCTOR.encode() to encode the input sentences. However, when I print the sentences passed to INSTRUCTOR.encode() before tokenization, it appears that the corresponding task instructions are not added to these sentences.

I'm not sure if my understanding and evaluation method are correct. I would greatly appreciate it if you could provide me with answers. Thank you very much.

hongjin-su · 2023-07-22T11:42:08Z

Hi, could you share the scripts you print out the sentences? Also, make sure you have correctly installed the InstructorEmbedding library.

EliverQ · 2023-07-22T13:43:40Z

I just use the source code on the Github:

https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L478-L565

    def encode(self, sentences,
               batch_size: int = 32,
               show_progress_bar: bool = None,
               output_value: str = 'sentence_embedding',
               convert_to_numpy: bool = True,
               convert_to_tensor: bool = False,
               device: str = None,
               normalize_embeddings: bool = False):
        """
        Computes sentence embeddings

        :param sentences: the sentences to embed
        :param batch_size: the batch size used for the computation
        :param show_progress_bar: Output a progress bar when encode sentences
        :param output_value:  Default sentence_embedding, to get sentence embeddings. Can be set to token_embeddings to get wordpiece token embeddings. Set to None, to get all output values
        :param convert_to_numpy: If true, the output is a list of numpy vectors. Else, it is a list of pytorch tensors.
        :param convert_to_tensor: If true, you get one large tensor as return. Overwrites any setting from convert_to_numpy
        :param device: Which torch.device to use for the computation
        :param normalize_embeddings: If set to true, returned vectors will have length 1. In that case, the faster dot-product (util.dot_score) instead of cosine similarity can be used.

        :return:
           By default, a list of tensors is returned. If convert_to_tensor, a stacked tensor is returned. If convert_to_numpy, a numpy matrix is returned.
        """
        self.eval()
        if show_progress_bar is None:
            show_progress_bar = False

        if convert_to_tensor:
            convert_to_numpy = False

        if output_value != 'sentence_embedding':
            convert_to_tensor = False
            convert_to_numpy = False

        input_was_string = False
        if isinstance(sentences, str) or not hasattr(sentences, '__len__'): #Cast an individual sentence to a list with length 1
            sentences = [sentences]
            input_was_string = True

        if device is None:
            device = self._target_device

        self.to(device)

        all_embeddings = []
        if isinstance(sentences[0],list):
            lengths = []
            for sen in sentences:
                lengths.append(-self._text_length(sen[1]))
            length_sorted_idx = np.argsort(lengths)
        else:
            length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences])
        sentences_sorted = [sentences[idx] for idx in length_sorted_idx]

        for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):
            sentences_batch = sentences_sorted[start_index:start_index+batch_size]
            features = self.tokenize(sentences_batch)
            features = batch_to_device(features, device)

            with torch.no_grad():
                out_features = self.forward(features)

                if output_value == 'token_embeddings':
                    embeddings = []
                    for token_emb, attention in zip(out_features[output_value], out_features['attention_mask']):
                        last_mask_id = len(attention)-1
                        while last_mask_id > 0 and attention[last_mask_id].item() == 0:
                            last_mask_id -= 1

                        embeddings.append(token_emb[0:last_mask_id+1])
                elif output_value is None:  #Return all outputs
                    embeddings = []
                    for sent_idx in range(len(out_features['sentence_embedding'])):
                        row =  {name: out_features[name][sent_idx] for name in out_features}
                        embeddings.append(row)
                else:   #Sentence embeddings
                    embeddings = out_features[output_value]
                    embeddings = embeddings.detach()
                    if normalize_embeddings:
                        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

                    # fixes for #522 and #487 to avoid oom problems on gpu with large datasets
                    if convert_to_numpy:
                        embeddings = embeddings.cpu()

                all_embeddings.extend(embeddings)

        all_embeddings = [all_embeddings[idx] for idx in np.argsort(length_sorted_idx)]

I print sentences in the encode() at about Line 512

hongjin-su · 2023-07-23T02:08:41Z

For the case of MTEB, make sure that the library is correctly installed by following https://github.com/HKUNLP/instructor-embedding#mteb.

EliverQ · 2023-07-23T07:33:41Z

Sorry, I have tried to install this previously but failed with the message here:

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/tmp/tmp73vjuhzp/output.json

So I can only pip install mteb following the instruction at https://github.com/embeddings-benchmark/mteb

Maybe this is the reason for the problem?

hongjin-su · 2023-07-24T04:14:56Z

Yes, we should install the customized mteb package for correct evaluation.

EliverQ · 2023-07-24T08:31:35Z

Thanks! I've corrected my evaluation method following your customized mteb package. The performance of replicating INSTRUCTOR have been improved but still lower than yours. Here I still have some detailed questions:

Have you been consistently overlooking the token embeddings of the instructions during the training and evaluation processes?
Have you been consistently using mean pooling as the pooling method during the training and evaluation processes?

Thank you again for your patient response.

hongjin-su · 2023-07-28T13:47:34Z

Yes, we use the mean pooling in both the training and evaluation processes.

ashokrajab · 2023-08-04T10:10:01Z

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/tmp/tmp73vjuhzp/output.json

I guess the reason for this issue might be due to the exit() statement present at https://github.com/HKUNLP/instructor-embedding/blob/main/evaluation/MTEB/setup.py#L42
@hongjin-su, could you kindly check this?

hongjin-su · 2023-12-19T12:09:43Z

Hi, you may check the permission of /tmp or /tmp/tmp73vjuhzp.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation settings of INSTRUCTOR #67

Evaluation settings of INSTRUCTOR #67

EliverQ commented Jul 21, 2023

hongjin-su commented Jul 22, 2023

EliverQ commented Jul 22, 2023

hongjin-su commented Jul 22, 2023

EliverQ commented Jul 22, 2023 •

edited

hongjin-su commented Jul 23, 2023

EliverQ commented Jul 23, 2023

hongjin-su commented Jul 24, 2023

EliverQ commented Jul 24, 2023

hongjin-su commented Jul 28, 2023

ashokrajab commented Aug 4, 2023

hongjin-su commented Dec 19, 2023

Evaluation settings of INSTRUCTOR #67

Evaluation settings of INSTRUCTOR #67

Comments

EliverQ commented Jul 21, 2023

hongjin-su commented Jul 22, 2023

EliverQ commented Jul 22, 2023

hongjin-su commented Jul 22, 2023

EliverQ commented Jul 22, 2023 • edited

hongjin-su commented Jul 23, 2023

EliverQ commented Jul 23, 2023

hongjin-su commented Jul 24, 2023

EliverQ commented Jul 24, 2023

hongjin-su commented Jul 28, 2023

ashokrajab commented Aug 4, 2023

hongjin-su commented Dec 19, 2023

EliverQ commented Jul 22, 2023 •

edited