A possible solution to solve KeyError: 'xxx.txt' when using “clip-retrieval inference” command #352

ShuxunoO · 2024-01-30T08:22:52Z

I met the same error as #345 when I used clip-retrieval inference command to extract images and corresponding texts features, my command is like following:

clip-retrieval inference \
--input_dataset /path/to/local/img-txt dataset \
--output_folder /path/to/local/embeddings \
--input_format files \
--enable_text True \
--enable_image True \
--clip_model open_clip:ViT-L-14//path/to/local/model.pt

My local directory structure is as follows:

/xxx/BAYC
        BoredApeYachtClub_0.png   BoredApeYachtClub_0.txt   
        BoredApeYachtClub_11.png   BoredApeYachtClub_11.txt
        BoredApeYachtClub_12.png   BoredApeYachtClub_12.txt
        BoredApeYachtClub_13.png  BoredApeYachtClub_13.txt
        BoredApeYachtClub_16.png   BoredApeYachtClub_16.txt
        BoredApeYachtClub_17.png  BoredApeYachtClub_17.txt
                                ……

and the output traceback is:

Traceback (most recent call last):
File "/xxx/anaconda3/envs/it-retrieval/bin/clip-retrieval", line 8, in
sys.exit(main())
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/clip_retrieval/cli.py", line 18, in main
fire.Fire(
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/clip_retrieval/clip_inference/main.py", line 155, in main
distributor()
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/clip_retrieval/clip_inference/distributor.py", line 17, in call
worker(
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/clip_retrieval/clip_inference/worker.py", line 127, in worker
runner(task)
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/clip_retrieval/clip_inference/runner.py", line 39, in call
batch = iterator.next()
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/clip_retrieval/clip_inference/reader.py", line 225, in iter
for batch in self.dataloader:
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/site-packages/clip_retrieval/clip_inference/reader.py", line 99, in getitem
image_file = self.image_files[key]
KeyError: 'BoredApeYachtClub_0.txt'

Traceback (most recent call last):0
File "", line 1, in
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/xxx/anaconda3/envs/it-retrieval/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

——————————————————————————————————————————————————————————

def __getitem__(self, ind):
  key = self.keys[ind]
  output = {}

  if self.enable_image:
    image_file = self.image_files[key]
    try:
      image_tensor = self.image_transform(Image.open(image_file))
                      ……

After my analysis, I think the problem is that the file suffix ".txt" in "key" at this location in the code causes an issue in finding the corresponding file in the image dictionary. This is because in the source code, the possible image file extensions are: ".png", ".jpg", ".jpeg", ".bmp", ".webp", ".PNG", ".JPG", ".JPEG", ".BMP", ".WEBP".

To elaborate further, the function folder_to_keys(folder, enable_text=True, enable_image=True, enable_metadata=False) at this location in the code incorrectly uses filenames with suffixes as keys while constructing the dictionaries "text_files", "image_files", and "metadata_files". In fact, it should only retain the filename (removing the suffix). Here is my modified version of the code:

def folder_to_keys(folder, enable_text=True, enable_image=True, enable_metadata=False):
    """returns a list of keys from a folder of images and text"""
    path = Path(folder)
    text_files = None
    metadata_files = None
    image_files = None
    if enable_text:
        text_files = [*path.glob("**/*.txt")]
        text_files = {text_file.relative_to(path).with_suffix('').as_posix(): text_file for text_file in text_files}
    if enable_image:
        image_files = [
            *path.glob("**/*.png"),
            *path.glob("**/*.jpg"),
            *path.glob("**/*.jpeg"),
            *path.glob("**/*.bmp"),
            *path.glob("**/*.webp"),
            *path.glob("**/*.PNG"),
            *path.glob("**/*.JPG"),
            *path.glob("**/*.JPEG"),
            *path.glob("**/*.BMP"),
            *path.glob("**/*.WEBP"),
        ]
        image_files = {image_file.relative_to(path).with_suffix('').as_posix(): image_file for image_file in image_files}
    if enable_metadata:
        metadata_files = [*path.glob("**/*.json")]
        metadata_files = {metadata_file.relative_to(path).with_suffix('').as_posix(): metadata_file for metadata_file in metadata_files}

    keys = None

    def join(new_set):
        return new_set & keys if keys is not None else new_set

    if enable_text:
        keys = join(text_files.keys())
    if enable_image:
        keys = join(image_files.keys())
    if enable_metadata:
        keys = join(metadata_files.keys())

    keys = list(sorted(keys))

    return keys, text_files, image_files, metadata_files

After modifying the code, the inference process went smoothly and I successfully obtained the corresponding feature vectors for both images and texts.

I hope this can help the users with the same errors！

The text was updated successfully, but these errors were encountered:

rom1504 · 2024-01-30T08:54:42Z

Can you read #329 and propose a fix that make things work without breaking what this PR had fixed ?

ShuxunoO · 2024-01-30T09:29:38Z

Can you read #329 and propose a fix that make things work without breaking what this PR had fixed ?

Sure~
The settings of my local folder and the output of the command line:

the output is:

This is reasonable because the code uses proxy paths relative to the root directory, resulting in all dictionary keys containing subdirectories of different levels.

text_files = {text_file.relative_to(path).as_posix(): text_file for text_file in text_files}

ShuxunoO · 2024-01-31T01:59:09Z

Can you read #329 and propose a fix that make things work without breaking what this PR had fixed ?

should I make a PR again？

ShuxunoO mentioned this issue Jan 30, 2024

Update reader.py #353

Open

ShuxunoO changed the title ~~A possible solution to solve KeyError: 'xxx_0.txt' when using “clip-retrieval inference” command~~ A possible solution to solve KeyError: 'xxx.txt' when using “clip-retrieval inference” command Jan 30, 2024

ShuxunoO mentioned this issue Jan 31, 2024

issue with inference #345

Open

ChesonHuang mentioned this issue Apr 23, 2024

Delete OFA-Sys/Chinese-CLIP#304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A possible solution to solve KeyError: 'xxx.txt' when using “clip-retrieval inference” command #352

A possible solution to solve KeyError: 'xxx.txt' when using “clip-retrieval inference” command #352

ShuxunoO commented Jan 30, 2024 •

edited

rom1504 commented Jan 30, 2024

ShuxunoO commented Jan 30, 2024

ShuxunoO commented Jan 31, 2024

A possible solution to solve KeyError: 'xxx.txt' when using “clip-retrieval inference” command #352

A possible solution to solve KeyError: 'xxx.txt' when using “clip-retrieval inference” command #352

Comments

ShuxunoO commented Jan 30, 2024 • edited

rom1504 commented Jan 30, 2024

ShuxunoO commented Jan 30, 2024

ShuxunoO commented Jan 31, 2024

ShuxunoO commented Jan 30, 2024 •

edited