Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIN_similarity_thresholds #198

Open
wangat opened this issue Jul 4, 2023 · 1 comment
Open

MIN_similarity_thresholds #198

wangat opened this issue Jul 4, 2023 · 1 comment

Comments

@wangat
Copy link

wangat commented Jul 4, 2023

Thank you for your code. I'm comparing Data cleansing libraries such as imagedups, fastdup, and imagededup. When testing imagedump, various hash methods were tested to determine thresholds for different data. However, when testing the cnn method, we encountered some problems. Because my version of torchvision was earlier, I did not use vit and efficientnet, and instead used the default mobilenetv3. However, it was found that different MIN_similarity_thresholds were set, ranging from 0.1 to 0.9, and no duplicate image was found (even the exact same image was used, or the duplicate image found by hash method was used). Later, the threshold was set to be negative, and the score was generally at the level of 1e-5. At the same time, the speed of using cnn method is particularly slow.

I'm sorry that I don't have enough time to study the code now, I would like to ask you if there is a wrong setting? Is it possible that my picture is too large to distinguish the dimensions? (25601440/19201080)

Thank you and look forward to your reply.

@tanujjain
Copy link
Collaborator

tanujjain commented Jul 28, 2023

no duplicate image was found (even the exact same image was used, or the duplicate image found by hash method was used)

This is quite unlikely. If the exact same image is used, the same encodings would be generated and the similarity score would be 1.0. Could you try to reproduce the issue with some pictures used in testing the package here)?

the speed of using cnn method is particularly slow

That's expected, since cnn method requires a forward pass through a deep learning model which is much more computationally expensive than hashing methods available in the package. It could be much quicker if you use it on a GPU machine.

Is it possible that my picture is too large to distinguish the dimensions?

The preprocessing steps before feeding the image to cnn include resizing, cropping, etc. depending upon the cnn network itself. If the pictures are exactly the same, the same encodings should be generated since the preprocessing module receives the same input. However, if the images are quite different, it's possible that for large pictures, the preprocessing steps cut out significant info and hence, images create wildly different encodings. For context, the mobilenetv3 used in the package was pretrained on ImageNet-1K dataset, which has a much lower resolution images than the one you are dealing with. So, some performance degradation can be expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants