Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): near-duplicate detection #8228

Merged
merged 49 commits into from May 16, 2024
Merged

Conversation

mertalev
Copy link
Contributor

@mertalev mertalev commented Mar 23, 2024

Description

This PR adds a new job to detect duplicate assets and aggregate them with a new duplicateId column. This PR only implements the backend for duplicate detection. It does not expose the results in the UI or take any actions relating to the assets: this is left for future work.

The data model is such that each (duplicateId, assetId) pair uniquely identifies a duplicate asset and each duplicateId can have many associated assets.

To do:

  • Handle edge case where multiple duplicateIds exist among the found duplicates
  • Better handling of concurrency
    • Disabled concurrency to avoid race conditions and improve accuracy
  • Confirm correctness of the results
  • Tune default threshold
  • Add migration
  • Add tests

Implements #1968

How Has This Been Tested?

Tested by running the new job on all assets through the job panel and inspecting logs to confirm that some assets have duplicates.

Tested that the duplicates displayed in the web view are actually near-duplicates.

Tested that changing the duplicate threshold changes the strictness of the results.

Copy link

cloudflare-pages bot commented Mar 23, 2024

Deploying immich with  Cloudflare Pages  Cloudflare Pages

Latest commit: 95eac75
Status: ✅  Deploy successful!
Preview URL: https://64b0401e.immich.pages.dev
Branch Preview URL: https://feat-duplicate-detection.immich.pages.dev

View logs

@mertalev mertalev force-pushed the feat/duplicate-detection branch 2 times, most recently from 3bf6521 to 788d476 Compare April 20, 2024 19:52
@alextran1502
Copy link
Contributor

Can I help with anything regarding this PR, I am happy to work on UI

@mertalev
Copy link
Contributor Author

I cleaned it up so the backend part is essentially good to go (might need to adjust the response if you want them to be grouped by duplicates and not just sorted). The UI... has a lot of room for improvement haha. It'd be great if you could help with that 😄

@klejejs
Copy link
Contributor

klejejs commented Apr 28, 2024

In terms of UI, would it make sense if photo stacks were automatically created for near-duplicate photos? It's something that the App-Which-Must-Not-Be-Named introduced a while ago and I personally find it very useful.

@mertalev
Copy link
Contributor Author

The idea right now is to have the duplicates displayed in a dedicated page where there are options to convert them to stacks or deduplicate based on some criteria, but otherwise treat them as separate assets until the user elects to do this. Auto-stacking would be very useful and convenient, though, so a later PR to add this functionality would be nice.

@AngelaDMerkel
Copy link

Auto-stacking would be very useful and convenient, though, so a later PR to add this functionality would be nice.

Is it possible to have some kind of notification so that the user can interact with stacking? It's nice when images are stacked automatically, but I find that this sometimes occurs erroneously and I'd like to at least know when a stack has been made.

@mertalev
Copy link
Contributor Author

mertalev commented May 2, 2024

A notification in the web UI would be straightforward. But auto-stacking would be a later addition, so discussion on that is a bit out of scope for this PR.

@PathToLife
Copy link

PathToLife commented May 4, 2024

Thanks for working on this! It's awesome to read through the implementation here.

Just wanted to add my results, the 0.2 threshold (barely) didn't detect duplicates in my resized image test case.
I added in some console logs and set the distance to 1.0, to see 0.02084 as the distance 😭🤦‍♂️

Is it worth perhaps working on a multi algorithm implementation? Experimentation shows pHash excels at resizing. I can look for some time to help add this - please do let me know.

"duplicateId": "79b47ba0-28e5-479b-b745-9f2885299077",
immich_microservices     |       "assetId": "c8c4d6e9-8d72-4e55-be22-20b9019ffca6",
immich_microservices     |       "distance": 0.020842433

I was testing resized images here (download to reproduce):

images-dupe-test.zip

async searchDuplicates({
    assetId,
    embedding,
    maxDistance,
    userIds,
  }: AssetDuplicateSearch): Promise<AssetDuplicateResult[]> {
    maxDistance = 1;
    this.logger.warn('searching duplicates', { assetId, maxDistance, userIds });
    const cte = this.assetRepository.createQueryBuilder('asset');
    cte
      .select('search.assetId', 'assetId')
      .addSelect('asset.duplicateId', 'duplicateId')
      .addSelect(`search.embedding <=> :embedding`, 'distance')
      .innerJoin('asset.smartSearch', 'search')
      .where('asset.ownerId IN (:...userIds )')
      .andWhere('asset.id != :assetId')
      .andWhere('asset.isVisible = :isVisible')
      .orderBy('search.embedding <=> :embedding')
      .limit(64)
      .setParameters({ assetId, embedding: asVector(embedding), isVisible: true, userIds });

    const builder = this.assetRepository.manager
      .createQueryBuilder()
      .addCommonTableExpression(cte, 'cte')
      .from('cte', 'res')
      .select('res.*')
      .where('res.distance <= :maxDistance', { maxDistance });

    const results = (await builder.getRawMany()) as any as Promise<AssetDuplicateResult[]>;

    this.logger.warn('found duplicates', { results });

    return results;
  }

image

@mertalev
Copy link
Contributor Author

mertalev commented May 4, 2024

The duplicate threshold will be exposed in the admin settings. I was debating between defaulting to 0.02 or 0.03, so maybe 0.03 is the better default after all.

@NicholasFlamy
Copy link
Contributor

In terms of UI, would it make sense if photo stacks were automatically created for near-duplicate photos? It's something that the App-Which-Must-Not-Be-Named introduced a while ago and I personally find it very useful.

This

Auto-stacking would be very useful and convenient, though, so a later PR to add this functionality would be nice.

Is it possible to have some kind of notification so that the user can interact with stacking? It's nice when images are stacked automatically, but I find that this sometimes occurs erroneously and I'd like to at least know when a stack has been made.

And this, will be separate. Once this is implemented, these features can be worked on. So they will basically be built off of the code in this feature.

@NicholasFlamy
Copy link
Contributor

NicholasFlamy commented May 7, 2024

I would suggest taking some inspiration from Samsung Gallery for the UI. When you hit delete duplicates it selects all but one of each of the duplicates (so if there is 3 it will select 2). I'm pretty sure it takes date modified or something and if they are different resolutions or file sizes it selects the lower resolution or filesize. Then you can hit delete with all of the duplicates selected.

I think implementing something similar wouldn't be too difficult, and it doesn't even have to select for you but the side-by-side view is the most important thing. Having a button to select duplicates which could prefer selecting the lower resolution/filesize would be an added bonus.

Screenshot_20240507_110313_Gallery
Screenshot_20240507_110337_Gallery
Screenshot_20240507_110457_Gallery
Screenshot_20240507_110352_Gallery

FYI the Testing Immich Album has photos I copied over for testing immich and I only select this album in the mobile app to protect my photos from bugs etc., so they are duplicates.

@be1ski
Copy link

be1ski commented May 7, 2024

Hey there, amazing work on this PR! Just a thought - how about we get rid of those JPEG duplicates when we've got the original HEIC files? What's your take on this?

@NicholasFlamy
Copy link
Contributor

Hey there, amazing work on this PR! Just a thought - how about we get rid of those JPEG duplicates when we've got the original HEIC files? What's your take on this?

If the files are basically identical then this should pick that up. If you're suggesting that it automatically prefer HEIC, that's for the UI which is coming eventually.

@mertalev
Copy link
Contributor Author

mertalev commented May 7, 2024

There will be an option to deduplicate based on resolution, file size, etc. That will get you most of the way there, except in cases where the HEIF is smaller than JPEG purely because it's a more efficient format.

Doing it based on format sounds iffy. You can have a high resolution, high quality JPEG that looks similar to a poor quality HEIF, not to mention that we'd need an arbitrary ranking for which format is better.

We can always expand on this in the future, possibly with a measure of compression artifacts and selecting the image with the least artifacts. But for the first cut, it's better to keep it simple.

@NicholasFlamy
Copy link
Contributor

@mertalev what do you think?
Screenshot_20240507_185727_Chrome

I haven't done much but at least you can get out of there.

@mertalev
Copy link
Contributor Author

mertalev commented May 8, 2024

Nice! I'll reduce the scope of this PR to just be the backend changes so we can do the UI separately.

@mertalev mertalev marked this pull request as ready for review May 8, 2024 06:03
@mertalev
Copy link
Contributor Author

mertalev commented May 8, 2024

After removing the UI changes, this PR is ready for review. The current behavior is that the feature is disabled by default and not exposed to the user except through the config file. The only blocker is that a seemingly unrelated E2E test is failing.

@NicholasFlamy
Copy link
Contributor

Another complement about this functionality. Since it's AI based, it picks up 2 different pictures taken directly after one another at slightly different angles or distances. So when I take multiple pictures just in case one of them is blurry bur then later have a bunch of extra, this should be the solution.

Long screenshot:
Screenshot_20240508_082317_Chrome

Ignore the sidebar, I tapped the button which scrolls down and adds to the screenshot and it did that.

@AngelaDMerkel
Copy link

In addition to hashing, the exif spec contains a field for OriginalFileName which could be used to match duplicates created from an original. A lot of software writes this field and would resolve the need to determine whether heic or jpeg (for example) is the original

@NicholasFlamy
Copy link
Contributor

NicholasFlamy commented May 8, 2024

In addition to hashing, the exif spec contains a field for OriginalFileName which could be used to match duplicates created from an original. A lot of software writes this field and would resolve the need to determine whether heic or jpeg (for example) is the original

I think that would be saved for later. For the UI. Alex and I discussed UI development and Alex will develop most of it but I'll try and start on it this week. We are thinking of a Utilities page which has the deduplication page. I am taking note of what you said, I'm not sure how much logic will go into the deduplication page but that seems like a good idea.

Copy link
Member

@danieldietzler danieldietzler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me! Awesome :)

@NicholasFlamy
Copy link
Contributor

NicholasFlamy commented May 16, 2024

Man @mertalev I'm sorry for bringing this up so late but I just checked #1968 and someone commented a really good point but was asking about the community deduplication project (but this prompted me to realize this could be an issue for our (by our I mean this PR) implementation) :
#1968 (reply in thread)

for duplicates that are deleted, is there some kind of block to stop these from being re-uploaded (ie through mobile auto backup)? Or the immich db can somehow retain the record of that particular upload to flag future uploads as duplicates?

This is actually a really good idea, but how would something like this be implemented? There would have to be a blacklist of some kind and preferably a way to remove items from the blacklist. If deleting files on the web could delete them on mobile this wouldn't be a problem. But I'm worried that in the current state of immich, deleting a duplicate on the server that was backed up from the phone, will be reuploaded from the phone.

Copy link
Contributor

@jrasm91 jrasm91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to see this moved to a DuplicateController and DuplicateService and served on the /duplicates route, but since it isn't used in the UI yet we can do this in a follow-up pull request.

@zackpollard
Copy link
Contributor

Man @mertalev I'm sorry for bringing this up so late but I just checked #1968 and someone commented a really good point but was asking about the community deduplication project (but this prompted me to realize this could be an issue for our (by our I mean this PR) implementation) :
#1968 (reply in thread)

for duplicates that are deleted, is there some kind of block to stop these from being re-uploaded (ie through mobile auto backup)? Or the immich db can somehow retain the record of that particular upload to flag future uploads as duplicates?

This is actually a really good idea, but how would something like this be implemented? There would have to be a blacklist of some kind and preferably a way to remove items from the blacklist. If deleting files on the web could delete them on mobile this wouldn't be a problem. But I'm worried that in the current state of immich, deleting a duplicate on the server that was backed up from the phone, will be reuploaded from the phone.

Imo this is a separate ongoing issue we need to address separately as it's impact is wider than just duplicates.

@zackpollard zackpollard merged commit 64636c0 into main May 16, 2024
23 checks passed
@zackpollard zackpollard deleted the feat/duplicate-detection branch May 16, 2024 17:08
@NicholasFlamy
Copy link
Contributor

Imo this is a separate ongoing issue we need to address separately as it's impact is wider than just duplicates.

Yeah, I guess so.

@bo0tzz
Copy link
Member

bo0tzz commented May 16, 2024

deleting a duplicate on the server that was backed up from the phone, will be reuploaded from the phone.

I believe currently the mobile app will not reupload files that it has already uploaded in the past (but I could be wrong). If you reinstalled the app or something like that it still would though.

@NicholasFlamy
Copy link
Contributor

deleting a duplicate on the server that was backed up from the phone, will be reuploaded from the phone.

I believe currently the mobile app will not reupload files that it has already uploaded in the past (but I could be wrong). If you reinstalled the app or something like that it still would though.

Even then, you'd have the duplicates sitting on your phone and if you reinstalled the app or switched phones and copied everything over then all of a sudden the photos reappear. But that's a different problem because it doesn't just affect duplicates.

@zackpollard
Copy link
Contributor

deleting a duplicate on the server that was backed up from the phone, will be reuploaded from the phone.

I believe currently the mobile app will not reupload files that it has already uploaded in the past (but I could be wrong). If you reinstalled the app or something like that it still would though.

Even then, you'd have the duplicates sitting on your phone and if you reinstalled the app or switched phones and copied everything over then all of a sudden the photos reappear. But that's a different problem because it doesn't just affect duplicates.

We have a plan to sync deletions to synced devices, we will likely be delivering that before stable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet