[Feature Request] Allow duplicates #4711

diesl · 2023-11-29T23:46:52Z

diesl
Nov 29, 2023

Description

Of course, duplicate detection is a good idea in general.
However, the current setup is too rigid in my opinion. It simply does not process duplicates.

I would like have the possibility to allow duplicate documents (on a case by case basis), because I think there are valid use cases for having duplicate documents:

Some documents do not change for a long period of time, e.g. Terms and Conditions, so I get the same file for different contracts. I still would like to have them in Paperless separately for each contract, although they are technically duplicates
I use Paperless for me and my wife, that means we have two users. It occurs that we both get the same document, but currently only one of us can add the document

Both cases already happened to me/us and I think other users can add more cases to the list. Unfortunately, I don't know how to handle this problem. Is there any workaround?

Apart from that, Paperless is doing a great job 👍

Other

No response

shamoon · 2023-11-30T00:25:45Z

shamoon
Nov 30, 2023
Maintainer

The first one feels like an edge case and I honestly dont really see the value in it but even so a workaround would be to add something basic like text on the PDF "Company A" to differentiate the files.

The second one would be resolved by sharing docs using permissions.

0 replies

diesl · 2023-11-30T08:34:22Z

diesl
Nov 30, 2023
Author

Thanks for your quick answer @shamoon

I will elaborate on my recent case and why I wrote this feature request:

I got a new mobile phone contract two weeks ago
Then I got an email with a bunch of PDFs: Contract details, Terms and Conditions, Payment conditions and some more
My wife got the same contract, but a week later
She got a separate email with the same kind of documents. However, some of them are identical and can not be consumed:
- Contract details: Okay, because different content
- Terms and Conditions: Same as mine and identical document with same hash
- Payment conditions: Same as mine and identical document with same hash

The ideal way would be the following:

Both emails are consumed automatically and assigned to different users (will check this out in the new version 2.0.0 👍)
Documents are not blocked as duplicates because logically, they are not
Bonus: The documents of one email will be linked automatically (see other feature request: Reference between documents #422)

With the example in mind, maybe the answers will be a bit more useful:

The first one feels like an edge case and I honestly dont really see the value in it

The value is of course to have all my documents in Paperless. And when I get them at two different points of time, then I want them in Paperless for the record and not "forget" about the later document that I actually got and that is not duplicate in a logical way.

a workaround would be to add something basic like text on the PDF "Company A" to differentiate the files.

I am not sure if I understand your workaround correctly, but I can not change the content of the file (e.g. a PDF file). But still, forcefully changing the hash of a file is not a real solution in my opinion.

The second one would be resolved by sharing docs using permissions.

Yes, that would be possible, but honestly I think this would be kind of misuse of the permission system?

Some arguments for creating a second document:

I can assign different values to all attributes (ASN, dates, tags, owner, ...)
When linking of documents will be possible in future, they could be linked separately (kind of 1. argument)
Less manual work (error prone), more automation, e.g. not forgetting to share documents, better automatic consumption, ...

1 reply

diesl Jan 24, 2024
Author

Hi @shamoon, as it turns out, I am not the only one with this problem.

In reality, there are chances that you get identical files that are different logical documents, especially when multiple persons are using the Paperless system. Somehow, it should be possible to allow duplicates, I mentioned some ideas in the other comments.

Could you please reevaluate this topic?

dsteinkopf · 2023-12-22T17:56:44Z

dsteinkopf
Dec 22, 2023

I am having a very similar situation: I am in the progress of importing my many years of existing documents from folders. I sometimes duplicate files into several directories (e.g. 2021-01 and 2021-Steuer (=tax)). I am importing using "PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS", so I'd expect the one duplicated document to get both tags: 2021-01 and 2021-Steuer. But this does not happen since it's only imported once and simply skipped the second time.

I my case, this solution would be to not import the whole document a second time but to add tags during the second import.

Would this also be a solution for you, @diesl ?

1 reply

diesl Dec 25, 2023
Author

In your case I agree that adding a tag "Steuer" would be a good solution, because your duplication is basically a tagging workaround. I think your situation with "tag duplicates" is rather special, but in the end the the workflow question is:

How to process a duplicate file, rather than just blocking it?

Import file and just add a tag "Duplicate" to second file
Show a form with further options how to proceed (Import document, drop document, ...)
Just import like any other file

Unfortunately, just use one document would not be a solution for me, see my real world example from #4711 (comment):

It is the same file but for different users, so logically, this is not a duplicate document
I need different documents to assign different values to all attributes (ASN, dates, tags, owner, ...)

fastlane086 · 2023-12-24T13:05:33Z

fastlane086
Dec 24, 2023

Hi folks, I have a similar problem to yours.

I have documents that either refer to two different things or to two different time periods.

For example: I receive travel expenses every month, but also a lump sum for learning materials every six months. The authority that issues the document shows both payments on one document. However, I would like to file it twice with the respective allocation. Or if I suspend the travel expenses, then they are combined. But I would still like to have them separately.

I would like to store a duplicate of this document with the other document type.

So the duplicate check is already good, but it might be a solution if you could force a duplicate in the edit mode.

1 reply

diesl Dec 25, 2023
Author

Yes, I also think the duplicate check is fine, but it should only be a warning, and not a blocker.

flemmingss · 2023-12-25T15:16:50Z

flemmingss
Dec 25, 2023

I can add one more reason that the "share with other user" is not a perfect solution.
Sometimes, lets say if me and my wife both has same "Terms and Conditions" for each our telephone subscription from the same vendor. We will both be owner of that document, so sharing from one to the other will not be accurate.

1 reply

diesl Dec 25, 2023
Author

Exactly my problem

dsteinkopf · 2023-12-27T15:58:50Z

dsteinkopf
Dec 27, 2023

In my eyes, one (hopefully) simple solution would be something like this:

Introduce a new setting (e.g. "IMPORT_DUPLICATES" default: False) that switches from the current behaviour (duplicates are just skipped) to the other "extreme": Import the second one as a new one (duplicate) and tag both (or only the new one?) with a special tag "duplicate".

And what about adding a further tag that helps to find the curresponding "other" duplicate? Idea: Add a tag "dup_HASH" to both documents (e.g. dup_123456789AB) ? So one can click on this tag and get a list of these identical documents to be able to edit them appropriately if necessary.

As far as I (very roughly) understand the code, this solution should be not too much work: If a dup is recognized: Check IMPORT_DUPLICATES: If false, raise an Exception (current behaviour). If true: "remember" that this is a duplicate by storing the document object of the original document. After the import (and processing): Add the duplicate and dup-hash tags to both documents. This would also work when further duplicates of the same document will be imported.

Opinions?

0 replies

fastlane086 · 2023-12-28T16:42:04Z

fastlane086
Dec 28, 2023

I have now simply scanned in a document that I need twice, so the document receives a new MD5 checksum and is therefore not recognized as a duplicate.

Maybe that will help

For documents that you receive directly as a PDF (e.g. terms and conditions), it might help to save them as a "new" PDF file using Microsoft print to PDF.

2 replies

dsteinkopf Dec 29, 2023

Hm. This surely is a good workaround in some cases. In my case, I am still importing (many) existing scanned documents where I don't have the paper anymore...

diesl Dec 29, 2023
Author

I have now simply scanned in a document that I need twice, so the document receives a new MD5 checksum and is therefore not recognized as a duplicate.

As already said, this does not work for digital data, unless printed again. The problem only exists for born-digital documents

For documents that you receive directly as a PDF (e.g. terms and conditions), it might help to save them as a "new" PDF file using Microsoft print to PDF.

This will work, but I think we agree this is just an ugly workaround. It results in bigger file size, lost content and metadata.

I am also not sure if you get the different files if you print to file multiple times. So it might work only once.

dsteinkopf · 2023-12-30T12:48:40Z

dsteinkopf
Dec 30, 2023

I've had another (incomplete) idea which uses the API and so it does not need a code change:

Set PAPERLESS_CONSUMER_DELETE_DUPLICATES to false. So after the import, duplicates will be left in the consume directoy.
A standalone script iterates over these the left over duplicate files (not imported) and uses the API .../api/documents/?checksum__iexact=MD5HASH to retrieve the existing document(s).
Determine which tags and other attributes have to be set because of the other file (name). How can this be done (easily)?
Set these additional tags and attributes via API.

In my case additional tags based on directory names would be enough.

Any ideas or additions to that?

0 replies

dknx01 · 2024-02-02T20:55:31Z

dknx01
Feb 2, 2024

I have the same issue. Paperless is treating two documents as duplicate, but they're not, at least not from a legal point.
A way to import two files, even with the same content is mandatory.

Example:
Informationsbogen_zur_Einlagensicherung_01_2020.pdf: Not consuming Informationsbogen_zur_Einlagensicherung_01_2020.pdf: It is a duplicate of Informationsbogen_zur_Einlagensicherung_01_2021 (#174)

Another examples is if you get the same kind of content from different contract partners or you get it multiple times as they're connectet to contracts. Like buying a house or flat you get documents with some small changes which seems not to be detected.

0 replies

bonerlog · 2024-02-13T13:08:43Z

bonerlog
Feb 13, 2024

+1 for allowing duplicates, when users are different.

Sharing is NOT always an option, especially if someone wants to keep the document, but the other one wants to delete it. Or someone wants to edit things the other one does not. It is also very time-intensive to share documents the correct way.

Depending on the access settings a user probably even doesn't recognize the duplicated document wasn't uploaded, it just does not appear.

0 replies

asklc · 2024-02-18T08:37:32Z

asklc
Feb 18, 2024

@diesl While reading your comments and giving some thought to your point I understand your situation and can relate to it as I can see this happening to me as well.

Probably you were describing some situation where, say, certain service providers like to send you contract information scattered accross multiple files, one being the actual contractor information, another being the terms and conditions and so on. Thus, multiple persons receive the same files via mail if they buy the same product where only the personal information differs.

While I see how feeding those documents seperately into Paperless as they come in via email is a straight forward thing, I'd like to suggest that perhaps pre-processing them and merging them might be advised here.

See it like this: while the documents arrive separately, they're logically one document (as you already stated) and in the physical world you would usually file them directly next to each other (or even staple them together) and so would your wife. So, if you would merge them, the contractor information on the first pages would make them distinct objects immediatelly and you don't have to link them or whatever in Paperless later on. That way Paperless won't treat them as duplicates and they stay coherent documents all along the way.

Maybe some pre-consumption script or something could be helpful here.

1 reply

diesl Feb 19, 2024
Author

Hi @asklc, thank you for your comment. While I agree that merging different documents into one could (sometimes) circumvent the duplication problem, I don't see that this approach is a good solution to the duplication problem.

while the documents arrive separately, they're logically one document (as you already stated)

If you mean that the different files (Contract, Terms and Conditions, Payment Conditions, ...) that one person gets (in one email?) are logically one document, I disagree. The only thing they have in common is the correspondent and maybe the date. But still, I want to assign different title, document type, tags, notice, and maybe different access rights, etc.

Therefore, I do not think this is a good idea, because the only advantage is to avoid a duplicate warning, while having a lot of drawbacks.

And this approach is still not fail safe: If two people get (identical) new Terms and Conditions as a single document, there is nothing to merge.

in the physical world you would usually file them directly next to each other

Only because you usually sort by incoming date. But a few people may first sort by document type and seperate Invoices and Terms and Conditions, for example.

With the merging approach, you would also loose the filtering and sorting capability.

almereyda · 2024-02-19T16:15:38Z

almereyda
Feb 19, 2024

The one request seems to be a special case of

Merge multiple documents into a single one #367

and the other qualifies for

Reference between documents #422

Maybe we need a follow up to the latter, in which we allow to create virtual dossiers of multiple documents, which are treated as a single one? An alternative path allows to type the relations using common vocabularies that provide relationship types (SKOS, OWL, Wikidata, Schema.org). Then we can distinguish the relations in a qualitative manner that allows automatic grouping and don't need additional models for collecting individual items to maintain.

SKOS, OWL, Wikidata and Schema.org reference to identity relations

Duplicates become even more challenging to consider together with versioning.

Update documents version #1218

3 replies

dsteinkopf Feb 19, 2024

#422
Maybe we need a follow up to the latter, in which we allow to create virtual dossiers of multiple documents

Nice idea - but this sound like whole bunch of work...

BTW. Thanks for the hint, that document linking is already possible this way :-)

An alternative path allows to type the relations using common vocabularies that provide relationship types...

Even better... and probably even more work...

Duplicates become even more challenging to consider together with versioning

Oh yes...

almereyda Feb 20, 2024

Multiple relationship types and automatic classification/grouping (around correspondents, documents, tags etc.) also allow for all sorts of interesting qualitative renderings of that graph, think • ARETE •, LogSeq, Obsidian, Zettlr et al.

diesl Feb 20, 2024
Author

Hi @almereyda, I think your first post was intended as an answer to asklc's post, correct?

Regarding references and document relationships, I think this is a completely different topic. This discussion is only about allowing duplicates. So if you have suggestions and feature requests in that direction, I would ask you to open another discussion for it.

Concerning the duplicates, I don't think that they need any kind of special relationship. The only reason I see for a "connection" is to have a UI to work with the duplicates (link to duplicate document, delete duplicate document, ...), but this could be determined by a DB query checking for identical hashes. No need for a saved reference per se in my opinion.

diesl · 2024-02-20T11:00:05Z

diesl
Feb 20, 2024
Author

Some time has passed since this discussion has been started and so it got a little bit cluttered over time. I am trying to summarize a bit.

As the discussion has shown, I am not the only one with the problem to receive (born digital) documents that are marked as duplicates because they have the same hash value. The catch is, they are not duplicates from a logical point of view.

Examples

These are real word examples (see also my initial request and later answer):

Unchanged documents: Some documents do not change for a long period of time, e.g. Terms and Conditions. However, they are separate documents at separate points of time. Still they should be consumed by Paperless separately (for each contract, fiscal year, etc)
Multiple users: It occurs that some users get the same document, but currently only the first one can add the document

Suggested workarounds

The currently suggested workarounds are:

"Adding text", thus manipulating the file: This is kind of a last resort approach. It is not user friendly, it changes the original file, does not apply in many workflows. I think we do not need to discuss this further
Share document by using permissions: In my opinion, this is more a theoretical approach. In practice it does not work, because then the admin must "fix" these problems. Other aspects like deleting the document for one user etc are not even considered
Print and rescan: Well, of course it works if you have a printer, but it is not user friendly and environmently friendly
Reprint as PDF: Works, but looses data and metadata and increases file size
Merge with other documents before consumption: I think this approach makes things even worse. Separate documents should stay separate

All suggested workarounds have some major drawbacks. They only way I see to solve them is to allow to create a second document (a "duplicate") for the following ...

Reasons

Some arguments for creating a second document:

Different values can be assigned to the attributes: Title, ASN, dates, tags, owner, ...)
Different owners, permissions, links, workflows can be assigned
Less manual work (error prone), more automation, e.g. not forgetting to share documents, better automatic consumption, ...

I think it is the easiest and best solution to allow duplicates. Are there any blocker arguments to not allow duplicates at all concerning the code and/or DB design?

Of course, details have to be discussed if following this path:

Show a duplicate warning to the user: During consumption? In a separate GUI? Using tags, similar to inbox tag?
Allow duplicates only across users? (Does not solve all use cases, see above)
...

Long story short, it would be nice to get some additional thoughts from you @shamoon and @stumpylog about how this problem could be solved in a nice and user friendly way?

0 replies

asklc · 2024-02-21T16:15:22Z

asklc
Feb 21, 2024

I guess, after all it shouldn't be that complicated introducing a new config switch and disabling the duplicate check in the document consumer based on that.

0 replies

iC0RE · 2024-04-30T07:49:00Z

iC0RE
Apr 30, 2024

Hey there,

I like the duplicates check very much. But sometimes I wish to turn it of for just a bunch of files.

In my opinion it would be nice to have an additional Action-button next to 'Dismiss' (maybe called: 'Force consumption' or 'consume') on the failed file tasks page.

So until this point everything goes the expected way.
After clicking this new button the consumption runs without pre checking for duplicates and the failes task goes over to complete (when finished).

I hope this was understandable.

Best regards

0 replies

NoricSteel · 2024-05-12T00:53:31Z

NoricSteel
May 12, 2024

Hi, I fully agree. I just set up pngx to finally manage my paypal and other bills. I exported the mails as PDF and during the import some were flagged as duplicates, although they had a completely different content. Just like iCORE suggested, an additional "Import anyways" option on the failed tasks page would be absolutely great.

Thanks!

0 replies

[Feature Request] Allow duplicates #4711

Description

Other

Replies: 16 comments · 10 replies

shamoon Nov 30, 2023 Maintainer

diesl Nov 30, 2023 Author

diesl Jan 24, 2024 Author

diesl Dec 25, 2023 Author

diesl Dec 25, 2023 Author

diesl Dec 25, 2023 Author

diesl Dec 29, 2023 Author

diesl Feb 19, 2024 Author

diesl Feb 20, 2024 Author

diesl Feb 20, 2024 Author

Examples

Suggested workarounds

Reasons

Replies: 16 comments 10 replies

shamoon
Nov 30, 2023
Maintainer

diesl
Nov 30, 2023
Author

diesl Jan 24, 2024
Author

diesl Dec 25, 2023
Author

diesl Dec 25, 2023
Author

diesl Dec 25, 2023
Author

diesl Dec 29, 2023
Author

diesl Feb 19, 2024
Author

diesl Feb 20, 2024
Author

diesl
Feb 20, 2024
Author