[Feature Request] Allow duplicates #4711
Replies: 16 comments 10 replies
-
The first one feels like an edge case and I honestly dont really see the value in it but even so a workaround would be to add something basic like text on the PDF "Company A" to differentiate the files. The second one would be resolved by sharing docs using permissions. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your quick answer @shamoon I will elaborate on my recent case and why I wrote this feature request:
The ideal way would be the following:
With the example in mind, maybe the answers will be a bit more useful:
The value is of course to have all my documents in Paperless. And when I get them at two different points of time, then I want them in Paperless for the record and not "forget" about the later document that I actually got and that is not duplicate in a logical way.
I am not sure if I understand your workaround correctly, but I can not change the content of the file (e.g. a PDF file). But still, forcefully changing the hash of a file is not a real solution in my opinion.
Yes, that would be possible, but honestly I think this would be kind of misuse of the permission system? Some arguments for creating a second document:
|
Beta Was this translation helpful? Give feedback.
-
I am having a very similar situation: I am in the progress of importing my many years of existing documents from folders. I sometimes duplicate files into several directories (e.g. 2021-01 and 2021-Steuer (=tax)). I am importing using "PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS", so I'd expect the one duplicated document to get both tags: I my case, this solution would be to not import the whole document a second time but to add tags during the second import. Would this also be a solution for you, @diesl ? |
Beta Was this translation helpful? Give feedback.
-
Hi folks, I have a similar problem to yours. I have documents that either refer to two different things or to two different time periods. For example: I receive travel expenses every month, but also a lump sum for learning materials every six months. The authority that issues the document shows both payments on one document. However, I would like to file it twice with the respective allocation. Or if I suspend the travel expenses, then they are combined. But I would still like to have them separately. I would like to store a duplicate of this document with the other document type. So the duplicate check is already good, but it might be a solution if you could force a duplicate in the edit mode. |
Beta Was this translation helpful? Give feedback.
-
I can add one more reason that the "share with other user" is not a perfect solution. |
Beta Was this translation helpful? Give feedback.
-
In my eyes, one (hopefully) simple solution would be something like this: Introduce a new setting (e.g. "IMPORT_DUPLICATES" default: False) that switches from the current behaviour (duplicates are just skipped) to the other "extreme": Import the second one as a new one (duplicate) and tag both (or only the new one?) with a special tag "duplicate". And what about adding a further tag that helps to find the curresponding "other" duplicate? Idea: Add a tag "dup_HASH" to both documents (e.g. As far as I (very roughly) understand the code, this solution should be not too much work: If a dup is recognized: Check IMPORT_DUPLICATES: If false, raise an Exception (current behaviour). If true: "remember" that this is a duplicate by storing the document object of the original document. After the import (and processing): Add the duplicate and dup-hash tags to both documents. This would also work when further duplicates of the same document will be imported. Opinions? |
Beta Was this translation helpful? Give feedback.
-
I have now simply scanned in a document that I need twice, so the document receives a new MD5 checksum and is therefore not recognized as a duplicate. Maybe that will help For documents that you receive directly as a PDF (e.g. terms and conditions), it might help to save them as a "new" PDF file using Microsoft print to PDF. |
Beta Was this translation helpful? Give feedback.
-
I've had another (incomplete) idea which uses the API and so it does not need a code change:
In my case additional tags based on directory names would be enough. Any ideas or additions to that? |
Beta Was this translation helpful? Give feedback.
-
I have the same issue. Paperless is treating two documents as duplicate, but they're not, at least not from a legal point. Example: Another examples is if you get the same kind of content from different contract partners or you get it multiple times as they're connectet to contracts. Like buying a house or flat you get documents with some small changes which seems not to be detected. |
Beta Was this translation helpful? Give feedback.
-
+1 for allowing duplicates, when users are different. Sharing is NOT always an option, especially if someone wants to keep the document, but the other one wants to delete it. Or someone wants to edit things the other one does not. It is also very time-intensive to share documents the correct way. Depending on the access settings a user probably even doesn't recognize the duplicated document wasn't uploaded, it just does not appear. |
Beta Was this translation helpful? Give feedback.
-
@diesl While reading your comments and giving some thought to your point I understand your situation and can relate to it as I can see this happening to me as well. Probably you were describing some situation where, say, certain service providers like to send you contract information scattered accross multiple files, one being the actual contractor information, another being the terms and conditions and so on. Thus, multiple persons receive the same files via mail if they buy the same product where only the personal information differs. While I see how feeding those documents seperately into Paperless as they come in via email is a straight forward thing, I'd like to suggest that perhaps pre-processing them and merging them might be advised here. See it like this: while the documents arrive separately, they're logically one document (as you already stated) and in the physical world you would usually file them directly next to each other (or even staple them together) and so would your wife. So, if you would merge them, the contractor information on the first pages would make them distinct objects immediatelly and you don't have to link them or whatever in Paperless later on. That way Paperless won't treat them as duplicates and they stay coherent documents all along the way. Maybe some pre-consumption script or something could be helpful here. |
Beta Was this translation helpful? Give feedback.
-
The one request seems to be a special case of and the other qualifies for Maybe we need a follow up to the latter, in which we allow to create virtual dossiers of multiple documents, which are treated as a single one? An alternative path allows to type the relations using common vocabularies that provide relationship types (SKOS, OWL, Wikidata, Schema.org). Then we can distinguish the relations in a qualitative manner that allows automatic grouping and don't need additional models for collecting individual items to maintain. SKOS, OWL, Wikidata and Schema.org reference to identity relations
Duplicates become even more challenging to consider together with versioning. |
Beta Was this translation helpful? Give feedback.
-
Some time has passed since this discussion has been started and so it got a little bit cluttered over time. I am trying to summarize a bit. As the discussion has shown, I am not the only one with the problem to receive (born digital) documents that are marked as duplicates because they have the same hash value. The catch is, they are not duplicates from a logical point of view. ExamplesThese are real word examples (see also my initial request and later answer):
Suggested workaroundsThe currently suggested workarounds are:
All suggested workarounds have some major drawbacks. They only way I see to solve them is to allow to create a second document (a "duplicate") for the following ... ReasonsSome arguments for creating a second document:
I think it is the easiest and best solution to allow duplicates. Are there any blocker arguments to not allow duplicates at all concerning the code and/or DB design? Of course, details have to be discussed if following this path:
Long story short, it would be nice to get some additional thoughts from you @shamoon and @stumpylog about how this problem could be solved in a nice and user friendly way? |
Beta Was this translation helpful? Give feedback.
-
I guess, after all it shouldn't be that complicated introducing a new config switch and disabling the duplicate check in the document consumer based on that. |
Beta Was this translation helpful? Give feedback.
-
Hey there, I like the duplicates check very much. But sometimes I wish to turn it of for just a bunch of files. In my opinion it would be nice to have an additional Action-button next to 'Dismiss' (maybe called: 'Force consumption' or 'consume') on the failed file tasks page. So until this point everything goes the expected way. I hope this was understandable. Best regards |
Beta Was this translation helpful? Give feedback.
-
Hi, I fully agree. I just set up pngx to finally manage my paypal and other bills. I exported the mails as PDF and during the import some were flagged as duplicates, although they had a completely different content. Just like iCORE suggested, an additional "Import anyways" option on the failed tasks page would be absolutely great. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Description
Of course, duplicate detection is a good idea in general.
However, the current setup is too rigid in my opinion. It simply does not process duplicates.
I would like have the possibility to allow duplicate documents (on a case by case basis), because I think there are valid use cases for having duplicate documents:
Both cases already happened to me/us and I think other users can add more cases to the list. Unfortunately, I don't know how to handle this problem. Is there any workaround?
Apart from that, Paperless is doing a great job 👍
Other
No response
Beta Was this translation helpful? Give feedback.
All reactions