Feature requests and feedback from a user's perspective #88
Replies: 11 comments 5 replies
-
Hello! Thanks for giving such detailed feedback. This is the stuff that keeps projects like this going.
Just remember that half of the credit goes to the original project. :) For example, the entire filename format logic was done by someone else, I just reworked the implementation a bit to make it more maintainable and fail-safe.
Sure, that's easy. I put it on the list.
That's partially true. Right now, the title field allows characters that are not valid in file names. When I change this behavior, I have to take that into consideration. Some users may use such titles and renaming would fail. The current logic makes url-safe filenames. That's not necessarily required, but the implementation is simple and bomb proof. I'll consider using less restrictive conversions, that just remove problematic characters or replace them with dashes, or something.
There's valid reason to have multiple documents with the same title, and putting ids in the title is the safest way of dealing with that. I've got two documents titled "December 2020", both typed "Bank statement", one tagged "account1", the other "account2". I agree it's inconvenient. "-01", "-02" seems like a good idea, I'll keep that in mind. There won't be any errors about duplicate titles, though.
I don't see myself adding a new entity type anytime soon. However, we could use tags to specify a "category" for each document type, which you would then be able to reference in the filename format. Or maybe just provide a plain text field for that. I'll think about that once I tackle this hierarchical tags thing. I also thought about scratching types entirely, since they're just a more specific form of tags, but it seems they are rather useful.
I'd rather not modify the original files at all. Paperless even has checksums in place that make sure they stay the same. I don't want to cause any data loss in case some library decides to wreak havoc. That's also the reason why I decided to keep originals and OCR-enhanced files next to each other with the new update. Also, we'd need the database anyway for searching and serving data to the front end in a reasonable time. Fetching that data by reading hundreds of files when the users searches for something just won't cut it.
Glad to hear that! Honestly, when I started working on this, I thought about removing this filename feature all together, since it was giving me lots of headaches and I thought that users don't care anyway about how the files are stored on disk, but it seems I was wrong. |
Beta Was this translation helpful? Give feedback.
-
Yes, credit to Daniel Quinn, the original author of paperless and all who contributed to that project. But, don't sell yourself short. You've put in a lot of work. 1. Add a placeholder for document type to PAPERLESS_FILENAME_FORMAT
Great. I was hoping you would say that. I looked over the code, and it seemed like it might an easy fix. 2. Don't alter the filename.
Fair enough. But I still feel that it would be much simpler and more elegant to leverage the file system and simply have the name of the file be the same as the the document title and simply restrict illegal characters or raise an alert box when the user tries to use an illegal character or something like that. Although, I have to admit, I'm not well versed on how Python deals with illegal characters in file names vs. what operating system and file system the code is running on. (As a side rant: If I were the absolute ruler of all things computer, the first thing I would do is have a dedicated path separator key on the keyboard that was not a printed character. The fact that
I've been putting a lot of though into this--trying to see things from both sides. And, I'm afraid I don't agree. I think you are making things harder on yourself then they need to be. I find it inadvisable to design a system that allows both: A
such that
and file: December 2020 - 0000002.pdf has
and B
I think that if you want condition A, you shouldn't allow the config setting in B. One of the main features of both paperless and paperless-ng that I like is the ability to store my documents in a directory structure of my choosing with file names of my choosing. I've tried other document management systems that put all of the document files in one giant directory and managed all the file names. I ended up with a directory filled with thousands of PDF files with names like: "ffe24e1f-a5d8-40a6-bb64-708fb8e078d9". Yuk! When I go look at my files, I would much rather see something like:
or, alternatively
then
side note: 3. Add one hierarchical grouping (call it category?) above document_type
Fair enough. That would probably be a heavy lift and would perhaps tip it over the "don't over complicate things" edge.
I think date, correspondent, and document type are the three core attributes and warrant their own special top-level type treatment as currently implemented. 4. Perhaps store the document's metadata in the file itself
Agreed. That would be a big bite to chew. I tried doing a little more research on the subject of writing XMP tags (https://en.wikipedia.org/wiki/Extensible_Metadata_Platform, https://exiftool.org). Ugh, what a thorny mess. I still think the notion of storing a document's metadata within the file itself is the best way to do things. Unfortunately, I don't think anyone has come up with any good standard way of doing it. |
Beta Was this translation helpful? Give feedback.
-
Probably. The thing is, paperless allowed duplicate titles in the past, and therefore, some users will have documents with duplicate titles and I need to take that into consideration. I really like the idea of having _01, _02 at the end of the file name in case that happens and will look into getting that into the code. No impact for users who are not in that situation, and a working solution for users who are.
I'm open to any recommendations on how to make it better. That part of the code is still in there from original paperless. The idea I'm working on right now looks somewhat like this:
Not optimal, since tags don't translate well into folders. There's also the idea of having hierarchical tags over at #56, maybe we can work out something with that. |
Beta Was this translation helpful? Give feedback.
-
I believe I adjusted most of the issues with the filenames now, except for the tags. |
Beta Was this translation helpful? Give feedback.
-
Yes, I see your point. I agree that the _01, _02 is the the best work-around for the few times the problem of duplicate titles comes up.
I got version 0.9.6 up and running. Things are looking really, really nice. I really like how my file names don't change and that I can organize my files into a directory structure of my choosing with
I also really like the Details, Content, and Metadata tabs on the edit document page. Great work. Regarding tags:
This is a tough one. Obviously tags don't translate into file system directories because a file can have multiple tags and the file system is "flat". For me, {document_type}/{correspondent}/{title} is good enough, and I don't bother with tags in my file system directory structure. New topic: Ability to delete the original and keep the archive versionI do have one other suggestion--not sure if I should start a new thread on a new issue ticket, but since this is already in the "discussion" section here it is: I would like to have the ability to delete the original version of the document if I'm happy with the archive version that ocrmypdf produces. I don't want to have to save two versions of the same document. My workaround is to run ocrmypdf on the file and verifying the results prior to importing it into paperless-ng and setting
|
Beta Was this translation helpful? Give feedback.
-
Awesome. Thank you for your feedback.
This is alright. GitHub just proposed me to enable this and I feel its a good place for things that aren't exactly tasks that fit into tickets.
I see. I already figured people would ask for that. Am I correct in assuming that you'd still want paperless to keep the original for each document you uploaded until you decide that the OCR'ed version is alright? I see two options:
|
Beta Was this translation helpful? Give feedback.
-
Regarding tags and foldersAfter some more thought--the following makes sense to me: Simply concatenate alphabetically all of the file's tags together into a directory name such that:
and:
results in the following directory structure:
Original vs. Archived document version
For my use case, I only want to keep one version of any given document. The document should be a PDF/A with an OCR text layer. I'm not necessarily interested in paperless-ng converting my PDFs for me. I am content to perform all of the document preparation before uploading to paperless-ng. So my work flow is the following:
This is not to say that incorporating PDF/A conversion into paperless-ng isn't a good idea--others may find it much more useful then me. I think it's a matter of how much of steps 1-3 you want to add. With your current set up, I think I would like to see a side-by-side comparison view that would allow me to visually inspect the "archived" and "original" versions before being confident about deleting the "original". This might not be so bad when uploading one file at a time, but might be challenging if there are multiple files to go through. Perhaps, present the user with a list of documents with two versions available and let the user choose to keep the original, delete the original, or retry the ocr step on the original with different settings, on a case-by-case basis.
I wouldn't. I would keep the final PDF/A document in a separate directory from files that have yet to be OCR'd and converted to PDF/A's as currently implemented. I always want to be able to simply look at the file system and see what is going on. |
Beta Was this translation helpful? Give feedback.
-
Just starting to use paperless ng and I'd like to tag the document type against two users and then use correspondent as inciter of the correspondence. My scanner has a number of shortcuts I've setup for this but obviously it isn't picking up on the naming convention User-Correspondent* == {document_type}-{correspondent}- is there going to be a patch available soon or should I just hack something in and make a pull request? |
Beta Was this translation helpful? Give feedback.
-
Hello, I just discover paperless-ng, so intuitive compared to Mayan-EDMS (work in progress to move all document from Mayan to Paper-NG). What is the status of these feature requests? Following the last comments, little suggestion to add something like group/tree function. Example:
In all case, thank you for the work accomplished :) |
Beta Was this translation helpful? Give feedback.
-
Just hijacking this thread on your final note:
do mind sharing your post-install steps? I managed to install all requirements (though scipy and numpi were impossible via pip for me, I ended up using it from the FreeBSD latest repo and changed the version in the requirements.txt - but that’s a different story) greetings |
Beta Was this translation helpful? Give feedback.
-
I'm adding package list for FreeBSD 12-3
and supervisord configuration for starting automatically at boot (
|
Beta Was this translation helpful? Give feedback.
-
Hi Jonas,
I'm writing this post to provide my feedback on paperless-ng.
First, a little background: I consider myself a database guru, but a mere enthusiast/tinker when it comes to coding python. I used Mayan (https://mayan-edms.com) to organize my home office documents for many years, but found it to be too cumbersome and overly complex for my needs. One thing it did do, however, is force me to think about how to organize my scanned documents in a way that makes sense to me. I've recently pulled all of my documents (~1500) out of Mayan and now store them in a simple file system hierarchy. I went searching for a new tool and found my way to your project, and all I can say is, thank-you. Great work on a great project. It checks all the right boxes for me: keeps it simple, OCRs documents and provides robust searching, stores the documents in the file system in a hierarchical way of my choosing, uses SQLite for simple backups or PostgreSQL (yeah!!) for larger setups, runs in a browser, is written in Python, has an intuitive GUI.
I'm following the development with great interest. For what it's worth, here is my list of feature requests and, perhaps, points of discussion:
1. Add a placeholder for document type to PAPERLESS_FILENAME_FORMAT
Such that if document types = Auto, Finance-Banking, Insurance-Health
and
I end up with something like:
2. Don't alter the filename.
Currently, when the file is stored in the media directory, the filename is changed to all lower case, no spaces, and the document ID is appended on the end. For example, if I were to do the following:
the file gets stored in the media directory as:
But, I would like it to be:
I would argue the following:
3. Add one hierarchical grouping (call it category?) above document_type
Perhaps this is not worth the extra complexity, and I'm also curious to see how your idea of nested tags turns out which may be more elegant. But, I would like to have something like:
categories = Insurance, Banking, Credit Cards
document_types = Invoices, Statements, Receipts, Letters
And end up with:
4. Perhaps store the document's metadata in the file itself
I like to think of a document viewer the same way I think about music players/organizers such as Rhythmbox, Amarok, and iTunes. I keep my digital music files organized in my computer's filesystem, but I use a music playing app to help me organize and play my music. Why not store all of a document's metadata in the document itself in the same way that MP3 music files store ID3 tags. Could this be done with XMP tagging (https://en.wikipedia.org/wiki/Extensible_Metadata_Platform)?
5. One final note
I'm running paperless-ng 0.9.4 in a FreeBSD jail with a "bare metal" install. Aside from some inotify speed bumps, everything runs fine.
Beta Was this translation helpful? Give feedback.
All reactions