-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add resmgr server with fastapi/uvicorn #1294
Comments
Implementation could borrow heavily from ocrd.mets_server and ocrd.cli.workspace.workspace_serve_cli, with endpoints 1:1 providing ocrd.cli.resmgr commands. Could you please take this on @joschrew? |
I don't think I need this solution for the slim containers. When resolving resources My problem with all the Resource-Manager stuff is, that it is very complex. To do something like what we have done for the Mets-Server seems to be too much, because there already is a (nearly) working solution. I would rather change the resource-manager to be able to download all desired resources to a configurable directory. Check if desired resource is already there, if not download it. Imo the problem with the current solution is that it wants to be flexible and smart. It would be easier if it would just download all to |
@joschrew please read my comprehensive analysis on the resmgr problem for context. We are not anywhere near any working solution at the moment. This is not about whether or how a processor can resolve (or locate) requested resources (i.e. models) at runtime. It is about the functionality of the
I just gave the METS Server example because it contains a simple FastAPI + Uvicorn structure that you can borrow from. (Of course, the same can be found in the Processing Server, but there it is spread across multiple modules.) |
I basically just can repeat myself, I already tried to understand what was written in the linked issue you mentioned.
The ocrd resmgr goal is in the end to make the resources available. So from my point of view the central point of this is exactly about how a processor can reach/resolve its resources, that's what the resmgr is in the end responsible to resolve. And my opinion is to throw away the resmgr or at least how its currently used. Regarding the linked issue I go with Future solution 1. and not 2. (not sure though if I understand all of it). What I have in mind is this: With both of this (ocrd resmgr able to download to an arbitrary directory, and the processor being able to show where it expects its resources) an error like a processor aborting with an error like "hey user, I cannot find my resources" can be resolved. And this is finally what this is all about, at least how I see it. |
I agree with @joschrew that the resource manager is very complex to use, sometimes unpredictable in behavior, and especially hard to adapt as it is to the network architecture partly due to the false positives when searching for resources. The resource manager should be able to list/download resources in both cases:
I prefer to concentrate more on a solution that works for the second case since that will indirectly work for the first case as well. I have already mentioned that in one of our Monday calls, but just for the record I will write it here again. Although future solution 2 may work for the Processor Server agents (still problematic), since we have the endpoint flexibility, it would not for the Processing Worker agents which consume requests from the RabbitMQ queues. The main issue with the Processing Workers consuming from their respective processor name queues is that we do not know exactly which instance of the processing worker will consume the list/download resource manager request from their queue. We assume that each processing worker instance of a specific ocrd processor is the same and that they can replace each other. Let's consider the following scenario with a single-step workflow for a better understanding of the situation:
Then the following steps are performed:
Even if the Processing Server API is extended to support the @bertsky, the only thing that I see working is if the models are downloaded/installed to all running network agents of the same type. We could implement the
We must still think about how to make the resource manager less complex and more straightforward since that will be running on the network agent side. |
@MehmedGIT thanks for your analysis. Indeed, your scenario is illustrative.
The resmgr still has bugs, esp. around its internal database mechanism (#1044 and #1251), but I won't go down that abyss here – let's assume we get those fixed! Otherwise, I think OcrdResourceManager and
Yes, of course, I have not even touched on that aspect.
I fully agree! (But they still could be in different physical locations in each deployment, so denotation of models via resource names, i.e. unresolved paths, should still be the working principle.)
I don't think we need to differentiate by agent type, though: regardless of type, they always derive from the same Docker image and will be started with the same volumes for models.
Your proposed implementation via SSH or broadcast queues would make the processing/runtime side of the Web API much more complex. I would rather separate these concerns by providing (i.e. generating and starting) a third service merely for the new resmgr API in every processor module (i.e. |
So to illustrate my rather terse description of the concept with your scenario:
Regarding name clashes (for the same processor executable): first come, first served (i.e. the second attempt receives an error msg). As was the case on the CLI. If we have a central resmgr-server anyway, requests will naturally be sequentialized. Regarding filesystem clashes (downloading to the same file, e.g. shared NFS): the resmgr itself must be atomic and prevent this from creating broken files. Again – first come, first served (i.e. the second attempt receives an error msg). I would not concern myself with user management at this point. |
That was helpful, thanks for the summary! Overall I know what to implement and a PR regarding the ocrd_network RM will come soon.
That should be just another network agent managed by the PS deployer. The PS config file does not need a modification since it already contains the host information of each Processing Worker/Processor Server. To prevent port collisions, the OS will randomly assign one per host.
Check.
Check.
Check.
Check. I would rather not rely on that shortcut for robustness reasons. That would be something to optimize later, after resolving the bugs of the RM.
The obstacle here is not the name or filesystem clashes. Sure, FIFO helps with that and requests can be forced to be sequential (sync against async methods). It is about finding a way to synchronize Processing Workers/Processor Servers with the Resource Manager. The RM should not overwrite a local model if it is actively being used by any of the agents on that host at that time. Any ideas on how to prevent that? Even when the processing agent is idle, the model is still cached in the memory.
I am not considering that at this point. |
Ah, indeed, if a download request is made with But do we really need to support overwriting in the networked setting? This could become a pure local/CLI thing, for SSH-based maintenance by admins. |
Originally posted by @bertsky in OCR-D/ocrd_all#69 (comment)
The text was updated successfully, but these errors were encountered: