Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add resmgr server with fastapi/uvicorn #1294

Open
bertsky opened this issue Nov 12, 2024 · 9 comments
Open

add resmgr server with fastapi/uvicorn #1294

bertsky opened this issue Nov 12, 2024 · 9 comments

Comments

@bertsky
Copy link
Collaborator

bertsky commented Nov 12, 2024

Elaborating a bit on option 2: of course, the (generated) docker-compose.yml for each module could also provide an additional server entry point – a simple REST API wrapper for resmgr CLI. Its (generated) volume and variable config would have to match the respective Processing Worker (or Processor Server) to be used. But the local resmgr would not need to "know" anything beyond what it can see in its thin container – a local ocrd-all-tool.json and ocrd-all-module-dir.json precomputed for the processors of that module at build time, plus the filesystem in that container and mounted volumes.

In addition, to get the same central resmgr user experience (for all processor executables at the same time), one would still need

  • either a single server (with resmgr-like endpoints or even providing some /discovery/processor/resources) which delegates to the individual resmgr servers,
  • or an intelligent resmgr client doing the same.

Regardless, crucially, this central component needs to know about all the deployed resmgr services – essentially holding a mapping from processor executables to module resmgr server host-port pairs. This could be generated along with the docker-compose.yml (in a new format like ocrd-all-module-dir.json), or the latter even gets parsed directly.

Originally posted by @bertsky in OCR-D/ocrd_all#69 (comment)

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 12, 2024

Implementation could borrow heavily from ocrd.mets_server and ocrd.cli.workspace.workspace_serve_cli, with endpoints 1:1 providing ocrd.cli.resmgr commands.

Could you please take this on @joschrew?

@joschrew
Copy link
Contributor

joschrew commented Nov 12, 2024

I don't think I need this solution for the slim containers. When resolving resources /usr/local/share/ocrd-resources/ is considered in most cases. For tesserocr-recognize I have to use TESSDATA_PREFIX to provide the path. So when starting the processing-workers I volume- mount my host-local modules-directory to /usr/local/share/ocrd-resources/. This way the processor should always find the resources. Downloading the resources is a bit complex in the host (for example tesserocr-recognize refuses to download to /usr/local/share/ocrd-resources), but this only has to be done once.

My problem with all the Resource-Manager stuff is, that it is very complex. To do something like what we have done for the Mets-Server seems to be too much, because there already is a (nearly) working solution. I would rather change the resource-manager to be able to download all desired resources to a configurable directory. Check if desired resource is already there, if not download it. Imo the problem with the current solution is that it wants to be flexible and smart. It would be easier if it would just download all to /usr/local/share/ocrd-resources. Additionally TESSDATA_PREFIX should always be set to /usr/local/share/ocrd-resources/ocrd-tesserocr-recognize. In this case for example it would just be possible to mount a directory to processing-server and the workers to /usr/local/share/ocrd-resources. And then in the processing server just the Resource-Manager has to be called to download to the shared folder.

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 12, 2024

@joschrew please read my comprehensive analysis on the resmgr problem for context. We are not anywhere near any working solution at the moment.

This is not about whether or how a processor can resolve (or locate) requested resources (i.e. models) at runtime.

It is about the functionality of the ocrd resmgr CLI made available within the thin container regime, i.e. (for all processors)

  • listing installed models,
  • listing additional models available for download, and
  • downloading models.

I just gave the METS Server example because it contains a simple FastAPI + Uvicorn structure that you can borrow from. (Of course, the same can be found in the Processing Server, but there it is spread across multiple modules.)

@joschrew
Copy link
Contributor

I basically just can repeat myself, I already tried to understand what was written in the linked issue you mentioned.

This is not about whether or how a processor can resolve (or locate) requested resources (i.e. models) at runtime.

The ocrd resmgr goal is in the end to make the resources available. So from my point of view the central point of this is exactly about how a processor can reach/resolve its resources, that's what the resmgr is in the end responsible to resolve.

And my opinion is to throw away the resmgr or at least how its currently used. Regarding the linked issue I go with Future solution 1. and not 2. (not sure though if I understand all of it).

What I have in mind is this: ocrd resmgr is called with a path to a directory. ocrd resmgr then downloads resources to this directory. It can list what is already in this directory and what resources are available online. Then the processors get a function like the --dump-module-dir function, for example called --show-resources-dir. With this they show where they expect there resources to be (this should be made configurable).

With both of this (ocrd resmgr able to download to an arbitrary directory, and the processor being able to show where it expects its resources) an error like a processor aborting with an error like "hey user, I cannot find my resources" can be resolved. And this is finally what this is all about, at least how I see it.

@MehmedGIT
Copy link
Contributor

I agree with @joschrew that the resource manager is very complex to use, sometimes unpredictable in behavior, and especially hard to adapt as it is to the network architecture partly due to the false positives when searching for resources.

The resource manager should be able to list/download resources in both cases:

  • Processing Worker/s and Processing Server/s are on the same host machine
  • Processing Worker/s and Processing Server/s are on different host machines

I prefer to concentrate more on a solution that works for the second case since that will indirectly work for the first case as well.

I have already mentioned that in one of our Monday calls, but just for the record I will write it here again. Although future solution 2 may work for the Processor Server agents (still problematic), since we have the endpoint flexibility, it would not for the Processing Worker agents which consume requests from the RabbitMQ queues. The main issue with the Processing Workers consuming from their respective processor name queues is that we do not know exactly which instance of the processing worker will consume the list/download resource manager request from their queue. We assume that each processing worker instance of a specific ocrd processor is the same and that they can replace each other.

Let's consider the following scenario with a single-step workflow for a better understanding of the situation:

  • 1 instance of a Processing Server running on Host A
  • 1 instance of a RabbitMQ queue running on Host A
  • 3 instances of ocrd-tesserocr-recognize as a Processor Server module deployed on Host B, C, and D
  • 3 instances of ocrd-tesserocr-recognize as a Processing Worker module deployed on Host E, F, G. These workers know the connection URL of the RabbitMQ running on Host A.
  • The Processing Server knows the hosts of the network agents (i.e., host B-G)

Then the following steps are performed:

  1. The user submits a processing request of type ocrd-tesserocr-recognize to the Processing Server
  2. Based on the agent flag specified in the processing request the request is forwarded to either of the two:
  • a Processor Server of type ocrd-tesserocr-recognize (randomly among hosts B, C, D)
  • a RabbitMQ queue with name ocrd-tesserocr-recognize, then the request is consumed by either of the Processing Workers on host E, F, G (in a round-robin fashion).

Even if the Processing Server API is extended to support the ocrd resmgr CLI, the user will still have no control over which agent the list/download request lands on (somewhat more controllable for the Processor Server, but still very unpleasant). The main idea of the network architecture was to hide the host details of the running agents from the general user. Regardless if Processor Servers or Processing Workers instances are deployed, they MUST always have the same models available locally. Otherwise, the Processing Server not only has to keep track of which worker has what models installed but also forward the requests accordingly (and this is not possible for the Processing Worker scenario).

@bertsky, the only thing that I see working is if the models are downloaded/installed to all running network agents of the same type. We could implement the ocrd resmgr CLI endpoints in the Processing Server and:

  • when a user (ideally an admin user) invokes the download endpoint, the Processing Server will iterate over all worker hosts of that processor type and install the resources locally on the remote host via SSH. There are also broadcast queues in RabbitMQ that may be better to utilize than iterating with SSH. However, currently, the workers are implemented in a blocking manner. Not sure yet how to make a single worker listen on two queues simultaneously (one for the requests, and one for list/download). The models will be installed only once when the workers are on the same host.
  • when a user (regular user) invokes the list endpoint, the Processing Server will check the installed resources via SSH on the remote host. Since all workers will be duplicates of each other, it would be enough to list the resources of a single worker of the specified type
  • the user (ideally an admin user) may even be allowed to delete models - this though, is somewhat more dangerous since it will be hard to track if there are running jobs or jobs in the queue that will require that model.
  • last but not least, what about multiple users running in the same domain? How do we prevent duplication/overwriting of models in the same environment?

We must still think about how to make the resource manager less complex and more straightforward since that will be running on the network agent side.

@bertsky
Copy link
Collaborator Author

bertsky commented Feb 17, 2025

@MehmedGIT thanks for your analysis. Indeed, your scenario is illustrative.

I agree with joschrew that the resource manager is very complex to use, sometimes unpredictable in behavior, and especially hard to adapt as it is to the network architecture partly due to the false positives when searching for resources.

The resmgr still has bugs, esp. around its internal database mechanism (#1044 and #1251), but I won't go down that abyss here – let's assume we get those fixed!

Otherwise, I think OcrdResourceManager and ocrd resmgr is good design and not too complex for the problems it solves (resolving files by name, from various locations both including package-distributed and user-managed, showing and adding such resources).

Even if the Processing Server API is extended to support the ocrd resmgr CLI, the user will still have no control over which agent the list/download request lands on.

Yes, of course, I have not even touched on that aspect.

Regardless if Processor Servers or Processing Workers instances are deployed, they MUST always have the same models available locally.

I fully agree!

(But they still could be in different physical locations in each deployment, so denotation of models via resource names, i.e. unresolved paths, should still be the working principle.)

the only thing that I see working is if the models are downloaded/installed to all running network agents of the same type

I don't think we need to differentiate by agent type, though: regardless of type, they always derive from the same Docker image and will be started with the same volumes for models.

We could implement the ocrd resmgr CLI endpoints in the Processing Server and: [...]

Your proposed implementation via SSH or broadcast queues would make the processing/runtime side of the Web API much more complex.

I would rather separate these concerns by providing (i.e. generating and starting) a third service merely for the new resmgr API in every processor module (i.e. resource-manager in addition to either worker or server) – as originally described.

@bertsky
Copy link
Collaborator Author

bertsky commented Feb 17, 2025

So to illustrate my rather terse description of the concept with your scenario:

  • whoever deployed the Processing Worker and/or Processor Server instances (N for each processor executable on M hosts, where N ≠ M generally),
    will also have to deploy ResourceManager Servers (just 1 for each processor module on every host),
    in this case: some ocrd network resmgr-server off the ocrd/tesserocr images on hosts B-G
  • that same component (could be the PS deployer or some docker-compose script) feeds that runtime knowledge about the RM addresses to the Processing Server (so the latter can implement a new central /discovery/processor/resource by talking to them...)
  • optionally, a ocrd network client discovery processor resource ... just relays to the corresponding endpoints on the PS
  • optionally, a ocrd resmgr --processing-server host:port ... also just relays to the PS (but translates the resmgr-CLI mnemonics to discovery-API style)
  • if the user requests (the equivalent of) download for ocrd-tesserocr-recognize, the PS looks up the RM Servers responsible (i.e. the ones for ocrd/tesserocr on B-G), and requests the respective download for -recognize in all of them – partial failure is overall failure
  • if the user requests (the equivalent of) list-installed for ocrd-tesserocr-recognize, the PS looks up the RM Servers responsible (i.e. the ones for ocrd/tesserocr on B-G), and requests the respective list-installed in any of them – consistency is assumed
  • if the user requests (the equivalent of) list-available for ocrd-tesserocr-recognize, the PS looks up the RM Servers responsible (i.e. the ones for ocrd/tesserocr on B-G), and requests the respective list-available in any of them; the obvious shortcut here would be to just look up the ocrd-all-tool.json (assuming we manage to generate and install one)

Regarding name clashes (for the same processor executable): first come, first served (i.e. the second attempt receives an error msg). As was the case on the CLI. If we have a central resmgr-server anyway, requests will naturally be sequentialized.

Regarding filesystem clashes (downloading to the same file, e.g. shared NFS): the resmgr itself must be atomic and prevent this from creating broken files. Again – first come, first served (i.e. the second attempt receives an error msg).

I would not concern myself with user management at this point.

@MehmedGIT
Copy link
Contributor

MehmedGIT commented Feb 18, 2025

That was helpful, thanks for the summary! Overall I know what to implement and a PR regarding the ocrd_network RM will come soon.

that same component (could be the PS deployer or some docker-compose script) feeds that runtime knowledge about the RM addresses to the Processing Server (so the latter can implement a new central /discovery/processor/resource by talking to them...)

That should be just another network agent managed by the PS deployer. The PS config file does not need a modification since it already contains the host information of each Processing Worker/Processor Server. To prevent port collisions, the OS will randomly assign one per host.

optionally, a ocrd network client discovery processor resource ... just relays to the corresponding endpoints on the PS

optionally, a ocrd resmgr --processing-server host:port ... also just relays to the PS (but translates the resmgr-CLI mnemonics to discovery-API style)

Check.

if the user requests (the equivalent of) download for ocrd-tesserocr-recognize, the PS looks up the RM Servers responsible (i.e. the ones for ocrd/tesserocr on B-G), and requests the respective download for -recognize in all of them – partial failure is overall failure

Check.

if the user requests (the equivalent of) list-installed for ocrd-tesserocr-recognize, the PS looks up the RM Servers responsible (i.e. the ones for ocrd/tesserocr on B-G), and requests the respective list-installed in any of them – consistency is assumed

Check.

if the user requests (the equivalent of) list-available for ocrd-tesserocr-recognize, the PS looks up the RM Servers responsible (i.e. the ones for ocrd/tesserocr on B-G), and requests the respective list-available in any of them; the obvious shortcut here would be to just look up the ocrd-all-tool.json (assuming we manage to generate and install one)

Check. I would rather not rely on that shortcut for robustness reasons. That would be something to optimize later, after resolving the bugs of the RM.

Regarding name clashes (for the same processor executable): first come, first served (i.e. the second attempt receives an error msg). As was the case on the CLI. If we have a central resmgr-server anyway, requests will naturally be sequentialized.

Regarding filesystem clashes (downloading to the same file, e.g. shared NFS): the resmgr itself must be atomic and prevent this from creating broken files. Again – first come, first served (i.e. the second attempt receives an error msg).

The obstacle here is not the name or filesystem clashes. Sure, FIFO helps with that and requests can be forced to be sequential (sync against async methods). It is about finding a way to synchronize Processing Workers/Processor Servers with the Resource Manager. The RM should not overwrite a local model if it is actively being used by any of the agents on that host at that time. Any ideas on how to prevent that? Even when the processing agent is idle, the model is still cached in the memory.

I would not concern myself with user management at this point.

I am not considering that at this point.

@bertsky
Copy link
Collaborator Author

bertsky commented Feb 18, 2025

It is about finding a way to synchronize Processing Workers/Processor Servers with the Resource Manager. The RM should not overwrite a local model if it is actively being used by any of the agents on that host at that time. Any ideas on how to prevent that? Even when the processing agent is idle, the model is still cached in the memory.

Ah, indeed, if a download request is made with overwrite being set, this could lead to races when concurrently new processors are instantiated loading these models. And also, it is unclear what state the running (and cached) instances are when models have been updated (how do I know what version of a model a running processor has loaded?).

But do we really need to support overwriting in the networked setting? This could become a pure local/CLI thing, for SSH-based maintenance by admins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants