Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bikeshed a worker/job queue system #57

Open
barzamin opened this issue Jun 8, 2018 · 3 comments
Open

Bikeshed a worker/job queue system #57

barzamin opened this issue Jun 8, 2018 · 3 comments
Labels
A: Backend Anything related to the backend P: High 🔥 High priority ticket
Milestone

Comments

@barzamin
Copy link
Member

barzamin commented Jun 8, 2018

  • in- or out-of-process?
  • storage backend?
    • can we contort postgres into fulfilling our needs? (kroeg did, but we probably shouldn't. we'd also probably have to use pg-specific stuff and that would create porting issues with Add SQLite backend #12)

i know @yipdw has Thoughts, it would be good to serialize them here maybe?

see also my post on r/rust, where people basically went 🤷‍♀️

dumping things here for myself or other people implementing this

existing code for worker pools n stuff

what do we want the interface to look like

what happens if rustodon(1) crashes when we still have jobs hanging around

if they're, eg, deliveries, they need to be retried when it comes back up.

there are methods (for postgres, SKIP LOCKED) to build work queues in database, but we don't really care abt that because we don't have multiple worker processes, and we can dispatch jobs to worker threads entirely in-core. what we really need is a journal table of jobs that have been started but not completed; when the app comes up, we read the journal and reissue those jobs

@barzamin barzamin added this to the alpha milestone Jun 8, 2018
@barzamin barzamin added the P: High 🔥 High priority ticket label Jun 8, 2018
@hannahwhy
Copy link
Contributor

hannahwhy commented Jun 8, 2018

Well, the choice of job system depends on Rustodon's design goals.

If the goal is scaling down -- scaling down to small systems and situations where no system administrator is present -- I think in-process, or at least automatically managed by the main process, makes sense. (I say that as to not preclude the possibility of the job runners being e.g. separate threads in a separate process whose lifetime is controlled by the main rustodon process, because that can be done without human intervention, and a separate worker process may be useful for taking advantage of isolation/quota mechanisms provided by the OS.)

If the goal is scaling up to multiple machines, then it may make sense to have an explicitly controllable separate process, because existing tools for multi-machine deployments are designed to expect things like separate web and worker processes.

I don't know if it's possible, or desirable, to target both spaces.

My preference is in-process, but I'm also interested in squeezing an ActivityPub server into nothing more than a single executable and state file (i.e. #12), so that should explain my position there.

@keiyakins
Copy link

I'd have to agree with the general principle that scaling down is more useful than scaling up. Anyone with tens of thousands of users is likely going to have significant infrastructure they'll want to leverage, in the form of distributed data stores, single sign-on systems, etc. That'll mean they'd have to rewrite a lot of it anyway.

On the other hand, being able to easily run an instance on a cheap VPS or even a low-power single board computer maximizes the usefulness of federation and allows wide scaling with lots of instances.

@netshade
Copy link
Contributor

  • I think that there still might be value wrt using Postgres / SQLite as the place where job records / retry counts / some level of worker memory is persisted, if only such that the atomicity of the application record that created the job and the persistence of the job can be maintained at the same level. That is to say, if I toot something and the DB definitely persisted the write, but the filesystem maybe did not, that would defeat the expectations of the user. ( "I tooted X and then my server restarted. Why is my toot not appearing in the CC'd users timelines?" ) It strikes me as safer to be able to say that if the action that caused the job was persisted, then the need to perform the job was equally persisted.

  • When I read the phrase:

    when the app comes up, we read the journal and reissue those jobs

    that suggests to me that workers are primarily responding to only new jobs, except in the case of a restart where incomplete jobs are re-dispatched, per your comment. I think that there might be a need for workers to be able to cancel their own jobs ( as much as they are able to, barring severe deadlocks ) and mark as to-be-retried-at-some-future-date , which would sort of require the system to always be reissuing those jobs. Which is to say that the work of examining the jobs and potentially dispatching them to a worker is a thing the system should always be doing, rather than at startup time.

  • Following the above comment, I think that if there is a "most important thing" for the worker system, to me it would be the contract of a job. Where it's performed has mechanical characteristics, but what is expressed by a job in the system, eg:

    dispatch!(DeliverToot(...), retry = 5, exponential_backoff = true)
    

    where it is sort of understood by reading that what the contract of a toot delivery is. What sort of assurances would we want for a given job in the system strikes me as highly variable ( Forgot Password does not necessarily need the same level of durability as Deliver This Very Important Toot ), so to me, the contract being written into the journal table you are talking about seems like it's important enough to merit its own bullet point in the feature list of this system. That is to say, the things to be addressed in this discussion are:

    1. How a job is performed ( in process, out of process )
    2. Where a job is kept ( DB, memory, filesystem, something else? )
    3. What a job is ( The details of the contract between the dispatcher and the worker )

    To that end, I think that the contract will need to have enough information to be able to satisfy the following questions that a worker may need to answer:

    1. Can this job be performed now?
    2. Is there another job that should be performed before this job?
    3. Is there someone else possibly working on this job?
    4. Should this job simply be deleted from the system entirely?

@barzamin barzamin added the A: Backend Anything related to the backend label Jul 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: Backend Anything related to the backend P: High 🔥 High priority ticket
Projects
None yet
Development

No branches or pull requests

4 participants