Bikeshed a worker/job queue system #57

barzamin · 2018-06-08T23:43:30Z

in- or out-of-process?
- ie, worker processes or worker threads
  - see threadpool
  - see rjq
storage backend?
- can we contort postgres into fulfilling our needs? (kroeg did, but we probably shouldn't. we'd also probably have to use pg-specific stuff and that would create porting issues with Add SQLite backend #12)

i know @yipdw has Thoughts, it would be good to serialize them here maybe?

see also my post on r/rust, where people basically went 🤷‍♀️

dumping things here for myself or other people implementing this

existing code for worker pools n stuff

what do we want the interface to look like

what happens if `rustodon(1)` crashes when we still have jobs hanging around

if they're, eg, deliveries, they need to be retried when it comes back up.

there are methods (for postgres, SKIP LOCKED) to build work queues in database, but we don't really care abt that because we don't have multiple worker processes, and we can dispatch jobs to worker threads entirely in-core. what we really need is a journal table of jobs that have been started but not completed; when the app comes up, we read the journal and reissue those jobs

The text was updated successfully, but these errors were encountered:

hannahwhy · 2018-06-08T23:54:49Z

Well, the choice of job system depends on Rustodon's design goals.

If the goal is scaling down -- scaling down to small systems and situations where no system administrator is present -- I think in-process, or at least automatically managed by the main process, makes sense. (I say that as to not preclude the possibility of the job runners being e.g. separate threads in a separate process whose lifetime is controlled by the main rustodon process, because that can be done without human intervention, and a separate worker process may be useful for taking advantage of isolation/quota mechanisms provided by the OS.)

If the goal is scaling up to multiple machines, then it may make sense to have an explicitly controllable separate process, because existing tools for multi-machine deployments are designed to expect things like separate web and worker processes.

I don't know if it's possible, or desirable, to target both spaces.

My preference is in-process, but I'm also interested in squeezing an ActivityPub server into nothing more than a single executable and state file (i.e. #12), so that should explain my position there.

keiyakins · 2018-06-20T19:35:26Z

I'd have to agree with the general principle that scaling down is more useful than scaling up. Anyone with tens of thousands of users is likely going to have significant infrastructure they'll want to leverage, in the form of distributed data stores, single sign-on systems, etc. That'll mean they'd have to rewrite a lot of it anyway.

On the other hand, being able to easily run an instance on a cheap VPS or even a low-power single board computer maximizes the usefulness of federation and allows wide scaling with lots of instances.

netshade · 2018-07-30T02:09:00Z

I think that there still might be value wrt using Postgres / SQLite as the place where job records / retry counts / some level of worker memory is persisted, if only such that the atomicity of the application record that created the job and the persistence of the job can be maintained at the same level. That is to say, if I toot something and the DB definitely persisted the write, but the filesystem maybe did not, that would defeat the expectations of the user. ( "I tooted X and then my server restarted. Why is my toot not appearing in the CC'd users timelines?" ) It strikes me as safer to be able to say that if the action that caused the job was persisted, then the need to perform the job was equally persisted.
When I read the phrase:

when the app comes up, we read the journal and reissue those jobs

that suggests to me that workers are primarily responding to only new jobs, except in the case of a restart where incomplete jobs are re-dispatched, per your comment. I think that there might be a need for workers to be able to cancel their own jobs ( as much as they are able to, barring severe deadlocks ) and mark as to-be-retried-at-some-future-date , which would sort of require the system to always be reissuing those jobs. Which is to say that the work of examining the jobs and potentially dispatching them to a worker is a thing the system should always be doing, rather than at startup time.
Following the above comment, I think that if there is a "most important thing" for the worker system, to me it would be the contract of a job. Where it's performed has mechanical characteristics, but what is expressed by a job in the system, eg:
```
dispatch!(DeliverToot(...), retry = 5, exponential_backoff = true)
```
where it is sort of understood by reading that what the contract of a toot delivery is. What sort of assurances would we want for a given job in the system strikes me as highly variable ( Forgot Password does not necessarily need the same level of durability as Deliver This Very Important Toot ), so to me, the contract being written into the journal table you are talking about seems like it's important enough to merit its own bullet point in the feature list of this system. That is to say, the things to be addressed in this discussion are:
1. How a job is performed ( in process, out of process )
2. Where a job is kept ( DB, memory, filesystem, something else? )
3. What a job is ( The details of the contract between the dispatcher and the worker )
To that end, I think that the contract will need to have enough information to be able to satisfy the following questions that a worker may need to answer:
1. Can this job be performed now?
2. Is there another job that should be performed before this job?
3. Is there someone else possibly working on this job?
4. Should this job simply be deleted from the system entirely?

barzamin added this to the alpha milestone Jun 8, 2018

barzamin added the P: High 🔥 High priority ticket label Jun 8, 2018

barzamin added the A: Backend Anything related to the backend label Jul 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bikeshed a worker/job queue system #57

Bikeshed a worker/job queue system #57

barzamin commented Jun 8, 2018 •

edited

hannahwhy commented Jun 8, 2018 •

edited

keiyakins commented Jun 20, 2018

netshade commented Jul 30, 2018

Bikeshed a worker/job queue system #57

Bikeshed a worker/job queue system #57

Comments

barzamin commented Jun 8, 2018 • edited

dumping things here for myself or other people implementing this

existing code for worker pools n stuff

what do we want the interface to look like

what happens if rustodon(1) crashes when we still have jobs hanging around

hannahwhy commented Jun 8, 2018 • edited

keiyakins commented Jun 20, 2018

netshade commented Jul 30, 2018

barzamin commented Jun 8, 2018 •

edited

what happens if `rustodon(1)` crashes when we still have jobs hanging around

hannahwhy commented Jun 8, 2018 •

edited