Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valkey Command Reference Spaghetti #31

Open
stockholmux opened this issue Apr 19, 2024 · 6 comments
Open

Valkey Command Reference Spaghetti #31

stockholmux opened this issue Apr 19, 2024 · 6 comments

Comments

@stockholmux
Copy link

stockholmux commented Apr 19, 2024

In a 103 message Slack conversations with @zuiderkwast, we discovered the inherited system of generating the command references has many moving parts and worryingly duplicated and unsynchronized information.

Currently, the website repo has to submodule both valkey-io/valkey-doc and valkey-io/valkey just to generate the command reference. And if someone needs to change/add arguments to a command in the software, they would need to touch 3 repos.

In valkey-io/valkey there is a json file for each command. This is used internally by valkey in the argument parser for things like arity, flags, and arguments. Example: ZRANGEBYLEX individual json

In valkey-io/valkey-doc there is a commands.json file which contains all the commands and is indirectly generated from the individual command files (by running an instance of Valkey and extracting the same data from command response with a python script). It ends up with almost the same data, but not quite (missing some pieces and has extra pieces vs the individual command). Example: ZRANGEBYLEX from commands.json. Of note, commands.json was out-of-date even before the split.

Additionally, there is resp2_replies.json and resp3_replies.json which appear to be generated... some how... but these have been human edited in the redis-doc repo (before the split).

The complex architecture here spreads content across many repos (docs, website, valkey itself) and increases the difficulty to do even simple things. This feels like a system that was built iteratively over years without a plan and we should take this opportunity to straighten out the spaghetti.

Goals:

  1. Remove processes that require committing generated content back to repos.
  2. Reduce/remove processes that require editing across multiple repos and in different processes.
  3. Create reproducible artifacts that can be consumed in other contexts instead of linking repos together
@madolson madolson added the help wanted Extra attention is needed label Apr 19, 2024
@zuiderkwast
Copy link
Contributor

I agree with goal 2.

Regarding goal 1, it's a good idea, but at the same time it's conflicting with another goal that I think is desirable: self-contained docs (goal 4).

Regarding goal 3, it's sounds good but where should we store these artifacts? In yet another repo? We'll have dependencies to the artifacts and then we may as well store these artifacts in the doc repo.

Here are a few things I would like the doc repo to provide:

  • Possibility to build man pages and HTML docs locally (without unreasonable deps). Think of it as a debian or redhat source package where the user runs make to get the manpages.)
  • Doc contributors can see and navigate the complete docs when editing and contributing docs.
  • Close to one-to-one mapping between doc repo structure and docs on website.

I agree that the various (outdated) json files and other junk is unacceptable. It can definitely be improved. but first let's see the current situation and what exactly were talking about.

What we have

Historically (until now) the doc repo had this property (self-contained) at the expense of having generated content checked in. The generated content we're talking about is:

  • Command syntax, group, ACL categories: Generated from a running node using a script utils/generate-commmands-json.py. Its output was stored in commands.json in the doc repo. This was supposed to be updated autmatically or at least regularily, but I suppose this was lost or never really worked.
  • Reply types in RESP2 and RESP3. These were generated at some point, I don't know exactly how, and checked in to the doc repo as resp2_replies.json and resp3_replies.json. The replies were validated using JSON-Schema in some way some time, but JSON Schema doesn't distinguish between sets and arrays, so things were edited manually afterwards.
  • Module API docs, generated from src/module.c using utils/generate-module-api-ref.rb and checked in to the doc repo as topics/module-api-ref.md. This worked relatively well in general and it was done after a new module API function was added.

What I'd like

  • Of course, we can't have such outdated in-repo artifacts, so a pre-condition is to have CI jobs that checks that they're up to date. Even better might be a CI job can update them automatically and open a doc PR to fix it. (I don't think it's very hard to do.)
  • Instead of commands in the doc repo being split into a three json files and a markdown file per command, i'd like to see just one markdown file. Metadata for each command (syntax, short description, reply types, ACL categories, etc.) can be stored as metadata at the top of each markdown file, so it has everything for that command. This is also easy to update automatically.

@stockholmux
Copy link
Author

Tell me more about your goal 4. I don't think I totally grok what 'self-contained' would mean in this context.

As for goal 3, I think that would be something we can store the artifacts using GitHub's artifact storage. It's still on GitHub but it's not stored in the git tree which makes far less problematic.

Possibility to build man pages and HTML docs locally (without unreasonable deps). Think of it as a debian or redhat source package where the user runs make to get the manpages.)

I think this so somewhat do-able, you and I might approach it slightly differently though. Being smart about it you can alternate the layout that generates the website; Jekyll probably falls into 'unreasonable deps' but there other options for SSGs that are just single, standalone binaries.

Doc contributors can see and navigate the complete docs when editing and contributing docs.

I don't think 'complete' is needed here really because lose potentially a lot to attain this goal. As an example, Let's say you want to include a how-to video. You can certainly do this with markdown that works on GitHub([![Watch the video](https://example.com/thumbnail.jpg)](https://youtu.be/dQw4w9WgXcQ)), however if you want to make that same video work in a template and scale to fit a phone, you need different markup and styles to maintain the aspect ratio. If you uses a tiny bit templating looking something like {{< youtube dQw4w9WgXcQ >}} you can make it work in the website or, say, remove it on platforms that don't support it.

For example, if I just pop a video into the Valkey website using the markdown that works for GitHub markdown preview, it will look something like this on the website:

Screenshot 2024-04-25 at 3 02 22 PM

It's not just limited to video embeds, but it also affects nicely usable diagrams, DRYing on notices, interpolate descriptions and titles and a bunch of other things that would be a massive pain to maintain over time.

Close to one-to-one mapping between doc repo structure and docs on website.

I think the /topics/... mapping was a good one and that should carry over in URLs. I don't think that should necessarily represent how the information should be organized though. We can do better than that in menus and subpages surely.

CI jobs that checks that they're up to date

This feels like an anti-pattern. You're storing information in two places and checking for synchronization. Modern SSGs can understand the format which Valkey stores the command JSON files. Instead of a CI job that checks it, I'd rather Valkey better utilize these files and store/pull more information from them.

i'd like to see just one markdown file. Metadata for each command (syntax, short description, reply types, ACL categories, etc.) can be stored as metadata at the top of each markdown file, so it has everything for that command. This is also easy to update automatically.

This isn't a bad idea, but the devil would be in the details. If you're creating a CI that makes PRs change the front matter I think you're going to be challenged to prevent people from making (or wanting to make) manual changes. I'm not a fan of having things that can't be edited by a human in GitHub (and that's partially how we got here in the first place).

I can definitely see a step that adds the information to the files as part of a build step, but it's very close to the method already used today where the command JSON is just referenced.

@zuiderkwast
Copy link
Contributor

Now you're trolling me. Rick roll. 😆

Jekyll probably falls into 'unreasonable deps' but there other options for SSGs that are just single, standalone binaries.

When building docs (say man pages) locally, getting the markdown source files from a debian source package using something like apt install valkey-doc-src and running make, then GitHub artifacts is an unreasonable dependency. In this context, you can only have dependencies to other debian packages (such as tools to convert markdown to man pages). That's what I meant.

What I mean by self-contained docs is that the docs package (ideally corresponding 1:1 to the doc repo) contain all of the doc content needed to render the docs. It's not 100% required. We could depend on the code repo (e.g. a distro package called valkey-src), but I don't think it's reasonable to depend on spinning up valkey just to dump some metadata and then use that to build the docs. (See below regarding JSON files.)

As an example, Let's say you want to include a how-to video.

I don't think we shall have embedded videos in the docs. I even think we shall avoid images, because they don't render well in a man page in a terminal. (There are only 9 images in 459 markdown files currently.) If we want diagrams, it's better to have them as ascii art in fenced code blocks. We can have links to videos but even that's not too nice in a man page.

We can have videos in other parts of the website but not in the docs.

it also affects nicely usable diagrams, DRYing on notices, interpolate descriptions and titles and a bunch of other things that would be a massive pain to maintain over time.

DRYing on notices? I don't get it. Do you mean repeating code like **Note:**? (That's less to repeat than {{% alert title="Note" type="info" %}} ... {{% /alert }} or whatever it was. We can never expect any contributor to write notices like that.)

Interpolate descriptions and titles? I don't know what you mean here. Nor do I understand "bunch of other things that would be a massive pain to maintain over time". I need concrete examples because I don't really see this.

CI jobs that checks that they're up to date

This feels like an anti-pattern. You're storing information in two places and checking for synchronization.

You're right, it's not nice. The only nice solution to all these goals (1-4) would be a monorepo maybe.

Modern SSGs can understand the format which Valkey stores the command JSON files.

The valkey-server uses those JSON files as input and computes the final ACL categories from a number of factors. If we want to use those JSON files directly and want find out the effective ACL categories, then we must duplicate the logic valkey-server uses to compute the ACL categories. That's why (I've been told) we shouldn't do that. Instead we should fetch the info from a running valkey node. That said, I think we should do it anyway if the alternatives are even worse. ;) Or perhaps better: 💡 We can store those ACL categories explicitly in those JSON files and remove that implicit juggling from valkey-server. Then it's safe to depend on the JSON files directly. (We'd need some logic to check that it's consistent though, but that can be in the python script that generates the C structs from the JSON files.)

Now I feel we're getting somewhere. I'd like to make this change so we can actually use those JSON files directly. That'd save us a lot of trouble.

If you're creating a CI that makes PRs change the front matter I think you're going to be challenged to prevent people from making (or wanting to make) manual changes. I'm not a fan of having things that can't be edited by a human in GitHub (and that's partially how we got here in the first place).

We have some generated content in the code repo too. commands.def is one such file. It's generated from the JSON files when you make and then checked in. (It rarely needs to be recompiled and never if you don't change any source code, so a user just building and installing never sees this.) We have a CI job that checks that it's up to date. Easy.

In generated code, it's pretty standard to insert a comment like "This is generated. Don't edit.". There are comments in YAML so it can be used in frontmatter I think. And if a user still edits some of these fields, a CI job can easily tell them that they did something wrong, just like we have a spell checker in a CI job.

I agree it's not ideal but I think it's less bad than the alternatives.

it's very close to the method already used today where the command JSON is just referenced

Yes, it's basically the same, just a little less scattered.

Perhaps we should just make sure we keep the commands.json updated and add some CI for that first? Of even before that, just do it once manually. Command syntax doesn't change that often anyway.

Sorry for a long and spaghetti-structured comment.

@zuiderkwast zuiderkwast removed the help wanted Extra attention is needed label Apr 26, 2024
@stockholmux
Copy link
Author

Re: 'unreasonable deps'

I'm still a bit unclear about this one, if the artifact is accessible via curl at a public URL that makes it unreasonable? That seems odd.

At any rate, I think we can swing something with make via other packages. All the new SSGs that I'm looking at are widely available in nearly every package manager and would be relatively trivial to wire up in a make file.

Re: self-contained docs

Okay, yeah, I think we're aligned that we shouldn't need Valkey to create valkey-docs.

Re: video/diagrams/images in docs

My personal preference is that we don't go through a one-way door on this.

Going to least common denominator for content types because of a technical reason will haunt us for years to come. I would rather have some sort of alternate presentation for text-only representations, like alternate links for video or diagrams.

(And ASCII art diagrams are sadly a nightmare for accessibility and responsiveness, we probably need to avoid them. I use a screen reader: ASCII art comes out as 5 minutes of a robot saying stuff like 'dot dash dash dot dot plus dot', plus it's hard to skip. Images can have a proper description that is a few seconds and skippable).

Re: DRYing on notices

I wasn't clear on that. I'm not implying that we need the Hugo alert title="Note" stuff, that's largely junk and more about HTML representation than content. However, a really common case is that you want to put a warning or notice across multiple pages. I'm making this specific case up but say you want to have a paragraph that describes a common issue people run into on the SCAN family of commands. Instead of copy/pasting the paragraph n times, you create a single notice and include it on all the pages. Later, you want to rephrase that notice, you don't have to remember every place you copy/pasted and update n files. Instead, you update one file and every page updates. It's not just this one use case, but it's an example.

We can store those ACL categories explicitly in those JSON files and remove that implicit juggling from valkey-server. Then it's safe to depend on the JSON files directly.

This is a good idea. Right now, as I understand it, the JSON files include part of the ACL picture, which feels like a design mistake. I don't quite get how Valkey infers the other ACL categories (I've looked but not deeply). I'm all for straight lines here.

Re: Committing generated code / preventing editing

I'd like to minimize this if at all possible, it makes me itch especially if the generated code pulls from other repos.
If we can't avoid it I want to be explicit/unmissable in every file that is non-editable. (Personally, I've had some bad experiences where the files weren't explicitly marked and wasted a day editing something generated; terrible contributor experience).

Re: commands.json

The website doesn't currently have a dependency on it (but we also don't have ACLs either) so I'd rather not take a dependency on something that will go away if individual command JSON files tell the whole story. That being said, I know Valkey isn't being built in a day and writing a small amount of code as a stop-gap may be unavoidable.

@zuiderkwast
Copy link
Contributor

if the artifact is accessible via curl at a public URL that makes it unreasonable? That seems odd.

It seems we are from different planets. 😄

In companies (like my employer) the build servers are essentially offline, so they can't by any chance download anything that's not been verified and approved security wise and legally wise.

In distros (at least in debian) packages are GPG signed and those who sign them are GPG signed by several other people, so they're highly trusted. (Assuming other distros are similar.) It may not be completely illegal to download stuff from elsewhere but I think it's a bit sketchy to do it without a good reason. It's normal for a package to depend on other package from the same distro package repo though, so for building the docs we can safely depend on valkey source for the JSON files.

alternate links for video or diagrams

If it can be done in Markdown (embedded HTML), then yes this is fine.

Another option is to have it on the website outside of the core docs. Getting started guides, how to download and install, etc. don't really belong in man pages anyway. We can have some different styles in different sections.

And ASCII art diagrams are sadly a nightmare for accessibility and responsiveness

Can't you skip code blocks? We have lots of output from that must be a nightmare for screen readers, like this: https://github.com/valkey-io/valkey-doc/blob/main/topics/cli.md#continuous-stats-mode

And what would you suggest for the git-merge manual?

image

the Hugo alert title="Note" stuff, that's largely junk

Thanks. I'm glad you agree on this one.

I'm making this specific case up but (...)
Instead of copy/pasting the paragraph n times, you create a single notice and include it on all the pages.
Later, you want to rephrase that notice, you don't have to remember every place you copy/pasted and update n files.

This rings 3 of my warning bells:

  1. There is no use case = YAGNI.
  2. copy/pasting the paragraph n times = boilerplate? No thanks. Any text based on boilerplate isn't a good text IMAO.
  3. "Later, you want to rephase that notice" rings my 3rd warning bell. Editing text out of context will certainly mess up the texts. They will look like artificial, generated boilerplate. Undoubtly, the surrounding text will assume something about what the included text says and refer to that, and later when the included text changes, it all doesn't make sense.

the JSON files include part of the ACL picture, which feels like a design mistake

Yeah, it's on my unwritten todo now, but I think I'll start with documenting the JSON files and see where that leads us. Incrementally.

Re: Committing generated code / preventing editing

I'd like to minimize this if at all possible

Yeah, I agree. I'm convinced. If we allow depending on the JSON files in the code repo, then we can remove almost all of that.

I'm considering adding build step for the docs, like a Makefile, to first build the full markdown files with H1, command syntax and everything. In the next step, these can be fed to a map page converter. Probably that middle-step can be used for the website too. But if such logic is done in Jekyll (etc) it's useful only for the website.

Re: commands.json

The website doesn't currently have a dependency on it

Great! (The redis website did though.) So don't depend on them! I'll try to solve the ACL problem. Let's delete commands.json (but we can wait until we have solved what it provides). I'm not sure we can get rid of the resp3_replies.json files that easily though. That info should go back into the JSON files in the code repo, but the scaffolding (validation, etc.) is not in place yet it seems.

It seems we're aligned on most of this now. 🤝

@stockholmux
Copy link
Author

In companies (like my employer) the build servers are essentially offline

Okay, so you're talking about air gapped docs. Got it. I don't see any blockers in using make files and common dependencies to generate any of this.

If it can be done in Markdown (embedded HTML), then yes this is fine.

That's the crux. It really can't. Embedded HTML will break the responsiveness of the website (see the rickroll). I really don't like embedding HTML directly into the markdown: It's likely to break the website in weird and hard to control ways.

Another option is to have it on the website outside of the core docs. Getting started guides, how to download and install, etc. don't really belong in man pages anyway. We can have some different styles in different sections.

I think this might be a really important note: man pages should be a subset of the entire documentation. We can have different rules for man vs non-man pages (which I'm generally fine with).

Re: ASCII Art diagrams and screen readers

No, you don't really want to skip code blocks (that's often the whole point). The example you gave from the git page is bad for accessibility. It comes out as "A B C topic D E F G H master" so someone using a screen reader exclusively would be totally lost as the connections are missing. We really need to be able to provide alternate text/description - Markdown doesn't provide this except in cases of images. So, we kind of need some way to make diagrams accessible, which will likely mean SSG shortcodes or macros to alternate ASCII art (for man pages) or real diagrams with alternate text for platforms that support it.

This rings 3 of my warning bells:

There is no use case = YAGNI.
copy/pasting the paragraph n times = boilerplate? No thanks. Any text based on boilerplate isn't a good text IMAO.
"Later, you want to rephase that notice" rings my 3rd warning bell. Editing text out of context will certainly mess up the texts. They will look like artificial, generated boilerplate. Undoubtly, the surrounding text will assume something about what the included text says and refer to that, and later when the included text changes, it all doesn't make sense.

Respectfully, I've used this very technique extensively on two very large oss/documentation projects, so the use case is clear and common.

Additionally, you do often have to repeat things across documentation; it's a safe assumption that on most documentation visits, users will read a single page or two of documentation and won't explore the links. So, if something needs to be mentioned across multiple pages, you write once and present where needed. Assuming that they'll find it elsewhere is pretty flimsy and re-writing it across many pages will lead to errors and (probable) madness.


Super excited when I saw valkey-io/valkey#403 ! I do think we're getting somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants