Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41673: [Format][Docs] Add arrow format introductory page #41593

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented May 8, 2024

Rationale for this change

The documentation for Arrow Format could be improved:

  • all types are not listed
  • all layouts are not explained

What changes are included in this PR?

This PR includes:

  • motivation behind the columnar format
  • different physical layouts explained together with diagrams of example type in comparison to the physical layout
  • Arrow terminology
  • Arrow C Data interface

in a separate "introduction" page with no technical details. Specifications index page is also restructured to include captions and make the left sidebar menu better organised.

Note: a table with all types listed together with their physical layout will be added in a separate PR to existing Columnar.rst page: #14752

Are these changes tested?

No, this is a docs change.

Are there any user-facing changes?

No.

@AlenkaF
Copy link
Member Author

AlenkaF commented May 9, 2024

cc @amoeba this could use a look already. I think all I wanted to add is here. Will need to do a general look through one more time before marking it ready for review though.

@AlenkaF
Copy link
Member Author

AlenkaF commented May 9, 2024

@github-actions crossbow submit preview-docs

Copy link

github-actions bot commented May 9, 2024

Revision: 3cdd97a

Submitted crossbow builds: ursacomputing/crossbow @ actions-4a1cc2326d

Task Status
preview-docs GitHub Actions

Comment on lines 40 to 42
CDataInterface
CStreamInterface
CDeviceDataInterface
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche have kept C Stream Interface in a separate file as the structure is nicer IMHO.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 9, 2024
@AlenkaF
Copy link
Member Author

AlenkaF commented May 13, 2024

@github-actions crossbow submit preview-docs

Copy link

Revision: 4d2bf8a

Submitted crossbow builds: ursacomputing/crossbow @ actions-cc7da250f4

Task Status
preview-docs GitHub Actions

@AlenkaF
Copy link
Member Author

AlenkaF commented May 13, 2024

Not sure why the captions in the left sidebar menu are not visible in the crossbow preview build:

Screenshot 2024-05-13 at 19 47 01

but are visible for me locally:

Screenshot 2024-05-13 at 19 46 44

@AlenkaF AlenkaF marked this pull request as ready for review May 13, 2024 17:53
@raulcd
Copy link
Member

raulcd commented May 16, 2024

@github-actions crossbow submit preview-docs

Copy link

Revision: 0af1708

Submitted crossbow builds: ursacomputing/crossbow @ actions-dde0f75093

Task Status
preview-docs GitHub Actions

@raulcd
Copy link
Member

raulcd commented May 20, 2024

Not sure what is it but there seems to be something going on with the left hand side panel. If I go to the columnar format page the tutorial link does not appear:
image
If I manually go to the Intro page it appears but there is a wrong increased level after Arrow Columnar Format, see the image below:
image

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

docs/source/format/images/var-string-view-diagram.svg Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not on this PR but should we expand the explanation to explain where the info like name and id (schema) is stored?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case of a struct type, there is some info in the struct content:

The field (key) is saved in the schema and the values of a specific field (key) are saved in the child array.

Maybe it would be good to add more content in the "Overview of Arrow Terminology" section where Schema is mentioned.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 20, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 21, 2024
@AlenkaF
Copy link
Member Author

AlenkaF commented May 21, 2024

@raulcd thanks for checking the sidebar! It should be corrected with 2a990b4

@AlenkaF
Copy link
Member Author

AlenkaF commented May 21, 2024

@github-actions crossbow submit preview-docs

Copy link

Revision: 830ac9a

Submitted crossbow builds: ursacomputing/crossbow @ actions-ed83863144

Task Status
preview-docs GitHub Actions

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another pass!

docs/source/format/Intro.rst Outdated Show resolved Hide resolved
docs/source/format/Intro.rst Outdated Show resolved Hide resolved
docs/source/format/Intro.rst Outdated Show resolved Hide resolved
docs/source/format/Intro.rst Outdated Show resolved Hide resolved
docs/source/format/Intro.rst Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validity bitmap of the child array 1 is wrong here, I think (it is the sane as for child array 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for better understanding the difference between dense and sparse union, it might be valuable to use the same "logical" column with the same values, so you can actually 1:1 compare the physical layout for both of the same array.

docs/source/format/Intro.rst Outdated Show resolved Hide resolved
Comment on lines +454 to +459
The Arrow C Data Interface defines a set of small C structures:

.. code-block::

struct ArrowSchema {
const char* format;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the tutorial during the conference, I included the struct to explain things. But for the intro here I am not sure we should include it without much more explanation of the details of those structs.

For the purpose of this intro page, I would maybe also not call out the C Data Interface specifically, but we could maybe list and briefly explain the additional specifications on top of the columnar format, which could be (for this intro) the C Data Interface to share data zero-copy within the same process and IPC as serialization format for between processes

docs/source/format/Intro.rst Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 23, 2024
@github-actions github-actions bot added awaiting review Awaiting review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels May 23, 2024
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants