New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41673: [Format][Docs] Add arrow format introductory page #41593
base: main
Are you sure you want to change the base?
Conversation
cc @amoeba this could use a look already. I think all I wanted to add is here. Will need to do a general look through one more time before marking it ready for review though. |
@github-actions crossbow submit preview-docs |
Revision: 3cdd97a Submitted crossbow builds: ursacomputing/crossbow @ actions-4a1cc2326d
|
docs/source/format/index.rst
Outdated
CDataInterface | ||
CStreamInterface | ||
CDeviceDataInterface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche have kept C Stream Interface in a separate file as the structure is nicer IMHO.
803b4b9
to
4d2bf8a
Compare
@github-actions crossbow submit preview-docs |
Revision: 4d2bf8a Submitted crossbow builds: ursacomputing/crossbow @ actions-cc7da250f4
|
@github-actions crossbow submit preview-docs |
Revision: 0af1708 Submitted crossbow builds: ursacomputing/crossbow @ actions-dde0f75093
|
Not sure what is it but there seems to be something going on with the left hand side panel. If I go to the columnar format page the tutorial link does not appear: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not on this PR but should we expand the explanation to explain where the info like name
and id
(schema) is stored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the case of a struct type, there is some info in the struct content:
The field (key) is saved in the schema and the values of a specific field (key) are saved in the child array.
Maybe it would be good to add more content in the "Overview of Arrow Terminology" section where Schema
is mentioned.
@github-actions crossbow submit preview-docs |
Revision: 830ac9a Submitted crossbow builds: ursacomputing/crossbow @ actions-ed83863144
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another pass!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validity bitmap of the child array 1 is wrong here, I think (it is the sane as for child array 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for better understanding the difference between dense and sparse union, it might be valuable to use the same "logical" column with the same values, so you can actually 1:1 compare the physical layout for both of the same array.
The Arrow C Data Interface defines a set of small C structures: | ||
|
||
.. code-block:: | ||
|
||
struct ArrowSchema { | ||
const char* format; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the tutorial during the conference, I included the struct to explain things. But for the intro here I am not sure we should include it without much more explanation of the details of those structs.
For the purpose of this intro page, I would maybe also not call out the C Data Interface specifically, but we could maybe list and briefly explain the additional specifications on top of the columnar format, which could be (for this intro) the C Data Interface to share data zero-copy within the same process and IPC as serialization format for between processes
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Rationale for this change
The documentation for Arrow Format could be improved:
What changes are included in this PR?
This PR includes:
in a separate "introduction" page with no technical details. Specifications index page is also restructured to include captions and make the left sidebar menu better organised.
Note: a table with all types listed together with their physical layout will be added in a separate PR to existing Columnar.rst page: #14752
Are these changes tested?
No, this is a docs change.
Are there any user-facing changes?
No.