Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Unable to disable url-encoding #41618

Open
r2evans opened this issue May 10, 2024 · 3 comments
Open

[R] Unable to disable url-encoding #41618

r2evans opened this issue May 10, 2024 · 3 comments

Comments

@r2evans
Copy link

r2evans commented May 10, 2024

Describe the bug, including details regarding any error messages, version, and platform.

I have a local datamart of various table schemas using hive partitioning. There are non-arrow (and non-R) tools accessing the directories, it would be nice to not have to search for names both with and without URL encoding. I cannot find an option or an argument that allows me to disable it. I recognize that perhaps S3 buckets might require it, but it seems like a bug (or mis-design?) that we cannot disable this otherwise disruptive and undocumented feature. Is this really silently hard-coded and required in all instances?

The datamart is on a local filesystem, and spaces are (obviously) fully permissible in directory names.

At a minimum, I feel documentation in write_dataset would be appropriate, though it would be really useful to not have to change all other utilities to work around this seemingly unnecessary behavior.

R-4.3.2 and arrow_15.0.1.

mt <- mtcars
mt$key <- paste(mt$cyl, mt$gear)
(td <- tempfile(fileext=".d"))
# [1] "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d"
dir.create(td)
res <- arrow::write_dataset(mt, path = td, partitioning = "key")
res
# NULL
Sys.glob(paste0(td, "/*/*"))
# [1] "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=4%203/part-0.parquet" "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=4%204/part-0.parquet"
# [3] "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=4%205/part-0.parquet" "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=6%203/part-0.parquet"
# [5] "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=6%204/part-0.parquet" "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=6%205/part-0.parquet"
# [7] "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=8%203/part-0.parquet" "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=8%205/part-0.parquet"

There is nothing in the return value that suggests the partitioning keys were url-encoded.

Component(s)

R

@amoeba amoeba changed the title unable to disable url-encoding [R] Unable to disable url-encoding May 11, 2024
@amoeba
Copy link
Member

amoeba commented May 11, 2024

Hi @r2evans, this is something we're aware of, see #34905 (comment). It's unfortunately not as simple as one approach being clearly better than the other. I don't think anyone's actively working on it so if you wanted to on the work as described there that'd be very welcome.

@r2evans
Copy link
Author

r2evans commented May 11, 2024

Huh, I swear I searched issues for "url" and "encode", don't know why I didn't see that. At least good to know I'm not the only one that finds it not obvious. I understand the issues with something like (e.g.) S3 and not allow spaces, which is why I suggested at least documenting it. The necessary steps/hints in #34905 (comment) are really useful, though it seems less likely that somebody is going to be able and willing to alter the underlying C++ as well as R and python.

An interesting (to me) note: despite requiring the url-encoding when writing the partitioning values, it does not require them when reading it. This means for my datamart, I can rename the directories immediately post-write (it's part of the datamart process anyway, for various reasons) and nobody is the wiser.

Thanks.

@amoeba
Copy link
Member

amoeba commented May 11, 2024

If that's the case, there may be some latent bugs in the implementation since the original PR that changed things was made.

PS: GitHub's search has gotten worse for people recently so it my currently be harder than normal to find things for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants