Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support special chars in S3URI #10283

Closed
wants to merge 2 commits into from
Closed

Conversation

dimas-b
Copy link
Contributor

@dimas-b dimas-b commented May 8, 2024

  • Remove code that interpreted URI query and fragments parts according to RFC 3986. In practice, S3 locations do not encode special chars and, therefore, do not really have query and fragment parts.

  • Add TestS3FileIOMinio for a small subset of tests using Minio as a realistic S3 protocol implementation for validating the handling of special chars.

Fixes #10279

CC: @danielcweeks

* Remove code that interpreted URI query and fragments parts according
  to RFC 3986. In practice, S3 locations do not encode special chars
  and, therefore, do not really have query and fragment parts.

* Add TestS3FileIOMinio for a small subset of tests using Minio as
  a realistic S3 protocol implementation for validating the handling
  of special chars.

Fixes apache#10279
@github-actions github-actions bot added the AWS label May 8, 2024
@danielcweeks danielcweeks self-requested a review May 8, 2024 01:54
@danielcweeks
Copy link
Contributor

We shouldn't remove the param/fragment handling because even though they aren't typically used, they are accepted by s3 and there are cases where they are used. This would change the existing behavior.

@snazy
Copy link
Member

snazy commented May 8, 2024

there are cases where they are used

As (not so widely) known, special characters in S3 URIs - i.e. those URIs, for example with (back)quotes and hash/question-marks, are not properly escaped by Iceberg. This makes those locations/URIs violate RFC 2396 - causing trouble even with AWSSDK functionality (see for example S3Utilities). This change is an effort to mitigate issues with those special characters. @danielcweeks can you point to the use cases you're thinking of in the Iceberg code base? I think that not changing it will cause non-resolvable user facing issues sooner or later.

@dimas-b
Copy link
Contributor Author

dimas-b commented May 8, 2024

@jackye1995 : What do you think about special chars in S3 URIs? I'd appreciate your insight. Thanks!

@danielcweeks
Copy link
Contributor

@dimas-b taking a look at the original issue, it appears the problem is more likely that we're not url encoding the column name in the path like we do with the partition value. For example, if you used a string column for the partition, you would see the following path:

spark-sql (default)> create table t1 (i int, `p#1` string) partitioned by (`p#1`);
spark-sql (default)> insert into table t1 values (1, "value#1");

s3://.../data/dO4ITQ/p#1=value%231/00000-4-92d34017-b333-4bdd-ac01-3589843bdfaf-0-00001.parquet

p#1=value%231 shows the key is not encoded, but the value is.

We url encode the partition value to avoid this type of issue and should probably do so for the column name as well. S3 supports a lot of pathing that is problematic (e.g. the double slash issue), but I think we want to avoid support those behaviors and be more restrictive about what we allow. This is a larger ongoing discussion and we may even want to put something into the spec v3 to clarify valid pathing.

@dimas-b
Copy link
Contributor Author

dimas-b commented May 9, 2024

Encoding special chars in partition path elements sounds like a good idea, but I'm not sure it is that simple

The problem is not specific to what Iceberg code puts into directory names. The base location of a table may also include # chars. This works well in S3 (as the test case in this PR shows), but Iceberg's S3FileIO will not handle those locations ATM.

How do you propose to deal with special chars in base paths?

Also, with Iceberg URL-encoding special chars in the file locations, the interpretation of those locations requires special logic. The cannot be interpreted according to RFC 3986 because the URI parsing rules require decoding those chars, which will subsequently lead to mismatches at the S3 API level (the latter interpreting key parameters verbatim).

As I noted in comments, this PR attempts to interpret S3 URI in a manner consistent with what AWS S3 UI produces. For example, if one creates a directory named te#st in S3 and obtains its URI from AWS UI, the # char is not encoded. This applies both to Iceberg-produced directory names as well as to base locations, controlled by the user.

@danielcweeks
Copy link
Contributor

danielcweeks commented May 9, 2024

@dimas-b Unfortunately, I agree there will likely be problems if people use special characters in the pathing prefix that's outside of Iceberg's control, but overall, I think we want to discourage (if not prohibit through the spec) usage of unencoded characters. While S3 supports this behavior, it causes a number of compatibility problems.

For example, this change would likely be incompatible with S3AFileSystem or any other system that uses the java URI implementation. We have this same type of issue with GCS, but for different reasons. I'm not sure about other implementations like pyarrow, OpenDAL in the rust, or Trino.

Overall, I think it's better to fail fast where interoperability is a concern as that's more important than supporting the full s3 key space.

I would suggest:

  1. fix the field name encoding
  2. continue the larger community discussion around what are "valid" paths
  3. recommend people avoid problematic naming

@dimas-b
Copy link
Contributor Author

dimas-b commented May 9, 2024

this change would likely be incompatible with S3AFileSystem or any other system that uses the java URI implementation. [...]

I believe we are already in this situation even without this PR, because Iceberg already permits quotes (for example) to be used in column names, but the corresponding S3 locations will not be parseable by java.net.URI (cf. projectnessie/nessie#8328),

@danielcweeks
Copy link
Contributor

@dimas-b I just put up #10329 to address the field name encoding. This should also address the quotes issue as well since it will be encoded. This doesn't do anything to address other ways that paths can get introduced (including path leading up to the table location), but that's a larger topic and I still feel we want to discourage (if not disallow) special characters in paths due to cross compatibility issues.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think it's better to fail fast where interoperability is a concern as that's more important than supporting the full s3 key space.

After seeing a variety of these kinds of issues, I think this is the conclusion I'm at as well. The alternative is that every Iceberg implementation needs to handle all these cases, and I think that's too much complexity to probably be worth it.

I'm in the camp of trying to disallow special characters but we still have not defined a notion of what are considered acceptable paths, and I think that's the part we would want to spec out in v3?

@dimas-b
Copy link
Contributor Author

dimas-b commented May 14, 2024

I still feel we want to discourage (if not disallow) special characters in paths due to cross compatibility issues.

I made a comment under #10329 proposing an escaping method that would not conflict with the URI RFC. I think that should effectively resolve the special chars problem as far as Iceberg code is concerned.

On top of that discouraging / disallowing special chars in base locations also sounds good to me.

@dimas-b
Copy link
Contributor Author

dimas-b commented May 27, 2024

See #10329

@dimas-b dimas-b closed this May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unpredictable behaviour with S3FileIO when column names contain #
4 participants