Support special chars in S3URI #10283

dimas-b · 2024-05-08T00:47:10Z

Remove code that interpreted URI query and fragments parts according to RFC 3986. In practice, S3 locations do not encode special chars and, therefore, do not really have query and fragment parts.
Add TestS3FileIOMinio for a small subset of tests using Minio as a realistic S3 protocol implementation for validating the handling of special chars.

* Remove code that interpreted URI query and fragments parts according to RFC 3986. In practice, S3 locations do not encode special chars and, therefore, do not really have query and fragment parts. * Add TestS3FileIOMinio for a small subset of tests using Minio as a realistic S3 protocol implementation for validating the handling of special chars. Fixes apache#10279

danielcweeks · 2024-05-08T01:57:47Z

We shouldn't remove the param/fragment handling because even though they aren't typically used, they are accepted by s3 and there are cases where they are used. This would change the existing behavior.

snazy · 2024-05-08T06:12:49Z

there are cases where they are used

As (not so widely) known, special characters in S3 URIs - i.e. those URIs, for example with (back)quotes and hash/question-marks, are not properly escaped by Iceberg. This makes those locations/URIs violate RFC 2396 - causing trouble even with AWSSDK functionality (see for example S3Utilities). This change is an effort to mitigate issues with those special characters. @danielcweeks can you point to the use cases you're thinking of in the Iceberg code base? I think that not changing it will cause non-resolvable user facing issues sooner or later.

dimas-b · 2024-05-08T12:22:15Z

@jackye1995 : What do you think about special chars in S3 URIs? I'd appreciate your insight. Thanks!

danielcweeks · 2024-05-09T03:17:10Z

@dimas-b taking a look at the original issue, it appears the problem is more likely that we're not url encoding the column name in the path like we do with the partition value. For example, if you used a string column for the partition, you would see the following path:

spark-sql (default)> create table t1 (i int, `p#1` string) partitioned by (`p#1`);
spark-sql (default)> insert into table t1 values (1, "value#1");

s3://.../data/dO4ITQ/p#1=value%231/00000-4-92d34017-b333-4bdd-ac01-3589843bdfaf-0-00001.parquet

p#1=value%231 shows the key is not encoded, but the value is.

We url encode the partition value to avoid this type of issue and should probably do so for the column name as well. S3 supports a lot of pathing that is problematic (e.g. the double slash issue), but I think we want to avoid support those behaviors and be more restrictive about what we allow. This is a larger ongoing discussion and we may even want to put something into the spec v3 to clarify valid pathing.

dimas-b · 2024-05-09T13:27:42Z

Encoding special chars in partition path elements sounds like a good idea, but I'm not sure it is that simple

The problem is not specific to what Iceberg code puts into directory names. The base location of a table may also include # chars. This works well in S3 (as the test case in this PR shows), but Iceberg's S3FileIO will not handle those locations ATM.

How do you propose to deal with special chars in base paths?

Also, with Iceberg URL-encoding special chars in the file locations, the interpretation of those locations requires special logic. The cannot be interpreted according to RFC 3986 because the URI parsing rules require decoding those chars, which will subsequently lead to mismatches at the S3 API level (the latter interpreting key parameters verbatim).

As I noted in comments, this PR attempts to interpret S3 URI in a manner consistent with what AWS S3 UI produces. For example, if one creates a directory named te#st in S3 and obtains its URI from AWS UI, the # char is not encoded. This applies both to Iceberg-produced directory names as well as to base locations, controlled by the user.

danielcweeks · 2024-05-09T17:25:27Z

@dimas-b Unfortunately, I agree there will likely be problems if people use special characters in the pathing prefix that's outside of Iceberg's control, but overall, I think we want to discourage (if not prohibit through the spec) usage of unencoded characters. While S3 supports this behavior, it causes a number of compatibility problems.

For example, this change would likely be incompatible with S3AFileSystem or any other system that uses the java URI implementation. We have this same type of issue with GCS, but for different reasons. I'm not sure about other implementations like pyarrow, OpenDAL in the rust, or Trino.

Overall, I think it's better to fail fast where interoperability is a concern as that's more important than supporting the full s3 key space.

I would suggest:

fix the field name encoding
continue the larger community discussion around what are "valid" paths
recommend people avoid problematic naming

dimas-b · 2024-05-09T20:50:25Z

this change would likely be incompatible with S3AFileSystem or any other system that uses the java URI implementation. [...]

I believe we are already in this situation even without this PR, because Iceberg already permits quotes (for example) to be used in column names, but the corresponding S3 locations will not be parseable by java.net.URI (cf. projectnessie/nessie#8328),

danielcweeks · 2024-05-13T21:24:04Z

@dimas-b I just put up #10329 to address the field name encoding. This should also address the quotes issue as well since it will be encoded. This doesn't do anything to address other ways that paths can get introduced (including path leading up to the table location), but that's a larger topic and I still feel we want to discourage (if not disallow) special characters in paths due to cross compatibility issues.

amogh-jahagirdar

Overall, I think it's better to fail fast where interoperability is a concern as that's more important than supporting the full s3 key space.

After seeing a variety of these kinds of issues, I think this is the conclusion I'm at as well. The alternative is that every Iceberg implementation needs to handle all these cases, and I think that's too much complexity to probably be worth it.

I'm in the camp of trying to disallow special characters but we still have not defined a notion of what are considered acceptable paths, and I think that's the part we would want to spec out in v3?

dimas-b · 2024-05-14T00:16:16Z

I still feel we want to discourage (if not disallow) special characters in paths due to cross compatibility issues.

I made a comment under #10329 proposing an escaping method that would not conflict with the URI RFC. I think that should effectively resolve the special chars problem as far as Iceberg code is concerned.

On top of that discouraging / disallowing special chars in base locations also sounds good to me.

dimas-b · 2024-05-27T23:06:32Z

See #10329

github-actions bot added the AWS label May 8, 2024

fix checkstyle

5aa7a3e

danielcweeks self-requested a review May 8, 2024 01:54

nastra self-requested a review May 13, 2024 08:35

danielcweeks mentioned this pull request May 13, 2024

Url encode field names for partition paths #10329

Merged

amogh-jahagirdar reviewed May 13, 2024

View reviewed changes

dimas-b mentioned this pull request May 16, 2024

Do not allow special characters in base table locations projectnessie/nessie#8524

Open

dimas-b closed this May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support special chars in S3URI #10283

Support special chars in S3URI #10283

dimas-b commented May 8, 2024

danielcweeks commented May 8, 2024

snazy commented May 8, 2024

dimas-b commented May 8, 2024 •

edited

danielcweeks commented May 9, 2024

dimas-b commented May 9, 2024 •

edited

danielcweeks commented May 9, 2024 •

edited

dimas-b commented May 9, 2024

danielcweeks commented May 13, 2024

amogh-jahagirdar left a comment •

edited

dimas-b commented May 14, 2024 •

edited

dimas-b commented May 27, 2024

Support special chars in S3URI #10283

Support special chars in S3URI #10283

Conversation

dimas-b commented May 8, 2024

danielcweeks commented May 8, 2024

snazy commented May 8, 2024

dimas-b commented May 8, 2024 • edited

danielcweeks commented May 9, 2024

dimas-b commented May 9, 2024 • edited

danielcweeks commented May 9, 2024 • edited

dimas-b commented May 9, 2024

danielcweeks commented May 13, 2024

amogh-jahagirdar left a comment • edited

Choose a reason for hiding this comment

dimas-b commented May 14, 2024 • edited

dimas-b commented May 27, 2024

dimas-b commented May 8, 2024 •

edited

dimas-b commented May 9, 2024 •

edited

danielcweeks commented May 9, 2024 •

edited

amogh-jahagirdar left a comment •

edited

dimas-b commented May 14, 2024 •

edited