introduce `HeaderValue` type to replace use of `String` #27

yskopets · 2020-08-10T22:09:36Z

Context

Related to hostcalls::get_map crashes Envoy when an HTTP header value is not a UTF-8 encoded string #25

Signed-off-by: Yaroslav Skopets <yaroslav@tetrate.io>

lizan · 2020-08-10T22:50:11Z

@PiotrSikora PTAL. While it's not encouraged but we do have certain traffic contains non UTF-8 header value. Since spec allows it historically, we shouldn't block that.

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.

PiotrSikora · 2020-08-11T16:17:18Z

@PiotrSikora PTAL. While it's not encouraged but we do have certain traffic contains non UTF-8 header value. Since spec allows it historically, we shouldn't block that.

This is actually very nuanced issue, since it exposes users to various attacks... but I agree that we should allow opaque data, and I had hostcalls returning Bytes until the week of open sourcing this SDK, so I need to figure out what I was thinking at the time.

@yskopets the HeaderValue looks nice, thanks! But where exactly do you want to use it in the SDK? hostcalls or traits? My thinking right now is that hostcalls should always return Bytes and then we could have "string" and "raw" accessors to the headers in traits. Reason being that 99%+ of the header values are valid UTF-8 strings, and I don't want to return the non-UTF-8 values to users not expecting/requesting them explicitly. Does this make sense?

(Note: I need to wrap a few things over the next 2 days, so I'll get back to this later in the week, since I don't want to make this breaking change in rush.)

yskopets · 2020-08-11T17:20:09Z

I was thinking about returning HeaderValue from the hostcalls.

A typical developer journey starts from

for (name, value) in &request_headers() {
    println!("{}={}", name, value);
}

Returning a type like HeaderValue makes this code both correct and convenient.

For comparison, request_headers_string() would loose correctness, and request_headers_raw() would loose convenience.

And if you decide to return Vec<(String, HeaderValue)> from get_map, then it seems natural to do the same for get_map_value.

Regarding set_map/set_map_value, I think they should use generics to allow any input: HeaderValue, String, &str, Vec<u8>, &[u8].

BTW, have you looked into https://crates.io/crates/http ? I started from there, but since it doesn't allow for pseudo headers like :path and :status I ended up creating a bare minimum version with only HeaderValue. Still, it would be nice to have HeaderMap and RequestBuilder as well.

yskopets · 2020-08-25T00:20:25Z

@PiotrSikora After looking into other use cases, I think, it makes sense to rename HeaderValue into ByteString and use it everywhere in the API instead of String.

In particular, Vec<(String, String)> could be replaced with Vec<(ByteString, ByteString)>, Bytes (currently, alias for Vec<u8>) could be replaced with ByteString.

ByteString, in this design, is responsible for:

equality comparison against &str, String, &[u8]
Index* operations
Display and Debug implementations that cover non-UTF-8 strings

Users who need operations such as starts_with, ends_with, replace, etc could additionally use bstr crate (ByteSlice and ByteVec traits).

Another reason to have ByteString is to move Vec<u8> out of API into implementation details. Keeping in mind iovec types from proxy-wasm/spec#1, Vec<u8> could eventually be replaced with Bytes from bytes crate (to avoid allocating a separate Vec for each header name and value).

codefromthecrypt · 2021-04-19T10:03:19Z

wondering if this is a blocking change to move off our fork? considering all libraries discussed are not api stable (even http is 0.2.4)

@lizan, there's an implication that there is customer traffic that not only exists, but is specifically using rust-wasm-sdk. If that's so, can you have them comment what non-utf8 headers they are using? I mean directly because some times engineers make mistakes, and libraries change to suit them, then they stop using that feature. For example, many core libraries lack support of non-string and instead deal on decoding side (ex substituting with ? when bytes are messed up). I would argue this isn't a blocking change if there is no current user of this, especially in such a young library ecosystem

codefromthecrypt · 2021-04-19T10:32:22Z

here's a suggested way out.

identify the data that caused the panic, assuming it was customer related and not a bytes generator
add a test that proves this data doesn't panic (rather replaces with ? or something if it is invalidly encoded)
migrate off the fork to the api as it stands, and continue a discussion of whether or not optimizing the api for binary headers is a good idea

codefromthecrypt · 2021-04-19T23:59:35Z

PS on the point of the actual change here, I would be really cautious about any binary api especially looking at the broader picture and impact of change. I've had some experience making fast parsers and binary is a double-edged sword. In worst case, people end up converting to strings in tight loops and you can't cache that anymore because they only have a ref to the binary form. Even if there is a utility to cache the string, it can be awkward to have someone do non-allocating comparisons with pre-existing strings leading to the same.

Bottom line is probably there are many reasons why nearly all http apis in languages choose to not allow arbitrary binary representations at the http logical abstraction. While primarily it would be about usability, performance would likely come quick second especially in code that runs in a proxy.

Even if the customer were able to convince this particular codebase to have a api that harms others, they won't be able to change the rest of the ecosystem other languages etc. IOTW I would backfill any missing UTF-8 test cases, but stop there. If there is code here that converts binary headers into strings, just backfill tests that prove they don't panic on malicious or poorly chosen header values.

codefromthecrypt · 2021-04-20T00:08:27Z

last note mainly to brain dump.. If it is really important non-interference, I would classify that differently than an api. For example if there are headers in as bytes and they need to stay exactly the same bytes, malicious or not, we could have a "normal" api, which is a lazy decoded view of them. At that point, only when reading back a weird value do you have to make a decision, and even if you decode into a string with ? substitution, the raw bytes stay as they are unless overwritten. This limits impact only to the ability to intentionally add poorly chosen binary values, effectively doing this intentionally is not supported by api, but pre-existing malicious or poorly chosen values remain supported.

Spoonbender · 2021-07-22T21:12:21Z

Just pointing out we were dealing with a similar issue in deno, see here: denoland/deno#7208
Notably, the Content-Disposition header is known for holding non-UTF data in the wild.

This concerns me both from a

compatibility perspective: we need envoy filters to be able to handle all real traffic, downstream and upstream, without malforming it (can't replace non-UTF8 chars with ?)
security perspective: concerned about panics, which unwind at wasm32-unknown-unknown and abort on wasm32-wasi, and might be exploitable for a DoS attack perhaps

If there is a concern about breaking the proxy-wasm-rust-sdk API, thus breaking existing filters, I propose we provide a backward-compatible build option (keep old unsafe API), thus providing ability to compile the existing unsafe filters which expects strings, using cargo features + rust conditional compilation abilities. This would exist side-by-side for some grace period before the unsafe and incorrect API is being decommissioned.

Swaagie · 2022-02-18T09:29:00Z

We've actually just run into this as well with Cookie having non utf-8 characters and https://github.com/proxy-wasm/proxy-wasm-rust-sdk/blob/master/src/hostcalls.rs#L241 panicking. We'll likely work around this by using the _bytes variants of the calls and manually parse so we can capture and handle the errors appropriately.

Having a HeaderValue or ByteString available seems like a good improvement. In the interim to enable the transition should each trait call the _bytes function variants internally? Moving utf-8 parsing to the traits and perform error propagation to allow consumers to act on it. And at the same time deprecate the non byte hostcalls functions. Thoughts?

Signed-off-by: spacewander <spacewanderlzx@gmail.com>

yskopets requested a review from PiotrSikora as a code owner August 10, 2020 22:09

yskopets force-pushed the feature/header-value branch from 6004488 to db44a7c Compare August 10, 2020 22:12

introduce HeaderValue type to replace use of String

2d818a3

Signed-off-by: Yaroslav Skopets <yaroslav@tetrate.io>

yskopets force-pushed the feature/header-value branch from db44a7c to 2d818a3 Compare August 10, 2020 22:23

PiotrSikora pushed a commit to PiotrSikora/proxy-wasm-rust-sdk that referenced this pull request Jul 4, 2023

Add OpenResty host implementation. (proxy-wasm#27)

c8ff5a8

Signed-off-by: spacewander <spacewanderlzx@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce `HeaderValue` type to replace use of `String` #27

introduce `HeaderValue` type to replace use of `String` #27

yskopets commented Aug 10, 2020 •

edited

lizan commented Aug 10, 2020

PiotrSikora commented Aug 11, 2020

yskopets commented Aug 11, 2020

yskopets commented Aug 25, 2020

codefromthecrypt commented Apr 19, 2021

codefromthecrypt commented Apr 19, 2021

codefromthecrypt commented Apr 19, 2021

codefromthecrypt commented Apr 20, 2021

Spoonbender commented Jul 22, 2021

Swaagie commented Feb 18, 2022

introduce HeaderValue type to replace use of String #27

Are you sure you want to change the base?

introduce HeaderValue type to replace use of String #27

Conversation

yskopets commented Aug 10, 2020 • edited

Context

lizan commented Aug 10, 2020

PiotrSikora commented Aug 11, 2020

yskopets commented Aug 11, 2020

yskopets commented Aug 25, 2020

codefromthecrypt commented Apr 19, 2021

codefromthecrypt commented Apr 19, 2021

codefromthecrypt commented Apr 19, 2021

codefromthecrypt commented Apr 20, 2021

Spoonbender commented Jul 22, 2021

Swaagie commented Feb 18, 2022

introduce `HeaderValue` type to replace use of `String` #27

introduce `HeaderValue` type to replace use of `String` #27

yskopets commented Aug 10, 2020 •

edited