Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding already encoded strings #181

Closed
jakubka opened this issue Apr 10, 2018 · 9 comments
Closed

Encoding already encoded strings #181

jakubka opened this issue Apr 10, 2018 · 9 comments

Comments

@jakubka
Copy link

jakubka commented Apr 10, 2018

Hi,

While debugging my aiohttp client app I encountered unexpected behaviour while sending URL params containing parts which already look like they are escaped (%aa for example).

Repro:

url = URL('http://example.com')
url.with_query(a='%xx')  # correctly produces 'http://example.com/?a=%25xx'
url.with_query(a='%aa')  # incorrectly (?) produces ''http://example.com/?a=%AA''

(see %25xx vs %AA)

It looks like yarl assumes that what I pass as a query param is already encoded and tries to 'correct' it by uppercasing the characters. This happens when I pass values between %aa and %ff which matches valid URL encoded values %AA to %FF.

Not sure if this behaviour is intended or documented. It certainly caught me off guard, because I as a library user expected to pass unecoded string.

@asvetlov
Copy link
Member

The idea is that query parameters passed to .with_query() should be normalized but already correct percent-encoded strings are not changed.
The behavior is fixed, a lot of existing code relies on it.

Another question is how to enforce percent-encoding (I see your needs)?
.with_query()/.update_query() accepts **kwargs already, a flag for controlling encoding behavior is not an option.
We can add a new .with_query_something() method to enforce encoding.
I have no idea for the name, need a suggestion.

Guys, what do you think?

@behnam
Copy link
Contributor

behnam commented Jul 13, 2018

Looks like this is the same problem as the one I mentioned in #210 (comment).

@mikenerone
Copy link

I have run into this, as well. yarl is a great lib on almost all fronts, but IMO, the original decision to have the explicit encoding process also try to make guesses about pre-encoded bits of the string is unreliable to the point of being unusable.

In my case, in a tool that has worked fine for a long time, one of my users needed to input a query parameter containing %Good or %Bad. The first encodes properly as %25Good, but thanks to yarl's uncontrollable guessing that the %Ba is meant to be a hex character, the latter results in the completely broken %BAd. For now, I will have to fix this by pre-encoding all of my param keys and values, which makes me wonder why I'm using yarl for this (aside from the fact that aiohttp makes me).

This behavior seems unfixable in yarl in a backward compatible way. Another method is needed to provide reliable encoding. IMO, aiohttp will then need to provide an option (defaulting to off, sadly) to use this new properly-functioning method.

@mikenerone
Copy link

mikenerone commented Nov 27, 2018

The right interface is to have two separate methods so the user can be explicit about whether the strings they're passing are pre-encoded (in which case no encoding is done) or the common case of not pre-encoded (in which case it is fully encoded with no guessing). I understand that the ship has sailed on with_params(), but at least a new, reliable interface could be made available going forward.

Picking a name is hard since the "right" name is already taken by the unfixable method.

@rmillet-rs
Copy link

Moreover there is different behaviour between with_query and update_query:

>>> from yarl import URL
>>> mystr = 'A %20 B + C %2B D %2520 E %252520 F'
>>> params = { 'a':  mystr }
>>> url = URL('http://127.0.0.1/')
>>> print(url.with_query(params))
http://127.0.0.1/?a=A+%20+B+%2B+C+%2B+D+%2520+E+%252520+F
>>> print(url.update_query(params))
http://127.0.0.1/?a=A+++B+%2B+C+%2B+D+%20+E+%2520+F

@mikenerone
Copy link

mikenerone commented Aug 27, 2019

>>> print(url.update_query(params))
http://127.0.0.1/?a=A+++B+%2B+C+%2B+D+%20+E+%2520+F

Emm, that's just plain wrong, isn't it, even considering it guessing that this is a pre-encoded string? %252520 shouldn't encode as %2520. The literal 25 in the middle was dropped.

At any rate, I really think yarl's encoding facilities need some love.

@webknjaz
Copy link
Member

@mikenerone it's not encoding but decoding issue I think. If you assume that %252520 is encoded, then decoding would turn %25 into % and 2 5 2 0 will stay unchanged which gives us %2520. It doesn't "remove 25", it decodes the leading %25.

@mikenerone
Copy link

@webknjaz I'd agree if we were looking at decoded output, but the %2520 we're seeing is in what's supposed to be encoded output. But with the current guessing behavior, there's a lot of ambiguity in even trying to figure out what to expect, so I'm really not even positive, to be honest. Probably not a productive rabbit hole for us to climb down anyway. :P

@asvetlov
Copy link
Member

#472 fixes this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants