You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The problem is a bit difficult to notice at first, but it's specifically found at the end of that message. \\be became \xc2\xbe instead of just \xbe. This extra byte is due to the utf-8 encoding. But before we discuss that further let's step back for a second.
As far as we can tell, the way filter parsing works is that it builds up a list of characters in the filter string one at a time. Each byte is run through the _p_escaped function in ldapfilter.py which turns the hex byte into a character. Then it stores this collection of characters as a python3 string on the LDAPFilter object.
The problem manifests itself when we call the toWire() method on the ldapfilter. Each string in the filter has the to_bytes() method from _encoder.py called on it. In this situation to_bytes(value) ends up being value.encode('utf-8'). For a binary data blob this encoding is not the appropriate choice and it actually corrupts the data.
utf-8 was designed to use just 1 byte when encoding anything in the range of 0-128. However in the range 128-256 it requires 2 bytes to encode a single character. Which is why a value like \xbe (190 in decimal) can only be stored as \xc2\xbe. Storing a literal ‘\xbe` would be invalid for the encoding.
While utf-8 is actually the correct encoding for most LDAP fields this does not work for fields that contain binary data like objectguid. The extra bytes added by the utf-8 encoding essentially corrupt the field we are trying to search on and the LDAP Server will return 0 results for your search.
We have some egregious hacks in place today to make it so that objectguid filters use a latin-1 encoding. This encoding uses only 1 byte for the range 0-256 so the toWire() method returns data that the LDAPServer likes. However, we hate this and would love to see a broader and better solution adopted by ldaptor. Unfortunately we were not able to identify a proper solution ourselves. The main difficulty we ran up against when trying to develop a real solution was that at the time of encoding for the wire the LDAPAttributeValueAssertion /BerSequence has no context and we are not able to tell whether this data is binary or not. So in lieu of a PR I wanted to create this issue so that the ldaptor team was at least aware and maybe a real solution could be found.
Thanks!
Xander
The text was updated successfully, but these errors were encountered:
Thank you for the issue and sorry for the long answer.
Most ldaptor classes can receive byte strings and unicode strings as arguments. It is encouraged to use byte strings as in that case these values are sent via network "as is". Unicode strings are converted to bytes before sending (see toWire method).
However ldapfilter functions need unicode strings in order to parse filter strings with the pyparsing library. If you pass byte string it is decoded to unicode and then encoded back in the toWire method as UTF-8 byte string.
(Back in Python 2 times ldaptor did not need these conversions because attribute values and transmitted values were of the same type - Python 2 str or bytes. Even if you passed unicode in most cases it was implicitly encoded to bytes. Encoding/decoding issues started when ldaptor was ported to Python 3.)
Right now I cannot find a good solution. I thought of rewriting ldapfilter module to allow using byte strings without decoding but it may be a too difficult task. Perhaps encoding may be passed as an optional argument but it looks like a large change: all ldaptor classes with toWire method will receive this argument in this case.
Hi all,
We are experiencing an issue when trying to parse an LDAP filter string that contains a binary attribute such as objectguid.
Using python3.7 and the latest ldaptor here is a snippet of code that can produce the error we're seeing.
Output:
The problem is a bit difficult to notice at first, but it's specifically found at the end of that message.
\\be
became\xc2\xbe
instead of just\xbe
. This extra byte is due to the utf-8 encoding. But before we discuss that further let's step back for a second.As far as we can tell, the way filter parsing works is that it builds up a list of characters in the filter string one at a time. Each byte is run through the
_p_escaped
function inldapfilter.py
which turns the hex byte into a character. Then it stores this collection of characters as a python3 string on the LDAPFilter object.The problem manifests itself when we call the
toWire()
method on the ldapfilter. Each string in the filter has theto_bytes()
method from_encoder.py
called on it. In this situationto_bytes(value)
ends up beingvalue.encode('utf-8')
. For a binary data blob this encoding is not the appropriate choice and it actually corrupts the data.utf-8
was designed to use just 1 byte when encoding anything in the range of 0-128. However in the range 128-256 it requires 2 bytes to encode a single character. Which is why a value like\xbe
(190 in decimal) can only be stored as\xc2\xbe
. Storing a literal ‘\xbe` would be invalid for the encoding.While
utf-8
is actually the correct encoding for most LDAP fields this does not work for fields that contain binary data like objectguid. The extra bytes added by the utf-8 encoding essentially corrupt the field we are trying to search on and the LDAP Server will return 0 results for your search.We have some egregious hacks in place today to make it so that objectguid filters use a
latin-1
encoding. This encoding uses only 1 byte for the range 0-256 so the toWire() method returns data that the LDAPServer likes. However, we hate this and would love to see a broader and better solution adopted by ldaptor. Unfortunately we were not able to identify a proper solution ourselves. The main difficulty we ran up against when trying to develop a real solution was that at the time of encoding for the wire theLDAPAttributeValueAssertion
/BerSequence
has no context and we are not able to tell whether this data is binary or not. So in lieu of a PR I wanted to create this issue so that the ldaptor team was at least aware and maybe a real solution could be found.Thanks!
Xander
The text was updated successfully, but these errors were encountered: