Skip to content
This repository has been archived by the owner on Feb 19, 2020. It is now read-only.

fix for malformed xml from Google and other sinners #198

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

jfarrimo
Copy link
Contributor

I've run into this bug:

https://bugs.freedesktop.org/show_bug.cgi?id=33857

Basically, the iTeleportService client is sending malformed xml through gchat, and gchat is merrily sending it along. Since this is a Google service, there is almost no hope of getting it fixed. Therefore, I'm forced to hack deep into the bowels of SleekXMPP to take care of it. It makes me feel dirty to do this. There is no shower that will make me feel clean again.

I don't have the luxury of de-friending this user. I don't see a good way to catch the malformed xml other than to sit in the socket and watch the stream as it comes by. Ugh.

Please tell me there is a better way to do this. Please.

BTW, the worst part is that there is no error message SleekXMPP emits that gives even the foggiest clue that this is the problem. It took me a good long while to drill down into the code to figure out what was going on. This is only partly SleekXMPP's fault. It's also the fault of the xml parser.

@jfarrimo
Copy link
Contributor Author

Hrm. Ok, looks like the pull request covers three commits. It's only meant to cover the last one that mentions iTeleportService in the subject.

@legastero
Copy link
Collaborator

Sadly, that looks to be the best solution for now (for Python 2, this won't work for 3 because filesocket isn't used there). The only other way to do it with the current architecture would be to make another wrapper for the socket object to do the filtering, but that's pretty much what this is.

@jfarrimo
Copy link
Contributor Author

I've done more digging, and the fault isn't so much that of SleekXMPP as it is that the expat parser doesn't provide any way to handle errors. That is probably ok most of the time, and it's the way the XML specification says things should work, but sometimes we have to be able to handle errors anyway. When Google passes us bad XML in iq messages,
we have to roll with it since they're an 800 pound gorilla. It would be just the same if Facebook was the culprit.

It looks like the lxml parser has a mode where it will try to correct errors, but I'm not entirely comfortable with that either. In a perfect world, the expat parser would hand you the offending xml, let you deal with it as best you can, then hand it back to retry.

The one thing that really irks me is that the expat parser won't even give you back the offending piece of XML. I had to put logging in the socket function just to get it to tell me what it was blowing up on. All that they give you is a useless line number and offset. Ugh!

@legastero
Copy link
Collaborator

I'm working on moving away from iterparse() and working directly with the expat parser (see https://gist.github.com/3716705). If you see a way to get the error correction hooks you need in there, we can add it in.

@jfarrimo
Copy link
Contributor Author

FYI, this has brought to light a bigger bug that I haven't found a workaround for yet. I'm counting up exceptions caught during session_start() and trying to exit the program after a certain number in a row (I'm taking this to mean something's badly broken). This, however, doesn't work. Here's what's going on:

  1. in the session_start() event, call self.send_presence() and self.get_roster(), as usual, catching exceptions they throw
  2. immediately get the xml stream error from the bad presence data received from iTeleportService
  3. call self.disconnect() to end it, since things are un-recoverably hosed at this point
  4. infinite loop encountered, program never stops, threads never die, etc...

I'm pretty sure this is because XMLStream._process() automatically calls self.reconnect() on getting this error (as it should in case of network errors, etc...). This somehow over-rides my self.disconnect() call. Basically, I can't make it stop.

I'm doing all of this so that, in the future, when more problems with malformed XML in presence stanzas crop up, I'll notice it and shut down. Otherwise the program will continually try to reconnect and it'll be difficult to detect that things have gone wrong.

@jfarrimo
Copy link
Contributor Author

Here's the offending code:

def session_start(self, blank_event):
    self.logger.debug("starting session")
    try:
        self.send_presence()
        self.get_roster()
    except IqError as err:
        self.start_errors += 1
        self.logger.error(
            'There was an error getting the roster (%r): error count = %s',
            err.iq['error']['condition'], self.start_errors)
        if self.start_errors >= self.error_limit:
            self.logger.error("Too many startup errors encountered, giving up")
            self.disconnect()
    except IqTimeout:
        self.start_errors += 1
        self.logger.error(
            'Server is taking too long to respond: error count = %s',
            self.start_errors)
        if self.start_errors >= self.error_limit:
            self.logger.error("Too many startup errors encountered, giving up")
            self.disconnect()
    else:
        self.start_errors = 0
        self.online = True
        self.logger.info("session started")

@legastero
Copy link
Collaborator

Interesting, I'll look into this. I know there are several issues related to state changes like this that need addressing.

For now, this looks like a good candidate for the new abort() method, which just shuts everything down without going through the full stream closing sequence.

@jfarrimo
Copy link
Contributor Author

Oh, sweetness! I hadn't noticed that. I'll plug it in now and give it a shot.

Thanks!

Conflicts:
	sleekxmpp/xmlstream/xmlstream.py
@jfarrimo
Copy link
Contributor Author

So, it turns out that abort() doesn't work out of the box in this situation. I had to comment out the self.stop.clear() call at the beginning of XMLStream._connect(). Apparently, there is a race condition between the event thread and the main thread where the event thread calls abort between the time that the main thread calls reconnect() and reconnect() transitions from the disconnected to the connected state (looks like a 2 second wait is in there). Do you know of any dangers from removing this self.stop.clear() call? The other option would be to add a similar self.aborting member and look for it before calling the self.stop.clear() call.

@legastero
Copy link
Collaborator

Good catch. The self.stop.clear() call really ought to be in connect() instead of _connect(). That should eliminate the race condition.

@jfarrimo
Copy link
Contributor Author

Ok, cool.

@azmeuk
Copy link

azmeuk commented Sep 5, 2016

Hi. Sorry to dig out a 3 years old issue. I am using Firebase Cloud Messaging and I am getting errors at the end of exchanges with the FCM server, because it seems it does not send </stream:stream> when it does not have anything more to send.

@jfarrimo , is this the kind of errors you were talking about?
I have put more details on stack overflow.

@jfarrimo
Copy link
Contributor Author

jfarrimo commented Sep 6, 2016

I don't remember the details of this issue at this point.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants