fix for malformed xml from Google and other sinners #198

jfarrimo · 2012-09-12T02:46:07Z

I've run into this bug:

https://bugs.freedesktop.org/show_bug.cgi?id=33857

Basically, the iTeleportService client is sending malformed xml through gchat, and gchat is merrily sending it along. Since this is a Google service, there is almost no hope of getting it fixed. Therefore, I'm forced to hack deep into the bowels of SleekXMPP to take care of it. It makes me feel dirty to do this. There is no shower that will make me feel clean again.

I don't have the luxury of de-friending this user. I don't see a good way to catch the malformed xml other than to sit in the socket and watch the stream as it comes by. Ugh.

Please tell me there is a better way to do this. Please.

BTW, the worst part is that there is no error message SleekXMPP emits that gives even the foggiest clue that this is the problem. It took me a good long while to drill down into the code to figure out what was going on. This is only partly SleekXMPP's fault. It's also the fault of the xml parser.

jfarrimo · 2012-09-12T02:47:24Z

Hrm. Ok, looks like the pull request covers three commits. It's only meant to cover the last one that mentions iTeleportService in the subject.

legastero · 2012-09-13T18:09:22Z

Sadly, that looks to be the best solution for now (for Python 2, this won't work for 3 because filesocket isn't used there). The only other way to do it with the current architecture would be to make another wrapper for the socket object to do the filtering, but that's pretty much what this is.

jfarrimo · 2012-09-13T18:42:16Z

I've done more digging, and the fault isn't so much that of SleekXMPP as it is that the expat parser doesn't provide any way to handle errors. That is probably ok most of the time, and it's the way the XML specification says things should work, but sometimes we have to be able to handle errors anyway. When Google passes us bad XML in iq messages,
we have to roll with it since they're an 800 pound gorilla. It would be just the same if Facebook was the culprit.

It looks like the lxml parser has a mode where it will try to correct errors, but I'm not entirely comfortable with that either. In a perfect world, the expat parser would hand you the offending xml, let you deal with it as best you can, then hand it back to retry.

The one thing that really irks me is that the expat parser won't even give you back the offending piece of XML. I had to put logging in the socket function just to get it to tell me what it was blowing up on. All that they give you is a useless line number and offset. Ugh!

legastero · 2012-09-13T18:56:13Z

I'm working on moving away from iterparse() and working directly with the expat parser (see https://gist.github.com/3716705). If you see a way to get the error correction hooks you need in there, we can add it in.

jfarrimo · 2012-09-13T22:27:24Z

FYI, this has brought to light a bigger bug that I haven't found a workaround for yet. I'm counting up exceptions caught during session_start() and trying to exit the program after a certain number in a row (I'm taking this to mean something's badly broken). This, however, doesn't work. Here's what's going on:

in the session_start() event, call self.send_presence() and self.get_roster(), as usual, catching exceptions they throw
immediately get the xml stream error from the bad presence data received from iTeleportService
call self.disconnect() to end it, since things are un-recoverably hosed at this point
infinite loop encountered, program never stops, threads never die, etc...

I'm pretty sure this is because XMLStream._process() automatically calls self.reconnect() on getting this error (as it should in case of network errors, etc...). This somehow over-rides my self.disconnect() call. Basically, I can't make it stop.

I'm doing all of this so that, in the future, when more problems with malformed XML in presence stanzas crop up, I'll notice it and shut down. Otherwise the program will continually try to reconnect and it'll be difficult to detect that things have gone wrong.

jfarrimo · 2012-09-13T22:30:45Z

Here's the offending code:

def session_start(self, blank_event):
    self.logger.debug("starting session")
    try:
        self.send_presence()
        self.get_roster()
    except IqError as err:
        self.start_errors += 1
        self.logger.error(
            'There was an error getting the roster (%r): error count = %s',
            err.iq['error']['condition'], self.start_errors)
        if self.start_errors >= self.error_limit:
            self.logger.error("Too many startup errors encountered, giving up")
            self.disconnect()
    except IqTimeout:
        self.start_errors += 1
        self.logger.error(
            'Server is taking too long to respond: error count = %s',
            self.start_errors)
        if self.start_errors >= self.error_limit:
            self.logger.error("Too many startup errors encountered, giving up")
            self.disconnect()
    else:
        self.start_errors = 0
        self.online = True
        self.logger.info("session started")

legastero · 2012-09-13T22:34:12Z

Interesting, I'll look into this. I know there are several issues related to state changes like this that need addressing.

For now, this looks like a good candidate for the new abort() method, which just shuts everything down without going through the full stream closing sequence.

jfarrimo · 2012-09-13T22:42:13Z

Oh, sweetness! I hadn't noticed that. I'll plug it in now and give it a shot.

Thanks!

Conflicts: sleekxmpp/xmlstream/xmlstream.py

jfarrimo · 2012-09-14T23:40:28Z

So, it turns out that abort() doesn't work out of the box in this situation. I had to comment out the self.stop.clear() call at the beginning of XMLStream._connect(). Apparently, there is a race condition between the event thread and the main thread where the event thread calls abort between the time that the main thread calls reconnect() and reconnect() transitions from the disconnected to the connected state (looks like a 2 second wait is in there). Do you know of any dangers from removing this self.stop.clear() call? The other option would be to add a similar self.aborting member and look for it before calling the self.stop.clear() call.

legastero · 2012-09-18T17:32:08Z

Good catch. The self.stop.clear() call really ought to be in connect() instead of _connect(). That should eliminate the race condition.

jfarrimo · 2012-09-19T00:24:54Z

Ok, cool.

azmeuk · 2016-09-05T14:48:44Z

Hi. Sorry to dig out a 3 years old issue. I am using Firebase Cloud Messaging and I am getting errors at the end of exchanges with the FCM server, because it seems it does not send </stream:stream> when it does not have anything more to send.

@jfarrimo , is this the kind of errors you were talking about?
I have put more details on stack overflow.

jfarrimo · 2016-09-06T17:32:05Z

I don't remember the details of this issue at this point.

Jay Farrimond added 3 commits July 10, 2012 17:30

callbacks for success/failure on stream sending

8981302

Merge remote-tracking branch 'remotes/fritzy/develop' into develop

61779da

fix for iTeleportService malformed xml

51c83a3

Merge remote-tracking branch 'upstream/develop' into develop

2ea67e2

Conflicts: sleekxmpp/xmlstream/xmlstream.py

hack to make abort() work all the time

4215525

Jay Farrimond added 3 commits December 31, 2013 01:27

bump version

35554de

more iteleport BULLSHIT

950b5e9

correctly handle xmlns:client bug

3c1007a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix for malformed xml from Google and other sinners #198

fix for malformed xml from Google and other sinners #198

jfarrimo commented Sep 12, 2012

jfarrimo commented Sep 12, 2012

legastero commented Sep 13, 2012

jfarrimo commented Sep 13, 2012

legastero commented Sep 13, 2012

jfarrimo commented Sep 13, 2012

jfarrimo commented Sep 13, 2012

legastero commented Sep 13, 2012

jfarrimo commented Sep 13, 2012

jfarrimo commented Sep 14, 2012

legastero commented Sep 18, 2012

jfarrimo commented Sep 19, 2012

azmeuk commented Sep 5, 2016

jfarrimo commented Sep 6, 2016

fix for malformed xml from Google and other sinners #198

Are you sure you want to change the base?

fix for malformed xml from Google and other sinners #198

Conversation

jfarrimo commented Sep 12, 2012

jfarrimo commented Sep 12, 2012

legastero commented Sep 13, 2012

jfarrimo commented Sep 13, 2012

legastero commented Sep 13, 2012

jfarrimo commented Sep 13, 2012

jfarrimo commented Sep 13, 2012

legastero commented Sep 13, 2012

jfarrimo commented Sep 13, 2012

jfarrimo commented Sep 14, 2012

legastero commented Sep 18, 2012

jfarrimo commented Sep 19, 2012

azmeuk commented Sep 5, 2016

jfarrimo commented Sep 6, 2016