-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better parsing of nested sections #1750
Comments
The issue is how the parsing code detects (or doesn't) headings, which has always been an issue, see e.g. mysociety/parlparse#53 . I think Parliament's is bad the other way, in that "New Clause 7" (the "heading" of the second vote on that page) is output as pure body text, with no real way of noticing it's something new. If you look at the source https://www.theyworkforyou.com/pwdata/scrapedxml/debates/debates2020-06-30d.xml you'll see we have it as:
I thought there was code to combine two minor-headings like that together on import if it found them, but presumably there's not or it's not working in some way. I see why it might be nice to have them all on one page, but that does make large debates even more unwieldy. But you'd have to introduce more structure to the output if you wanted to do anything with this, I think, and it's never been worth the effort involved. |
Yeah, I was specifically looking for debates with multiple votes to test a motion extractor - and that flushed out ones like this where things are more spread out than I expected. If we sketched out (and funded) a project around clearer understanding of amendments and legislative process - a good approach to this would fit into it. |
Related to mysociety/parlparse#171 - but I think can be improved just in display.
So there's something a bit off about how TWFY is parsing some complicated debates:
The navigation structure assumes: header starts, header ends, header starts, header ends.
But in practice, this is sometimes nesting:
e.g. https://www.theyworkforyou.com/debates/?id=2020-06-30d.191.3
logically contains all the votes in the following 'debates' - but these are separated off because of the new header.
While parliament groups brings them all in one page https://hansard.parliament.uk/Commons/2020-06-30/debates/581DFFF9-B3ED-4B76-9F51-A1F2325334A6/ImmigrationAndSocialSecurityCo-Ordination(EUWithdrawal)Bill
In practice, the problem I have is making the linking clearer between a vote and the debate.
Currently there isn't a good link the tree, because the parent debate just contains the text of the amendment (which is useful) but not the discussion - while the top level debate (which I guess we could link to instead), does not contain the vote itself.
The text was updated successfully, but these errors were encountered: