Bad line-breaks in long words, consider breaking at hyphenation #51

peterjc · 2016-04-13T16:52:33Z

e.g. Prokka GFF file containing this in column 9:

product=2-amino-4-hydroxy-6-hydroxymethyldihydropteridine pyrophosphokinase

The EMBL file from converting the GFF gave:

FT                   /product="2-amino-4-hydroxy-6-hydroxymethyldihydropteridin
FT                   e pyrophosphokinase"

Notice this has inserted the line-break mid-word, which is bad.

The Prokka GBK file had:

                     /product="2-amino-4-hydroxy-6-
                     hydroxymethyldihydropteridine pyrophosphokinase"

Notice it broke on the hyphen, which is much better.

The text was updated successfully, but these errors were encountered:

This ought to close sanger-pathogens#51, not sure if it should be used on all qualifiers or (as implemented) just the product?

peterjc · 2016-05-02T14:28:12Z

According to https://docs.python.org/2/library/textwrap.html the TextWrapper class used from the Python library will by default break at hyphens.

Adding break_long_words=False might be helpful here? That gives:

FT                   /product="2-amino-4-hydroxy-6-hydroxymethyldihydropteridine
FT                   pyrophosphokinase"

This passes the ENA validation tool, although it isn't quite as strict as the Prokka Genbank wrapping. Pull request to follow...

peterjc · 2016-11-10T15:17:16Z

Could you reopen this issue as discussed on #68 please?

peterjc · 2016-11-10T16:38:35Z

It seems that break_long_words=False in Python's textwrap considers hyphenated terms as a single word, and thus will try to avoid breaking them.

That's generally fine, but we have a problem if the hyphenated term itself is about 60+ characters, meaning even when put on a line on its own once the 21 character FT indent is added it exceeds the 80 character limit.

The "about" is because things are much tighter if this is the first word as you also have the prefix /product=" to consider, while for the final word there is the extra " to include.

Sadly if we stick with the default of break_long_words=True, then Python does not seem to take advantage of hyphens when deciding where to line-break in this corner case - Python bug filed: http://bugs.python.org/issue28660

andrewjpage · 2016-11-11T09:08:23Z

Thanks for filing the python bug. This format is a royal pain in the modern age.

peterjc added a commit to peterjc/gff3toembl that referenced this issue May 2, 2016

Avoid breaking long words in product attributes.

b3b628a

This ought to close sanger-pathogens#51, not sure if it should be used on all qualifiers or (as implemented) just the product?

peterjc mentioned this issue May 2, 2016

Avoid breaking long words in product attributes #53

Merged

andrewjpage closed this as completed in #53 May 3, 2016

peterjc mentioned this issue Nov 10, 2016

More wrapping tests; Fix ignored test #68

Merged

andrewjpage reopened this Nov 10, 2016

andrewjpage added the bug label Nov 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad line-breaks in long words, consider breaking at hyphenation #51

Bad line-breaks in long words, consider breaking at hyphenation #51

peterjc commented Apr 13, 2016

peterjc commented May 2, 2016

peterjc commented Nov 10, 2016

peterjc commented Nov 10, 2016 •

edited

Loading

andrewjpage commented Nov 11, 2016

Bad line-breaks in long words, consider breaking at hyphenation #51

Bad line-breaks in long words, consider breaking at hyphenation #51

Comments

peterjc commented Apr 13, 2016

peterjc commented May 2, 2016

peterjc commented Nov 10, 2016

peterjc commented Nov 10, 2016 • edited Loading

andrewjpage commented Nov 11, 2016

peterjc commented Nov 10, 2016 •

edited

Loading