Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad line-breaks in long words, consider breaking at hyphenation #51

Open
peterjc opened this issue Apr 13, 2016 · 4 comments · Fixed by #53
Open

Bad line-breaks in long words, consider breaking at hyphenation #51

peterjc opened this issue Apr 13, 2016 · 4 comments · Fixed by #53
Labels

Comments

@peterjc
Copy link
Contributor

peterjc commented Apr 13, 2016

e.g. Prokka GFF file containing this in column 9:

product=2-amino-4-hydroxy-6-hydroxymethyldihydropteridine pyrophosphokinase

The EMBL file from converting the GFF gave:

FT                   /product="2-amino-4-hydroxy-6-hydroxymethyldihydropteridin
FT                   e pyrophosphokinase"

Notice this has inserted the line-break mid-word, which is bad.

The Prokka GBK file had:

                     /product="2-amino-4-hydroxy-6-
                     hydroxymethyldihydropteridine pyrophosphokinase"

Notice it broke on the hyphen, which is much better.

peterjc added a commit to peterjc/gff3toembl that referenced this issue May 2, 2016
This ought to close sanger-pathogens#51, not sure if it should be used on
all qualifiers or (as implemented) just the product?
@peterjc
Copy link
Contributor Author

peterjc commented May 2, 2016

According to https://docs.python.org/2/library/textwrap.html the TextWrapper class used from the Python library will by default break at hyphens.

Adding break_long_words=False might be helpful here? That gives:

FT                   /product="2-amino-4-hydroxy-6-hydroxymethyldihydropteridine
FT                   pyrophosphokinase"

This passes the ENA validation tool, although it isn't quite as strict as the Prokka Genbank wrapping. Pull request to follow...

@peterjc
Copy link
Contributor Author

peterjc commented Nov 10, 2016

Could you reopen this issue as discussed on #68 please?

@andrewjpage andrewjpage reopened this Nov 10, 2016
@peterjc
Copy link
Contributor Author

peterjc commented Nov 10, 2016

It seems that break_long_words=False in Python's textwrap considers hyphenated terms as a single word, and thus will try to avoid breaking them.

That's generally fine, but we have a problem if the hyphenated term itself is about 60+ characters, meaning even when put on a line on its own once the 21 character FT indent is added it exceeds the 80 character limit.

The "about" is because things are much tighter if this is the first word as you also have the prefix /product=" to consider, while for the final word there is the extra " to include.

Sadly if we stick with the default of break_long_words=True, then Python does not seem to take advantage of hyphens when deciding where to line-break in this corner case - Python bug filed: http://bugs.python.org/issue28660

@andrewjpage
Copy link
Member

Thanks for filing the python bug. This format is a royal pain in the modern age.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants