Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-threading for long posts (AP->Bsky, potentially opt-in?) #1705

Open
saschanaz opened this issue Jan 19, 2025 · 3 comments
Open

Auto-threading for long posts (AP->Bsky, potentially opt-in?) #1705

saschanaz opened this issue Jan 19, 2025 · 3 comments

Comments

@saschanaz
Copy link

(Basically repeating #1002 but for auto-thread)

Accounts based on ActivityPub based servers would like to write long posts rather than having to write multiple smaller posts, e.g.:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.

At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.

https://loremipsum.de #loremipsum

But this would be truncated in bsky:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.

At vero eos et accusam et justo duo dolores et ea rebum [...] https://mastodon.social/@loremipsum/post/123

It totally strips out the originally intended URL and hashtag, limiting the interaction the post originally wanted. It would be better if it was more like:

(first post)
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

(second post in reply)
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.

https://loremipsum.de #loremipsum

... so that the users won't have to leave bluesky to read more, reducing cross-platform hassle.

@snarfed
Copy link
Owner

snarfed commented Jan 20, 2025

Thanks for filing this, and for the details and examples! And for finding #1002, that is indeed useful background.

The bad news is, this is unlikely. It may often be possible to auto-thread a long post and get a reasonable result, but not always. Individual paragraphs can be too long, determining sentence breaks isn't always easy, and that's all before we consider other languages that do this kind of thing entirely differently, eg #1625.

Usually, I'm all for shipping a first pass at a feature when it's 80% good enough! I tend to be reluctant when it comes to modifying user-authored content like this, though, especially when the failure modes seem common enough and bad enough like this.

Alternatively, I could aim more for 98-99% here, use a grammer analyzer and try hard to find sentence breaks, or phrase breaks for even individual sentences that are too long, do something reasonable for other languages's line-breaking models...but honestly that's just not a high priority right now.

Regardless, this is absolutely a useful feature request. Thank you again! Happy to keep this issue open to track.

@saschanaz
Copy link
Author

saschanaz commented Jan 20, 2025

The bad news is, this is unlikely. It may often be possible to auto-thread a long post and get a reasonable result, but not always. Individual paragraphs can be too long, determining sentence breaks isn't always easy, and that's all before we consider other languages that do this kind of thing entirely differently, eg #1625.

Should it be sentence breaks though? Word breaks might be enough, with the same [...] mark that is being used now. For example:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

Breaking this would be:

(first post)
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo [...]

(second post in reply)
[...] dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

And yes, detecting any boundary would be better with libraries like PyICU (for ICU4C, unfortunately the latest ICU4X doesn't seem available in Python)

@snarfed
Copy link
Owner

snarfed commented Jan 20, 2025

Good point! Maybe the 80% solution here, eg just break at whitespace like how we truncate now, is reasonable after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants