Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade unicode-width and handle width more correctly #430

Merged
merged 1 commit into from
Oct 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion papergrid/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ std = []
ansi = ["ansi-str", "ansitok"]

[dependencies]
unicode-width = "=0.1.11"
unicode-width = "0.2"
bytecount = "0.6"
fnv = "1.0"
ansi-str = { version = "0.8", optional = true }
Expand Down
34 changes: 7 additions & 27 deletions papergrid/src/util/string.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,33 +7,16 @@
/// Returns string width and count lines of a string. It's a combination of [`string_width_multiline`] and [`count_lines`].
#[cfg(feature = "std")]
pub fn get_text_dimension(text: &str) -> (usize, usize) {
#[cfg(not(feature = "ansi"))]
{
let (lines, acc, max) = text.chars().fold((1, 0, 0), |(lines, acc, max), c| {
if c == '\n' {
(lines + 1, 0, acc.max(max))
} else {
let w = unicode_width::UnicodeWidthChar::width(c).unwrap_or(0);
(lines, acc + w, max)
}
});

(lines, acc.max(max))
}

#[cfg(feature = "ansi")]
{
get_lines(text)
.map(|line| get_line_width(&line))
.fold((0, 0), |(i, acc), width| (i + 1, acc.max(width)))
}
get_lines(text)
.map(|line| get_line_width(&line))
.fold((0, 0), |(i, acc), width| (i + 1, acc.max(width)))
}

/// Returns a string width.
pub fn get_line_width(text: &str) -> usize {
#[cfg(not(feature = "ansi"))]
{
unicode_width::UnicodeWidthStr::width(text)
get_string_width(text)
}

#[cfg(feature = "ansi")]
Expand All @@ -44,7 +27,7 @@ pub fn get_line_width(text: &str) -> usize {
ansitok::parse_ansi(text)
.filter(|e| e.kind() == ansitok::ElementKind::Text)
.map(|e| &text[e.start()..e.end()])
.map(unicode_width::UnicodeWidthStr::width)
.map(get_string_width)
.sum()
}
}
Expand All @@ -53,10 +36,7 @@ pub fn get_line_width(text: &str) -> usize {
pub fn get_text_width(text: &str) -> usize {
#[cfg(not(feature = "ansi"))]
{
text.lines()
.map(unicode_width::UnicodeWidthStr::width)
.max()
.unwrap_or(0)
text.lines().map(get_string_width).max().unwrap_or(0)
}

#[cfg(feature = "ansi")]
Expand All @@ -72,7 +52,7 @@ pub fn get_char_width(c: char) -> usize {

/// Returns a string width (accouting all characters).
pub fn get_string_width(text: &str) -> usize {
unicode_width::UnicodeWidthStr::width(text)
unicode_width::UnicodeWidthStr::width(text.replace(|c| c < ' ', "").as_str())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though now it became a heavy operation :(

Copy link
Owner

@zhiburt zhiburt Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder 💭

Maybe we rather must 'enforce' this in a way we did with
tabled::settings::formatting::Charset::clean?

Just add a preprocessing step with this replacement and for those who are not sure about there sources it "assumed" to be used.
But for those who are pretty sure or did some processing themself we wouldn't waste this allocation?

What I mean is famous - "why need to pay for what we are not using" (don't remember how the quote goes exactly).

What do you think?

Copy link
Owner

@zhiburt zhiburt Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once I had an idea about a generic which would describe a table content as Clean | Dirty so we could do it out of the box.
And who are sure about their actions could use optimized version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the search portion of the replace operation is important to avoid; it's just a single search over the string contents, and when rendering text for human consumption, a character search won't add any significant overhead. The memory allocation, though, is worth avoiding if possible.

Ideally, there'd be a version of str::replace that returns a Cow<str>, and only allocates if it does any replacements. There isn't one in the standard library, but you could add one in util::string. That would eliminate the allocation overhead, and the overhead of a search for control characters isn't worth avoiding.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge it, and do Cow<str>.
Will run benchmarks and look at it.

As I quite a bit favor the movement of it into its own setting 😄
*All though the one downside......we will need to recalculate it twice....... in most cases... but it's details.

Once again.
Thanks a lot.

}

/// Calculates a number of lines.
Expand Down
32 changes: 25 additions & 7 deletions papergrid/tests/grid/render.rs
Original file line number Diff line number Diff line change
Expand Up @@ -207,14 +207,32 @@ test_table!(
"+----+--+"
);

#[test]
fn emoji_width_test() {
use papergrid::util::string::get_string_width;
assert_eq!(get_string_width("👩"), 2);
assert_eq!(get_string_width("🔬"), 2);
assert_eq!(get_string_width("👩\u{200D}🔬"), 2);
}

test_table!(
emoji_handling,
grid(2, 1).data([["👩👩👩👩👩👩"], ["Hello"]]).build(),
"+------------+"
"|👩👩👩👩👩👩|"
"+------------+"
"|Hello |"
"+------------+"
);

test_table!(
hieroglyph_handling_2,
grid(2, 1).data([["জী._ডি._ব্লক_সল্টলেক_দূর্গা_পুজো_২০১৮.jpg"], ["Hello"]]).build(),
"+------------------------------------+"
"|জী._ডি._ব্লক_সল্টলেক_দূর্গা_পুজো_২০১৮.jpg|"
"+------------------------------------+"
"|Hello |"
"+------------------------------------+"
emoji_handling_2,
grid(2, 1).data([["👩\u{200D}🔬👩\u{200D}🔬👩\u{200D}🔬👩\u{200D}🔬👩\u{200D}🔬👩\u{200D}🔬"], ["Hello"]]).build(),
"+------------+"
"|👩\u{200D}🔬👩\u{200D}🔬👩\u{200D}🔬👩\u{200D}🔬👩\u{200D}🔬👩\u{200D}🔬|"
"+------------+"
"|Hello |"
"+------------+"
);

test_table!(
Expand Down
2 changes: 1 addition & 1 deletion testing_table/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@ ansi = ["ansitok"]

[dependencies]
ansitok = { version = "0.2", optional = true }
unicode-width = "=0.1.11"
unicode-width = "0.2"
Loading