Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anagram add tests for graphemes #2445

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

meatball133
Copy link
Member

@BethanyG BethanyG self-requested a review May 18, 2024 16:39
Copy link
Member

@BethanyG BethanyG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instructions are explicit about ASCII characters only. We should have a discussion around weather or not we're internationalizing the Anagrams. Currently, they are all English only, and do not contain umlauts or other accents. I don't think this exercise is a good candidate for Unicode as currently specified.

@meatball133
Copy link
Member Author

meatball133 commented May 18, 2024

The tests already contain non-ASCII characters. I am pretty sure this "β" is Greek for beta (a non ascii letter). From my perspective, it could be argued that this exercise shouldn't have Unicode characters. But that discussion should have been held when the first uncicode tests were added.

@meatball133
Copy link
Member Author

Here is the pr for reference: #2366

@BethanyG
Copy link
Member

BethanyG commented May 18, 2024

It could be argued that this exercise shouldn't have Unicode characters. But that discussion should have been held when the first uncicode tests were added.

Just because we didn't have the discussion then doesn't mean we can't have it now. I don't think those test cases should have been added, given the instructions (I know I approved that, but I shouldn't have without looking at the docs). Even with the scenario flag.

I think if we go this route, we need to change/clarify the instructions, and we also need to make sure that any test cases form valid words in the target language. For ref, the Wiki article on Anagrams.

@senekor
Copy link
Contributor

senekor commented May 18, 2024

The reasoning at the time of the original PR was that these tests are added under the unicode scenario. (forum post) Languages are free to exclude these tests and add additional instructions if the choose to include them. Example from the Rust track:

# Instructions append

The Rust track extends the possible letters to be any unicode character, not just ASCII alphabetic ones.

@BethanyG
Copy link
Member

BethanyG commented May 18, 2024

Yes - I know. 🙂 But at the time, I didn't realize that the Anagram instructions directed students specifically at ASCII. I think that directive should be removed, and the instructions made more general, or there should be explicit mention of Unicode handling. For example, reverse-string doesn't mention ASCII at all, and micro-blog gets very specific about encodings.

I think if we're going to include Unicode here, the instructions should follow one or the other of those exercise examples. I also think that we should make sure that any test cases follow the rules of Anagram formation.

@ErikSchierboom
Copy link
Member

I think if we're going to include Unicode here, the instructions should follow one or the other of those exercise examples.

I think I would prefer to just not mention ASCII in the instructions, as we already have non-ascii characters.

I also think that we should make sure that any test cases follow the rules of Anagram formation.

What do you mean by this? Just that we double-check if the test cases match the instructions?

@BethanyG
Copy link
Member

What do you mean by this? Just that we double-check if the test cases match the instructions?

An Anagram needs to be a valid word in a given language (capitalization non withstanding), so as the instructions are written I think all candidates passed in test cases either need to be valid words, or obviously not valid, if that makes sense.

@senekor
Copy link
Contributor

senekor commented May 21, 2024

An Anagram needs to be a valid word in a given language

I feel like the only sensible interpretation of the candidates list is that students can assume them to be valid words. Otherwise, every single solution to this exercise would have to include an actual dicitionary of the English language.

Wikipedia may have some definition of the word Anagram, but natural language processing is not the goal of this exerise, right?

@BethanyG
Copy link
Member

BethanyG commented May 21, 2024

I feel like the only sensible interpretation of the candidates list is that students can assume them to be valid words. Otherwise, every single solution to this exercise would have to include an actual dicitionary of the English language.

..which is part of the point I am trying to make. In adding in Unicode characters (or any extended ASCII for that matter) we've taken this beyond English. With very few exceptions, English doesn't include any accented characters (nor Greek!)

But that's fine! It just means when we craft test cases, we need to make sure that the candidates are valid words -- whatever the language is that the candidates are written in. Conversely, I don't know any languages that use a Euro sign within words, so "€a" (last test case) is "obviously" wrong (and the start word is a single letter followed by a non-word symbol anyways), with I think is fine.

We might want to add "assume all candidates are valid words" to the instructions as well tho, just to be safe.

And for the test case in Greek, we have ["ΒΓΑ", "ΒΓΔ", "γβα", "αβγ"] with the expected result being ["ΒΓΑ", "γβα"]. But why not "αβγ"? Because its not a valid word - it is "ABC". But I had to go look that up in Google translate to figure that out. So I think that case needs to be changed.

And for this proposed test case, we need to pick a word that's valid, and craft candidates that are also valid words in the same language.

@senekor
Copy link
Contributor

senekor commented May 21, 2024

But why not "αβγ"?

Because it's the lowercase version of ΑΒΓ.

I don't see the impact of whether the words are valid in some natural langue on the user experience, but I'm fine with reimplementing the test cases with actual words that test the same thing.

@senekor
Copy link
Contributor

senekor commented May 21, 2024

Note that the test case using €a is extremely important for correct unicode handling. Finding valid words in a natural language that contain multi-byte characters in UTF-8 and have the same number of each byte could prove difficult.

@MatthijsBlom
Copy link
Contributor

I don't think considering languages is feasible. The strings abba and baäb might be anagrams in one 'language', but not in another: various writing systems disagree on whether two characters are the same 'letter' or not.

If Unicode tests are to be included, I think it would be wise to have the instructions explicitly define what they mean by 'letter' for the purposes of this exercise, but not consider applicability to the World's writing systems.

@BethanyG
Copy link
Member

Because it's the lowercase version of ΑΒΓ.

D'OH 🤦🏽‍♀️ Nevermind, then.

Note that the test case using €a is extremely important for correct unicode handling. Finding valid words in a natural language that contain multi-byte characters in UTF-8 and have the same number of each byte could prove difficult.

Probably. In any case, I don't have the interest to find any. 🙂

If we change the instructions to remove the reference to ASCII and add in a note that the student should assume all candidate words are valid, then I am fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants