Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syl-based content shelving and reinsertion #3

Closed
drupchen opened this issue Jul 6, 2020 · 2 comments · Fixed by #6
Closed

syl-based content shelving and reinsertion #3

drupchen opened this issue Jul 6, 2020 · 2 comments · Fixed by #6
Assignees
Labels
enhancement New feature or request urgent

Comments

@drupchen
Copy link
Contributor

drupchen commented Jul 6, 2020

No description provided.

@drupchen
Copy link
Contributor Author

drupchen commented Jul 6, 2020

This is to solve cross-line tokens such as "ཝ་ཡེ། བཀྲ་\nཤིས་ཡིན་པས།" where "བཀྲ་ཤིས་" should be counted as a token.

Stripping the \n is a bad idea for large documents, and splitting the tokens in the output is also a bad idea for most use cases
The default behaviour should be to shift the \n to the end of the current token, so that we get "[ཝ་ཡེ] [།] [བཀྲ་ཤིས་] [\n] [ཡིན་] [པས] [།]"

import botok


def get_chunks(raw_string):
    chunker = botok.Chunks(raw_string)
    chunks = chunker.make_chunks()
    chunks = chunker.get_readable(chunks)
    return chunks


def shelve_info(chunks):
    shelved = []
    clean_chunks = []

    syl_count = 0
    for i, chunk in enumerate(chunks):
        marker, text = chunk
        if marker == 'TEXT' or marker == 'PUNCT':
            syl_count += 1

        # 2.a. extract transparent chars
        # TODO: adapt to also include \t as transparent char
        if '\n' in text:
            # remove transparent char
            text = text.replace('\n', '')
            index = (syl_count, '\n')

            shelved.append(index)
            clean_chunks.append((marker, text))


        # 2.b. extract any non-bo chunk
        elif marker != 'TEXT' and marker != 'PUNCT':
            index = (syl_count, text)
            shelved.append(index)

        else:
            clean_chunks.append(chunk)

    return clean_chunks, shelved





test = "བཀྲ་ཤིས་བདེ་ལེགས་\nཕུན་སུམ་ཚོགས། this is non-bo text རྟག་ཏུ་བདེ་\nབ་ཐོབ་པ\nར་ཤོག"

# 1. get chunks
chunks = get_chunks(test)

# 2. shelve needed info
chunks, shelved = shelve_info(chunks)

##############################################################################################
# 3. tokenize
str_for_botok = ''.join([c[1] for c in chunks])

tok = botok.WordTokenizer()
tokens = tok.tokenize(str_for_botok)

# extract (text, amount_of_syls) from token list
tokens = [(t.text, 1) if t.chunk_type == 'PUNCT' else (t.text, len(t.syls)) for t in tokens]
##############################################################################################

# 4. reinsert shelved tokens
# at this point, the only thing left is to merge shelved with tokens in accordance with the indices

Here is the content of the two lists at this point of execution:

shelved = [(4, '\n'), (8, 'this is non-bo text '), (11, '\n'), (14, '\n'), (15, '\n')]
format : [(syl_index, string_to_reinsert), ...]

tokens = [('བཀྲ་ཤིས་', 2), ('བདེ་ལེགས་', 2), ('ཕུན་སུམ་', 2), ('ཚོགས', 1), ('། ', 1), ('རྟག་', 1), ('ཏུ་', 1), ('བདེ་བ་', 2), ('ཐོབ་པ', 2), ('ར་', 1), ('ཤོག', 1)]
format : # [(token_text, syl_amount), ...]

@10zinten
Copy link
Contributor

10zinten commented Jul 11, 2020

@drupchen Is there a way to pass shelved to pybo_mod in Text.custome_pipeline ?

Let's discuss in #6

@10zinten 10zinten mentioned this issue Jul 11, 2020
10zinten added a commit that referenced this issue Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request urgent
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants