Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trascriber notes containing brackets #195

Closed
TomazErjavec opened this issue Mar 23, 2022 · 7 comments
Closed

Trascriber notes containing brackets #195

TomazErjavec opened this issue Mar 23, 2022 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@TomazErjavec
Copy link
Collaborator

Transcriber notes are in the source documents often indicated by being enclosed in brackets or similar, and these marks then also serve to identify the notes. The (admittedly implicit) assumption in Parla-CLARIN as well as ParlaMint recommendations is that these marks are not retained in the marked-up TEI document. This seem to make sense, as they are mark-up baggage from the source document, and only make the actualy content of the notes more opague.

However, many corpora, e.g. CZ have retained these markers: <note type="comment">(otevřením?)</note>.
I propose that they are deleted. Rather than making an issue for every corpus that has them, this could be one of the v2tov3 script functions and recoded in #183.

@matyaskopp, would you agree?

@TomazErjavec TomazErjavec added the bug Something isn't working label Mar 23, 2022
@matyaskopp
Copy link
Collaborator

At the time of creating CZ data, I have been thinking about it. I did not care much about Parla-CLARIN recommendation in this issue. I decided to kept brackets for this reason:

  • When you run the default XSLT transformation you expect to get a text that makes sense. But when you remove brackets from comments senseless text can be produced

so I am not sure if brackets should've removed (or added when they are missing?)

@TomazErjavec
Copy link
Collaborator Author

When you run the default XSLT transformation you expect to get a text that makes sense. But when you remove brackets from comments senseless text can be produced

OK, so we have to conflicting requirements:

  1. for those that would want to analyze or make use of the comments, it is better if they contain only text
  2. for a straighforwad text dump, it is better if the brackets are kept

I had a look at the similar example of the q element, and TEI is (of course!) agnostic on whether to keep the quotation marks, although it does advise if the quotation marks are not kept, they original form be kept in the @rendition attribute.

Note also that if 2. is taken, we have, as you note, further choices:

  1. Leave the original brackets, whatever they are in the source: this has the disadvantage that, again, they can all be different, also that ParlaMint I partners mostly got rid of them, and now they would have to change their scripts to retain then, lots of work for a relativelly small gain
  2. Delete the original marks, but insert unified ones, probably square brackets. This has the disadvantage that we are changing the source (but so are we if we delete them), but it is nice because then we do have uniform encoding and the partenrs don't need to do anything (v2tov3 can remove old ones and insert the standardised ones).

So, I would either delete them or delete them but reintroduce some common brackets. The second might indeed be preferable, but it does mean that maybe Parla-CLARIN and definitelly ParlaMint guidelines would need to be changed, and this implemented in v2tov3 and probably validation script.

@TomazErjavec
Copy link
Collaborator Author

@matyaskopp, as we are closing issues, maybe we should now also decide how to treat these brackets. My suggestions (delete + re-introduce) is above. What do you think?

@matyaskopp
Copy link
Collaborator

@TomazErjavec , I agree with delete+reintroduce but I am not sure where it should be implemented:

@TomazErjavec
Copy link
Collaborator Author

I would vote for finalization, as we can there also catch & correct errors of the new partners.

@matyaskopp
Copy link
Collaborator

I would vote for finalization, as we can there also catch & correct errors of the new partners.

Ok, agree

@TomazErjavec
Copy link
Collaborator Author

Transcriber notes are now - after finalization - bracketless. so, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants