Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inspect teiHeder elements that contain text content #215

Closed
matyaskopp opened this issue Apr 26, 2022 · 7 comments
Closed

Inspect teiHeder elements that contain text content #215

matyaskopp opened this issue Apr 26, 2022 · 7 comments
Assignees
Labels
🕮 Documentation Improvements or additions to documentation enhancement New feature or request

Comments

@matyaskopp
Copy link
Collaborator

matyaskopp commented Apr 26, 2022

List of metadata elements that contain text content only.
(related to #183 , #205 (comment))

The final list is in this comment. The discussion is below in comments if necessary.

remove text content from

  • birth, death,
  • sex, affiliation

Note that the @xml:lang attribute from the above elements should also be removed.

preserve text content

  • p
  • ref
  • title, idno, head, label, resp,
  • language,
  • date,
  • name, orgName, persName, placeName, occupation, education, publisher,
  • measure (because it gives bi-lingual info, so we should probably keep it)
  • meeting (I think, because the text content cannot be regenerated from the attribtues for some corpora, can be discussed).
@matyaskopp matyaskopp added enhancement New feature or request 🕮 Documentation Improvements or additions to documentation labels Apr 26, 2022
@matyaskopp matyaskopp changed the title Inspect teiHeder element that contain text content Inspect teiHeder elements that contain text content Apr 26, 2022
@matyaskopp
Copy link
Collaborator Author

Element <sex> contains text content that labels gender in the language of parliament:

<element name="sex">
<attribute name="value">
<choice>
<value>M</value>
<value>F</value>
<value>U</value>
</choice>
</attribute>
<text/>
</element>

By removing such cases, we lose information in parliament language. Do we want to have such info?

@TomazErjavec
Copy link
Collaborator

Well, if we are removing the text content in birth and death (which can be formatted as per object language), then we should also remove sex. So, yes, let us have a consistent policy: if metadata info is available in attribute value, then we do not have it in the text content.

@TomazErjavec
Copy link
Collaborator

I think we now have a complete list, it is actually shorter than I thought. Here are some further comments:

  • we really need to remove pure text content from birth and death as they can have a subordinate element (i.e. the place where somebody was born or died)
  • by analogy, we should also remove the text content of date
  • sex could really stay with text content, although it is somewhat useless and repetitive, so I think we should delete it too
  • affiliation was even now empty for most corpora, so we should remove its text as well.

@matyaskopp, do you agree? And can you implement fixes in processing, and I can do the schema + ODD.

@matyaskopp
Copy link
Collaborator Author

@TomazErjavec Agree.
You can do it now, I think it would be easier to have invalid samples for fixings implementation.

@TomazErjavec
Copy link
Collaborator

You can do it now, I think it would be easier to have invalid samples for fixings implementation.

Out of time now but will do it soon. Note that these are massive changes, so there will be lots of errors!

@TomazErjavec
Copy link
Collaborator

by analogy, we should also remove the text content of date

Thinking about this some more, I think we should allow text content for date as it appears in some contexts where this makes sense, e.g. publicationStmt/date. So, not changing it, and editing the list in the first comment.

@TomazErjavec
Copy link
Collaborator

OK, did it, I think: Schema corrected in devel, TEI in description and description-desc. Also merged description into description-desc, was a nightmare, I hope I haven't messed things up. There are of course hundreds of errors now in
validation...
Now have to fix Parla-CLARIN, so we are not in contradiction with empty birth, death, sex, affiliation.

matyaskopp added a commit that referenced this issue May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🕮 Documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants