Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update emendation rules to improve search engine for Latin, abbreviations, nicknames, etc. #2779

Open
FreeREGcomputer opened this issue Sep 25, 2024 · 4 comments
Assignees

Comments

@FreeREGcomputer
Copy link

Issue reported by cathyj at 2024-09-25 13:30:20 UTC
Time: 2024-09-25T13:20:00+00:00
Session ID: 215cefaf95fe9c16d2cf318cfe404cf7
Problem Page URL: /search_queries/new?locale=en
Previous Page URL: https://www.freereg.org.uk/cms/information-for-transcribers
Reported Issue:
Not all Latinized names are being returned from searches, eg. Davidus, Agneta. For those that are being returned, ones with genitive and dative endings are not always returned. Because we TWYS with endings too, there will be entries such as Henrici, Thome & Simonis (genitive) and Elizabetham (dative), see http://beinecke1.library.yale.edu/info/bookcataloging/latinnames.htm.
At the very least, I think we should return all of the commonest Latin names if the English version is searched, see https://homepages.rootsweb.com/~oel/givennames2.html

Screenshot

@stsccfr
Copy link
Collaborator

stsccfr commented Sep 25, 2024

For what it's worth, there is a list of forenames that the search engine can find in our Help pages, which may be hard to find so let me put the link here: Forename abbreviations

Addressing this problem in a general way may be fiendishly difficult given the variation in the endings of names in languages with noun declensions. We may have to settle for using wildcards, which can be used for both forenames and surnames, but only if one specifies a Place, which can be a serious limitation. Alternatively, we could extend the search engine's capabilities on a case-by-case basis. Now that we have Cathy's report for Davidus and Agneta, we could add special support for these. And the Rootsweb list that she provided gives us something specific to work with as well.

Another related problem are names that differ in an initial H sound, such as Hannah/Anna and Ellen/Helen. We treat these as distinct names nowadays, but in the past it wasn't so clear, especially when people spelt phonetically and were hearing something pronounced by someone who had an accent in which an initial H sound was added or removed. Family Search treats these pairs as equivalent, but we don't, and Soundex doesn't help because the initial letter is always preserved in a Soundex code. If I were searching for the baptism of Ellen Arrowsmith, I would definitely want the record for Helen Harrowsmith included in my search results.

We also need to extend the list of names in which X is being used as an abbreviation for Chris or Christ. If one searches for Christopher one will find records containing Xtopherus, but not records containing just Xtopher or Xtofer. We also need Christian to find Xtian and Christiana to find Xtiana.

It should also be mentioned that anything we do to increase the number of desired search results may also increase the number of false positives (like Soundex does) and this in turn can increase the likelihood of exceeding the limit of 500 results in any one search, or even the limit of 100 seconds for search time.

@stsccfr
Copy link
Collaborator

stsccfr commented Dec 5, 2024

Just to add some more information-- abbreviations, nicknames, Latin names, etc. are all handled in the same way by the code that creates the search_record. They are referred to as 'emendations' and the rules to create them are in lib/tasks/load_emendations.rake. In each emendation rule, the name being searched for is paired with an alternative name so that the search engine will include records containing that alternative name are included in the search results.

There is a cost to applying these emendation rules, not so much in the time it takes to search the database, but in the time it takes to build the search_records from the records in the CSV files. The total processing time for a CSV file is currently limited to just 600 ms (see app/models/csvfile.rb) so we cannot have as many emendations as we like.

So how should we decide which emendations to use? We currently have just over 2400 emendations (that's 4 emendations per abbreviation, because any abbreviated name is given 4 rules, one for the name without any abbreviation mark at the end, and three for each of the abbreviation marks of full stop, colon, and hyphen, for example Wm Wm. Wm: and Wm-). I don't know how many emendations we can have given the current limit on processing time of 600 ms, but in any case what makes the most sense to me is to choose emendations on the basis of their frequency in the current database. To give an example: if you search for Christopher you will also get records in which that forename was rendered as Xtopherus. But if you search for Harriet you won't find any records for Harriett. There are currently only 86 instances of Xtopherus in the database, but there are 137,705 instances of the spelling Harriett in the database, so to make the most out of the limited processing time that we are given, I would say that the rule for Harriett is more important than the one for Xtopherus.

One might argue that one should just rely on Soundex to find Harriett and other names that differ only in vowel sounds or doubled-consonants. We will certainly have to rely on Soundex very often to handle much of the variation, but consider 'Eizabeth' which is surely a misspelling of Elizabeth. Soundex won't help you here, but there are 1959 instances of this misspelling in the database, which is too many for us to go back and fix by hand. If fixing them programmatically is deemed to risky, then we should consider creating emendation rules for common misspellings/mistranscriptions as well.

@stsccfr stsccfr transferred this issue from FreeUKGen/FreeUKRegProductIssues Dec 8, 2024
@stsccfr
Copy link
Collaborator

stsccfr commented Dec 8, 2024

For additional info on the emendation rules that enable the search engine to find variant names, see #516 and #823. In #823 in his comment on 15 Jul 2017 Kirk mentions how to avoid doing a complete rebuild of the search_record_collection when emendations are changed or added, by using a forename index to find only those records that need to be updated using the new rules.

@stsccfr stsccfr changed the title 307200620 Latinized names with their various endings (Catherine) Update emendation rules to improve search engine for Latin, abbreviations, nicknames, etc. Dec 24, 2024
@stsccfr
Copy link
Collaborator

stsccfr commented Jan 9, 2025

A new load_emendations.rake file has been sent to Vino for deployment on test3. It contains very nearly the same number of emendations as the old file (about 2500) so there should not be any significant change in the CSV processing time. The information on the Forename abbreviations Help Page should get updated automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants