Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(9586): implement freetext search in cht datasource #9625

Open
wants to merge 72 commits into
base: master
Choose a base branch
from

Conversation

sugat009
Copy link
Member

@sugat009 sugat009 commented Nov 7, 2024

Description

Closes: #9586

Code review checklist

  • Readable: Concise, well named, follows the style guide, documented if necessary.
  • Documented: Configuration and user documentation on cht-docs
  • Tested: Unit and/or e2e where appropriate
  • Internationalised: All user facing text
  • Backwards compatible: Works with existing data and configuration or includes a migration. Any breaking changes documented in the release notes.

Compose URLs

If Build CI hasn't passed, these may 404:

License

The software is provided under AGPL-3.0. Contributions to this project are accepted under the same license.

@sugat009 sugat009 linked an issue Nov 7, 2024 that may be closed by this pull request
@sugat009 sugat009 force-pushed the 9586-implement-freetext-search-in-cht-datasource branch from eba7aac to 43efbef Compare November 18, 2024 08:59
@sugat009
Copy link
Member Author

@jkuester the e2e tests related to cht-datasource are failing right now. I remember seeing something related to changes in how we deal with auth and its impact in cht-datasource. Do we have any workaround at the moment?

@jkuester
Copy link
Contributor

jkuester commented Jan 14, 2025

Oh wow. I did not realize that the end-result of the changes was to just delete the existing remote cht-datasource tests with no alternative. This is very disappointing.

We can do better here, though. The challenge is that in a browser context, the session cookie is automatically included when making the remote fetch calls. The cht-datasource remote context is only currently used in the admin app and in webapp, both instances running in the browser. This is why we never added any code to cht-datasource to handle auth on the remote fetch calls. I see two options for fixing our tests here:

1. Update cht-datasource to accept auth information when creating a remote DataContext

This would essentially be addressing #9701. We could pass username/password in the getRemoteDataContext call and then update the cht-datasource code to set those on the fetch calls.

As I noted on the ticket, though, I am reluctant to make changes to the implementation code just for testing purposes.

2. Re-introduce the MITM for global.fetch in the integration tests

Some test code like this should get the integration tests running again:

  const { USERNAME, PASSWORD } = require('@constants');
  const initialFetch = global.fetch;
  before(() => {
    const headers = new Headers();
    headers.set('Authorization', 'Basic ' + Buffer.from(`${USERNAME}:${PASSWORD}`).toString('base64'));
    global.fetch = (url, options) => initialFetch(url, { headers, ...options, });
  });
  after(() => {
    global.fetch = initialFetch;
  });

We could add this code to each of the files in tests/integration/shared-libs/cht-datasource or we could try to put together a hooks file specific to these tests where we could centralize this code. (Would like to still be able to run the tests from the integration-all-local npm script.)


At this point I am fine with either approach. #2 would be easier to work into an already huge PR. #1 will probably be needed eventually if we want to be able to use cht-datasource outside of cht-core (still probably a BIG "if").

@sugat009
Copy link
Member Author

I'm more inclined towards method #2 at the moment as the ticket to address the issue has already been created, whereas the majority of the scope of #9586 has been addressed through this PR.

@sugat009
Copy link
Member Author

@jkuester I did not go with the hook implementation as it would mean checking for filenames to run because this auth setup needs to run only for a few test suites, which would have resulted in a not-so-fruitful hook file with checks.

@sugat009 sugat009 requested a review from jkuester January 17, 2025 07:56
Copy link
Contributor

@jkuester jkuester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright! We are getting very close here.

shared-libs/cht-datasource/src/contact.ts Outdated Show resolved Hide resolved
shared-libs/cht-datasource/src/contact.ts Outdated Show resolved Hide resolved
freetext: Nullable<string> = null,
type: Nullable<string> = null
) => ctx.bind(Contact.v1.getIds)(Contact.v1.createQualifier(freetext, type)),
getUuids: (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: I guess this should be getUuidsByTypeFreetext. Then we also need:

  • getUuidsPageByType
  • getUuidsByType
  • getUuidsPageByFreetext
  • getUuidsByFreetext

Right? I am still open to other ideas here since this full permutation approach is not very scalable. 😞 But, I am not sure what else to do except for accepting Qualifier parameters in this API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other alternative I could think of is to make every parameter optional and process according to the non-empty parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR, after further discussion on the squad call, we decided to go forwards with having specific functions for each permutation of search parameters that we want to support. This will result in a lot of functions, but the amount of duplicated code will be limited since it will all be just calling through to the same core api functions. Additionally, being able to have verbose names for each function will improve the readability/clarity of this api. This seems like an opportunity to find simplicity by expanding horizontally instead of vertically (by nesting a bunch of optional params into the same qualifier object).

shared-libs/cht-datasource/src/index.ts Outdated Show resolved Hide resolved
shared-libs/cht-datasource/src/qualifier.ts Outdated Show resolved Hide resolved
const allContacts = person ? [person, ...fetchedContacts] : fetchedContacts;
const contactsWithHydratedPrimaryContact = contacts.map(
hydratePrimaryContact(allContacts)
).filter(item => item ?? false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: why was this check included? I don't think we want to filter the null values here. That is part of the horrible jankyness here with the lineage where we might not actually find docs for all the levels of a contacts hierarchy. But, if a level is missing, we want to show that with null and not just drop the entry so it looks like there was no level there at all...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I was not aware of this. When implementing this for the first time in shared-libs/cht-datasource/src/local/contact.ts there was a case for person typed contact in which null values were being returned which was not happening in the corresponding shared-libs/cht-datasource/src/local/person.ts so, I thought null values were not required.

I'll remove the filter from here.

};

/** @internal */
export const getContactLineage = (medicDb: PouchDB.Database<Doc>) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I guess now the getPrimaryContactIds, hydratePrimaryContact, hydrateLineage functions do not need to be exported since they are only getting used in this file.

data: pagedDocs.data.map((doc) => doc._id),
cursor: pagedDocs.cursor
};
return await getPaginatedDocs(getDocsFn, limit, skip);
Copy link
Contributor

@jkuester jkuester Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: okay bad news 😞 I found an edge case that breaks this logic... We should never get null id values back from these freetext queries, but it is possible to get duplicate id values.

To test this, I created a contact with:

  "name": "Alberto O'Kon Rivera",
  "short_name": "River",

Then I did a search with freetext=river. The list of ids returned contained two instances of the same id value for this contact. This is because the getByStartsWithFreetext query will match the emissions for both River and Rivera (as intended). To make things even worse, even if we dupe-checked the ids returned for a given page, I don't think there is any reason that Couch could not give us more dupes of the same id later on different pages (because the view will return things ordered by key not by id).... 😬

@sugat009 definitely interested to hear your thoughts on the best way to proceed here. Pragmatically, my current inclination is to guarantee that each page will be free from duplicates, but then note in our documentation that different pages could contain the same ids. (I cannot come up with any feasible way to guarantee no dupes across pages).

The good news is that if we decide to just dupe-check on a page-by-page basis, then I think we can easily just reuse the fetch and filter logic here like this:

      const uuidSet = new Set<string>();
      const filterFn = (uuid: Nullable<string>): boolean => {
        if (!uuid) {
          return false;
        }
        const { size } = uuidSet;
        uuidSet.add(uuid);
        return uuidSet.size !== size;
      };

      return await fetchAndFilter(
        getDocsFn,
        filterFn,
        limit
      )(limit, skip);

(We just need to update the fetchAndFilter signature and change <T extends Doc> to just be <T>. I don't think we actually need the extends Doc for the current functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same thing is going to apply for the contact logic too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after a bunch more investigation, I still believe the best approach is to ensure there are no duplicate entries on each page, but allow duplicate entries to be returned across pages. This is slightly better than in the current implementation of shared-libs/search where, as far as I can tell, the default page limit was 50 and there was no dupe resolution at all for single-request search queries (so it would have been possible to get duplicate results back even on the same page).

For the record, here are the various approaches I considered to try and avoid returning duplicate entries across pages (and why each of them will not work):

  1. Edit the view code to only emit once for each contact.
    • This would just break freetext searching since it depends on mapping multiple search terms to the same contact.
  2. Use a reduce function on the view to combine duplicate results.
    • To properly support freetext searches, we need to emit multiple keys for the same contact. The "duplicate" results are produced at runtime and are caused by doing a "starts-with" search across a range of keys. I do not see any way to use a reduce to join results for a contact that would be useful for a "starts-with" type search.
  3. Emit the contact _id value as part of the key. Then the query results would be sorted by contact id.
    • You cannot use the _id as the first value in a key array (unless you wanted to only search for entries for a particular id). Otherwise you would need to match all for the first element in the array and that would prevent filtering by any subsequent elements in the array. The start_key/end_key boundaries would essentially just include everything.
    • You cannot emit the _id value as part of the key (concatenated to the actual search key). If you put the _id at the beginning of the key value, it would break the "starts-with" range search functionality. If you put the _id at the end of the key value, there still would be no way to guarantee no duplicates because the results would be sorted by key and the same id value might be later in the results after a slightly different search term.
  4. Just pull back all the ids and filter them for duplicates.
    • Technically this would work, but would totally defeat the purpose of having a paged API. It would also be much less performant than our current approach.

tests/integration/shared-libs/cht-datasource/auth.js Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement freetext search in cht-datasource
3 participants