Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research freetext view use to inform hosting TCO reduction #9293

Closed
1 of 4 tasks
mrjones-plip opened this issue Jul 31, 2024 · 19 comments
Closed
1 of 4 tasks

Research freetext view use to inform hosting TCO reduction #9293

mrjones-plip opened this issue Jul 31, 2024 · 19 comments
Assignees
Labels
Type: Technical issue Improve something that users won't notice

Comments

@mrjones-plip
Copy link
Contributor

mrjones-plip commented Jul 31, 2024

Describe the issue
it's been decided that storage is the highest cost for hosting total cost of ownership (TCO). It's suspected that CHT's use of CouchDB freetext views cause a large amount of disk space. But how much? How might the size of the view be reduced? Where does the UI surface the use of these views?

Describe the improvement you'd like
research and document use of freetext views so we know how to safely reduce the disk use caused by using them

Describe alternatives you've considered
na


Next Steps:

  • Check Couch/HAProxy logs for instances of the views being used
  • community research
  • Remove views and run e2e tests and see what happens
  • Add logging for the view and manually poke around the app

Open Questions:

  • Where are the views being used programmatically?
  • What fields are being searched for?
  • What are users actually searching for?

CouchDB views for freetext searching

These get queried by the @medic/search shared lib:

These generated queries are then executed one level higher in shared-libs/search/src/search.jsThis is then imported and ran by:


logs are VERY informative, even showing the keyword freetext !here's couch logs when searching for bbbbbbbbbbbbbbbbbbbbbbbbbb:

[notice] 2024-07-31T14:12:35.171290Z [email protected] <0.16948.0> 8d6a7d7e03 192-168-68-235.local-ip.medicmobile.org 172.20.0.7 medic GET /medic/_design/medic-client/_view/contacts_by_freetext?startkey=%5B%22bbbbbbbbbbbbbbbbbbbbbbbbbb%22%5D&endkey=%5B%22bbbbbbbbbbbbbbbbbbbbbbbbbb%EF%BF%B0%22%5D 200 ok 9

As well, it shows up in HA proxy for online users when searching for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:

<150>Jul 31 14:11:16 haproxy[12]: 172.20.0.7,couchdb,200,5,0,0,GET,/medic/_design/medic-client/_view/contacts_by_freetext?startkey=%5B%22aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa%22%5D&endkey=%5B%22aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa%EF%BF%B0%22%5D,-,medic,'-',276,5,-,'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:129.0) Gecko/20100101 Firefox/129.0'
@garethbowen
Copy link
Member

An additional "next step" that I think would be informative is taking a representative database and deleting the freetext views from the ddoc, calling view_cleanup, and then comparing the on disk size of the db. If it's not significant then we're barking up the wrong btree.

@m5r m5r self-assigned this Aug 1, 2024
@m5r
Copy link
Member

m5r commented Aug 8, 2024

I've done some research over the last few days with a relatively large dataset to give us an idea of how much disk space we could save for large instances. I've generated 5.94M contacts and 840k reports with the test data generator that amounts to a 5.1GB medic database and an 18GB medic-client. For comparisons, one of our largest deployments has 29.3M documents in the medic database that amounts to 80GB and a 187GB medic-client.

NB: when measuring the medic-client ddoc, I'm looking at the data size on disk that CouchDB returns when querying GET /medic/_design/medic-client/_info.

mrjones sez:

You can see check this amount with curl and jq. I loaded the data up using the comment below and confirm (but I get 15GB instead of 18!?)

curl -qs http://admin:groot@localhost:5984/medic/_design/medic-client/_info | jq '(.view_index.sizes.file/1000/1000/10|round/100)'
15.19

In this PR I managed to shrink medic-client from 18GB to 10.2GB so roughly -43%. With similar proportions, that would save about 80GB of storage on that large deployment I mentioned earlier. This is very much napkin math to get an idea of what to expect but it doesn't look as big of a shrink as I initially thought 😅
In my opinion, it's still a win but before moving forward we should see how these views are queried in production and understand the intention behind those queries to figure out what are the expected results. With that piece of info we can better adjust what the view indexes and make sure we reduce the chance of potentially shipping a breaking change in the search behavior.

NB2: The PR doesn't pass CI because some integration and e2e tests are failing. They're few but they rely on the fulltext views to index every field of every documents, they query arbitrary fields and expect the values on those fields to get picked up by the index and have those documents turn up in the search. But basic searches work as expected and their respective tests pass 🟢

Out of curiosity I deleted the 3 _fulltext views and measured again, this time medic-client went down to 6.2GB. So those 3 fulltext views account for roughly 2/3rd of the size of medic-client. Triggering the view cleanup didn't make any visible difference.

What to do next:

  • look at production logs to see how it's used to adjust the view indexing with users' usage and expectations
  • decide if it's the right thing to pursue considering the amount of disk space estimated to save
  • reproduce the experiment on production data by:
    • cloning a large production instance
    • measuring disk space (on the machine with df for example, and via couch APIs)
    • applying desired changes to the _freetext views
    • ❗❗ measuring how long it takes for couch to re-index the views
    • and finally measuring how much disk space we saved
  • look for other approaches to reduce hosting costs:
    • reduce the size of other ddocs, for reference, on that large deployment I mentioned earlier, medic-scripts represents 21.9GB, medic-conflicts 32.7GB, medic-admin 6.3GB, and medic (the ddoc) 34.7GB
    • find a more storage-effective way to do fulltext search (I'm not very convinced we can achieve something meaningful here, client-side search fulltext search in general is not easy)
    • find any other low hanging fruit to reduce RAM/CPU requirements (upgrading couch, optimizing a costly algorithm, that kind of thing...)

@mrjones-plip
Copy link
Contributor Author

mrjones-plip commented Aug 9, 2024

If anyone needs to reproduce testing on a large data set, here's the steps from @m5r :

Download the archive (private google drive link)

compose.yml file:

services:
  couchdb:
    image: public.ecr.aws/medic/cht-couchdb:4.9.0
    volumes:
      - ${COUCHDB_DATA:-./srv}:/opt/couchdb/data
      - cht-credentials:/opt/couchdb/etc/local.d/
    environment:
      - "COUCHDB_USER=${COUCHDB_USER:-admin}"
      - "COUCHDB_PASSWORD=${COUCHDB_PASSWORD:?COUCHDB_PASSWORD must be set}"
      - "COUCHDB_SECRET=${COUCHDB_SECRET}"
      - "COUCHDB_UUID=${COUCHDB_UUID}"
      - "SVC_NAME=${SVC_NAME:-couchdb}"
      - "COUCHDB_LOG_LEVEL=${COUCHDB_LOG_LEVEL:-error}"
    restart: always
    logging:
      driver: "local"
      options:
        max-size: "${LOG_MAX_SIZE:-50m}"
        max-file: "${LOG_MAX_FILES:-20}"
    networks:
      cht-net:

volumes:
  cht-credentials:

networks:
  cht-net:
    name: ${CHT_NETWORK:-cht-net}

and the couchdb-override.yml

services:
    couchdb:
        ports:
          - "5984:5984"
          - "5986:5986"

run it with:

COUCHDB_USER=admin COUCHDB_PASSWORD=groot docker compose -f ./compose.yml -f ./couchdb-override.yml up -d

@garethbowen
Copy link
Member

@m5r Fantastic data, thanks!

I had a quick look at the PR and I think we can reduce disk space a little further by not emitting fields with <key>:<value>. That feature is poorly documented and I doubt anyone uses it in production. If that's the case it would be good to check if that makes a measurable additional improvement to disk usage.

measuring how long it takes for couch to re-index the views

This will also be important to keep an eye on - thanks for calling it out. I know the PR is in very draft state, but there are many improvements we can make to it to improve compile time with this new approach, for example, not iterating over all fields, but only checking those that are explicitly to be indexed. So all that is to say, it might be better to measure this once we've decided this is the right approach and have made the other improvements to the view.

It would also be good to check the total indexing time compared to the current freetext views because new documents will have to be indexed either way, and if this one is significantly faster then that will be another upside to consider.

@jkuester
Copy link
Contributor

jkuester commented Aug 23, 2024

Out of an abundance of caution, I have taken the time to drill down a bit further into the source code that is using shared-libs/search and can confirm that there does not seem to be any programatic usage of the freetext searching (this is very much what we expected). The freetext search queries always originate from the user entering the query text either into a select2 search box or into the search box on the reports/contacts tabs.

@jkuester
Copy link
Contributor

jkuester commented Aug 23, 2024

Based on the above findings that we do not have any programatic requirements around which fields are available in the freetext search, I would like to start a more detailed design discussion regarding exactly which fields we should index. @garethbowen I would appreciate your feedback on this proposal!

We should only index a field if:

  • It contains data that a user could be expected to know. (e.g. do not need index fields that contain UUID values, or other codified data that would not be visible to a user)
  • It is not overly generic such that there would be no value in searching. (e.g. it would be unhelpful to search for a contact by sex or for a report by form)
  • It is a "known" field that is likely to be used in most CHT deployments. (Not a custom field that is specific to a certain project.)

Given this criteria, here are the fields that I propose indexing:

Contact docs

  • patient_id
  • place_id
  • case_id
  • name
  • phone
  • date_of_birth
  • external_id? - No special CHT logic attached to this field, but maybe its usage is common enough in the ecosystem to make it useful to index.
  • house_number? - Used in Nepal by users to for an address

Report docs

  • fields.patient_id
  • fields.place_id
  • fields.case_id
  • fields.patient_name

Additionally, I propose that we index a new freetext_search_index search_keywords field on both contacts and reports. This field would allow partners to be able to include data from other fields into the search index. As long as everything is space-separated, they should be able to combine data from multiple actual fields into freetext_search_index.


Also, for completeness I just want to include the further recommendations Gareth made above. When we re-work the freetext view code, we should:

  • Avoid emitting the <key>:<value> entries
    • Except for case_id:${id} - this we need to keep for the special reports-page linked search functionality.
  • Directly check for the intended fields we want to index on instead of looping through all of Object.keys.

@mrjones-plip
Copy link
Contributor Author

Are all the *_id fields listed above UUIDs such that we're expecting folks to search for them or are they shorter IDs (eg 3-8 digits)? If the former, seems like we shouldn't index, yah?

@jkuester
Copy link
Contributor

Those are all "shortcodes" and not UUIDs.

@mrjones-plip
Copy link
Contributor Author

Perfect! thanks for the confirmation.

@garethbowen
Copy link
Member

there does not seem to be any programatic usage of the freetext searching

I don't think this is correct - there is a very special use case which searches by case_id by creating a link on the field value:

return { filter: `case_id:${id}` };

I don't know if this is even used, so we could investigate to see if we can drop it, but this is a big job. Alternatively keep case_id:<shortcode> as a special case in freetext search, which if it isn't used, won't take up any disk space. Or we could rework it to be a special kind of filter so it uses a custom view rather than the generic freetext view.

I propose indexing...

I'm not sure about indexing contact phone number. I assume we'd index the standardised format, but nobody is going to type that in, so we'll need to also standardise the input, which won't work if anyone uses space separation. At this point I would leave it out, and we can add it back in if someone misses it.

Likewise I don't understand indexing the contact dob. It's skipped in the current implementation but I don't know how you would format it and how users would work out how to type it in the correct format. Also would it need to be exact or would we fuzzy match, looking for someone born in "september 2023". Can you give some examples of how this would work, what use case this would serve, and how users would discover it?

I propose that we index a new freetext_search_index field on both contacts and reports.

This is a nice idea, and means there's some extensibility for use cases we haven't thought of. It feels like a different issue, but if we do it in two releases it'll mean two expensive view reindexes so I appreciate you bringing it up now - maybe it's cleanest as two commits in the same release. How would app developers populate this field? Would it actually be fields.freetext_search_index so it's built up by enketo? Or would there be some custom code to hoist the value up to the root? Also may I recommend it's called something more intuitive, like keywords or search_keywords? I think it warrants discussion with app devs - can you raise a new issue for this to get the conversation going? Then we can decide whether there's enough alignment to get it added in the same release.

@jkuester
Copy link
Contributor

jkuester commented Aug 26, 2024

there is a very special use case which searches by case_id by creating a link on the field value

😓 Thanks for pointing this out. I was even originally looking for the case_id usage, but just could not find the connection back to the search service. (Turns out the code path is: format-data-record.service > report-view-model-generator.service > reports-content.component HTML > reports-content.component TS. It was the re-direction through the HTML code to call the search function that I missed. 🤦 ).

IMHO, the "proper" way to handle this functionality would be to rework it as a filter with a more dedicated view. However, I doubt we want to invest that kind of effort into this functionality right now. The most straightforward thing would be to just leave a special case in our freetext view code for case_id:${id}. The only reason I can think of to not do that would be that if we ever want to fix this up in the future, it would mean re-building the freetext views.... I guess the current approach is already not working well with the redesigned search: #8419

On a related note, @garethbowen do you know what is up with the search on the reports tab via URL query param functionality? Diana seemed to think it was related to the case_id functionality, but it does not seem like that is how it works anymore (given the code flow I mapped above). Not sure if there is something else I am missing here... 😟

I'm not sure about indexing contact phone number

I agree with the formatting concerns here and do believe that the "freetext" search is not the best way to implement a "find-contact-by-phone" feature. That being said, my thinking was that the user would likely be able to see the formatted version of the phone number in the contact-summary. So, they would at least have a chance of knowing what to search for. I am fine not including phone in the index, particularly since if anyone misses it, it will give us an opportunity to perhaps give a proper solution for that workflow.

I don't understand indexing the contact dob. It's skipped in the current implementation

🤔 I don't think this is accurate. If you are thinking of this check for fields that end with date in the view map code, then this actually does not apply to the date_of_birth field.... 😅 (And anecdotally I tried it out and was able to find contacts by searching for the date of birth.)

but I don't know how you would format it and how users would work out how to type it in the correct format. Also would it need to be exact or would we fuzzy match, looking for someone born in "september 2023". Can you give some examples of how this would work, what use case this would serve, and how users would discover it?

As with the phone number, I 100% agree that the formatting here is the main issue. At the same time, I guess it is possible to display that DoB to the user in the standard yyyy-mm-dd format in the contact summary and then the user could search based on what they see. As with the phone, this is bad UX and I am find not included the DoB in the index (and if anyone asks for it, we can implement a proper search). A purely generic freetext search like this is just not the best way to provide this functionality.

Mostly, my consideration around all these index fields was simply for the kinds of things that folks usually use as reference for finding people. Name/phone/DoB are def the most common that I know of. After reading what you wrote and thinking about this more, I think our generic contact search is probably not super useful for anything but searching by name and shortcode. More custom handling (and UI) would be needed to make searching by phone/DoB really useful or intuitive...I n the meantime, if folks really want to be able to search by phone/DoB, they can always just add that data to their search_keywords field!

search_keywords? I think it warrants discussion with app devs - can you raise a new issue for this to get the conversation going?

Issue logged: #9378. Will raise a discussion on Slack. 👍

@garethbowen
Copy link
Member

On a related note, @garethbowen do you know what is up with the search on the reports tab via URL query param functionality?

Commented in Slack but repeating here for posterity and visibility. I've traced it back to this commit (which is exactly 10 years old this Sunday 🍰 ). This was added for this issue. This was done so you could link to the reports tab and show all reports about a given patient.

I don't understand indexing the contact dob. It's skipped in the current implementation

I don't think this is accurate

I was thinking of the type check but realising now that dob will either be a number or a string - thanks for refreshing my memory.

I think our generic contact search is probably not super useful for anything but searching by name and shortcode

Yes this is basically where I've landed too.

@m5r m5r removed their assignment Sep 10, 2024
@m5r m5r self-assigned this Sep 12, 2024
@m5r
Copy link
Member

m5r commented Sep 12, 2024

Rebasing the PR against 4.10.x and applying the changes mentioned in this thread brought the size of medic-client down to 9.8GB from 10.2GB 👀

@mrjones-plip
Copy link
Contributor Author

mrjones-plip commented Sep 12, 2024

can't wait for a passing build so that we can try this branch on the copy of production data!

in the meanwhile to facilitate local testing on dev instances, I've added everyone here's SSH key to the test box and I'm creating a massive tar.gz of production data. USE EXTREME CAUTION when copying this locally - it has full PII and PHI. You may not copy it to your local machine unless you have full disk encryption and a strong password on your device - full stop!

To download the 33GB file:

scp [email protected]:/opt/couchdb/moh-zanz-prod-data/storage/medic-core/prod.moh.zanz.couchdb.tar.gz .

Anyone who can SSH in to watchdog can copy this file. Here's the last ~50 chars of the SSH keys on it:

ssh [email protected] "egrep -o '.{50}$' .ssh/authorized_keys|cut -f1 -d'@'"
iEyXlktTxE0synSdvWE4CNOJPE mrjones-nubuntu-aws-pem
HDPBOLsajCEAxP/Cn+uGQwPdBz89q5D3 mrjones
rnP39+7Y3CrMOco8BFCVQitOLwnwFEgFOi mrjones
34cPwUIFXapstbch8XfLn3rfR mrjones
ELr321BrFqeGY1sBGOc+VUiDjcBwj//Q== jkuester
B/J7wvY/SGXqwHVXxWuLtVcj5RWq6ugXYCuCoAZ m5r
NeATBNbTkgE+zEHBoHyVorqJkNoPfMnWwZr eljhkrr
DOLkpvVM7ToIsHSXqKlmipkMC+ih6zwOPQ== yrimal
FuzFoexe0xSwHwdYwgAR6HXjw7DkgHR6jV++ yrimal
nA8Gt3c5K0gIsEAPMo2N+V81w7SZheSSVggr sugat
gbwQXdI1ZqFczd1NfIPOk1Qz0rf4tqqkrb/e sugat
uQLQUAp5NtQlce1jbGehAL3bd8rxqChyGrab sugat
pmEYTImgLznmwCOyp5iz3icMAha/CG/I3TiX sugat
8aZt9qIvh/Q0llDGZMw8yM3ZZWUczxlVKNVN hareet
Wuayc+etBS2QATSf4oVlMX5CytWEPxojMDai7 henok
FuzFoexe0xSwHwdYwgAR6HXjw7DkgHR6jV++ yuvraj

@mrjones-plip
Copy link
Contributor Author

cc @m5r @jkuester and @sugat009 ☝️

@m5r
Copy link
Member

m5r commented Sep 16, 2024

Thanks for the prod data and the words of caution @mrjones-plip

I pulled it down, made CouchDB index it and warm up the views, and then pushed the new freetext search views.
The medic-client ddoc went down from ~21.5GB to ~8.4GB (and ~14.9GB to ~5.2GB size on disk) 👀

image
image

@jkuester
Copy link
Contributor

jkuester commented Oct 2, 2024

Here are my notes on the "next steps" from the meeting today with @dianabarsan @m5r and @sugat009:

Context

We have collected a large amount of data regarding the impact it would have to scope the freetext views to just index a limited allow-list of fields. We know the improvements it offers in saved disk space and the costs it will require in terms of re-indexing during upgrades.

We also have collected a large amount of information about how users are currently using the search functionality and unfortunately have come to the conclusion that a non-trivial percentage of users are likely to perform searches that depend on custom fields on reports and contacts being indexed.

Because of this, it seems unwise to move forwards with this project as it is currently defined.

Proposed next steps

  • We should not make any major changes to the offline freetext search behavior. We simply do not have enough information about how this is being used and the information we do have indicates it is likely that users rely on searching for custom fields.
  • For online freetext searching, we need to do more research and analysis on how this functionality is currently being used (and evaluate the impact of having an allow-list of indexed fields). Additionally we should research the usage of the filtered export functionality (where a data export is performed using a freetext search query).
    • The freetext views on the Couch server are really only used for searches by online users and for filtered exports. It is possible that we could preserve the current search functionality for offline users (by creating a new offline-only design doc with new freetext views for offline users) and at the same time we could make the changes proposed above in this issue to the existing freetext views in medic-client and they would only impact online users and exports (but crucially would also have the intended effect of reducing disk usage on the server).
  • We also should quantify the impact of a more simple alternative approach. The current freetext views are making duplicate emissions for each indexed value because they emit once for key and once for field_name:key.
    • We have identified a couple isolated cases where this functionality seems necessary for specific programatic usage, but outside of that, it seems highly unlikely that users are leveraging this functionality (or even know about it).
    • We should evaluate the impact of simply removing these extra field_name emissions from the index (and compare it to the results we were seeing for the allow-list approach). If the disk space savings are high enough, it may make sense to keep things simple and just go after this low hanging fruit right now.

@jkuester
Copy link
Contributor

jkuester commented Oct 2, 2024

One more interesting data point potentially supporting a separation in implementation for offline and online search is that Couch has introduced a new Lucene-based freetext search for the server.

@mrjones-plip
Copy link
Contributor Author

Allies recently retested 3 production deployments by upgrading them all to 4.12, running compaction, upgrading them to the freetext lite branch and running compaction again. We recorded the output of du for the couchdb data directory at before the freetext lite branch, the max value while upgrading and the final value after upgrading. There's a detailed doc, but the tl;dr is:

Name Before Max During End Change (MB) Change (%)
Deployment 1 443 929 386 57 12.9%
Deployment 2 53,803 229,423 39,903 13,900 25.8%
Deployment 3 979 1,813 925 54 5.5%

We've concluded that while possible to get up 25% disk space back, the potential impact of either breaking work flows or only getting 5% disk space back. We're hoping that CouchDB Nouveau search will yield the results we're looking for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Technical issue Improve something that users won't notice
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants