New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Generate plain text corpus via export command #562

Merged

quadrismegistus merged 8 commits into develop from textcorpusfromscript

Dec 15, 2023

Contributor

quadrismegistus commented Nov 28, 2023 •

edited

Loading

For #556


          generate command pulls from solr, transforms doc

1db0c92

codecov bot commented Nov 28, 2023 •

edited

Loading

Codecov Report

Merging #562 (1702b2b) into develop (59c82e5) will increase coverage by 2.77%.
The diff coverage is n/a.

Additional details and impacted files

@@             Coverage Diff              @@
##           develop      #562      +/-   ##
============================================
+ Coverage    97.22%   100.00%   +2.77%     
============================================
  Files          119         5     -114     
  Lines         6279        78    -6201     
  Branches         8         8              
============================================
- Hits          6105        78    -6027     
+ Misses         174         0     -174

quadrismegistus and others added 4 commits

November 28, 2023 13:50


          adding Wouter's preprocessing code

b34b6d6

Co-authored-by: Wouter Haverals <[email protected]>


          generate and tests

cdc7bf2


          fix to reqs

cbfa9ad


          comment

f5dc287

quadrismegistus requested a review from rlskoeser

November 30, 2023 21:38

quadrismegistus added 3 commits

November 30, 2023 16:40


          comment desc

ff7dcb7


          cleaning up

e133c9c


          switching from pandas/csv to json for metadata

1702b2b

rlskoeser reviewed

View reviewed changes

Contributor

rlskoeser left a comment

It's really great to see a draft of this script. It looks like you were able to adapt some of Vineet's code; glad to see that. (although I haven't compared so it's hard to tell what you've changed. I do remember his version is far more complicated than we needed!)

I'm having trouble understanding some of the transformation logic - both what you're doing but more importantly why you're doing it. You should be able to get what you need out of Solr pretty directly with minimal transformation in the python code. If you need to rename the fields, you can do that when you limit which fields are returned - you can read about it in the Solr docs or look at our solr queryset object where we're doing something similar (we call it aliases there).

I'm also a little surprised about the number of solr queries - is it one query to get the documents and then one query per document to get the pages? Maybe this is the same as what Vineet was already doing, I don't remember, but I think we could do it more efficiently. I'd be happy to talk through different approaches. I guess right now you're writing out one file of page content per document so maybe this is reasonable and we can just keep it in mind as a possibility for future refinement.

dev-requirements.txt

		pre-commit
		tqdm

Contributor

rlskoeser Dec 4, 2023

Is this optional or required? if it's needed for the script to run, it needs to go in the main requirements file. Dev requirements aren't installed in production.

We already have progressbar2 as a dependency (which I know is old). Could you use that instead? Or how much effort to switch to tqdm? Which of rich and tqdm is lighter weight?

Contributor

rlskoeser Dec 4, 2023

while you're at it: could you check if we're still requiring gensim and comment it out? we should make it an optional install since we're not planning to run the other version of this script

ppa/archive/management/commands/generate_textcorpus.py

Comment on lines +167 to +174

+              @contextmanager
+              def logging_disabled(highest_level=logging.CRITICAL):
+                  """Quick way to suppress solr logs as we iterate. Taken from https://gist.github.com/simon-weber/7853144"""
+                  previous_level = logging.root.manager.disable
+                  logging.disable(highest_level)
+                  try: yield
+                  finally: logging.disable(previous_level)

Contributor

rlskoeser Dec 4, 2023

This should be handled by updating django logging config rather than overriding here (we might need to turn logging on to debug something). What log level are you using and how noisy is it ?

ppa/archive/management/commands/generate_textcorpus.py

+                          fn,meta = self._save_doc(doc_id)
+                          # if we saved, update progress bar desc
+                          if fn: pbar.set_description(f'{pdesc}: {fn}')

Contributor

rlskoeser Dec 4, 2023

this doesn't look like standard formatting, do you have the pre-commit hooks installed?

Contributor Author

quadrismegistus Dec 4, 2023

No. I'll do that next commit

ppa/archive/management/commands/generate_textcorpus.py

+                  @staticmethod
+                  def _get_id(doc_id:str) -> str:
+                      """Method to make a file-safe version of a document ID"""
+                      return doc_id.replace('/','|')

Contributor

rlskoeser Dec 4, 2023

django may have an existing escape character utility method we can use that handles more characters

Contributor Author

quadrismegistus Dec 4, 2023

It's really only the / that is raising problems from ECCO and Hathi IDs — we don't need anything as struct as url escaping – but I can look into it.

ppa/archive/management/commands/generate_textcorpus.py

+                      self.path_metadata = os.path.join(self.path,'metadata.json')
+                      # query to get initial results
+                      results = SolrQuerySet().facet(self.DOC_ID_FIELD, limit=self.doc_limit)

Contributor

rlskoeser Dec 4, 2023

why use facets?

Contributor

rlskoeser Dec 4, 2023

I think a regular search would be better than a facet. You need to add a filter on suppressed/public here. It could be helpful to use our existing ppa solr queryset object here.

Contributor Author

quadrismegistus Dec 4, 2023

The solr querying code is unchanged from Vineet's script. I also found it strange that it queries once and then for each document. We really only need to stream all solr results, with perhaps an initial query of suppressed==True.

ppa/archive/management/commands/generate_textcorpus.py

+                      docs = [
+                          doc
+                          for doc in result.get_results(rows=self.page_counts[doc_id])
+                          if doc[self.DOC_ID_FIELD]==doc_id

Contributor

rlskoeser Dec 4, 2023

is the Solr filter unreliable?

Contributor Author

quadrismegistus Dec 4, 2023

Yes, I found it was in one instance. I can try to recreate

ppa/archive/management/commands/generate_textcorpus.py

Comment on lines +85 to +91

+                      # find the metadata doc
+                      metadata_docs = [d for d in docs if d["item_type"] == "work"]
+                      assert len(metadata_docs)==1
+                      meta = {self.WORK_ID_FIELD:work_id, **metadata_docs[0]}
+                      # find the pages docs
+                      page_docs = [d for d in docs if d["item_type"] == "page"]

Contributor

rlskoeser Dec 4, 2023

I'm having trouble following, why is this part needed?

Contributor Author

quadrismegistus Dec 4, 2023

One of the docs is "work", the other are pages. This isn't the quickest way to separate them but that's what it does.

ppa/archive/management/commands/generate_textcorpus.py

Comment on lines +101 to +115

+                      """Reformat document dictionary"""
+                      # get new dictionary
+                      odoc={
+                          key_new:doc.get(key_orig,'')
+                          for key_new,key_orig in (
+                              self.OUTPUT_DOC_FIELDS.items()
+                          )
+                      }
+                      # return with page id
+                      return {
+                          self.PAGE_ID_FIELD:f'{meta[self.WORK_ID_FIELD]}_{odoc[self.PAGE_NUM_FIELD]}',
+                          **odoc
+                      }

Contributor

rlskoeser Dec 4, 2023

This is hard to understand. What's going on, and why is it needed? The docstring should provide more context and an example if that's needed to make this clear.

Contributor Author

quadrismegistus Dec 4, 2023

We're just returning a new output dictionary with field names changed and an additional field called page_id which is equal to {work_id}_{page_orig}.

ppa/archive/management/commands/generate_textcorpus.py

+              from contextlib import contextmanager
+              import logging
+              class SolrCorpus:

Contributor

rlskoeser Dec 4, 2023

I vaguely remember this class from Vineet's code. It's great to keep it if you find it a useful way to encapsulate the logic. (I would probably put the methods on the manage command, personally.)

If you keep it, you should write unit tests for each method of the class.

Contributor

rlskoeser Dec 4, 2023

🤔 maybe a custom solr queryset specific to this export would be a better approach

Contributor Author

quadrismegistus Dec 4, 2023 •

edited

Loading

Happy to hear suggestions. We may only need to query for suppressed status; we want everything else, no?

ppa/archive/tests/test_generate_textcorpus.py

+                          "generate_textcorpus", "--path", tmpdir.dirpath(), "--doc-limit","one"
+                      )
+                  with pytest.raises(CommandError):

Contributor

rlskoeser Dec 4, 2023

fyi, pytest.raises has optional arguments that will match the expected error text in the exception

Contributor

rlskoeser commented Dec 4, 2023

@quadrismegistus have you used json lines before (JSONL), and have you considered using it for the PPA corpus?

Contributor Author

quadrismegistus commented Dec 4, 2023 •

edited

Loading

@quadrismegistus have you used json lines before (JSONL), and have you considered using it for the PPA corpus?

Yes. What benefit do you see it having in this instance? Are you imagining a .jsonl file for the entire corpus, or changing the individual work jsons to jsonls? The former seems unwieldy to me for average users and the latter seems unnecessary given that the json files are small, read in <1ms, and the jsonl file would eventually read the same file contents into memory and I'm not sure there'd by any speed improvement.

Edit: However, since we're shifting a lot of work over to ppa-nlp anyway, there may be an argument for just having this script export a giant pages.jsonl instead of a folder of individual json files, and then have ppa-nlp transform the data/files as needed. That way we wouldn't need to alter any IDs due to file naming constraints and we could stream-save directly from Solr.

Contributor Author

quadrismegistus commented Dec 4, 2023

@rlskoeser I started a new branch/PR for a minimalist jsonl implementation of this, branching off this PR, because it kind of gets rid of a lot, but I kind of like the simplicity now: #563

I'll request review there and if you like that approach I'll merge that PR into this one.

quadrismegistus merged commit 1702b2b into develop

4 of 7 checks passed

quadrismegistus deleted the textcorpusfromscript branch

December 15, 2023 16:22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet