Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate plain text corpus via export command #562

Merged
merged 8 commits into from
Dec 15, 2023

Conversation

quadrismegistus
Copy link
Contributor

@quadrismegistus quadrismegistus commented Nov 28, 2023

For #556

Copy link

codecov bot commented Nov 28, 2023

Codecov Report

Merging #562 (1702b2b) into develop (59c82e5) will increase coverage by 2.77%.
The diff coverage is n/a.

Additional details and impacted files
@@             Coverage Diff              @@
##           develop      #562      +/-   ##
============================================
+ Coverage    97.22%   100.00%   +2.77%     
============================================
  Files          119         5     -114     
  Lines         6279        78    -6201     
  Branches         8         8              
============================================
- Hits          6105        78    -6027     
+ Misses         174         0     -174     

Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really great to see a draft of this script. It looks like you were able to adapt some of Vineet's code; glad to see that. (although I haven't compared so it's hard to tell what you've changed. I do remember his version is far more complicated than we needed!)

I'm having trouble understanding some of the transformation logic - both what you're doing but more importantly why you're doing it. You should be able to get what you need out of Solr pretty directly with minimal transformation in the python code. If you need to rename the fields, you can do that when you limit which fields are returned - you can read about it in the Solr docs or look at our solr queryset object where we're doing something similar (we call it aliases there).

I'm also a little surprised about the number of solr queries - is it one query to get the documents and then one query per document to get the pages? Maybe this is the same as what Vineet was already doing, I don't remember, but I think we could do it more efficiently. I'd be happy to talk through different approaches. I guess right now you're writing out one file of page content per document so maybe this is reasonable and we can just keep it in mind as a possibility for future refinement.

pre-commit
tqdm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this optional or required? if it's needed for the script to run, it needs to go in the main requirements file. Dev requirements aren't installed in production.

We already have progressbar2 as a dependency (which I know is old). Could you use that instead? Or how much effort to switch to tqdm? Which of rich and tqdm is lighter weight?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while you're at it: could you check if we're still requiring gensim and comment it out? we should make it an optional install since we're not planning to run the other version of this script

Comment on lines +167 to +174
@contextmanager
def logging_disabled(highest_level=logging.CRITICAL):
"""Quick way to suppress solr logs as we iterate. Taken from https://gist.github.com/simon-weber/7853144"""
previous_level = logging.root.manager.disable
logging.disable(highest_level)
try: yield
finally: logging.disable(previous_level)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled by updating django logging config rather than overriding here (we might need to turn logging on to debug something). What log level are you using and how noisy is it ?

fn,meta = self._save_doc(doc_id)

# if we saved, update progress bar desc
if fn: pbar.set_description(f'{pdesc}: {fn}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look like standard formatting, do you have the pre-commit hooks installed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I'll do that next commit

@staticmethod
def _get_id(doc_id:str) -> str:
"""Method to make a file-safe version of a document ID"""
return doc_id.replace('/','|')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

django may have an existing escape character utility method we can use that handles more characters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really only the / that is raising problems from ECCO and Hathi IDs — we don't need anything as struct as url escaping – but I can look into it.

self.path_metadata = os.path.join(self.path,'metadata.json')

# query to get initial results
results = SolrQuerySet().facet(self.DOC_ID_FIELD, limit=self.doc_limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use facets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a regular search would be better than a facet. You need to add a filter on suppressed/public here. It could be helpful to use our existing ppa solr queryset object here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solr querying code is unchanged from Vineet's script. I also found it strange that it queries once and then for each document. We really only need to stream all solr results, with perhaps an initial query of suppressed==True.

docs = [
doc
for doc in result.get_results(rows=self.page_counts[doc_id])
if doc[self.DOC_ID_FIELD]==doc_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the Solr filter unreliable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I found it was in one instance. I can try to recreate

Comment on lines +85 to +91
# find the metadata doc
metadata_docs = [d for d in docs if d["item_type"] == "work"]
assert len(metadata_docs)==1
meta = {self.WORK_ID_FIELD:work_id, **metadata_docs[0]}

# find the pages docs
page_docs = [d for d in docs if d["item_type"] == "page"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble following, why is this part needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the docs is "work", the other are pages. This isn't the quickest way to separate them but that's what it does.

Comment on lines +101 to +115
"""Reformat document dictionary"""

# get new dictionary
odoc={
key_new:doc.get(key_orig,'')
for key_new,key_orig in (
self.OUTPUT_DOC_FIELDS.items()
)
}

# return with page id
return {
self.PAGE_ID_FIELD:f'{meta[self.WORK_ID_FIELD]}_{odoc[self.PAGE_NUM_FIELD]}',
**odoc
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to understand. What's going on, and why is it needed? The docstring should provide more context and an example if that's needed to make this clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're just returning a new output dictionary with field names changed and an additional field called page_id which is equal to {work_id}_{page_orig}.

from contextlib import contextmanager
import logging

class SolrCorpus:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember this class from Vineet's code. It's great to keep it if you find it a useful way to encapsulate the logic. (I would probably put the methods on the manage command, personally.)

If you keep it, you should write unit tests for each method of the class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 maybe a custom solr queryset specific to this export would be a better approach

Copy link
Contributor Author

@quadrismegistus quadrismegistus Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to hear suggestions. We may only need to query for suppressed status; we want everything else, no?

"generate_textcorpus", "--path", tmpdir.dirpath(), "--doc-limit","one"
)

with pytest.raises(CommandError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, pytest.raises has optional arguments that will match the expected error text in the exception

@rlskoeser
Copy link
Contributor

@quadrismegistus have you used json lines before (JSONL), and have you considered using it for the PPA corpus?

@quadrismegistus
Copy link
Contributor Author

quadrismegistus commented Dec 4, 2023

@quadrismegistus have you used json lines before (JSONL), and have you considered using it for the PPA corpus?

Yes. What benefit do you see it having in this instance? Are you imagining a .jsonl file for the entire corpus, or changing the individual work jsons to jsonls? The former seems unwieldy to me for average users and the latter seems unnecessary given that the json files are small, read in <1ms, and the jsonl file would eventually read the same file contents into memory and I'm not sure there'd by any speed improvement.

Edit: However, since we're shifting a lot of work over to ppa-nlp anyway, there may be an argument for just having this script export a giant pages.jsonl instead of a folder of individual json files, and then have ppa-nlp transform the data/files as needed. That way we wouldn't need to alter any IDs due to file naming constraints and we could stream-save directly from Solr.

@quadrismegistus
Copy link
Contributor Author

@rlskoeser I started a new branch/PR for a minimalist jsonl implementation of this, branching off this PR, because it kind of gets rid of a lot, but I kind of like the simplicity now: #563

I'll request review there and if you like that approach I'll merge that PR into this one.

@quadrismegistus quadrismegistus merged commit 1702b2b into develop Dec 15, 2023
4 of 7 checks passed
@quadrismegistus quadrismegistus deleted the textcorpusfromscript branch December 15, 2023 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants