-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate plain text corpus via export command #562
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## develop #562 +/- ##
============================================
+ Coverage 97.22% 100.00% +2.77%
============================================
Files 119 5 -114
Lines 6279 78 -6201
Branches 8 8
============================================
- Hits 6105 78 -6027
+ Misses 174 0 -174 |
Co-authored-by: Wouter Haverals <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really great to see a draft of this script. It looks like you were able to adapt some of Vineet's code; glad to see that. (although I haven't compared so it's hard to tell what you've changed. I do remember his version is far more complicated than we needed!)
I'm having trouble understanding some of the transformation logic - both what you're doing but more importantly why you're doing it. You should be able to get what you need out of Solr pretty directly with minimal transformation in the python code. If you need to rename the fields, you can do that when you limit which fields are returned - you can read about it in the Solr docs or look at our solr queryset object where we're doing something similar (we call it aliases
there).
I'm also a little surprised about the number of solr queries - is it one query to get the documents and then one query per document to get the pages? Maybe this is the same as what Vineet was already doing, I don't remember, but I think we could do it more efficiently. I'd be happy to talk through different approaches. I guess right now you're writing out one file of page content per document so maybe this is reasonable and we can just keep it in mind as a possibility for future refinement.
pre-commit | ||
tqdm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this optional or required? if it's needed for the script to run, it needs to go in the main requirements file. Dev requirements aren't installed in production.
We already have progressbar2
as a dependency (which I know is old). Could you use that instead? Or how much effort to switch to tqdm? Which of rich and tqdm is lighter weight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while you're at it: could you check if we're still requiring gensim
and comment it out? we should make it an optional install since we're not planning to run the other version of this script
@contextmanager | ||
def logging_disabled(highest_level=logging.CRITICAL): | ||
"""Quick way to suppress solr logs as we iterate. Taken from https://gist.github.com/simon-weber/7853144""" | ||
previous_level = logging.root.manager.disable | ||
logging.disable(highest_level) | ||
try: yield | ||
finally: logging.disable(previous_level) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be handled by updating django logging config rather than overriding here (we might need to turn logging on to debug something). What log level are you using and how noisy is it ?
fn,meta = self._save_doc(doc_id) | ||
|
||
# if we saved, update progress bar desc | ||
if fn: pbar.set_description(f'{pdesc}: {fn}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't look like standard formatting, do you have the pre-commit hooks installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I'll do that next commit
@staticmethod | ||
def _get_id(doc_id:str) -> str: | ||
"""Method to make a file-safe version of a document ID""" | ||
return doc_id.replace('/','|') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
django may have an existing escape character utility method we can use that handles more characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really only the / that is raising problems from ECCO and Hathi IDs — we don't need anything as struct as url escaping – but I can look into it.
self.path_metadata = os.path.join(self.path,'metadata.json') | ||
|
||
# query to get initial results | ||
results = SolrQuerySet().facet(self.DOC_ID_FIELD, limit=self.doc_limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use facets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a regular search would be better than a facet. You need to add a filter on suppressed/public here. It could be helpful to use our existing ppa solr queryset object here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solr querying code is unchanged from Vineet's script. I also found it strange that it queries once and then for each document. We really only need to stream all solr results, with perhaps an initial query of suppressed==True.
docs = [ | ||
doc | ||
for doc in result.get_results(rows=self.page_counts[doc_id]) | ||
if doc[self.DOC_ID_FIELD]==doc_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the Solr filter unreliable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I found it was in one instance. I can try to recreate
# find the metadata doc | ||
metadata_docs = [d for d in docs if d["item_type"] == "work"] | ||
assert len(metadata_docs)==1 | ||
meta = {self.WORK_ID_FIELD:work_id, **metadata_docs[0]} | ||
|
||
# find the pages docs | ||
page_docs = [d for d in docs if d["item_type"] == "page"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble following, why is this part needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the docs is "work", the other are pages. This isn't the quickest way to separate them but that's what it does.
"""Reformat document dictionary""" | ||
|
||
# get new dictionary | ||
odoc={ | ||
key_new:doc.get(key_orig,'') | ||
for key_new,key_orig in ( | ||
self.OUTPUT_DOC_FIELDS.items() | ||
) | ||
} | ||
|
||
# return with page id | ||
return { | ||
self.PAGE_ID_FIELD:f'{meta[self.WORK_ID_FIELD]}_{odoc[self.PAGE_NUM_FIELD]}', | ||
**odoc | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to understand. What's going on, and why is it needed? The docstring should provide more context and an example if that's needed to make this clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're just returning a new output dictionary with field names changed and an additional field called page_id which is equal to {work_id}_{page_orig}.
from contextlib import contextmanager | ||
import logging | ||
|
||
class SolrCorpus: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely remember this class from Vineet's code. It's great to keep it if you find it a useful way to encapsulate the logic. (I would probably put the methods on the manage command, personally.)
If you keep it, you should write unit tests for each method of the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 maybe a custom solr queryset specific to this export would be a better approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to hear suggestions. We may only need to query for suppressed status; we want everything else, no?
"generate_textcorpus", "--path", tmpdir.dirpath(), "--doc-limit","one" | ||
) | ||
|
||
with pytest.raises(CommandError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi, pytest.raises
has optional arguments that will match the expected error text in the exception
@quadrismegistus have you used json lines before (JSONL), and have you considered using it for the PPA corpus? |
Yes. What benefit do you see it having in this instance? Are you imagining a .jsonl file for the entire corpus, or changing the individual work jsons to jsonls? The former seems unwieldy to me for average users and the latter seems unnecessary given that the json files are small, read in <1ms, and the jsonl file would eventually read the same file contents into memory and I'm not sure there'd by any speed improvement. Edit: However, since we're shifting a lot of work over to ppa-nlp anyway, there may be an argument for just having this script export a giant |
@rlskoeser I started a new branch/PR for a minimalist jsonl implementation of this, branching off this PR, because it kind of gets rid of a lot, but I kind of like the simplicity now: #563 I'll request review there and if you like that approach I'll merge that PR into this one. |
For #556