Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate plain text corpus via export command #562

Merged
merged 8 commits into from
Dec 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion dev-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ pytest-django>=4.5.2
pytest-cov
django-debug-toolbar
sphinx
pre-commit
pre-commit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this optional or required? if it's needed for the script to run, it needs to go in the main requirements file. Dev requirements aren't installed in production.

We already have progressbar2 as a dependency (which I know is old). Could you use that instead? Or how much effort to switch to tqdm? Which of rich and tqdm is lighter weight?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while you're at it: could you check if we're still requiring gensim and comment it out? we should make it an optional install since we're not planning to run the other version of this script

tqdm
200 changes: 200 additions & 0 deletions ppa/archive/management/commands/generate_textcorpus.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
"""
**generate_textcorpus** is a custom manage command to generate a plain
text corpus from Solr. It should be run *after* content has been indexed
into Solr via the **index** manage command.
"""

import os
import json
from django.core.management.base import BaseCommand
from parasolr.django import SolrQuerySet
from collections import defaultdict, OrderedDict
import logging
from tqdm import tqdm
from typing import Tuple
from contextlib import contextmanager
import logging

class SolrCorpus:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember this class from Vineet's code. It's great to keep it if you find it a useful way to encapsulate the logic. (I would probably put the methods on the manage command, personally.)

If you keep it, you should write unit tests for each method of the class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 maybe a custom solr queryset specific to this export would be a better approach

Copy link
Contributor Author

@quadrismegistus quadrismegistus Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to hear suggestions. We may only need to query for suppressed status; we want everything else, no?

"""Custom class to generate a text corpus from Solr"""

# Class attributes that rarely, if ever, need to change
DOC_ID_FIELD = "group_id_s" # Solr field name for document identifier (not source_id, which is same across excerpts)
PAGE_ORDER_FIELD = "order" # Solr field name for page ordering
OUTPUT_DOC_FIELDS = dict(
page_num_orig = 'label',
page_num_digi = 'order',
page_text = 'content',
)
PAGE_ID_FIELD = 'page_id'
WORK_ID_FIELD = 'work_id'
PAGE_NUM_FIELD = 'page_num_orig'
PAGE_SORT_FIELD = 'page_num_digi'

def __init__(self, path, doc_limit=-1):
"""
A class encapsulating a Solr Client specification which yields
metadata and page data for PPA documents.

:param path: A string to a path for the corpus output.
:param doc_limit: Max no. of documents to process. The default of -1
means we process ALL documents found.
"""
# root path for corpus
self.path = path

# limit docs queried
self.doc_limit = doc_limit

# subsequent paths
self.path_texts = os.path.join(self.path,'texts')
self.path_metadata = os.path.join(self.path,'metadata.json')

# query to get initial results
results = SolrQuerySet().facet(self.DOC_ID_FIELD, limit=self.doc_limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use facets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a regular search would be better than a facet. You need to add a filter on suppressed/public here. It could be helpful to use our existing ppa solr queryset object here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solr querying code is unchanged from Vineet's script. I also found it strange that it queries once and then for each document. We really only need to stream all solr results, with perhaps an initial query of suppressed==True.

# store page counts and doc ids
self.page_counts = results.get_facets().facet_fields[self.DOC_ID_FIELD]
self.doc_ids = self.page_counts.keys()
self.doc_count = len(self.doc_ids)

@staticmethod
def _get_id(doc_id:str) -> str:
"""Method to make a file-safe version of a document ID"""
return doc_id.replace('/','|')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

django may have an existing escape character utility method we can use that handles more characters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really only the / that is raising problems from ECCO and Hathi IDs — we don't need anything as struct as url escaping – but I can look into it.


def _get_meta_pages(self, doc_id:str) -> Tuple[dict,list]:
"""Get metadata (dictionary) and pages (list of dictionaries) for a given document"""

# get file safe work_id
work_id = self._get_id(doc_id)

# query
result = (
SolrQuerySet()
.search(**{self.DOC_ID_FIELD: doc_id})
.order_by(self.PAGE_ORDER_FIELD)
)

# populate the result cache with number of rows specified
docs = [
doc
for doc in result.get_results(rows=self.page_counts[doc_id])
if doc[self.DOC_ID_FIELD]==doc_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the Solr filter unreliable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I found it was in one instance. I can try to recreate

]

# find the metadata doc
metadata_docs = [d for d in docs if d["item_type"] == "work"]
assert len(metadata_docs)==1
meta = {self.WORK_ID_FIELD:work_id, **metadata_docs[0]}

# find the pages docs
page_docs = [d for d in docs if d["item_type"] == "page"]
Comment on lines +85 to +91
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble following, why is this part needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the docs is "work", the other are pages. This isn't the quickest way to separate them but that's what it does.


# transform into new dictionary with keys in `self.PAGE_ID_FIELD` and `self.OUTPUT_DOC_FIELDS`
pages = [self._transform_doc(doc,meta) for doc in page_docs]

# make sure sorted by numeric page num (i.e. "digital")
pages.sort(key=lambda page: page[self.PAGE_SORT_FIELD])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Solr sorting unreliable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it was in one instance, yes, I can try to recreate.

return meta, pages

def _transform_doc(self,doc:dict,meta:dict) -> dict:
"""Reformat document dictionary"""

# get new dictionary
odoc={
key_new:doc.get(key_orig,'')
for key_new,key_orig in (
self.OUTPUT_DOC_FIELDS.items()
)
}

# return with page id
return {
self.PAGE_ID_FIELD:f'{meta[self.WORK_ID_FIELD]}_{odoc[self.PAGE_NUM_FIELD]}',
**odoc
}
Comment on lines +101 to +115
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to understand. What's going on, and why is it needed? The docstring should provide more context and an example if that's needed to make this clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're just returning a new output dictionary with field names changed and an additional field called page_id which is equal to {work_id}_{page_orig}.


def _save_doc(self,doc_id:str) -> Tuple[str,dict]:
"""Save document pages as json and return filename along with document's metadata"""

# get metadata and pages for this doc
meta,pages = self._get_meta_pages(doc_id)

# if pages, save json
if pages:
filename = os.path.join(self.path_texts, meta[self.WORK_ID_FIELD]+'.json')
os.makedirs(self.path_texts,exist_ok=True)
with open(filename,'w') as of:
json.dump(pages, of, indent=2)

# otherwise, returned filename is blank to indicate no file saved
else:
filename=''

return filename,meta


def save(self):
"""Save the generated corpus text and metadata to files on disk"""

# save docs and gather metadata
metadata=[]
pdesc='Saved text to'
pbar=tqdm(total=self.doc_count, desc=f'{pdesc}: ...')
for doc_id in self.doc_ids:
# get saved filename and found metadata for this document
fn,meta = self._save_doc(doc_id)

# if we saved, update progress bar desc
if fn: pbar.set_description(f'{pdesc}: {fn}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look like standard formatting, do you have the pre-commit hooks installed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I'll do that next commit


# tick
pbar.update()

# add this doc's meta to metadata
metadata.append(meta)
pbar.close()

# save metadata csv
with open(self.path_metadata,'w') as of:
json.dump(metadata, of, indent=2)
print(f'Saved metadata to: {self.path_metadata}')





@contextmanager
def logging_disabled(highest_level=logging.CRITICAL):
"""Quick way to suppress solr logs as we iterate. Taken from https://gist.github.com/simon-weber/7853144"""
previous_level = logging.root.manager.disable
logging.disable(highest_level)
try: yield
finally: logging.disable(previous_level)

Comment on lines +167 to +174
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled by updating django logging config rather than overriding here (we might need to turn logging on to debug something). What log level are you using and how noisy is it ?




class Command(BaseCommand):
"""Custom manage command to generate a text corpus from text indexed in Solr"""

def add_arguments(self, parser):
parser.add_argument(
"--path", required=True, help="Directory path to save corpus file(s)."
)

parser.add_argument(
"--doc-limit",
type=int,
default=-1,
help="Limit on the number of documents for corpus generation."
"The default of -1 considers ALL documents.",
)


def handle(self, *args, **options):
with logging_disabled():
SolrCorpus(
path=options["path"],
doc_limit=options["doc_limit"],
).save()
103 changes: 103 additions & 0 deletions ppa/archive/tests/test_generate_textcorpus.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
from unittest.mock import patch
import json
import pytest
from django.core.management import call_command
from django.core.management.base import CommandError
import os

# mock results for acet query used to get document IDs and page counts
mock_solr_facets = {"group_id_s": {"doc_1": 2, "doc_2": 1}}

# mock result for solr document data
mock_solr_docs = [
# The first record has item_type='work' and contains metadata for the
# document
{"item_type": "work", "pub_year": 1863, "group_id_s":"doc_1"},
{"item_type": "work", "pub_year": "unknown","group_id_s":"doc_2"},
# If multiple metadata rows are found, the first one (above) is used
# Subsequent records have item_type='page', page-order specified by
# 'order', with content in 'content'
{
"item_type": "page",
"order": 1,
"content": "Four score and seven years ago our fathers brought forth"
" on this continent, a new nation, ",
"group_id_s":"doc_1",
"label":'i'
},
{
"item_type": "page",
"order": 2,
"content": "conceived in Liberty, and dedicated to the proposition"
" that all men are created equal.",
"group_id_s":"doc_1",
"label":'ii'
},



{
"item_type": "page",
"order": 3,
"content": "!!!!!",
"group_id_s":"doc_2",
"label":"2"
},
]


@pytest.fixture
def patched_solr_queryset(mock_solr_queryset):
# local fixture that uses parasolr queryset mock
# and patches in test docs & facets
mock_qs = mock_solr_queryset()
with patch(
"ppa.archive.management.commands.generate_textcorpus.SolrQuerySet", new=mock_qs
) as mock_queryset_cls:
mock_qs = mock_queryset_cls.return_value
mock_qs.get_results.return_value = mock_solr_docs
mock_qs.get_facets.return_value.facet_fields = mock_solr_facets

yield mock_qs


def test_save(tmpdir, patched_solr_queryset):
call_command("generate_textcorpus", "--path", tmpdir.dirpath())
metadata_file = tmpdir.dirpath("metadata.csv")
assert metadata_file.check()

with open(metadata_file) as f:
meta=json.load(f)
assert len(meta) == 2

tdir=tmpdir.dirpath('texts')
fns=os.listdir(tdir)
assert len(fns) == 2

fn1=os.path.join(tdir,fns[0])
fn2=os.path.join(tdir,fns[1])
with open(fn1) as f: ld1=json.load(f)
with open(fn2) as f: ld2=json.load(f)

assert len(ld1)==2
assert len(ld2)==1

assert all(all(bool(v) for k,v in d.items()) for d in ld1)
assert all(all(bool(v) for k,v in d.items()) for d in ld2)






def test_invalid_preprocess_flags(tmpdir, patched_solr_queryset):
# Flags that are not supported
with pytest.raises(CommandError):
call_command(
"generate_textcorpus", "--path", tmpdir.dirpath(), "--doc-limit","one"
)

with pytest.raises(CommandError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, pytest.raises has optional arguments that will match the expected error text in the exception

call_command(
"generate_textcorpus", "--woops","huh"
)
Loading