V5.0.0
New Features
- Automatic population of metadata: PDF metadata is automatically retrieved from a variety of providers, including adding bibtex, citation counts, journal quality assessments, and noting retractions
- full-text search: A major difference between our published work and this repo is ability to search over all of scientific literature. We've brought the OSS version closer by adding full-text keyword search via tantivy. Now you can index and search many papers before embdding, making it feasible to ingest many papers.
- unified settings management: You can now save/load settings and that makes it easier for us to distribute settings reflecting various tasks with PaperQA2. Examples are writing wikipedia articles, identifying contradictions, and obtaining structured data
- CLI: We've made a CLI that uses persistent parsings/indexes and makes it much easier to just ask questions of a folder of PDFs
- Litellm: We've adopted litellm as the LLM wrapper of choice. This means we now support many LLM APIs directly with only the model string changing. It also means we have "routers" now that can do fallbacks, api rate limiting, and retries.
Improvements
- More modern agent frameworks
- Reduction in dependencies
- Removed code duplicated by litellm
- Many improvements on code style and best practices
Regressions/Deprecation
We've removed the following features to keep our library focused:
doc_match
- we do not have enough data to support that this method actually helps for very large corpuses- LangchainVectorStore - We no longer support more complex vector stores via Langchain like FAISS. Instead, we only support Numpy vector stores. We never found the paradigm of very large vector stores to be better than keyword search -> vector search -> LLM reranking and thus removed the code
Detailed Changes:
- typo by @oganm in #303
- Updated readme and models by @mskarlin in #305
- Add Client (external API) Module For Enhanced Metadata by @mskarlin in #306
- Agentic workflows, locally indexed search, and CLI by @mskarlin in #309
- Add new unpaywall provider by @mskarlin in #310
- Rollback search fields to
list
and dynamically compute md5 hash in tests by @mskarlin in #311 - Refactor to breakout config from rest of code by @whitead in #289
- Changed to rely on litellm for computing cost by @whitead in #321
- Fixing
LLMModel.axyz_iter
type hints by @jamesbraza in #324 - CLI Fixes by @whitead in #322
black
ened code to prevent IDE scrolling by @jamesbraza in #330- Optimized import paths by @jamesbraza in #331
- Removed
pytest-mock
plugin by @jamesbraza in #328 - Adding
pytest-xdist
plugin by @jamesbraza in #329 - Passing
mypy
by @jamesbraza in #332 - Removing
make_chain
in favor ofrun_prompt
by @jamesbraza in #325 - Readme updates by @mskarlin in #323
- Adding
refurb
tool, andlint
CI by @jamesbraza in #333 - Fixing arg ordering after #325 by @jamesbraza in #334
- Fixing
parse_text
after #332 by @jamesbraza in #335 - Fixing union attr error by @jamesbraza in #338
- Check if a journal name starts with
the
by @geemi725 in #320 - Fixing two more tests by @jamesbraza in #340
- All Ruff
ANN
autofixes by @jamesbraza in #341 - Adding in
.mailmap
by @jamesbraza in #342 - Remove cassettes which aren't needed by @mskarlin in #339
- Add configs for contracrow + wikicrow by @mskarlin in #336
- Removed
LangchainVectorStore
,llms
extra, and fixing upREADME
by @jamesbraza in #343 - Dropping
requests
dependency by @jamesbraza in #346 - Removed
html2text
requirement by @jamesbraza in #347 - Requiring Python 3.11+ by @jamesbraza in #348
- Did one revision at README by @whitead in #344
- Renaming fitz to pymupdf by @mskarlin in #350
- Better control flow in
litellm_get_search_query
by @jamesbraza in #351 - Recurse into directories; catch empty documents by @sidnarayanan in #352
- Move configure_cli_logging such that it's not called twice by @mskarlin in #353
- Cleaning up dependencies by @jamesbraza in #354
- Fixed code in README by @whitead in #355
- Added citation and paper URL by @whitead in #357
aviary
andldp
for agents overlangchain
by @jamesbraza in #358- Adds retraction status by @geemi725 in #314
- Adding
pylint
by @jamesbraza in #349 - Added account for cost info by @whitead in #360
New Contributors
- @oganm made their first contribution in #303
- @geemi725 made their first contribution in #320
- @sidnarayanan made their first contribution in #352
Full Changelog: v4.9.0...vnew