Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marker v2 #116

Merged
merged 42 commits into from
May 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
51990c8
Remove pymupdf
VikParuchuri Apr 29, 2024
30da488
Add in surya OCR
VikParuchuri May 1, 2024
1dfbd0d
Wire in new layout model
VikParuchuri May 1, 2024
efa115c
Add in ordering model
VikParuchuri May 1, 2024
4660a3d
Initial working version
VikParuchuri May 1, 2024
0937baf
Fix equation bugs
VikParuchuri May 1, 2024
f8f595c
Fix hyphen issue
VikParuchuri May 1, 2024
d22c5a5
Remove libmagic dependency
VikParuchuri May 1, 2024
ba0df58
Small table fix
VikParuchuri May 1, 2024
5e10508
Fix up table formatting
VikParuchuri May 2, 2024
1c5ff32
Refactor equations
VikParuchuri May 2, 2024
07ab29f
Fix OCR heuristic bug
VikParuchuri May 2, 2024
c10c3a0
Output quality fixes
VikParuchuri May 2, 2024
b4ab382
Fix how headers are found
VikParuchuri May 3, 2024
9f043f1
Fix headings
VikParuchuri May 3, 2024
824c4c5
Merge pull request #107 from VikParuchuri/commercial
VikParuchuri May 3, 2024
4786f17
Patch
VikParuchuri May 3, 2024
2e31aef
Merge pull request #108 from VikParuchuri/commercial
VikParuchuri May 3, 2024
df6f8fc
Improve table recognition and equation insertion
VikParuchuri May 3, 2024
c22d32e
Sort character blocks for pdf text
VikParuchuri May 3, 2024
9086dd5
Improve table, markdown, and ocr
VikParuchuri May 3, 2024
2f93800
Merge pull request #109 from VikParuchuri/commercial
VikParuchuri May 4, 2024
6198478
Fix rotation issues
VikParuchuri May 6, 2024
77a99f3
Work on tables
VikParuchuri May 7, 2024
f7444f3
Improve sorting
VikParuchuri May 7, 2024
c8c1f06
Enable extracting and saving images
VikParuchuri May 7, 2024
fb738ef
Merge pull request #111 from VikParuchuri/commercial
VikParuchuri May 7, 2024
01c18b8
Fix issues with fixed thresholds
VikParuchuri May 7, 2024
7f18bb9
Fix bolding, update deps
VikParuchuri May 7, 2024
f7bb860
Merge pull request #112 from VikParuchuri/commercial
VikParuchuri May 7, 2024
287f546
Address a bunch of Github issues
VikParuchuri May 8, 2024
aaef442
Fix code block formatting
VikParuchuri May 8, 2024
2bff18f
Specify batch sizes properly
VikParuchuri May 8, 2024
9f19f8c
Update benchmark scoring
VikParuchuri May 9, 2024
6933022
Update benchmarks
VikParuchuri May 9, 2024
4966f7a
Flush CUDA memory after inference
VikParuchuri May 9, 2024
2d7cb00
Fix deployment bugs
VikParuchuri May 9, 2024
e6428cf
Get chunk conversion working
VikParuchuri May 9, 2024
9b481d3
Fix additional deployment issues
VikParuchuri May 9, 2024
2120555
More bug fixes
VikParuchuri May 9, 2024
0179337
Merge pull request #114 from VikParuchuri/commercial
VikParuchuri May 9, 2024
7edbece
Bump package version
VikParuchuri May 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .github/workflows/cla.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: "Marker CLA Assistant"
on:
issue_comment:
types: [created]
pull_request_target:
types: [opened,closed,synchronize]

# explicitly configure permissions, in case your GITHUB_TOKEN workflow permissions are set to read-only in repository settings
permissions:
actions: write
contents: write
pull-requests: write
statuses: write

jobs:
CLAAssistant:
runs-on: ubuntu-latest
steps:
- name: "Marker CLA Assistant"
if: (github.event.comment.body == 'recheck' || github.event.comment.body == 'I have read the CLA Document and I hereby sign the CLA') || github.event_name == 'pull_request_target'
uses: contributor-assistant/[email protected]
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# the below token should have repo scope and must be manually added by you in the repository's secret
# This token is required only if you have configured to store the signatures in a remote repository/organization
PERSONAL_ACCESS_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
with:
path-to-signatures: 'signatures/version1/cla.json'
path-to-document: 'https://github.com/VikParuchuri/marker/blob/master/CLA.md'
# branch should not be protected
branch: 'master'
allowlist: VikParuchuri
2 changes: 0 additions & 2 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,6 @@ jobs:
run: |
pip install poetry
poetry install
poetry remove torch
poetry run pip install torch --index-url https://download.pytorch.org/whl/cpu
- name: Build package
run: |
poetry build
Expand Down
13 changes: 3 additions & 10 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,8 @@ name: Integration test with benchmark
on: [push]

env:
TESSDATA_PREFIX: "/usr/share/tesseract-ocr/4.00/tessdata"
TORCH_DEVICE: "cpu"
OCR_ENGINE: "tesseract" # So we don't have to install ghostscript, which takes a long time
OCR_ENGINE: "surya"

jobs:
build:
Expand All @@ -16,12 +15,6 @@ jobs:
uses: actions/setup-python@v4
with:
python-version: 3.11
- name: Install system dependencies
run: |
sudo apt-get update
cat scripts/install/apt-requirements.txt | xargs sudo apt-get install -y
- name: Show tessdata folders
run: ls /usr/share/tesseract-ocr/
- name: Install python dependencies
run: |
pip install poetry
Expand All @@ -30,8 +23,8 @@ jobs:
poetry run pip install torch --index-url https://download.pytorch.org/whl/cpu
- name: Download benchmark data
run: |
wget -O benchmark_data.zip "https://drive.google.com/uc?export=download&id=1ktVDYPEeyHlKLaF56FnHjI5VjVnYa1xL"
unzip benchmark_data.zip
wget -O benchmark_data.zip "https://drive.google.com/uc?export=download&id=1NHrdYatR1rtqs2gPVfdvO0BAvocH8CJi"
unzip -o benchmark_data.zip
- name: Run benchmark test
run: |
poetry run python benchmark.py benchmark_data/pdfs benchmark_data/references report.json
Expand Down
24 changes: 24 additions & 0 deletions CLA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Marker Contributor Agreement

This Marker Contributor Agreement ("MCA") applies to any contribution that you make to any product or project managed by us (the "project"), and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean Vikas Paruchuri. The term "you" shall mean the person or entity identified below.

If you agree to be bound by these terms, sign by writing "I have read the CLA document and I hereby sign the CLA" in response to the CLA bot Github comment. Read this agreement carefully before signing. These terms and conditions constitute a binding legal agreement.

1. The term 'contribution' or 'contributed materials' means any source code, object code, patch, tool, sample, graphic, specification, manual, documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and registrations, in your contribution:
- you hereby assign to us joint ownership, and to the extent that such assignment is or becomes invalid, ineffective or unenforceable, you hereby grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, royalty free, unrestricted license to exercise all rights under those copyrights. This includes, at our option, the right to sublicense these same rights to third parties through multiple levels of sublicensees or other licensing arrangements, including dual-license structures for commercial customers;
- you agree that each of us can do all things in relation to your contribution as if each of us were the sole owners, and if one of us makes a derivative work of your contribution, the one who makes the derivative work (or has it made will be the sole owner of that derivative work;
- you agree that you will not assert any moral rights in your contribution against us, our licensees or transferees;
- you agree that we may register a copyright in your contribution and exercise all ownership rights associated with it; and
- you agree that neither of us has any duty to consult with, obtain the consent of, pay or render an accounting to the other for any use or distribution of vour contribution.
3. With respect to any patents you own, or that you can license without payment to any third party, you hereby grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, royalty-free license to:
- make, have made, use, sell, offer to sell, import, and otherwise transfer your contribution in whole or in part, alone or in combination with or included in any product, work or materials arising out of the project to which your contribution was submitted, and
- at our option, to sublicense these same rights to third parties through multiple levels of sublicensees or other licensing arrangements.
If you or your affiliates institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the contribution or any project it was submitted to constitutes direct or contributory patent infringement, then any patent licenses granted to you under this agreement for that contribution shall terminate as of the date such litigation is filed.
4. Except as set out above, you keep all right, title, and interest in your contribution. The rights that you grant to us under these terms are effective on the date you first submitted a contribution to us, even if your submission took place before the date you sign these terms. Any contribution we make available under any license will also be made available under a suitable FSF (Free Software Foundation) or OSI (Open Source Initiative) approved license.
5. You covenant, represent, warrant and agree that:
- each contribution that you submit is and shall be an original work of authorship and you can legally grant the rights set out in this MCA;
- to the best of your knowledge, each contribution will not violate any third party's copyrights, trademarks, patents, or other intellectual property rights; and
- each contribution shall be in compliance with U.S. export control laws and other applicable export and import laws.
You agree to notify us if you become aware of any circumstance which would make any of the foregoing representations inaccurate in any respect. Vikas Paruchuri may publicly disclose your participation in the project, including the fact that you have signed the MCA.
6. This MCA is governed by the laws of the State of California and applicable U.S. Federal law. Any choice of law rules will not apply.
Loading
Loading