Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/contain nltk assets in docker image #3853

Merged
merged 39 commits into from
Jan 8, 2025
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
3d9e013
feat:contain nltk data in docker image
christinestraub Dec 24, 2024
3c7985f
fix test dockerfile errors
christinestraub Dec 24, 2024
8ee6a2b
feat:using nltk assets in docker image
christinestraub Dec 26, 2024
2ed5675
feat:fix makefile error
christinestraub Dec 26, 2024
c482698
test:fix lint errors
christinestraub Dec 26, 2024
9d1a57d
test:fix test_dockerfile error
christinestraub Dec 27, 2024
27eea08
feat:integrate nltk data into docker image
christinestraub Jan 2, 2025
6b1ee55
test:fix version
christinestraub Jan 3, 2025
03a0adf
feat:fix nltk data path
christinestraub Jan 3, 2025
9c42660
feat:fix ingest test errors
christinestraub Jan 3, 2025
efe1167
feat:update dockerfile
christinestraub Jan 3, 2025
b961399
test: fix lint errors
christinestraub Jan 3, 2025
e7333e6
test:fix dockerfile errors
christinestraub Jan 3, 2025
dad9f87
feat:update ability to validate nltk assets
christinestraub Jan 3, 2025
f37a922
test:fix lint errors
christinestraub Jan 3, 2025
908c9e7
feat:fix nltk path error
christinestraub Jan 6, 2025
494b0e0
fix conflicts error
christinestraub Jan 6, 2025
30198a7
Merge branch 'main' into feat/contain-nltk-assets-in-docker-image
christinestraub Jan 6, 2025
39187d0
feat:add nltk model installation logic
christinestraub Jan 6, 2025
915b0ce
feat:revert ci.yml
christinestraub Jan 6, 2025
f4611a1
feat:revert Makefile
christinestraub Jan 6, 2025
2471633
feat:fix ingest test errors
christinestraub Jan 6, 2025
28262a4
feat:fix ingest test errors
christinestraub Jan 6, 2025
f778c0a
test:fix lint errors
christinestraub Jan 6, 2025
648c24a
test:fix lint errors
christinestraub Jan 6, 2025
a456d4e
feat: revert test function
christinestraub Jan 6, 2025
c2616cc
test:fix unit errors
christinestraub Jan 6, 2025
a4c4cbb
commented ondrive sh
christinestraub Jan 6, 2025
26b42d8
commented outlook sh
christinestraub Jan 6, 2025
82f16ec
commented outlook sh
christinestraub Jan 6, 2025
8b2f950
feat:update dockerfile and tokenize.py
christinestraub Jan 6, 2025
c7942ad
feat:update download_nltk_packages()
christinestraub Jan 6, 2025
8d57220
added changes as per the suggestion
christinestraub Jan 7, 2025
95358e0
updated code as per suggestion
christinestraub Jan 7, 2025
6e9931a
modified .gitignore
christinestraub Jan 7, 2025
7eeb411
removed if statement in tokenize
christinestraub Jan 7, 2025
26854e5
removed unused vars and function
christinestraub Jan 7, 2025
f340d65
remove download check function from test_tokenize
christinestraub Jan 7, 2025
2830248
Update CHANGELOG.md
christinestraub Jan 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
## 0.16.13-dev0

### Enhancements

### Features

### Fixes

- **Fix NLTK Download** to use nltk assets in docker image

## 0.16.12

### Enhancements
Expand Down
25 changes: 15 additions & 10 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM quay.io/unstructured-io/base-images:wolfi-base-latest as base
FROM quay.io/unstructured-io/base-images:wolfi-base-latest AS base

USER root

Expand All @@ -9,21 +9,26 @@ COPY unstructured unstructured
COPY test_unstructured test_unstructured
COPY example-docs example-docs

# Copy the downloaded NLTK data folder to your local environment.s
COPY ./nltk_data /home/notebook-user/nltk_data

christinestraub marked this conversation as resolved.
Show resolved Hide resolved
RUN chown -R notebook-user:notebook-user /app && \
apk add font-ubuntu git && \
fc-cache -fv && \
if [ "$(readlink -f /usr/bin/python3)" != "/usr/bin/python3.11" ]; then \
ln -sf /usr/bin/python3.11 /usr/bin/python3; \
fi
apk add font-ubuntu git && \
fc-cache -fv && \
[ -e /usr/bin/python3 ] || ln -s /usr/bin/python3.11 /usr/bin/python3

USER notebook-user

RUN find requirements/ -type f -name "*.txt" -exec pip3.11 install --no-cache-dir --user -r '{}' ';' && \
python3.11 -c "from unstructured.nlp.tokenize import download_nltk_packages; download_nltk_packages()" && \
python3.11 -c "from unstructured.partition.model_init import initialize; initialize()" && \
python3.11 -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"
RUN find requirements/ -type f -name "*.txt" -exec pip3.11 install --no-cache-dir --user -r '{}' ';'

christinestraub marked this conversation as resolved.
Show resolved Hide resolved
# Command to check if NLTK data has been copied correctly
RUN python3.11 -c "import nltk; print(nltk.data.find('tokenizers/punkt_tab'))"

RUN python3.11 -c "from unstructured.partition.model_init import initialize; initialize()" && \
python3.11 -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"

ENV PATH="${PATH}:/home/notebook-user/.local/bin"
ENV TESSDATA_PREFIX=/usr/local/share/tessdata
ENV NLTK_DATA=/home/notebook-user/nltk_data

CMD ["/bin/bash"]
christinestraub marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[".", "(", ")", ":", "''", "EX", "JJS", "WRB", "VBG", "VBP", "NN", "SYM", "VB", "UH", "NNPS", "NNP", "``", "$", "NNS", "JJR", "MD", "RP", "VBD", "DT", "POS", "RBR", ",", "VBZ", "PDT", "VBN", "WP$", "WDT", "WP", "PRP$", "CD", "IN", "#", "CC", "RB", "FW", "RBS", "PRP", "LS", "JJ", "TO"]

Large diffs are not rendered by default.

Large diffs are not rendered by default.

98 changes: 98 additions & 0 deletions nltk_data/tokenizers/punkt_tab/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)

Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
been contributed by various people using NLTK for sentence boundary detection.

For information about how to use these models, please confer the tokenization HOWTO:
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
and chapter 3.8 of the NLTK book:
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation

There are pretrained tokenizers for the following languages:

File Language Source Contents Size of training corpus(in tokens) Model contributed by
=======================================================================================================================================================================
czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
Literarni Noviny
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
(Berlingske Avisdata, Copenhagen) Weekend Avisen
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
(American)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
Text Bank (Suomen Kielen newspapers
Tekstipankki)
Finnish Center for IT Science
(CSC)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
(European)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
(Switzerland) CD-ROM
(Uses "ss"
instead of "ß")
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
(Bokmål and Information Technologies,
Nynorsk) Bergen
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
(http://www.nkjp.pl/)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
(Brazilian) (Linguateca)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
Slovene Academy for Arts
and Sciences
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
(European)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
(and some other texts)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
(Türkçe Derlem Projesi)
University of Ankara
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
Unicode using the codecs module.

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
Computational Linguistics 32: 485-525.

---- Training Code ----

# import punkt
import nltk.tokenize.punkt

# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

# Read in training corpus (one example: Slovene)
import codecs
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()

# Train tokenizer
tokenizer.train(text)

# Dump pickled tokenizer
import pickle
out = open("slovene.pickle","wb")
pickle.dump(tokenizer, out)
out.close()

---------
118 changes: 118 additions & 0 deletions nltk_data/tokenizers/punkt_tab/czech/abbrev_types.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
t
množ
např
j.h
man
ú
jug
dr
bl
ml
okr
st
uh
šp
judr
u.s.a
p
arg
žitě
st.celsia
etc
p.s
t.r
lok
mil
ict
n
tl
min
č
d
al
ravenně
mj
nar
plk
s.p
a.g
roč
b
zdi
r.s.c
přek
m
gen
csc
mudr
vic
š
sb
resp
tzn
iv
s.r.o
mar
w
čs
vi
tzv
ul
pen
zv
str
čp
org
rak
sv
pplk
u.s
prof
c.k
op
g
vii
kr
ing
j.o
drsc
m3
l
tr
ceo
ch
fuk
vl
viii
líp
hl.m
t.zv
phdr
o.k
tis
doc
kl
ard
čkd
pok
apod
r
a.s
j
jr
i.m
e
kupř
f
xvi
mir
atď
vr
r.i.v
hl
kv
t.j
y
q.p.r
96 changes: 96 additions & 0 deletions nltk_data/tokenizers/punkt_tab/czech/collocations.tab
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
i dejmala
##number## prosince
h steina
##number## listopadu
a dvořák
v klaus
i čnhl
##number## wladyslawowo
##number## letech
a jiráska
a dubček
##number## štrasburk
##number## juniorské
##number## století
##number## kola
##number## pád
##number## května
##number## týdne
v dlouhý
k design
##number## červenec
i ligy
##number## kolo
z svěrák
##number## mája
##number## šimková
a bělého
a bradáč
##number## ročníku
##number## dubna
a vivaldiho
v mečiara
c carrićre
##number## sjezd
##number## výroční
##number## kole
##number## narozenin
k maleevová
i čnfl
##number## pádě
##number## září
##number## výročí
a dvořáka
h g.
##number## ledna
a dvorský
h měsíc
##number## srpna
##number## tř.
a mozarta
##number## sudetoněmeckých
o sokolov
k škrach
v benda
##number## symfonie
##number## července
x šalda
c abrahama
a tichý
##number## místo
k bielecki
v havel
##number## etapu
a dubčeka
i liga
##number## světový
v klausem
##number## ženy
##number## létech
##number## minutě
##number## listopadem
##number## místě
o vlček
k peteraje
i sponzor
##number## června
##number## min.
##number## oprávněnou
##number## květnu
##number## aktu
##number## květnem
##number## října
i rynda
##number## února
i snfl
a mozart
z košler
a dvorskému
v marhoul
v mečiar
##number## ročník
##number## máje
v havla
k gott
s bacha
##number## ad
Loading
Loading