Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

You're parsing an XML document using an HTML parser #15

Open
lucazav opened this issue Feb 8, 2023 · 1 comment
Open

You're parsing an XML document using an HTML parser #15

lucazav opened this issue Feb 8, 2023 · 1 comment

Comments

@lucazav
Copy link

lucazav commented Feb 8, 2023

I'm running the demo code, referencing a specific grobid url:

import scipdf
article_dict = scipdf.parse_pdf_to_dict('examples/futoma2017improved.pdf',
                                        grobid_url='https://<my-grobid-url>/')

I'm getting the following error:

/anaconda/envs/scipdfparser/lib/python3.9/site-packages/bs4/builder/init.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor.

I'm running scipdf in a conda environment with Python 3.9.16. Here the installed packages:

# packages in environment at /anaconda/envs/scipdfparser:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
asttokens                 2.0.5              pyhd3eb1b0_0  
backcall                  0.2.0              pyhd3eb1b0_0  
beautifulsoup4            4.11.2                   pypi_0    pypi
blas                      1.0                         mkl  
blis                      0.7.9                    pypi_0    pypi
ca-certificates           2023.01.10           h06a4308_0  
catalogue                 2.0.8                    pypi_0    pypi
certifi                   2022.12.7        py39h06a4308_0  
charset-normalizer        3.0.1                    pypi_0    pypi
click                     8.1.3                    pypi_0    pypi
comm                      0.1.2            py39h06a4308_0  
confection                0.0.4                    pypi_0    pypi
cymem                     2.0.7                    pypi_0    pypi
debugpy                   1.5.1            py39h295c915_0  
decorator                 5.1.1              pyhd3eb1b0_0  
en-core-web-sm            3.5.0                    pypi_0    pypi
entrypoints               0.4              py39h06a4308_0  
executing                 0.8.3              pyhd3eb1b0_0  
idna                      3.4                      pypi_0    pypi
intel-openmp              2021.4.0          h06a4308_3561  
ipykernel                 6.19.2           py39hb070fc8_0  
ipython                   8.8.0            py39h06a4308_0  
jedi                      0.18.1           py39h06a4308_1  
jinja2                    3.1.2                    pypi_0    pypi
jupyter_client            7.4.8            py39h06a4308_0  
jupyter_core              5.1.1            py39h06a4308_0  
langcodes                 3.3.0                    pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1  
libffi                    3.4.2                h6a678d5_6  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libsodium                 1.0.18               h7b6447c_0  
libstdcxx-ng              11.2.0               h1234567_1  
lxml                      4.9.2                    pypi_0    pypi
markupsafe                2.1.2                    pypi_0    pypi
matplotlib-inline         0.1.6            py39h06a4308_0  
mkl                       2021.4.0           h06a4308_640  
mkl-service               2.4.0            py39h7f8727e_0  
mkl_fft                   1.3.1            py39hd3c417c_0  
mkl_random                1.2.2            py39h51133e4_0  
murmurhash                1.0.9                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
nest-asyncio              1.5.6            py39h06a4308_0  
numpy                     1.23.5           py39h14f4228_0  
numpy-base                1.23.5           py39h31eccc5_0  
openssl                   1.1.1s               h7f8727e_0  
packaging                 22.0             py39h06a4308_0  
pandas                    1.5.3                    pypi_0    pypi
parso                     0.8.3              pyhd3eb1b0_0  
pathy                     0.10.1                   pypi_0    pypi
pexpect                   4.8.0              pyhd3eb1b0_3  
pickleshare               0.7.5           pyhd3eb1b0_1003  
pip                       22.3.1           py39h06a4308_0  
platformdirs              2.5.2            py39h06a4308_0  
preshed                   3.0.8                    pypi_0    pypi
prompt-toolkit            3.0.36           py39h06a4308_0  
psutil                    5.9.0            py39h5eee18b_0  
ptyprocess                0.7.0              pyhd3eb1b0_2  
pure_eval                 0.2.2              pyhd3eb1b0_0  
pydantic                  1.10.4                   pypi_0    pypi
pygments                  2.11.2             pyhd3eb1b0_0  
pyphen                    0.13.2                   pypi_0    pypi
python                    3.9.16               h7a1cb2a_0  
python-dateutil           2.8.2              pyhd3eb1b0_0  
pytz                      2022.7.1                 pypi_0    pypi
pyzmq                     23.2.0           py39h6a678d5_0  
readline                  8.2                  h5eee18b_0  
requests                  2.28.2                   pypi_0    pypi
scipdf                    0.1.dev0                 pypi_0    pypi
setuptools                65.6.3           py39h06a4308_0  
six                       1.16.0             pyhd3eb1b0_1  
smart-open                6.3.0                    pypi_0    pypi
soupsieve                 2.3.2.post1              pypi_0    pypi
spacy                     3.5.0                    pypi_0    pypi
spacy-legacy              3.0.12                   pypi_0    pypi
spacy-loggers             1.0.4                    pypi_0    pypi
sqlite                    3.40.1               h5082296_0  
srsly                     2.4.5                    pypi_0    pypi
stack_data                0.2.0              pyhd3eb1b0_0  
textstat                  0.7.3                    pypi_0    pypi
thinc                     8.1.7                    pypi_0    pypi
tk                        8.6.12               h1ccaba5_0  
tornado                   6.2              py39h5eee18b_0  
tqdm                      4.64.1                   pypi_0    pypi
traitlets                 5.7.1            py39h06a4308_0  
typer                     0.7.0                    pypi_0    pypi
typing-extensions         4.4.0                    pypi_0    pypi
tzdata                    2022g                h04d1e81_0  
urllib3                   1.26.14                  pypi_0    pypi
wasabi                    1.1.1                    pypi_0    pypi
wcwidth                   0.2.5              pyhd3eb1b0_0  
wheel                     0.37.1             pyhd3eb1b0_0  
xz                        5.2.10               h5eee18b_1  
zeromq                    4.3.4                h2531618_0  
zlib                      1.2.13               h5eee18b_0
@huohuade-blog
Copy link

huohuade-blog commented Feb 22, 2023

I also have the same problem. I think it is caused by the updated bs4. It doesn't matter and I guess this warning will not influence the process. and I try to change the features to xml, which will throw errors. you can filter the warning by this way.
import warnings;from bs4.builder import XMLParsedAsHTMLWarning;warnings.filterwarnings('ignore', category=XMLParsedAsHTMLWarning)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants