The uparxive aims to provide a llm-friendly dataest for the whole arxiv .tex source. A similiar dataset is the unarxive, while the uparxvie use a different tool chain.
The Uparxive dataset is stored in .json
format, which can be seamlessly converted into Markdown .md
format.
The Uparxive dataset adheres to the following rules:
-
Tables and Figures: Elements enclosed within
\begin{table} \end{table}
and\begin{figure} \end{figure}
tags are extracted and appended at the end of the document for clarity and organization. -
Citations and References: Citations (
\cite{}
) and references (\ref{}
) are converted to more explicit forms to improve readability. Examples include:- Direct mentions:
(See [Ref. [1,2] of ArXiv.1512.03385])
- Contextual references:
in [Ref. [1,2] of ArXiv.1512.03385]
- Equation/Section/Figures/Tables references:
in [Equation [1] of ArXiv.1512.03385]
, depending on the usage context.
- Direct mentions:
-
Mathematical Notations:
- In-line Math: Single dollar signs
$
are used for in-line mathematical expressions, e.g.,$\alpha$
. - Block Math: Double dollar signs
$$
denote block mathematical expressions, e.g.,$$\mathbf{y}=\mathcal{F}(\mathbf{x},\{W_{i}\})+\mathbf{x}.$$
- In-line Math: Single dollar signs
See 1512.03385.json and 1512.03385.md as example
- uparxive_metadata: huggingface
- uparxive: huggingface
- uparxive-reference: huggingface
- upar5iv(uparxive version)[Coming Soon][License check]
- unarxive(uparxive version)[Coming Soon][License check]
Note: The "full" version of uparxive was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms. (For information on papers' licenses use arXiv's bulk metadata access).
Note: paper under CC BY-NC-ND license are not included in the dataset.
Up to April 2024, there are around 2,450,893 papers in the arxiv source, and the uparxive dataset has covered 1,700,000 papers. Those missing parts are mainly due to the lack of the .tex
source or the failure of the conversion process.
To effectively collect and process data from the arXiv source, follow the outlined tool chain and resources provided below:
- arXiv Bulk Data Access: Access and download bulk data directly from arXiv using the AWS S3
request-payer
dataset. Detailed instructions and access points can be found here: arXiv Bulk Data Access. or check this and this - arXiv API: For more specific data needs or metadata, use the arXiv API. Documentation and usage guidelines are available here: arXiv API.
- Important Note: When crawling arXiv source files, ensure to use
export.arxiv.org
instead of the officialarxiv.org
domain to avoid overloading the main site.
- Important Note: When crawling arXiv source files, ensure to use
- To simplify API interactions, consider using the
arxiv.py
Python library. This wrapper facilitates easier querying and data retrieval from the arXiv API.
- Update as of 2025.04.30: Deyan Ginev (@Deyan) has published an HTML-packed arXiv dataset, known as ar5iv. This dataset provides HTML files that are more friendly for data extraction aimed at language model training or other NLP tasks.
obtain the .json
format data
- Compile the
.tex
file to.xml
file- LaTeXML
python python_script/tex_to_xml.py --root [SourPath]
- Convert the format
.xml
file to.json
or.md
filepython python_script/xml_to_json.py --root [SourPath]
- Update 2025.04.30: html_to_dense_text.py aim to convert the
ar5iv
dataset fromhtml
format to llm-friendly data.python python_script/html_to_json.py --root [SourPath]
- Update 2025.05.22: Nougat provide a standard parser for the latexmlc html format
from nougat.dataset.parser.html2md import html2md
. I do think this solution is much more convience than my custom implement. - Upate 2025.06.17: we now reach the V3 version of uparxive with the help of the nougat
- Upate 2025.06.24: we now release the aligned markdown-pdf subset, please see here for more detail
Turn to the Citation Retreive for more details
-
Resource: in order to retrieve the digital url for each citation string, you need collect the citation metadata from
- openalex snapshot: official page or the aws s3 opendata
- crossref snapshot: Metadata Retrieval or the aws s3 request-payer dataset
- arxive metadata: arxiv_dataset
-
Tool Chain:
- Citation Structure Tool:
- Citation Retreive Engine:
- Citation_Retreive_Script