About: 6000+ Arxiv papers from AI category at 2020. The dataset contains latex source files and images, which make it a good research dataset for multimodal learning.
- Dataset URL: https://pan.baidu.com/s/1DsLVmZno7JSWxNQ9CBbBJQ
- Dataset size: ~20G(compressed).
About: Build multimodal retrieval or recommendation system supporting text, image, formulas, and tables. Consider answering the following questions:
- Which image is most relevant to a given sentence/query?
- Which sentence/paragraph is most relevant to a given image?
- Which formulas are relevant to a given sentence/query?
- Which tables are relevant to a given sentence/query?
- What concepts are relevant to a given formula?
- ... other important questions ...
About:Build fine-grained knowledge graph from given research papers of Arxiv6k. Consider answering the following questions:
- Which sentence is most similar to a given sentence?
- What concepts can be extracted from the corpus?
- Which concept is relevant to a given phrase/concept and in what manner?
- Which concepts are relevant to a given research problem?
- Which concepts are clustered together in one paragraph/section/paper?
- ... other important questions...
About:build AI helper system for computer science.
- See Home for Researchers for reference.
About:build your own dataset, and develop some interesting models with it.