Skip to content

glhuilli/neurips_crawler

Repository files navigation

neurips_crawler

Get all NeurIPS papers for input years.

This code is inspired on @benhamner's NeurIPS 2015 crawler.

To use this script you should first setup a Virtual Environment, and run

pip install -r requirements.txt

Then you can run the code below to start crawling all PDFs from every NeurIPS conference from year from_year to year to_year. The output will be stored in ./output (default) folder. And the execution logs will be stored in crawler_log.txt (default).

python src/neurips_crawler.py --from_year=1998 --to_year=2018 --output=./output/ --log=./crawler_log.txt

For each conference year, the script will create a folder inside --output where all papers will be stored together with a jsons file with each paper's metadata collected from the website. If a folder for a given year is available, it will be assumed that the year was already processed and skip to the next one. Logs will be available in both the console and a log file.

Any entry on the jsons file will look like this:

{
  "id": "755b0ec5-8636-50cc-9c95-e9f7d11e3a47",
  "title": "Efficient Algorithms for Non-convex Isotonic Regression through Submodular Optimization",
  "pdf_name": "7286-efficient-algorithms-for-non-convex-isotonic-regression-through-submodular-optimization.pdf",
  "abstract": "We consider the minimization of submodular functions subject to ordering constraints. We show that this potentially non-convex optimization problem can be cast as a convex optimization problem on a space of uni-dimensional measures, with ordering constraints corresponding to first-order stochastic dominance.  We propose new discretization schemes that lead to simple and efficient algorithms based on zero-th, first, or higher order oracles;  these algorithms also lead to improvements without isotonic constraints. Finally,   our experiments  show that non-convex loss functions can be much more robust to outliers for isotonic regression, while still being solvable in polynomial time.",
  "authors": [
    {
      "id": "francis-bach-6335",
      "name": "Francis Bach"
    }
  ]
}

If you want to retry a given year (e.g. some papers failed to download), you can use the --force option.

The script was formatted using mypy, yapf, isort, and pylint.

Note that downloading the whole NeurIPS/NIPS period (1988 - 2018) could be a large set of files (e.g. The period 1988 to 2018 is ~8.8Gb). Also, note that this code will only work as long as NeurIPS organizers keep the website unchanged.

About

Get all NeurIPS papers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages