Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve concurrency and space efficiency of parse_collection #403

Open
mpetri opened this issue Jun 2, 2020 · 22 comments
Open

Improve concurrency and space efficiency of parse_collection #403

mpetri opened this issue Jun 2, 2020 · 22 comments
Labels
enhancement New feature or request

Comments

@mpetri
Copy link

mpetri commented Jun 2, 2020

Describe the solution you'd like
The parse_collection command seems to have several potential areas of improvements:

  1. "Remapping IDs" could be done in parallel

  2. "Concatenating batches" and "Remapping IDs" could be combined into one step. Currently remapping ids rewrites the content on disk only to write it again in the "Concatenating batches" step.

  3. The run() method could use an unordered_map instead of a map or is there a reason for this? We are remapping later anyway?

  4. After remapping/concatenating a batch the original file could be deleted right away instead of only at the end.

Some of these steps would prohibit resuming the merging process but would reduce parsing time and half the space usage.

@mpetri mpetri added the enhancement New feature or request label Jun 2, 2020
@mpetri
Copy link
Author

mpetri commented Jun 2, 2020

To put some context into this:

I was creating a collection and processing the documents (where it outputs [Batch 1234] Processed documents) took the same time as the "remapping IDs" part of the parsing.

@mpetri
Copy link
Author

mpetri commented Jun 2, 2020

Another idea: You could even concatenate in parallel as you know where the data for the different batches will end up in the big file.

@elshize
Copy link
Member

elshize commented Jun 2, 2020

I agree, there is much to be improved. I could only spare a moment to look at it for now, but from what I can tell:

  • I don't see the reason to use map either, maybe it was a remnant of an earlier version. unordered_map should do just fine. A little side note: we had a discussion (or more like started it) about using an actually efficient hash map. unordered_map is garbage, to put it simply. But we never arrived at the preferred replacement, so that's a still an open question.
  • Doing remapping could be probably done directly into the final file. I didn't dive deep into it now, but I think we should be able to know where things go and do both things in parallel as you mention in the last comment.
  • Speaking of parallel, we should think about how to deal with it. Technically, it's pretty simple, you just write wherever you need to in separate threads. I'm just not sure what would happen if we use too many threads at a time and completely overwhelm I/O. We might not do ourselves any favors by doing so.
  • I think the slowness of remapping has to do with the fact we're directly writing to mapped files (not sure, though). I think some kind of smart buffering could help with that.

Qustion 1: How long did concatenation take? I since these are bigger chunks, I'm expecting it not to be anything close to remapping.

Now, regarding the disk space, deleting things right after merging is easy, just not sure if it helps you a lot if we merge directly into one big file, because you'll peek at double the size of the index anyway, you'd just get rid of some of it sooner. That's also why I'm asking about the concatenation. If it's not too expensive, we might want to leave it be.

Qustion 2: How big is the collection you're trying to index?

Conclusion

There are definitely several things we can do here, some easier than the others. Say, replacing the map should be easy, same with eager deleting of the batches (maybe we should leave an option to not do that with an additional flag?). Parallel remapping can be done easily enough but should be checked for performance when using large numbers of threads.

The more complex stuff that requires some thought would be to try buffering in memory when remapping and flush every now and then. I'm expecting this to speed things up, but I'm not sure how exactly the OS is flushing stuff to the memory mapped file, so can't be sure.

Lastly, we're stretched a little bit thin at the moment, so not sure when someone can pick these up. However, we always welcome pull requests. We will also be grateful for any insights about what's slow, what's fast and what other potentially pertinent things you encounter during the build.

@mpetri
Copy link
Author

mpetri commented Jun 2, 2020

The collection was ~400M docs in plaintext format, processed using 60 threads. The rough times for the different steps are:

4.3 hours - Processed documents/records 
  2 mins  - Merging titles
  2 mins  - Creating document lexicon
  0 secs  - Merging URLs
1.5 hours - Collecting terms
  2 mins  - Writing terms
 10 mins  - Mapping terms
4.2 hours - Remapping IDs
4.6 hours - Concatenating batches

@elshize
Copy link
Member

elshize commented Jun 2, 2020

Okay, that's helpful, thanks. We definitely have to improve on the later stages.

@mpetri
Copy link
Author

mpetri commented Jun 2, 2020

I would like to work on this but parallelizing these things in c++ is quite a bit more awkward than in languages like rust.

@mpetri
Copy link
Author

mpetri commented Jun 2, 2020

The main issue was that the whole process was neither CPU or IO bound. There was plenty of I/O capacity and for most parts the CPU was also barely used. That's why I opened this issue.

@mpetri
Copy link
Author

mpetri commented Jun 2, 2020

As a side question. Would it be faster/better to do lucene -> CIFF -> pisa?

@elshize
Copy link
Member

elshize commented Jun 2, 2020

I would like to work on this but parallelizing these things in c++ is quite a bit more awkward than in languages like rust.

It's actually not as bad with TBB, you can either do parallel loops, thread pools, etc. There are some examples throughout the codebase.

The main issue was that the whole process was neither CPU or IO bound. There was plenty of I/O capacity and for most parts the CPU was also barely used. That's why I opened this issue.

Running merging part in one thread will cause that for sure: only one thread is doing the work and waits a lot for I/O. We could think of doing it asynchronously, or just simply pass the computed data to a writer thread. But I think there's more we can do.

As a side question. Would it be faster/better to do lucene -> CIFF -> pisa?

I don't think anybody has ever checked that, I certainly haven't. I have no idea how fast Lucene is for one. Also, I don't have a good guess how fast ciff to PISA will be. Maybe @JMMackenzie will have some guesses. The conversion is not multi-threaded though, but it is written in Rust, so you might wanna take a look. But I'm not sure if protobuf allows random access or if it implements parallel iterator or something. If not, then you'd have to read and serve data to workers to write it back to another format, so I'm not sure how much there is to gain.

@JMMackenzie
Copy link
Member

JMMackenzie commented Jun 2, 2020

As a side question. Would it be faster/better to do lucene -> CIFF -> pisa?

I don't think anybody has ever checked that, I certainly haven't. I have no idea how fast Lucene is for one. Also, I don't have a good guess how fast ciff to PISA will be. Maybe @JMMackenzie will have some guesses. The conversion is not multi-threaded though, but it is written in Rust, so you might wanna take a look. But I'm not sure if protobuf allows random access or if it implements parallel iterator or something. If not, then you'd have to read and serve data to workers to write it back to another format, so I'm not sure how much there is to gain.

Oops, scrubbing what I had here as I had done "Pisa to CIFF" and not the other way. I am sure I had timings somewhere but I can't find them right now.

Looking at Anserini/Lucene (this paper from Jimmy Lin: https://arxiv.org/pdf/1910.11028.pdf) it seems like Lucene can handle ClueWeb12B in around 2 hours (with full positions and collection storage) - it's probably under an hour for a regular index.

I estimate the Lucene -> CIFF part takes around an hour but I can't remember unfortunately.

@JMMackenzie
Copy link
Member

Jimmy Lin has kindly managed to dig up some information from his personal Mac Pro (SSD):

time target/appassembler/bin/ExportAnseriniLuceneIndex -output cw12b-complete-20200309.ciff.gz
 -index lucene-index-ciff.cw12b.20200309 -description "Anserini v0.7.2, ClueWeb12-B13 regression"
real	167m21.523s
user	283m30.397s
sys	4m52.060s

So that's the time taken to export Lucene to CIFF (it runs single threaded).

@lintool
Copy link

lintool commented Jun 3, 2020

Unless there are scientifically interesting questions you want to explore, I would advocate just giving up on indexing and letting Anserini/Lucene do it for you via CIFF. Why waste engineering cycles on this?

My entire argument in https://arxiv.org/pdf/1910.11028.pdf is that we've pushed indexing to basically its limits with current in-memory inversion/flush to disk/merge approach. Unless we dramatically rethink the pipeline, we've reached a plateau. Just buy more SSDs to go faster.

@elshize
Copy link
Member

elshize commented Jun 3, 2020

Yep, I think this might be the best approach.

@lintool
Copy link

lintool commented Jun 3, 2020

Also, we're one issue away from the entire indexing pipeline in Anserini from being pip installable: castorini/pyserini#77

Something like:

$ pip install pyserini
...
$ python -m pyserini.index --collection cw12b --path /path/to/collection ...

Bam!

If you want, we can move the CIFF exporter into pyserini as well, and everything will be pip-able. You can script it easily from your end as well.

@elshize
Copy link
Member

elshize commented Jun 3, 2020

That would be definitely nice to be able to easily index and export it from python like that. We could then include it as an option in our documentation.

@mpetri
Copy link
Author

mpetri commented Jun 3, 2020

I'm mostly looking for easy of use and speed. I will try pyserini -> ciff -> pisa and report some statistics when I'm done.

Maybe the docs should reflect that there are other options to create an index.

@elshize
Copy link
Member

elshize commented Jun 3, 2020

Yes, we should mention CIFF in the docs.

@JMMackenzie
Copy link
Member

Just to follow up briefly, I ran a few quick-n-dirty Gov2 experiments.

Indexing Gov2 via Anserini takes ~20 mins with 30 threads.
Taking this Lucene index to CIFF takes about ~45-60 mins.
Taking CIFF to PISA canonical takes ~15 mins.

End-to-end raw corpus to files that PISA can deal with (but not yet query over) is ~2 hours max. This is also on a loaded machine, so it's probably ultimately faster than this depending on your resources. The bottleneck is the single-threaded "Lucene to CIFF" part which populates the protobuf from the Lucene index. Not much you can do to speed that up, but as a once-off operation it's probably fine.

@elshize
Copy link
Member

elshize commented Jun 4, 2020

@JMMackenzie thanks for taking the time to check that out. Sounds like it's altogether faster than with PISA alone. One side note is that you won't end up with a full forward index, just the inverted index and term and document lexicons, but that might not be important in majority of cases.

@mpetri
Copy link
Author

mpetri commented Jun 11, 2020

I just wanted to confirm lucene -> ciff -> pisa is much faster than the parse_collection tool provided by pisa.

I think this issue can be closed but maybe lucene+ciff should be the default way of creating a pisa index?

@JMMackenzie
Copy link
Member

Thanks Matthias, I guess we should at least document the tooling so users have a choice, including how to get queries formatted via Anserini (etc).

@elshize
Copy link
Member

elshize commented Jun 11, 2020

Thanks for confirmation. Let's keep it open for now. We definitely want to improve some things around parsing, especially since some of them shouldn't be too hard. The CIFF route is good to have but having native parsing capabilities has its own advantages, so we definitely want to see what can be improved.

That said, I don't think CIFF is mentioned anywhere in the docs, so we should definitely mention it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants