Skip to content

cgd8d/StandaloneRefitter

Repository files navigation

I've got a refit-apds module in (my copy of) EXOAnalysis, but it is severely time-consuming, with all of the time being taken up by either matrix-vector multiplication or matrix multiplication with skinny matrices.  I'm using BLAS, but gemm can't do much with skinny matrices, so the performance is still insufficient.

However, in fact the matrices we're multiplying by are reused many times -- so there should be much more opportunity for BLAS to help if I can "package" many skinny matrices into one fat one.  That requires packaging multiple EXOEventData events together and handling them in parallel, which EXOAnalysis can't accommodate.  Thus the need for a standalone version of the code.

This actually can also alleviate some of the pressure on me to re-implement clustering, grid corrections, gain corrections, and purity corrections -- after running the standalone program, we could just rerun the needed components of EXOAnalysis and accomplish these things.

Note that the whole matrix A is not identical event-by-event -- just the noise portions.  So, some care will need to be taken.  Additionally, I'll want to support events converging at different times.

Basic plan:  the class should have two matrices (column-major), one of which is designated the queue for vectors needing to be multiplied.  Each event needs to track where it put its own vectors in the matrix.  When it's time, matrix multiplication gets called, and results are put into a result matrix.  The queue matrix is cleared, ready to accept new requests.


Memory usage, for 40 simultaneous signals (fColumnLength ~ 3e5, max number of signals per event = 5):
fNoiseMulQueue:	3e5*40*8B = 			96 MB
fNoiseResultQueue:				96 MB
fWireDeposit: 512*10*8B =			negligible
fWireInduction:					negligible
fLightMaps: 80*40^3*8B =			41 MB
fGainMaps:
fNoiseDiag: 3e5*8B =				2.4 MB
fInvSqrtNoiseDiag:				2.4 MB
fNoiseCorrelations: 3e5*300*8B=			720 MB
EventHandler signal solvers: 6*40*3e5*8B =	576 MB
fWireModel: 40*2000*8B =			negligible
Workspace: 3e5*5*8B =				12 MB
fWFEvent: 226*2000*4B =				2 MB

So, I can only account for 1.6 GB; there's a mysterious 900 MB I can't locate...
Could it be libraries I'm linking with?  (Ie do they get loaded into memory?)

ToDO:

New high-priority tasks:
* Convert to use static linking.
	I'm attempting to rebuild my ROOT installation statically.
* Run from an executable on scratch, and with all files also on scratch.  (Use $GSCRATCH2)
* Understand why reading chunks from file crashes; add lots of assert statements to make sure I'm creating and reading the noise file properly.
O Test whether xrootd is currently working for raw root files at NERSC.
- I should overwrite fRawEnergy, for now anyway.


- Track down memory hogs -- where is all of the memory usage coming from exactly?
- Write up code (EXOAnalysis, EXOFitting, script) for getting a rotated resolution from a denoised file.
- Investigate whether I can reorganize preconditioners to be faster, since they're the non-noise bottleneck.
	I've added a watch to DoInvRPrecon, which should give valuable information.
* Verify that reasonable results are being produced by current code!!
- Improve noise matrix by exploiting symmetries.  (Are there any in DFT domain?)
- SLAC vs NERSC (it's looking like NERSC is necessary -- but it would be nice to give Tony a firm answer on this before asking he get xrootd working again).
- Make it possible to set a threshold on the command-line.
- Write up note in latex, explaining algorithm and implementation.

With the important goals of:
- Compare speeds at SLAC vs NERSC (various flavors at SLAC; don't bother with multithreaded NERSC MKL).
* Identify a proper threshold.
	As a secondary consideration, identify the preconditioned threshold and relative number of iterations.
* Verify whether I gain anything by including wires.
	If so, pin down exactly what is providing the gain -- better APD denoising, or wire denoising.

0 Mike is going to stress-test NERSC waveform access, to see what kind of combined waveform read speed is acheivable.
0 Mike will ask Tony to get a pipeline task set up at SLAC.


	The algorithms can be denoted by a box, where some plausible options are:

			Using APDs	Using Wires
	Denoise APDs	XXXXXXXXXX	XXXXXXXXXXX
	Denoise Wires	XXXXXXXXXX	XXXXXXXXXXX

			Using APDs	Using Wires
	Denoise APDs	XXXXXXXXXX
	Denoise Wires			XXXXXXXXXXX

			Using APDs	Using Wires
	Denoise APDs	XXXXXXXXXX	XXXXXXXXXXX
	Denoise Wires

			Using APDs	Using Wires
	Denoise APDs	XXXXXXXXXX
	Denoise Wires

	Each box filled in should give a strictly better result; the question is which boxes are negligible.




Multi-threaded code: in principle, it's ready to go.  I continue linking to the sequential MKL, and exploit instead the higher-level parallelism of the code.  For multiplying by the noise matrix, I distribute frequency ranges to the various cores; for the rest of BiCGSTAB, I distribute events.  The noise matrix and executable image only need to be stored once, which is the source of savings.  Note that because we're entirely within one process, it is easy to ensure the noise matrix doesn't change differently in the different streams.
To use, compile with
-DUSE_THREADS -DNUM_THREADS=6 (hopper)
-DUSE_THREADS -DNUM_THREADS=12 (edison)
The number of threads is chosen so we fill one NUMA per process.

The other possible way to save memory is with shared memory; we can place one noise matrix in shared memory and access it from 6/12 independent processes.  This is a simpler model to work with, but offers less significant potential for gain.  (Still need N executable images, which may be large; and joining the columns from all of the separate processes would be quite difficult.)  Still, should bear this in mind in case threads are difficult to make work, since this does offer safety and some insulation from ROOT peculiarities.


Running on hopper: Edison is down until Sunday; might as well focus on getting things running on Hopper.
Things that I can work on:
	Make sure I don't kill nodes anymore.
	Understanding what is a truly minimal installation. (Were any third-party libraries non-essential?)
	Installing a more current boost; or making the code work without lockfree.
	Converting code to optionally work with either ACML or MKL. (ACML documentation is terrible.)



ToDO: If fMin, fMax, or fChannels are changed mid-run, then there will be a race condition with the FinishEvent thread.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published