Performance enhancements of conditional logit #81

mathijsvdv · 2021-07-20T20:57:02Z

In the past, I've used Pylogit (specifically the MNL) on a large dataset of 200mln rows. I have noticed two bottlenecks:

The sparse matrix structure for weights_per_obs is not always kept, causing a 200 mln x 200 mln dense numpy array to be created, see also issue Sparse to Dense #79.
The derivatives dh_dv for a conditional logit represent an identity matrix but are coded as a csr_matrix. This causes the calculation dh_dv.dot(design) to be relatively slow even though its result is trivially design.

To remedy the first bottleneck, I used the same solution proposed in issue #79.

For the second bottleneck, I made an efficient identity_matrix class (derived from scipy's spmatrix). When such an identity matrix I is multiplied with A using I.dot(A) we get A again.

I've run a benchmark by making a script that estimates an MNL on the usual Swiss-Metro dataset. I ran the line-profiler on some of the critical functions, namely calc_gradient and calc_fisher_info_matrix. In summary, this change reduced the computation time of calc_gradient by 26% (from 0.080697 to 0.059372), and that of calc_fisher_info_matrix by 99% (!) (from 0.906896s to 0.0062323s).

Profiling results are attached.
profile_before.txt
profile_after.txt

This is to be used as an efficient implementation of dh_dv

This means that dh_dv.dot(design) will be an instant calculation.

In the process, add checking for negative `arg1` argument.

timothyb0912 · 2021-08-17T05:16:35Z

Hi @mathijsvdv this is great! Wow:

the speedups are fantastic
the profiling is much appreciated
the super well documented class and tests are much appreciated.

Thanks also for your patience as I've been much delayed in responding to this PR and to the issue that spawned it.
I should be able to take a look at this within a few days and update the package.

Thanks again for your help!

mathijsvdv · 2021-08-17T19:42:57Z

Glad you like it @timothyb0912 ! I really appreciate all the work you've done to make a flexible logit estimation suite so I'm happy to help!

dima-quant

Great fixes, too bad still not merged into master

mathijsvdv added 4 commits July 16, 2021 23:38

Add identity_matrix.

41206ea

This is to be used as an efficient implementation of dh_dv

Keep sparse matrix structure for weights_per_obs

d771f50

Use identity_matrix rather than csr_matrix for dh_dv

2c395b3

This means that dh_dv.dot(design) will be an instant calculation.

Add tests of identity_matrix constructor

2e7a069

In the process, add checking for negative `arg1` argument.

timothyb0912 self-assigned this Aug 17, 2021

timothyb0912 added the enhancement label Aug 17, 2021

timothyb0912 linked an issue Aug 17, 2021 that may be closed by this pull request

Sparse to Dense #79

Open

timothyb0912 removed a link to an issue Aug 17, 2021

Sparse to Dense #79

Open

timothyb0912 linked an issue Aug 17, 2021 that may be closed by this pull request

Sparse to Dense #79

Open

dima-quant approved these changes May 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance enhancements of conditional logit #81

Performance enhancements of conditional logit #81

Uh oh!

mathijsvdv commented Jul 20, 2021

Uh oh!

timothyb0912 commented Aug 17, 2021 •

edited

Loading

Uh oh!

mathijsvdv commented Aug 17, 2021

Uh oh!

dima-quant left a comment

Uh oh!

Uh oh!

Performance enhancements of conditional logit #81

Are you sure you want to change the base?

Performance enhancements of conditional logit #81

Uh oh!

Conversation

mathijsvdv commented Jul 20, 2021

Uh oh!

timothyb0912 commented Aug 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mathijsvdv commented Aug 17, 2021

Uh oh!

dima-quant left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timothyb0912 commented Aug 17, 2021 •

edited

Loading