Search by content of a column vs luau filter #2483

ondohotola · 2025-01-27T09:09:26Z

ondohotola
Jan 27, 2025

Currently I use luau filter string.match(A,B), which is the longest taking in a pipeline and while I don’t want to call it a bottleneck I wonder it it were possible to search this (and see if it speeds this up)?

I can’t find anything in the —help.

If it is not possible, would it be useful? If it were useful, would it be reasonably easy to implement?

Answered by jqnatividad

Jan 27, 2025

Some metrics using the different approaches above using the 1 million row NYC 311 sample we use in our benchmarks:

$ /usr/bin/time qsv luau filter 'string.match(City,Borough)' /tmp/NYC_311_SR_2010-2020-sample-1M.csv -o /tmp/t1.csv                           
        7.72 real         7.08 user         0.57 sys
$ /usr/bin/time qsv luau filter --no-globals 'string.match(col["City"],col["Borough"])' /tmp/NYC_311_SR_2010-2020-sample-1M.csv -o /tmp/t2.csv
        5.30 real         4.68 user         0.56 sys
$ qsv luau filter 'City==Borough' /tmp/NYC_311_SR_2010-2020-sample-1M.csv -o /tmp/t3.csv 
        7.48 real         6.89 user         0.55 sys
$ /usr/bin/time qsv luau filter --no-globals 'c…

View full answer

jqnatividad · 2025-01-27T10:44:56Z

jqnatividad
Jan 27, 2025
Maintainer

Have you tried the luau --no-globals option?

qsv luau filter --no-globals 'string.match(col["A"], col["B"])'

It should be faster as --no-globals tells qsv not to initialize globals for EVERY row, and you have to use col["COLNAME"] instead of just COLNAME in your luau expressions.

Also, can you give me some metrics?

How many rows? How long does it take?

0 replies

jqnatividad · 2025-01-27T11:10:42Z

jqnatividad
Jan 27, 2025
Maintainer

You can also just do a direct comparison and skip using string.match() to squeeze out more performance.

qsv luau filter --no-globals 'col["A"]==col["B"]'

0 replies

jqnatividad · 2025-01-27T11:29:11Z

jqnatividad
Jan 27, 2025
Maintainer

Some metrics using the different approaches above using the 1 million row NYC 311 sample we use in our benchmarks:

$ /usr/bin/time qsv luau filter 'string.match(City,Borough)' /tmp/NYC_311_SR_2010-2020-sample-1M.csv -o /tmp/t1.csv                           
        7.72 real         7.08 user         0.57 sys
$ /usr/bin/time qsv luau filter --no-globals 'string.match(col["City"],col["Borough"])' /tmp/NYC_311_SR_2010-2020-sample-1M.csv -o /tmp/t2.csv
        5.30 real         4.68 user         0.56 sys
$ qsv luau filter 'City==Borough' /tmp/NYC_311_SR_2010-2020-sample-1M.csv -o /tmp/t3.csv 
        7.48 real         6.89 user         0.55 sys
$ /usr/bin/time qsv luau filter --no-globals 'col["City"]==col["Borough"]' /tmp/NYC_311_SR_2010-2020-sample-1M.csv -o /tmp/t4.csv             
        5.10 real         4.53 user         0.55 sys

2 replies

jqnatividad Jan 28, 2025
Maintainer

And this is the fastest version @ondohotola , where we use --no-globals, --colindex and --no-headers together.

$ /usr/bin/time qsv luau filter --no-globals --colindex --no-headers 'col[17]==col[26]' /tmp/NYC_311_SR_2010-2020-sample-1M.csv -o /tmp/t5.csv
        3.60 real         2.90 user         0.50 sys

ondohotola Jan 28, 2025
Author

In my context I can't use == but have to use string.find() (I need to determine glue records ie a.com IN NS ns1.a.com will require a glue record whereas a.com IN NS ns1.b.com will not).

As it's used in a pipe I'd have to add a rename to the pipe, anyway.

BTW, string.match() was an ignorant mistake be me :-)-O

ondohotola · 2025-01-27T13:35:51Z

ondohotola
Jan 27, 2025
Author

Thank you.

The figure for the first example look about right.

I’ll try with index first for reference then with —no-globals with and without index and also look at string.find().

I’ll revert.

0 replies

ondohotola · 2025-01-27T15:28:45Z

ondohotola
Jan 27, 2025
Author

Auto-Indexing actually slows it down, as numerous indexes are generated, but in a million records --no-globals indeed shaves a couple of seconds off.

0 replies

jqnatividad · 2025-01-27T15:40:12Z

jqnatividad
Jan 27, 2025
Maintainer

luau is currently single threaded. This specific use case will benefit from parallelization, which is already in the backlog #1398

0 replies

jqnatividad · 2025-01-27T16:14:48Z

jqnatividad
Jan 27, 2025
Maintainer

Hi @ondohotola - I moved the issue to a Discussion so other folks who have a similar luau problem can find it.

1 reply

ondohotola Jan 27, 2025
Author

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search by content of a column vs luau filter #2483

{{title}}

Replies: 7 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Search by content of a column vs luau filter #2483

ondohotola Jan 27, 2025

Replies: 7 comments · 3 replies

jqnatividad Jan 27, 2025 Maintainer

jqnatividad Jan 27, 2025 Maintainer

jqnatividad Jan 27, 2025 Maintainer

jqnatividad Jan 28, 2025 Maintainer

ondohotola Jan 28, 2025 Author

ondohotola Jan 27, 2025 Author

ondohotola Jan 27, 2025 Author

jqnatividad Jan 27, 2025 Maintainer

jqnatividad Jan 27, 2025 Maintainer

ondohotola Jan 27, 2025 Author

ondohotola
Jan 27, 2025

Replies: 7 comments 3 replies

jqnatividad
Jan 27, 2025
Maintainer

jqnatividad
Jan 27, 2025
Maintainer

jqnatividad
Jan 27, 2025
Maintainer

jqnatividad Jan 28, 2025
Maintainer

ondohotola Jan 28, 2025
Author

ondohotola
Jan 27, 2025
Author

ondohotola
Jan 27, 2025
Author

jqnatividad
Jan 27, 2025
Maintainer

jqnatividad
Jan 27, 2025
Maintainer

ondohotola Jan 27, 2025
Author