Skip to content

Commit 22c1be0

Browse files
authored
Merge pull request #500 from gareth-nx/pr/quickselect
Selection algorithms
2 parents 3af3259 + 0f4e635 commit 22c1be0

9 files changed

+1122
-0
lines changed

doc/specs/stdlib_selection.md

+350
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,350 @@
1+
---
2+
title: Selection Procedures
3+
---
4+
5+
# The `stdlib_selection` module
6+
7+
[TOC]
8+
9+
## Overview of selection
10+
11+
Suppose you wish to find the value of the k-th smallest entry in an array of size N, or
12+
the index of that value. While it could be done by sorting the whole array
13+
using `[[stdlib_sorting(module):sort(interface)]]` or
14+
`[[stdlib_sorting(module):sort_index(interface)]]` from
15+
`[[stdlib_sorting(module)]]` and then finding the k-th entry, that would
16+
require O(N x LOG(N)) time. However selection of a single entry can be done in
17+
O(N) time, which is much faster for large arrays. This is useful, for example,
18+
to quickly find the median of an array, or some other percentile.
19+
20+
The Fortran Standard Library therefore provides a module, `stdlib_selection`,
21+
which implements selection algorithms.
22+
23+
## Overview of the module
24+
25+
The module `stdlib_selection` defines two generic subroutines:
26+
* `select` is used to find the k-th smallest entry of an array. The input
27+
array is also modified in-place, and on return will be partially sorted
28+
such that `all(array(1:k) <= array(k)))` and `all(array(k) <= array((k+1):size(array)))` is true.
29+
The user can optionally specify `left` and `right` indices to constrain the search
30+
for the k-th smallest value. This can be useful if you have previously called `select`
31+
to find a smaller or larger rank (that will have led to partial sorting of
32+
`array`, thus implying some constraints on the location).
33+
34+
* `arg_select` is used to find the index of the k-th smallest entry of an array.
35+
In this case the input array is not modified, but the user must provide an
36+
input index array with the same size as `array`, having indices that are a permutation of
37+
`1:size(array)`, which is modified instead. On return the index array is modified
38+
such that `all(array(index(1:k)) <= array(index(k)))` and `all(array(k) <= array(k+1:size(array)))`.
39+
The user can optionally specify `left` and `right` indices to constrain the search
40+
for the k-th smallest value. This can be useful if you have previously called `arg_select`
41+
to find a smaller or larger rank (that will have led to partial sorting of
42+
`index`, thus implying some constraints on the location).
43+
44+
45+
## `select` - find the k-th smallest value in an input array
46+
47+
### Status
48+
49+
Experimental
50+
51+
### Description
52+
53+
Returns the k-th smallest value of `array(:)`, and also partially sorts `array(:)`
54+
such that `all(array(1:k) <= array(k))` and `all(array(k) <= array((k+1):size(array)))`
55+
56+
### Syntax
57+
58+
`call [[stdlib_selection(module):select(interface)]]( array, k, kth_smallest [, left, right ] )`
59+
60+
### Class
61+
62+
Generic subroutine.
63+
64+
### Arguments
65+
66+
`array` : shall be a rank one array of any of the types:
67+
`integer(int8)`, `integer(int16)`, `integer(int32)`, `integer(int64)`,
68+
`real(sp)`, `real(dp)`, `real(xdp)`, `real(qp)`. It is an `intent(inout)` argument.
69+
70+
`k`: shall be a scalar with any of the types:
71+
`integer(int8)`, `integer(int16)`, `integer(int32)`, `integer(int64)`. It
72+
is an `intent(in)` argument. We search for the `k`-th smallest entry of `array(:)`.
73+
74+
`kth_smallest`: shall be a scalar with the same type as `array`. It is an
75+
`intent(out)` argument. On return it contains the k-th smallest entry of
76+
`array(:)`.
77+
78+
`left` (optional): shall be a scalar with the same type as `k`. It is an
79+
`intent(in)` argument. If specified then we assume the k-th smallest value is
80+
definitely contained in `array(left:size(array))`. If `left` is not present,
81+
the default is 1. This is typically useful if multiple calls to `select` are
82+
made, because the partial sorting of `array` implies constraints on where we
83+
need to search.
84+
85+
`right` (optional): shall be a scalar with the same type as `k`. It is an
86+
`intent(in)` argument. If specified then we assume the k-th smallest value is
87+
definitely contained in `array(1:right)`. If `right` is not present, the
88+
default is `size(array)`. This is typically useful if multiple calls to
89+
`select` are made, because the partial sorting of `array` implies constraints
90+
on where we need to search.
91+
92+
### Notes
93+
94+
Selection of a single value should have runtime of O(`size(array)`), so it is
95+
asymptotically faster than sorting `array` entirely. The test program at the
96+
end of this document shows that is the case.
97+
98+
The code does not support `NaN` elements in `array`; it will run, but there is
99+
no consistent interpretation given to the order of `NaN` entries of `array`
100+
compared to other entries.
101+
102+
`select` was derived from code in the Coretran library by Leon Foks,
103+
https://github.com/leonfoks/coretran. Leon Foks has given permission for the
104+
code here to be released under stdlib's MIT license.
105+
106+
### Example
107+
108+
```fortran
109+
program demo_select
110+
use stdlib_selection, only: select
111+
implicit none
112+
113+
real, allocatable :: array(:)
114+
real :: kth_smallest
115+
integer :: k, left, right
116+
117+
array = [3., 2., 7., 4., 5., 1., 4., -1.]
118+
119+
k = 2
120+
call select(array, k, kth_smallest)
121+
print*, kth_smallest ! print 1.0
122+
123+
k = 7
124+
! Due to the previous call to select, we know for sure this is in an
125+
! index >= 2
126+
call select(array, k, kth_smallest, left=2)
127+
print*, kth_smallest ! print 5.0
128+
129+
k = 6
130+
! Due to the previous two calls to select, we know for sure this is in
131+
! an index >= 2 and <= 7
132+
call select(array, k, kth_smallest, left=2, right=7)
133+
print*, kth_smallest ! print 4.0
134+
135+
end program demo_select
136+
```
137+
138+
## `arg_select` - find the index of the k-th smallest value in an input array
139+
140+
### Status
141+
142+
Experimental
143+
144+
### Description
145+
146+
Returns the index of the k-th smallest value of `array(:)`, and also partially sorts
147+
the index-array `indx(:)` such that `all(array(indx(1:k)) <= array(indx(k)))` and
148+
`all(array(indx(k)) <= array(indx((k+1):size(array))))`
149+
150+
### Syntax
151+
152+
`call [[stdlib_selection(module):arg_select(interface)]]( array, indx, k, kth_smallest [, left, right ] )`
153+
154+
### Class
155+
156+
Generic subroutine.
157+
158+
### Arguments
159+
160+
`array` : shall be a rank one array of any of the types:
161+
`integer(int8)`, `integer(int16)`, `integer(int32)`, `integer(int64)`,
162+
`real(sp)`, `real(dp)`, `real(xdp), `real(qp)`. It is an `intent(in)` argument. On input it is
163+
the array in which we search for the k-th smallest entry.
164+
165+
`indx`: shall be a rank one array with the same size as `array`, containing all integers
166+
from `1:size(array)` in any order. It is of any of the types:
167+
`integer(int8)`, `integer(int16)`, `integer(int32)`, `integer(int64)`. It is an
168+
`intent(inout)` argument. On return its elements will define a partial sorting of `array(:)` such that:
169+
`all( array(indx(1:k-1)) <= array(indx(k)) )` and `all(array(indx(k)) <= array(indx(k+1:size(array))))`.
170+
171+
`k`: shall be a scalar with the same type as `indx`. It is an `intent(in)`
172+
argument. We search for the `k`-th smallest entry of `array(:)`.
173+
174+
`kth_smallest`: a scalar with the same type as `indx`. It is an `intent(out)` argument,
175+
and on return it contains the index of the k-th smallest entry of `array(:)`.
176+
177+
`left` (optional): shall be a scalar with the same type as `k`. It is an `intent(in)`
178+
argument. If specified then we assume the k-th smallest value is definitely contained
179+
in `array(indx(left:size(array)))`. If `left` is not present, the default is 1.
180+
This is typically useful if multiple calls to `arg_select` are made, because
181+
the partial sorting of `indx` implies constraints on where we need to search.
182+
183+
`right` (optional): shall be a scalar with the same type as `k`. It is an `intent(in)`
184+
argument. If specified then we assume the k-th smallest value is definitely contained
185+
in `array(indx(1:right))`. If `right` is not present, the default is
186+
`size(array)`. This is typically useful if multiple calls to `arg_select` are
187+
made, because the reordering of `indx` implies constraints on where we need to
188+
search.
189+
190+
### Notes
191+
192+
`arg_select` does not modify `array`, unlike `select`.
193+
194+
The partial sorting of `indx` is not stable, i.e., indices that map to equal
195+
values of array may be reordered.
196+
197+
The code does not support `NaN` elements in `array`; it will run, but there is
198+
no consistent interpretation given to the order of `NaN` entries of `array`
199+
compared to other entries.
200+
201+
While it is essential that that `indx` contains a permutation of the integers `1:size(array)`,
202+
the code does not check for this. For example if `size(array) == 4`, then we could have
203+
`indx = [4, 2, 1, 3]` or `indx = [1, 2, 3, 4]`, but not `indx = [2, 1, 2, 4]`. It is the user's
204+
responsibility to avoid such errors.
205+
206+
Selection of a single value should have runtime of O(`size(array)`), so it is
207+
asymptotically faster than sorting `array` entirely. The test program at the end of
208+
these documents confirms that is the case.
209+
210+
`arg_select` was derived using code from the Coretran library by Leon Foks,
211+
https://github.com/leonfoks/coretran. Leon Foks has given permission for the
212+
code here to be released under stdlib's MIT license.
213+
214+
### Example
215+
216+
217+
```fortran
218+
program demo_arg_select
219+
use stdlib_selection, only: arg_select
220+
implicit none
221+
222+
real, allocatable :: array(:)
223+
integer, allocatable :: indx(:)
224+
integer :: kth_smallest
225+
integer :: k, left, right
226+
227+
array = [3., 2., 7., 4., 5., 1., 4., -1.]
228+
indx = [( k, k = 1, size(array) )]
229+
230+
k = 2
231+
call arg_select(array, indx, k, kth_smallest)
232+
print*, array(kth_smallest) ! print 1.0
233+
234+
k = 7
235+
! Due to the previous call to arg_select, we know for sure this is in an
236+
! index >= 2
237+
call arg_select(array, indx, k, kth_smallest, left=2)
238+
print*, array(kth_smallest) ! print 5.0
239+
240+
k = 6
241+
! Due to the previous two calls to arg_select, we know for sure this is in
242+
! an index >= 2 and <= 7
243+
call arg_select(array, indx, k, kth_smallest, left=2, right=7)
244+
print*, array(kth_smallest) ! print 4.0
245+
246+
end program demo_arg_select
247+
```
248+
249+
## Comparison with using `sort`
250+
251+
The following program compares the timings of `select` and `arg_select` for
252+
computing the median of an array, vs using `sort` from `stdlib`. In theory we
253+
should see a speed improvement with the selection routines which grows like
254+
LOG(size(`array`)).
255+
256+
```fortran
257+
program selection_vs_sort
258+
use stdlib_kinds, only: dp, sp, int64
259+
use stdlib_selection, only: select, arg_select
260+
use stdlib_sorting, only: sort
261+
implicit none
262+
263+
call compare_select_sort_for_median(1)
264+
call compare_select_sort_for_median(11)
265+
call compare_select_sort_for_median(101)
266+
call compare_select_sort_for_median(1001)
267+
call compare_select_sort_for_median(10001)
268+
call compare_select_sort_for_median(100001)
269+
270+
contains
271+
subroutine compare_select_sort_for_median(N)
272+
integer, intent(in) :: N
273+
274+
integer :: i, k, result_arg_select, indx(N), indx_local(N)
275+
real :: random_vals(N), local_random_vals(N)
276+
integer, parameter :: test_reps = 100
277+
integer(int64) :: t0, t1
278+
real :: result_sort, result_select
279+
integer(int64) :: time_sort, time_select, time_arg_select
280+
logical :: select_test_passed, arg_select_test_passed
281+
282+
! Ensure N is odd
283+
if(mod(N, 2) /= 1) stop
284+
285+
time_sort = 0
286+
time_select = 0
287+
time_arg_select = 0
288+
289+
select_test_passed = .true.
290+
arg_select_test_passed = .true.
291+
292+
indx = (/( i, i = 1, N) /)
293+
294+
k = (N+1)/2 ! Deliberate integer division
295+
296+
do i = 1, test_reps
297+
call random_number(random_vals)
298+
299+
! Compute the median with sorting
300+
local_random_vals = random_vals
301+
call system_clock(t0)
302+
call sort(local_random_vals)
303+
result_sort = local_random_vals(k)
304+
call system_clock(t1)
305+
time_sort = time_sort + (t1 - t0)
306+
307+
! Compute the median with selection, assuming N is odd
308+
local_random_vals = random_vals
309+
call system_clock(t0)
310+
call select(local_random_vals, k, result_select)
311+
call system_clock(t1)
312+
time_select = time_select + (t1 - t0)
313+
314+
! Compute the median with arg_select, assuming N is odd
315+
local_random_vals = random_vals
316+
indx_local = indx
317+
call system_clock(t0)
318+
call arg_select(local_random_vals, indx_local, k, result_arg_select)
319+
call system_clock(t1)
320+
time_arg_select = time_arg_select + (t1 - t0)
321+
322+
if(result_select /= result_sort) select_test_passed = .FALSE.
323+
if(local_random_vals(result_arg_select) /= result_sort) arg_select_test_passed = .FALSE.
324+
end do
325+
326+
print*, "select ; N=", N, '; ', merge('PASS', 'FAIL', select_test_passed), &
327+
'; Relative-speedup-vs-sort:', (1.0*time_sort)/(1.0*time_select)
328+
print*, "arg_select; N=", N, '; ', merge('PASS', 'FAIL', arg_select_test_passed), &
329+
'; Relative-speedup-vs-sort:', (1.0*time_sort)/(1.0*time_arg_select)
330+
331+
end subroutine
332+
333+
end program
334+
```
335+
336+
The results seem consistent with expectations when the `array` is large; the program prints:
337+
```
338+
select ; N= 1 ; PASS; Relative-speedup-vs-sort: 1.90928173
339+
arg_select; N= 1 ; PASS; Relative-speedup-vs-sort: 1.76875830
340+
select ; N= 11 ; PASS; Relative-speedup-vs-sort: 1.14835048
341+
arg_select; N= 11 ; PASS; Relative-speedup-vs-sort: 1.00794709
342+
select ; N= 101 ; PASS; Relative-speedup-vs-sort: 2.31012774
343+
arg_select; N= 101 ; PASS; Relative-speedup-vs-sort: 1.92877376
344+
select ; N= 1001 ; PASS; Relative-speedup-vs-sort: 4.24190664
345+
arg_select; N= 1001 ; PASS; Relative-speedup-vs-sort: 3.54580402
346+
select ; N= 10001 ; PASS; Relative-speedup-vs-sort: 5.61573362
347+
arg_select; N= 10001 ; PASS; Relative-speedup-vs-sort: 4.79348087
348+
select ; N= 100001 ; PASS; Relative-speedup-vs-sort: 7.28823519
349+
arg_select; N= 100001 ; PASS; Relative-speedup-vs-sort: 6.03007460
350+
```

src/CMakeLists.txt

+1
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ set(fppFiles
1212
stdlib_linalg_diag.fypp
1313
stdlib_linalg_outer_product.fypp
1414
stdlib_optval.fypp
15+
stdlib_selection.fypp
1516
stdlib_sorting.fypp
1617
stdlib_sorting_ord_sort.fypp
1718
stdlib_sorting_sort.fypp

src/Makefile.manual

+3
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ SRCFYPP = \
1313
stdlib_quadrature.fypp \
1414
stdlib_quadrature_trapz.fypp \
1515
stdlib_quadrature_simps.fypp \
16+
stdlib_selection.fypp \
1617
stdlib_random.fypp \
1718
stdlib_sorting.fypp \
1819
stdlib_sorting_ord_sort.fypp \
@@ -105,6 +106,8 @@ stdlib_quadrature_trapz.o: \
105106
stdlib_quadrature.o \
106107
stdlib_error.o \
107108
stdlib_kinds.o
109+
stdlib_selection.o: \
110+
stdlib_kinds.o
108111
stdlib_sorting.o: \
109112
stdlib_kinds.o \
110113
stdlib_string_type.o

0 commit comments

Comments
 (0)