|
| 1 | +--- |
| 2 | +title: Selection Procedures |
| 3 | +--- |
| 4 | + |
| 5 | +# The `stdlib_selection` module |
| 6 | + |
| 7 | +[TOC] |
| 8 | + |
| 9 | +## Overview of selection |
| 10 | + |
| 11 | +Suppose you wish to find the value of the k-th smallest entry in an array of size N, or |
| 12 | +the index of that value. While it could be done by sorting the whole array |
| 13 | +using `[[stdlib_sorting(module):sort(interface)]]` or |
| 14 | +`[[stdlib_sorting(module):sort_index(interface)]]` from |
| 15 | +`[[stdlib_sorting(module)]]` and then finding the k-th entry, that would |
| 16 | +require O(N x LOG(N)) time. However selection of a single entry can be done in |
| 17 | +O(N) time, which is much faster for large arrays. This is useful, for example, |
| 18 | +to quickly find the median of an array, or some other percentile. |
| 19 | + |
| 20 | +The Fortran Standard Library therefore provides a module, `stdlib_selection`, |
| 21 | +which implements selection algorithms. |
| 22 | + |
| 23 | +## Overview of the module |
| 24 | + |
| 25 | +The module `stdlib_selection` defines two generic subroutines: |
| 26 | +* `select` is used to find the k-th smallest entry of an array. The input |
| 27 | +array is also modified in-place, and on return will be partially sorted |
| 28 | +such that `all(array(1:k) <= array(k)))` and `all(array(k) <= array((k+1):size(array)))` is true. |
| 29 | +The user can optionally specify `left` and `right` indices to constrain the search |
| 30 | +for the k-th smallest value. This can be useful if you have previously called `select` |
| 31 | +to find a smaller or larger rank (that will have led to partial sorting of |
| 32 | +`array`, thus implying some constraints on the location). |
| 33 | + |
| 34 | +* `arg_select` is used to find the index of the k-th smallest entry of an array. |
| 35 | +In this case the input array is not modified, but the user must provide an |
| 36 | +input index array with the same size as `array`, having indices that are a permutation of |
| 37 | +`1:size(array)`, which is modified instead. On return the index array is modified |
| 38 | +such that `all(array(index(1:k)) <= array(index(k)))` and `all(array(k) <= array(k+1:size(array)))`. |
| 39 | +The user can optionally specify `left` and `right` indices to constrain the search |
| 40 | +for the k-th smallest value. This can be useful if you have previously called `arg_select` |
| 41 | +to find a smaller or larger rank (that will have led to partial sorting of |
| 42 | +`index`, thus implying some constraints on the location). |
| 43 | + |
| 44 | + |
| 45 | +## `select` - find the k-th smallest value in an input array |
| 46 | + |
| 47 | +### Status |
| 48 | + |
| 49 | +Experimental |
| 50 | + |
| 51 | +### Description |
| 52 | + |
| 53 | +Returns the k-th smallest value of `array(:)`, and also partially sorts `array(:)` |
| 54 | +such that `all(array(1:k) <= array(k))` and `all(array(k) <= array((k+1):size(array)))` |
| 55 | + |
| 56 | +### Syntax |
| 57 | + |
| 58 | +`call [[stdlib_selection(module):select(interface)]]( array, k, kth_smallest [, left, right ] )` |
| 59 | + |
| 60 | +### Class |
| 61 | + |
| 62 | +Generic subroutine. |
| 63 | + |
| 64 | +### Arguments |
| 65 | + |
| 66 | +`array` : shall be a rank one array of any of the types: |
| 67 | +`integer(int8)`, `integer(int16)`, `integer(int32)`, `integer(int64)`, |
| 68 | +`real(sp)`, `real(dp)`, `real(xdp)`, `real(qp)`. It is an `intent(inout)` argument. |
| 69 | + |
| 70 | +`k`: shall be a scalar with any of the types: |
| 71 | +`integer(int8)`, `integer(int16)`, `integer(int32)`, `integer(int64)`. It |
| 72 | +is an `intent(in)` argument. We search for the `k`-th smallest entry of `array(:)`. |
| 73 | + |
| 74 | +`kth_smallest`: shall be a scalar with the same type as `array`. It is an |
| 75 | +`intent(out)` argument. On return it contains the k-th smallest entry of |
| 76 | +`array(:)`. |
| 77 | + |
| 78 | +`left` (optional): shall be a scalar with the same type as `k`. It is an |
| 79 | +`intent(in)` argument. If specified then we assume the k-th smallest value is |
| 80 | +definitely contained in `array(left:size(array))`. If `left` is not present, |
| 81 | +the default is 1. This is typically useful if multiple calls to `select` are |
| 82 | +made, because the partial sorting of `array` implies constraints on where we |
| 83 | +need to search. |
| 84 | + |
| 85 | +`right` (optional): shall be a scalar with the same type as `k`. It is an |
| 86 | +`intent(in)` argument. If specified then we assume the k-th smallest value is |
| 87 | +definitely contained in `array(1:right)`. If `right` is not present, the |
| 88 | +default is `size(array)`. This is typically useful if multiple calls to |
| 89 | +`select` are made, because the partial sorting of `array` implies constraints |
| 90 | +on where we need to search. |
| 91 | + |
| 92 | +### Notes |
| 93 | + |
| 94 | +Selection of a single value should have runtime of O(`size(array)`), so it is |
| 95 | +asymptotically faster than sorting `array` entirely. The test program at the |
| 96 | +end of this document shows that is the case. |
| 97 | + |
| 98 | +The code does not support `NaN` elements in `array`; it will run, but there is |
| 99 | +no consistent interpretation given to the order of `NaN` entries of `array` |
| 100 | +compared to other entries. |
| 101 | + |
| 102 | +`select` was derived from code in the Coretran library by Leon Foks, |
| 103 | +https://github.com/leonfoks/coretran. Leon Foks has given permission for the |
| 104 | +code here to be released under stdlib's MIT license. |
| 105 | + |
| 106 | +### Example |
| 107 | + |
| 108 | +```fortran |
| 109 | +program demo_select |
| 110 | + use stdlib_selection, only: select |
| 111 | + implicit none |
| 112 | +
|
| 113 | + real, allocatable :: array(:) |
| 114 | + real :: kth_smallest |
| 115 | + integer :: k, left, right |
| 116 | +
|
| 117 | + array = [3., 2., 7., 4., 5., 1., 4., -1.] |
| 118 | +
|
| 119 | + k = 2 |
| 120 | + call select(array, k, kth_smallest) |
| 121 | + print*, kth_smallest ! print 1.0 |
| 122 | +
|
| 123 | + k = 7 |
| 124 | + ! Due to the previous call to select, we know for sure this is in an |
| 125 | + ! index >= 2 |
| 126 | + call select(array, k, kth_smallest, left=2) |
| 127 | + print*, kth_smallest ! print 5.0 |
| 128 | +
|
| 129 | + k = 6 |
| 130 | + ! Due to the previous two calls to select, we know for sure this is in |
| 131 | + ! an index >= 2 and <= 7 |
| 132 | + call select(array, k, kth_smallest, left=2, right=7) |
| 133 | + print*, kth_smallest ! print 4.0 |
| 134 | +
|
| 135 | +end program demo_select |
| 136 | +``` |
| 137 | + |
| 138 | +## `arg_select` - find the index of the k-th smallest value in an input array |
| 139 | + |
| 140 | +### Status |
| 141 | + |
| 142 | +Experimental |
| 143 | + |
| 144 | +### Description |
| 145 | + |
| 146 | +Returns the index of the k-th smallest value of `array(:)`, and also partially sorts |
| 147 | +the index-array `indx(:)` such that `all(array(indx(1:k)) <= array(indx(k)))` and |
| 148 | +`all(array(indx(k)) <= array(indx((k+1):size(array))))` |
| 149 | + |
| 150 | +### Syntax |
| 151 | + |
| 152 | +`call [[stdlib_selection(module):arg_select(interface)]]( array, indx, k, kth_smallest [, left, right ] )` |
| 153 | + |
| 154 | +### Class |
| 155 | + |
| 156 | +Generic subroutine. |
| 157 | + |
| 158 | +### Arguments |
| 159 | + |
| 160 | +`array` : shall be a rank one array of any of the types: |
| 161 | +`integer(int8)`, `integer(int16)`, `integer(int32)`, `integer(int64)`, |
| 162 | +`real(sp)`, `real(dp)`, `real(xdp), `real(qp)`. It is an `intent(in)` argument. On input it is |
| 163 | +the array in which we search for the k-th smallest entry. |
| 164 | + |
| 165 | +`indx`: shall be a rank one array with the same size as `array`, containing all integers |
| 166 | +from `1:size(array)` in any order. It is of any of the types: |
| 167 | +`integer(int8)`, `integer(int16)`, `integer(int32)`, `integer(int64)`. It is an |
| 168 | +`intent(inout)` argument. On return its elements will define a partial sorting of `array(:)` such that: |
| 169 | + `all( array(indx(1:k-1)) <= array(indx(k)) )` and `all(array(indx(k)) <= array(indx(k+1:size(array))))`. |
| 170 | + |
| 171 | +`k`: shall be a scalar with the same type as `indx`. It is an `intent(in)` |
| 172 | +argument. We search for the `k`-th smallest entry of `array(:)`. |
| 173 | + |
| 174 | +`kth_smallest`: a scalar with the same type as `indx`. It is an `intent(out)` argument, |
| 175 | +and on return it contains the index of the k-th smallest entry of `array(:)`. |
| 176 | + |
| 177 | +`left` (optional): shall be a scalar with the same type as `k`. It is an `intent(in)` |
| 178 | +argument. If specified then we assume the k-th smallest value is definitely contained |
| 179 | +in `array(indx(left:size(array)))`. If `left` is not present, the default is 1. |
| 180 | +This is typically useful if multiple calls to `arg_select` are made, because |
| 181 | +the partial sorting of `indx` implies constraints on where we need to search. |
| 182 | + |
| 183 | +`right` (optional): shall be a scalar with the same type as `k`. It is an `intent(in)` |
| 184 | +argument. If specified then we assume the k-th smallest value is definitely contained |
| 185 | +in `array(indx(1:right))`. If `right` is not present, the default is |
| 186 | +`size(array)`. This is typically useful if multiple calls to `arg_select` are |
| 187 | +made, because the reordering of `indx` implies constraints on where we need to |
| 188 | +search. |
| 189 | + |
| 190 | +### Notes |
| 191 | + |
| 192 | +`arg_select` does not modify `array`, unlike `select`. |
| 193 | + |
| 194 | +The partial sorting of `indx` is not stable, i.e., indices that map to equal |
| 195 | +values of array may be reordered. |
| 196 | + |
| 197 | +The code does not support `NaN` elements in `array`; it will run, but there is |
| 198 | +no consistent interpretation given to the order of `NaN` entries of `array` |
| 199 | +compared to other entries. |
| 200 | + |
| 201 | +While it is essential that that `indx` contains a permutation of the integers `1:size(array)`, |
| 202 | +the code does not check for this. For example if `size(array) == 4`, then we could have |
| 203 | +`indx = [4, 2, 1, 3]` or `indx = [1, 2, 3, 4]`, but not `indx = [2, 1, 2, 4]`. It is the user's |
| 204 | +responsibility to avoid such errors. |
| 205 | + |
| 206 | +Selection of a single value should have runtime of O(`size(array)`), so it is |
| 207 | +asymptotically faster than sorting `array` entirely. The test program at the end of |
| 208 | +these documents confirms that is the case. |
| 209 | + |
| 210 | +`arg_select` was derived using code from the Coretran library by Leon Foks, |
| 211 | +https://github.com/leonfoks/coretran. Leon Foks has given permission for the |
| 212 | +code here to be released under stdlib's MIT license. |
| 213 | + |
| 214 | +### Example |
| 215 | + |
| 216 | + |
| 217 | +```fortran |
| 218 | +program demo_arg_select |
| 219 | + use stdlib_selection, only: arg_select |
| 220 | + implicit none |
| 221 | +
|
| 222 | + real, allocatable :: array(:) |
| 223 | + integer, allocatable :: indx(:) |
| 224 | + integer :: kth_smallest |
| 225 | + integer :: k, left, right |
| 226 | +
|
| 227 | + array = [3., 2., 7., 4., 5., 1., 4., -1.] |
| 228 | + indx = [( k, k = 1, size(array) )] |
| 229 | +
|
| 230 | + k = 2 |
| 231 | + call arg_select(array, indx, k, kth_smallest) |
| 232 | + print*, array(kth_smallest) ! print 1.0 |
| 233 | +
|
| 234 | + k = 7 |
| 235 | + ! Due to the previous call to arg_select, we know for sure this is in an |
| 236 | + ! index >= 2 |
| 237 | + call arg_select(array, indx, k, kth_smallest, left=2) |
| 238 | + print*, array(kth_smallest) ! print 5.0 |
| 239 | +
|
| 240 | + k = 6 |
| 241 | + ! Due to the previous two calls to arg_select, we know for sure this is in |
| 242 | + ! an index >= 2 and <= 7 |
| 243 | + call arg_select(array, indx, k, kth_smallest, left=2, right=7) |
| 244 | + print*, array(kth_smallest) ! print 4.0 |
| 245 | +
|
| 246 | +end program demo_arg_select |
| 247 | +``` |
| 248 | + |
| 249 | +## Comparison with using `sort` |
| 250 | + |
| 251 | +The following program compares the timings of `select` and `arg_select` for |
| 252 | +computing the median of an array, vs using `sort` from `stdlib`. In theory we |
| 253 | +should see a speed improvement with the selection routines which grows like |
| 254 | +LOG(size(`array`)). |
| 255 | + |
| 256 | +```fortran |
| 257 | +program selection_vs_sort |
| 258 | + use stdlib_kinds, only: dp, sp, int64 |
| 259 | + use stdlib_selection, only: select, arg_select |
| 260 | + use stdlib_sorting, only: sort |
| 261 | + implicit none |
| 262 | +
|
| 263 | + call compare_select_sort_for_median(1) |
| 264 | + call compare_select_sort_for_median(11) |
| 265 | + call compare_select_sort_for_median(101) |
| 266 | + call compare_select_sort_for_median(1001) |
| 267 | + call compare_select_sort_for_median(10001) |
| 268 | + call compare_select_sort_for_median(100001) |
| 269 | +
|
| 270 | + contains |
| 271 | + subroutine compare_select_sort_for_median(N) |
| 272 | + integer, intent(in) :: N |
| 273 | +
|
| 274 | + integer :: i, k, result_arg_select, indx(N), indx_local(N) |
| 275 | + real :: random_vals(N), local_random_vals(N) |
| 276 | + integer, parameter :: test_reps = 100 |
| 277 | + integer(int64) :: t0, t1 |
| 278 | + real :: result_sort, result_select |
| 279 | + integer(int64) :: time_sort, time_select, time_arg_select |
| 280 | + logical :: select_test_passed, arg_select_test_passed |
| 281 | +
|
| 282 | + ! Ensure N is odd |
| 283 | + if(mod(N, 2) /= 1) stop |
| 284 | +
|
| 285 | + time_sort = 0 |
| 286 | + time_select = 0 |
| 287 | + time_arg_select = 0 |
| 288 | +
|
| 289 | + select_test_passed = .true. |
| 290 | + arg_select_test_passed = .true. |
| 291 | +
|
| 292 | + indx = (/( i, i = 1, N) /) |
| 293 | +
|
| 294 | + k = (N+1)/2 ! Deliberate integer division |
| 295 | +
|
| 296 | + do i = 1, test_reps |
| 297 | + call random_number(random_vals) |
| 298 | +
|
| 299 | + ! Compute the median with sorting |
| 300 | + local_random_vals = random_vals |
| 301 | + call system_clock(t0) |
| 302 | + call sort(local_random_vals) |
| 303 | + result_sort = local_random_vals(k) |
| 304 | + call system_clock(t1) |
| 305 | + time_sort = time_sort + (t1 - t0) |
| 306 | +
|
| 307 | + ! Compute the median with selection, assuming N is odd |
| 308 | + local_random_vals = random_vals |
| 309 | + call system_clock(t0) |
| 310 | + call select(local_random_vals, k, result_select) |
| 311 | + call system_clock(t1) |
| 312 | + time_select = time_select + (t1 - t0) |
| 313 | +
|
| 314 | + ! Compute the median with arg_select, assuming N is odd |
| 315 | + local_random_vals = random_vals |
| 316 | + indx_local = indx |
| 317 | + call system_clock(t0) |
| 318 | + call arg_select(local_random_vals, indx_local, k, result_arg_select) |
| 319 | + call system_clock(t1) |
| 320 | + time_arg_select = time_arg_select + (t1 - t0) |
| 321 | +
|
| 322 | + if(result_select /= result_sort) select_test_passed = .FALSE. |
| 323 | + if(local_random_vals(result_arg_select) /= result_sort) arg_select_test_passed = .FALSE. |
| 324 | + end do |
| 325 | +
|
| 326 | + print*, "select ; N=", N, '; ', merge('PASS', 'FAIL', select_test_passed), & |
| 327 | + '; Relative-speedup-vs-sort:', (1.0*time_sort)/(1.0*time_select) |
| 328 | + print*, "arg_select; N=", N, '; ', merge('PASS', 'FAIL', arg_select_test_passed), & |
| 329 | + '; Relative-speedup-vs-sort:', (1.0*time_sort)/(1.0*time_arg_select) |
| 330 | +
|
| 331 | + end subroutine |
| 332 | +
|
| 333 | +end program |
| 334 | +``` |
| 335 | + |
| 336 | +The results seem consistent with expectations when the `array` is large; the program prints: |
| 337 | +``` |
| 338 | + select ; N= 1 ; PASS; Relative-speedup-vs-sort: 1.90928173 |
| 339 | + arg_select; N= 1 ; PASS; Relative-speedup-vs-sort: 1.76875830 |
| 340 | + select ; N= 11 ; PASS; Relative-speedup-vs-sort: 1.14835048 |
| 341 | + arg_select; N= 11 ; PASS; Relative-speedup-vs-sort: 1.00794709 |
| 342 | + select ; N= 101 ; PASS; Relative-speedup-vs-sort: 2.31012774 |
| 343 | + arg_select; N= 101 ; PASS; Relative-speedup-vs-sort: 1.92877376 |
| 344 | + select ; N= 1001 ; PASS; Relative-speedup-vs-sort: 4.24190664 |
| 345 | + arg_select; N= 1001 ; PASS; Relative-speedup-vs-sort: 3.54580402 |
| 346 | + select ; N= 10001 ; PASS; Relative-speedup-vs-sort: 5.61573362 |
| 347 | + arg_select; N= 10001 ; PASS; Relative-speedup-vs-sort: 4.79348087 |
| 348 | + select ; N= 100001 ; PASS; Relative-speedup-vs-sort: 7.28823519 |
| 349 | + arg_select; N= 100001 ; PASS; Relative-speedup-vs-sort: 6.03007460 |
| 350 | +``` |
0 commit comments