-
Notifications
You must be signed in to change notification settings - Fork 71
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #489 from chrisarg/main
chrisarg article
- Loading branch information
Showing
1 changed file
with
393 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,393 @@ | ||
Author: Christos Argyropoulos | ||
Title: Merry Inline C(hristmas) | ||
Topic: Inline::C | ||
|
||
=encoding utf8 | ||
|
||
=head2 Dancing through the Cnow | ||
|
||
It is this time of the year, the time you have finally some time to catch up finishing the | ||
projects you were supposed to do the previous months. A blizzard of C code, like the arrows in | ||
the movie 300 will descent upon you if you do, but you will end up in the naughty list of | ||
researchers if you don't; for who will analyze your data if YOU drop the ball? | ||
But fear not, because L<Inline::C> will save the day and Perl will make the glaCiers melt away. | ||
|
||
For the last couple of years, I have been using L<Inline::C> to leverage a large amount of C code | ||
related to biological sequence (text) analysis for my research work. Our group has been using portable | ||
sequencing technologies by L<Oxford Nanopore|https://nanoporetech.com/products/sequence/flongle> to measure | ||
RNA molecules in real-time as markers of kidney disease progression. The problem we are facing is that | ||
understanding the data is not trivial, and many traditional bioinformatic workflows need considerable | ||
adaptation to work. Often, one has to combine exotic pieces of code that is available in C libraries into | ||
complex workflows that are non standard and require one to use the power of Perl to glue them together. | ||
Let's C how L<Inline::C> can help us with the mess. | ||
|
||
=head2 To A or not to A? | ||
|
||
The molecules I am interested at have a tail of A's at the end of the sequence, e.g. something line | ||
ACTGCCATCAAGAAAAAAAAAAAAA , but the A's are irrelevant to the analysis I want to do. Often there are | ||
errors in the text, so that the sequence is not perfect, e.g. one is looking at something like this | ||
ACTGCCATCAAGAAAAAAAAAAAAAAGTAACAAAA. The question is, how can I remove the A's at the end of the sequence | ||
knowing that the error exists? There is a Python program L<cutadapt|https://cutadapt.readthedocs.io/en/stable/>, | ||
for this task, but note why I don't want to use it for my tasks: | ||
|
||
=for :list | ||
* I will have to fire off another process, which is slow | ||
* I will have to use the hard disk and pipes for IPC (slow) | ||
* My downstream data analyses take place in Perl and C, so I will have to convert the data back and forth | ||
|
||
Let's C how we can filter the noisy A's in Perl and then how we can add performance with L<Inline::C>. | ||
|
||
=head3 Regex for noisy A's | ||
|
||
The general idea here is to use a regex that puts an upper limit to the proportion of errors in the A tails | ||
of the sequence. For example something like this: | ||
|
||
=begin perl | ||
|
||
my $polyA_min_25_pct_A = qr/ | ||
( ## match a poly A tail which is | ||
## delimited at its 5' by *at least* one A | ||
A{1,} | ||
## followed by the tail proper which has a | ||
(?: ## minimum composition of 25% A, i.e. | ||
## we are looking for snippets with | ||
(?: | ||
## up to 3 CTGs followed by at | ||
## least one A | ||
[CTG]{0,3}A{1,} | ||
) | ||
| ## OR | ||
(?: | ||
## at least one A followed by | ||
## up to 3 CTGs | ||
A{1,}[CTG]{0,3} | ||
) | ||
)+ ## extend as much as possible | ||
)\z/xp; | ||
|
||
# and then use it like this: | ||
my $s = "ACTGCCATCAAGAAAAAAAAAAAAAAGTAACAAAA"; | ||
$s =~ m/$$polyA_min_25_pct_A/; | ||
my $best_index = length $1; | ||
|
||
=end perl | ||
|
||
when one runs the code above, the C<$best_index> will be 29, i.e. the regex filtered out | ||
everything after: ACTGCC | ||
|
||
The cutadapt algorithm does not use a regex, but a simple scoring system to decide when to | ||
stop adding letters to the inferred tail. In particular, the algorithm considers all | ||
possible suffixes in the sequence of interest, and after filtering those that have more than | ||
20% non-A letters, returns the position of the suffix with the largest score as the beginning | ||
of the tail. The algorithm in Perl is shown below: | ||
|
||
=begin perl | ||
|
||
sub perl_cutadapt { | ||
my $s = shift; | ||
my $n = length $s; | ||
my $best_index = $n; | ||
my $best_score = my $score = 0; | ||
foreach my $i ( reverse( 0 .. $n - 1 ) ) { | ||
my $nuc = substr $s, $i, 1; | ||
$score += $nuc eq 'A' ? +1 : -2; | ||
if ( $score > $best_score ) { | ||
$best_index = $i; | ||
$best_score = $score; | ||
} | ||
} | ||
$best_index = $n - $best_index; | ||
if ( $best_score < 0.4 * ( $best_index + 1 ) ) { | ||
$best_index = $n; | ||
} | ||
return $best_index; | ||
} | ||
|
||
=end perl | ||
|
||
The A part is deemed to be 23 letters long and the non-A part is inferred to be ACTGCCATCAAG | ||
|
||
=head3 Inline::C for performance | ||
|
||
The algorithm as implemented is not very fast, and for large or many sequences it can be very | ||
slow. But as we will C, we can use L<Inline::C> to speed things up. The C code is shown below: | ||
|
||
=begin perl | ||
|
||
|
||
use Inline ( | ||
C => 'DATA', | ||
); | ||
say _cutadapt_in_C($s); | ||
|
||
__DATA__; | ||
__C__ | ||
|
||
|
||
#include <stdlib.h> | ||
#include <string.h> | ||
#include<stdio.h> | ||
#include <math.h> | ||
|
||
int _cutadapt_in_C(char *s) { | ||
int n = strlen(s); | ||
int best_index = n; | ||
int best_score = 0; | ||
int score = 0; | ||
for (int i = n - 1; i >= 0; i--) { | ||
char nuc = s[i]; | ||
if (nuc == 'A') { | ||
score += 1; | ||
} | ||
else { | ||
score -= 2; | ||
} | ||
if (score > best_score) { | ||
best_index = i; | ||
best_score = score; | ||
} | ||
} | ||
best_index = (best_score < -0.4 * (best_index + 1)) ? n : n - best_index; | ||
return best_index; | ||
|
||
} | ||
|
||
=end perl | ||
|
||
But how fast is fast? Let's run a benchmark using different sequence lengths | ||
and fixing the length of the A tail to be 20% of the sequence length. The | ||
results (mean and standard deviation in microseconds over 2000 repetitions | ||
for each length) are shown below (the benchmarking code may be found in the /scripts directory of | ||
L<Bio::SeqAlignment::Examples::TailingPolyester>): | ||
|
||
| Algorithm | Language | Target Sequence Length | | | | | ||
|--------------|----------|-----------------------|--------------------|-------------------|-----------------| | ||
| | | 100 | 1000 | 2000 | 10000 | | ||
|--------------|----------|-----------------------|--------------------|-------------------|-----------------| | ||
| cutadapt | Perl | 16.0±2.5 | 150.0±11.1 | 310.0±21.0 | 1500.0±88.0 | | ||
| regex | Perl | 26.0±10.0 | 310.0±26.0 | 620.0±42.0 | 3200.0±140.0 | | ||
| cutadaptC | Perl/C | 0.6±1.0 | 3.1±1.1 | 6.0±1.4 | 28.0±4.3 | | ||
|
||
A nice 30-50x speedup for the C code over the Perl code. The savings are real, considering that | ||
a typical long RNA-seq experiment may have 10^6 - 10^7 reads, and each read may have a length of 1000 bases. | ||
|
||
=head2 Making memories this C(hristmas) | ||
|
||
The C code is not very complex, but it is a good example of how one can use L<Inline::C> to speed up | ||
tasks. But the module can help with more than that. For example, one can use it to interface with | ||
other foreign code, by making and managing shared memory regions. Consider an example, | ||
in which we hijack the C<Newxz> and C<Safefree> functions from the Perl API to allocate and free | ||
memory areans to make C<$memory>. Such a variable is effectively a pointer to a memory arena, and | ||
we can use it to store and retrieve data from it. Suppose that one had a library that took such an | ||
arena as input and filled it with data. Then the arena could dance with any other library that | ||
expected a pointer to a memory arena. The library could be written in C, or Assembly. For example, | ||
this is how one can sum lots and lots of random numbers using either C or Assembly, under the | ||
loving embrace of Perl working with L<Inline::C>: | ||
|
||
=begin perl | ||
|
||
use Inline ( | ||
C => 'DATA', | ||
); | ||
use Inline ( | ||
C => 'DATA', | ||
); | ||
my $number_of_doubles = 2_000_000; | ||
my $memory = alloc_with_Newxz($number_of_doubles*8); | ||
generate_random_double_array($memory, $number_of_doubles); | ||
|
||
use Benchmark qw(:all); | ||
cmpthese( | ||
-1, | ||
{ | ||
'C' => sub { sum_array_C($memory, $number_of_doubles) }, | ||
'ASM' => sub { sum_array_doubles($memory, $number_of_doubles) }, | ||
'ASM_AVX' => sub { sum_array_doubles_AVX_unaligned($memory, $number_of_doubles) }, | ||
} | ||
); | ||
|
||
free_with_Safefree($memory); | ||
|
||
|
||
use Inline | ||
ASM => 'DATA', | ||
AS => 'nasm', | ||
ASFLAGS => '-f elf64', | ||
PROTO => { | ||
sum_array_doubles=> 'double(void *,size_t)', | ||
sum_array_doubles_AVX_unaligned => 'double(void *,size_t)', | ||
|
||
}; | ||
|
||
|
||
__DATA__; | ||
__C__ | ||
|
||
|
||
#include <stdlib.h> | ||
#include <string.h> | ||
#include<stdio.h> | ||
#include <math.h> | ||
|
||
|
||
int _cutadapt_in_C(char *s) { | ||
int n = strlen(s); | ||
int best_index = n; | ||
int best_score = 0; | ||
int score = 0; | ||
for (int i = n - 1; i >= 0; i--) { | ||
char nuc = s[i]; | ||
if (nuc == 'A') { | ||
score += 1; | ||
} | ||
else { | ||
score -= 2; | ||
} | ||
if (score > best_score) { | ||
best_index = i; | ||
best_score = score; | ||
} | ||
} | ||
best_index = (best_score < -0.4 * (best_index + 1)) ? n : n - best_index; | ||
return best_index; | ||
|
||
} | ||
|
||
#define IsSVValidPtr(sv) do { \ | ||
if (!SvOK((sv))) { \ | ||
croak("Pointer is not defined"); \ | ||
} \ | ||
if (!SvIOK((sv))) { \ | ||
croak("Pointer does not contain an integer"); \ | ||
} \ | ||
IV value = SvIV((sv)); \ | ||
if (value <= 0) { \ | ||
croak("Pointer is negative or zero"); \ | ||
} \ | ||
} while(0) | ||
|
||
#define SetTypedPtr(ptr,sv, type) type *ptr; \ | ||
ptr = (type *) SvIV((sv)) | ||
|
||
|
||
void generate_random_double_array(SV *sv, size_t num_elements) { | ||
IsSVValidPtr(sv); | ||
SetTypedPtr(array, sv, double); | ||
for (size_t i = 0; i < num_elements; ++i) { | ||
array[i] = ((double)rand() / RAND_MAX) * 10.0 - 5.0; | ||
} | ||
} | ||
|
||
double sum_array_C(SV *sv, size_t length) { | ||
IsSVValidPtr(sv); | ||
double sum = 0.0; | ||
SetTypedPtr(array, sv, double); | ||
for (size_t i = 0; i < length; i++) { | ||
sum += array[i]; | ||
} | ||
return sum; | ||
} | ||
|
||
|
||
|
||
// get a buffer | ||
SV* alloc_with_Newxz(size_t length) { | ||
char* array ; | ||
Newxz(array, length, char); | ||
return newSVuv(PTR2UV(array)); | ||
} | ||
|
||
void free_with_Safefree(size_t address) { | ||
void* buffer = (void*)address; | ||
Safefree(buffer); | ||
} | ||
|
||
|
||
__ASM__ | ||
NSE equ 4 ; number of SIMD double elements per iteration | ||
DOUBLE equ 8 ; number of bytes per double | ||
; Use RIP-relative memory addressing | ||
default rel | ||
|
||
; Mark stack as non-executable for Binutils 2.39+ | ||
section .note.GNU-stack noalloc noexec nowrite progbits | ||
|
||
SECTION .text | ||
|
||
global sum_array_doubles | ||
sum_array_doubles: ; based on Kusswurm listing 5-7c | ||
; Initialize | ||
vxorpd xmm0, xmm0, xmm0 ; sum = 0.0 | ||
sub rdi, DOUBLE ; rdi = &array[-1] | ||
|
||
Loop1: | ||
add rdi, DOUBLE | ||
vaddsd xmm0, xmm0, qword [rdi] | ||
sub rsi, 1 | ||
jnz Loop1 | ||
ret | ||
|
||
|
||
global sum_array_doubles_AVX_unaligned | ||
sum_array_doubles_AVX_unaligned: ; based on Kusswurm listing 9-4d | ||
vxorpd ymm0, ymm0, ymm0 ; sum = 0.0 | ||
|
||
; i = 0 in the comments of this block | ||
lea r10,[rdi - DOUBLE] ; r10 = &array[i-1] | ||
cmp rsi, NSE ; check if we have at least NSE elements | ||
jb Remainder_AVX ; if not, jump to remainder | ||
lea r10, [rdi-NSE * DOUBLE] ; r10 = &array[i-NSE] | ||
|
||
|
||
Loop1_AVX: | ||
add r10, DOUBLE * NSE ; r10 = &array[i] | ||
vaddpd ymm0, ymm0, [r10] ; sum += array[i] | ||
sub rsi, NSE ; decrement the counter | ||
cmp rsi, NSE ; check if we have at least NSE elements | ||
jae Loop1_AVX ; if so, loop again | ||
|
||
; Reduce packed sum using SIMD addition | ||
vextractf128 xmm1, ymm0, 1 ; extract the high 128 bits | ||
vaddpd xmm2, xmm1, xmm0 ; sum += high 128 bits | ||
vhaddpd xmm0, xmm2, xmm2 ; sum += low 128 bits | ||
test rsi, rsi ; check if we have any elements left | ||
jz End_AVX ; if not, jump to the end | ||
|
||
add r10, DOUBLE * NSE - DOUBLE ; r10 = &array[i-1] | ||
|
||
|
||
; Handle the remaining elements | ||
Remainder_AVX: | ||
add r10, DOUBLE | ||
vaddsd xmm0, xmm0, qword [r10] | ||
sub rsi, 1 | ||
jnz Remainder_AVX | ||
|
||
End_AVX: | ||
;vmovsd xmm0, xmm5 | ||
ret | ||
|
||
=end perl | ||
|
||
In this example we make 2 million of doubles, fill them up with random numbers | ||
and then benchmark their sum them up in either C or Assembly. For the latter | ||
we ccan use your grandfather's era Assembly or bring to the table a vectorized | ||
version that uses SIMD instructions (in this case AVX extensions). In my | ||
old Xeon, this is what I get: | ||
|
||
Rate ASM C ASM_AVX | ||
ASM 565/s -- -0% -68% | ||
C 565/s 0% -- -68% | ||
ASM_AVX 1778/s 215% 215% -- | ||
|
||
In my applications, I use this trick to interface with vectorized, hand optimized | ||
Assembly code, when intrinsics fail to deliver performance in C. But even in such cases, | ||
the memory management is done by Perl through L<Inline::C>. | ||
|
||
=head2 Conclusions | ||
|
||
Give your self a present this C(hristmas) and learn how to use L<Inline::C> to speed things up. | ||
And if making memories seems too much, fear not, the L<Task::MemManager> module that I wrote up, | ||
will cut you some slaCk. | ||
|
||
=cut |