Skip to content

Commit

Permalink
Merge pull request #507 from perladvent/publish/2024-12-16
Browse files Browse the repository at this point in the history
2024-12-16
  • Loading branch information
oalders authored Dec 15, 2024
2 parents 720f41a + 770768f commit 8960904
Showing 1 changed file with 46 additions and 43 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,30 @@ Topic: Inline::C

=encoding utf8

=head2 Dancing through the Cnow
=head2 Dancing through the Cnow

It is this time of the year, the time you have finally some time to catch up finishing the
projects you were supposed to do the previous months. A blizzard of C code, like the arrows in
the movie 300 will descent upon you if you do, but you will end up in the naughty list of
researchers if you don't; for who will analyze your data if YOU drop the ball?
projects you were supposed to do the previous months. A blizzard of C code, like the arrows in
the movie 300 will descend upon you if you do, but you will end up in the naughty list of
researchers if you don't; for who will analyze your data if YOU drop the ball?
But fear not, because L<Inline::C> will save the day and Perl will make the glaCiers melt away.

For the last couple of years, I have been using L<Inline::C> to leverage a large amount of C code
related to biological sequence (text) analysis for my research work. Our group has been using portable
sequencing technologies by L<Oxford Nanopore|https://nanoporetech.com/products/sequence/flongle> to measure
RNA molecules in real-time as markers of kidney disease progression. The problem we are facing is that
understanding the data is not trivial, and many traditional bioinformatic workflows need considerable
understanding the data is not trivial, and many traditional bioinformatic workflows need considerable
adaptation to work. Often, one has to combine exotic pieces of code that is available in C libraries into
complex workflows that are non standard and require one to use the power of Perl to glue them together.
Let's C how L<Inline::C> can help us with the mess.

=head2 To A or not to A?

The molecules I am interested at have a tail of A's at the end of the sequence, e.g. something line
The molecules I am interested at have a tail of A's at the end of the sequence, e.g. something line
ACTGCCATCAAGAAAAAAAAAAAAA , but the A's are irrelevant to the analysis I want to do. Often there are
errors in the text, so that the sequence is not perfect, e.g. one is looking at something like this
ACTGCCATCAAGAAAAAAAAAAAAAAGTAACAAAA. The question is, how can I remove the A's at the end of the sequence
knowing that the error exists? There is a Python program L<cutadapt|https://cutadapt.readthedocs.io/en/stable/>,
knowing that the error exists? There is a Python program L<cutadapt|https://cutadapt.readthedocs.io/en/stable/>,
for this task, but note why I don't want to use it for my tasks:

=for :list
Expand All @@ -45,20 +45,20 @@ of the sequence. For example something like this:
=begin perl

my $polyA_min_25_pct_A = qr/
( ## match a poly A tail which is
( ## match a poly A tail which is
## delimited at its 5' by *at least* one A
A{1,}
## followed by the tail proper which has a
## followed by the tail proper which has a
(?: ## minimum composition of 25% A, i.e.
## we are looking for snippets with
(?:
## up to 3 CTGs followed by at
## we are looking for snippets with
(?:
## up to 3 CTGs followed by at
## least one A
[CTG]{0,3}A{1,}
[CTG]{0,3}A{1,}
)
| ## OR
(?:
## at least one A followed by
(?:
## at least one A followed by
## up to 3 CTGs
A{1,}[CTG]{0,3}
)
Expand All @@ -77,8 +77,8 @@ everything after: ACTGCC

The cutadapt algorithm does not use a regex, but a simple scoring system to decide when to
stop adding letters to the inferred tail. In particular, the algorithm considers all
possible suffixes in the sequence of interest, and after filtering those that have more than
20% non-A letters, returns the position of the suffix with the largest score as the beginning
possible suffixes in the sequence of interest, and after filtering those that have more than
20% non-A letters, returns the position of the suffix with the largest score as the beginning
of the tail. The algorithm in Perl is shown below:

=begin perl
Expand All @@ -103,7 +103,7 @@ of the tail. The algorithm in Perl is shown below:
return $best_index;
}

=end perl
=end perl

The A part is deemed to be 23 letters long and the non-A part is inferred to be ACTGCCATCAAG

Expand All @@ -114,7 +114,6 @@ slow. But as we will C, we can use L<Inline::C> to speed things up. The C code i

=begin perl


use Inline (
C => 'DATA',
);
Expand All @@ -124,10 +123,10 @@ slow. But as we will C, we can use L<Inline::C> to speed things up. The C code i
__C__


#include <stdlib.h>
#include <stdlib.h>
#include <string.h>
#include<stdio.h>
#include <math.h>
#include<stdio.h>
#include <math.h>

int _cutadapt_in_C(char *s) {
int n = strlen(s);
Expand All @@ -154,21 +153,25 @@ slow. But as we will C, we can use L<Inline::C> to speed things up. The C code i

=end perl

But how fast is fast? Let's run a benchmark using different sequence lengths
and fixing the length of the A tail to be 20% of the sequence length. The
results (mean and standard deviation in microseconds over 2000 repetitions
for each length) are shown below (the benchmarking code may be found in the /scripts directory of
But how fast is fast? Let's run a benchmark using different sequence lengths
and fixing the length of the A tail to be 20% of the sequence length. The
results (mean and standard deviation in microseconds over 2000 repetitions
for each length) are shown below (the benchmarking code may be found in the /scripts directory of
L<Bio::SeqAlignment::Examples::TailingPolyester>):

| Algorithm | Language | Target Sequence Length | | | |
|--------------|----------|-----------------------|--------------------|-------------------|-----------------|
| | | 100 | 1000 | 2000 | 10000 |
|--------------|----------|-----------------------|--------------------|-------------------|-----------------|
| cutadapt | Perl | 16.0±2.5 | 150.0±11.1 | 310.0±21.0 | 1500.0±88.0 |
| regex | Perl | 26.0±10.0 | 310.0±26.0 | 620.0±42.0 | 3200.0±140.0 |
| cutadaptC | Perl/C | 0.6±1.0 | 3.1±1.1 | 6.0±1.4 | 28.0±4.3 |
=begin code

| Algorithm | Language | Target Sequence Length | | | |
|--------------|----------|------------------------|--------------------|-------------------|-----------------|
| | | 100 | 1000 | 2000 | 10000 |
|--------------|----------|------------------------|--------------------|-------------------|-----------------|
| cutadapt | Perl | 16.0±2.5 | 150.0±11.1 | 310.0±21.0 | 1500.0±88.0 |
| regex | Perl | 26.0±10.0 | 310.0±26.0 | 620.0±42.0 | 3200.0±140.0 |
| cutadaptC | Perl/C | 0.6±1.0 | 3.1±1.1 | 6.0±1.4 | 28.0±4.3 |

A nice 30-50x speedup for the C code over the Perl code. The savings are real, considering that
=end code

A nice 30-50x speedup for the C code over the Perl code. The savings are real, considering that
a typical long RNA-seq experiment may have 10^6 - 10^7 reads, and each read may have a length of 1000 bases.

=head2 Making memories this C(hristmas)
Expand All @@ -177,8 +180,8 @@ The C code is not very complex, but it is a good example of how one can use L<In
tasks. But the module can help with more than that. For example, one can use it to interface with
other foreign code, by making and managing shared memory regions. Consider an example,
in which we hijack the C<Newxz> and C<Safefree> functions from the Perl API to allocate and free
memory areans to make C<$memory>. Such a variable is effectively a pointer to a memory arena, and
we can use it to store and retrieve data from it. Suppose that one had a library that took such an
memory areans to make C<$memory>. Such a variable is effectively a pointer to a memory arena, and
we can use it to store and retrieve data from it. Suppose that one had a library that took such an
arena as input and filled it with data. Then the arena could dance with any other library that
expected a pointer to a memory arena. The library could be written in C, or Assembly. For example,
this is how one can sum lots and lots of random numbers using either C or Assembly, under the
Expand Down Expand Up @@ -224,10 +227,10 @@ loving embrace of Perl working with L<Inline::C>:
__C__


#include <stdlib.h>
#include <stdlib.h>
#include <string.h>
#include<stdio.h>
#include <math.h>
#include<stdio.h>
#include <math.h>


int _cutadapt_in_C(char *s) {
Expand Down Expand Up @@ -330,7 +333,7 @@ loving embrace of Perl working with L<Inline::C>:

global sum_array_doubles_AVX_unaligned
sum_array_doubles_AVX_unaligned: ; based on Kusswurm listing 9-4d
vxorpd ymm0, ymm0, ymm0 ; sum = 0.0
vxorpd ymm0, ymm0, ymm0 ; sum = 0.0

; i = 0 in the comments of this block
lea r10,[rdi - DOUBLE] ; r10 = &array[i-1]
Expand All @@ -354,7 +357,7 @@ loving embrace of Perl working with L<Inline::C>:
jz End_AVX ; if not, jump to the end

add r10, DOUBLE * NSE - DOUBLE ; r10 = &array[i-1]


; Handle the remaining elements
Remainder_AVX:
Expand All @@ -371,7 +374,7 @@ loving embrace of Perl working with L<Inline::C>:

In this example we make 2 million of doubles, fill them up with random numbers
and then benchmark their sum them up in either C or Assembly. For the latter
we ccan use your grandfather's era Assembly or bring to the table a vectorized
we can use your grandfather's era Assembly or bring to the table a vectorized
version that uses SIMD instructions (in this case AVX extensions). In my
old Xeon, this is what I get:

Expand All @@ -390,4 +393,4 @@ Give your self a present this C(hristmas) and learn how to use L<Inline::C> to s
And if making memories seems too much, fear not, the L<Task::MemManager> module that I wrote up,
will cut you some slaCk.

=cut
=cut

0 comments on commit 8960904

Please sign in to comment.