Loading…
Attending this event?
TPRC 2024 in beautiful Las Vegas, Nevada! June 25-27th.
Wednesday June 26, 2024 9:30am - 10:20am PDT
Component-Based Software Engineering (CBSE) assembles pre-existing, reusable software components into new applications. CBSE targeting scientific applications (SCBSE) has the potential to instill novel functionalities in existing, extensively tested components with the promise of rapid development cycles that are particularly relevant for fast moving, data-intensive fields such as bioinformatics.

Dynamically-typed scripting languages have played a particularly crucial role in SCBSE by acting as glue that integrates and coordinates heterogeneous components, or by facilitating communication of components that are organized as filters in complex data flows. Such languages confer additional benefits including rapid prototyping and support for multiple programming paradigms, thus allowing effortless exploration of multiple architectural alternatives before settling on the final design.

Perl has traditionally been the go-to dynamically-typed scripting language in bioinformatics (“BioPerl”) due to its robust text manipulation capabilities, but its declining overall popularity has also affected its standing in this field. The recent under-utilization of Perl in bioinformatics represents a significant missed opportunity to enhance applications by leveraging the vast resources in the Comprehensive Perl Archive Network (CPAN), a versatile and rich choice of Object-Oriented (OO) modules and a mature framework (Alien) for making external libraries and tools available to Perl for component-based application building. Such applications can also leverage Perl’s maturing framework for interacting with libraries in non-Perl languages using Foreign Function Interfaces (FFI) and the Perl Data Language (PDL), thus providing additional memory-based options to the traditional filter-based communication scheme that has been the mainstream approach in bioinformatics.

To the extent that these external applications and libraries can support the OpenMP “fork-join” multi-threading paradigm, the resulting component-based Perl application will be endowed with both coarse-grained (process-based) and fine-grained (thread-based) parallelism to adapt to the properties of the hardware environment.

In this paper we illustrate the value of Perl for SCBSE in bioinformatics by combining OO Perl, Alien, PDL, FFI and OpenMP in order to enhance two bioinformatic applications: 1) the R-based RNA-sequencing simulator “polyester”, and 2) the biological sequence similarity database search tool “edlib”. We utilize lightweight OO schemas and PDL to enhance the first tool, endowing it with capabilities to simulate additional processes (tailing with long poly-adenine, polyA, tail) which operate during the generation of messenger RNA. We then use this application to develop a novel, native Perl approach to trimming these tails, based on regular expressions, and provide a fast alternative to the Python application “cutadapt” which has been the gold standard in the field for years.

Our enhancement of “edlib”, which is a single-threaded command-line application and library, proceeds along an entirely different pathway: the introduction of coarse-level parallelism through the Many Core Engine (MCE) for the application, and OpenMP for the underlying library through the Platypus::FFI framework. These approaches used in combination can provide customizable levels of coarse- and fine-grained parallelism for the data- and compute-intensive task of sequence analysis, and provide proof-of-concept for the utility of the Bio::SeqAlignment framework currently under development in Perl.
Wednesday June 26, 2024 9:30am - 10:20am PDT
3: Apollo 1-2
Log in to leave feedback.

Attendees (8)


Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link