Potential Thesis Topics
Master Thesis topics
Common Workflow Language: Mining user behavior for feature development
Common Workflow Language is a standard for describing command line applications and workflows made from them, typically using (Docker) software containers. CWL is popular in bioinformatics and is gaining traction in other fields such as astrophysics. For flexibility, users are allowed to use Javascript in their CWL tools descriptions to handle complicated command line interfaces, or to generate dynamic configuration files. For maximal language simplicity and ease of parsing/implementation it would be ideal if there was no Javascript allowed in CWL. Therefore the maintainers and contributors to the CWL standard try to monitor how users make use of the Javascript feature and then create new language constructs that fulfill those needs. However the current approach is very manual, infrequent, and non-comprehensive.
Quantitative analysis phase
Using online archives of CWL workflows, and by searching GitHub and GitLab: can we characterize, in an automated fashion, how users use the Javascript Expression feature of CWL? Techniques may include Abstract Syntax Trees, which would benefit from domain specific enhancement.
Design phase
Propose new features for the discovered JS motifs and evaluate the burden of a new language feature versus the utility to users.
Notes:
All work must be done openly and under the Apache 2.0 license. There is a possibility of the results of the project being used for many years to come!
Prerequisites: Comfort with the Linux/Unix command line
May qualify for summer funding via Google's Summer of Code
Contact Michael Crusoe at for further details
Common Workflow Language: distributed execution with data streaming
Command line scientific analysis tools often support streaming data into or out of the tool. (At the command line we use the unix pipe “|” or named pipes to implement this). This speeds up the analysis by avoiding slow disk/storage IO.
While the CWL standard supports this approach, no CWL-aware workflow system makes use of this optimization.
You would implement this feature (automatic streaming data in and out of scientific computing tools) to one of the CWL workflow engines, such as Toil (which is Python based).
The first iteration would stream in and out of object stores (Amazon S3, Google Cloud Storage, etc..). More advanced implementations may feature direct streaming between the tools, but this requires refactoring the job scheduling engine.
Notes:
All work must be done openly and under the Apache 2.0 license. If successful, you will have contributed a major feature to a popular workflow engine!
Prerequisites: Python
May qualify for summer funding via Google's Summer of Code
Cross-architecture Single instruction, multiple data (SIMD) analysis
SIMD intrinsics like SSE SSE2 SSE3 SSSE3 SSE4.1 AVX AVX2 AVX-512 are popular with C/C++ programmers for speeding up analysis code in many research domains, including bioinformatics. Alas they are architecture specific and implementing fallbacks and multi-versioning is tedious. For example, SSE is not available on Raspberry Pi which is popular in education and hobbyist settings.
The SIMD Everywhere header-only C/C++ library reduces this burden by using a variety of methods: 1) GCC or clang extensions 2) OpenMP 3) Cilk Plus 4) pure source implementation and 5) cross-architecture SIMD (e.g. implementing SSE2 with NEON ARM intrinsics).
However, the performance of these implementations are not quantified.
In this project you will benchmark the use of the SIMD Everywhere library and these different backends as used by real scientific computing codebases on a variety of hardware. You will gain experience in SIMD programming and you will improve the SIMD Everywhere Open Source project by adding new SIMD instructions and accelerated implementations.
Notes:
All work must be done openly and under the MIT license. If successful, your work will benefit many scientific and research applications!
Prerequisites: C or C++ experience
May qualify for summer funding via Google's Summer of Code.
Bioinformatic pipelines on Arm64: performance analysis on AWS Graviton and Apple Silicon
The Intel dominance of the scientific computing market may be weakened with the popularity of Arm64 architecture systems like AWS Graviton and Apple Silicon. While there are initial reports of cost and energy savings for industrial applications, this has not been analyzed for bioinformatic pipelines.
In this project you compare the performance on a cost, time, and energy basis of multiple real world bioinformatic pipelines on Arm64 systems like those from the AWS cloud and Apple Silicon. You will have the opportunity to assist in porting the parts of bioinformatic codebases that prevent their running on Arm64 processors in conjunction with the Debian Med project. You will also gain experience using software container (docker) build and deployment technologies.
Notes:
All work must be done openly and under an approved Free/Open Source Software license. If successful, your work will benefit many scientific and research applications!
May qualify for summer funding via Google's Summer of Code.
Deep learning for decoy generation in protein identification from mass spectrometry
Prequisites: Knowledge of bioinformatics and of TensorFlow
Notes:
All work must be done openly and under an approved Free/Open Source Software license. If successful, your work will benefit many scientific and research applications!
Antimicrobial resistant bacteria; predicting antimicrobial resistance; machine learning; sequence bioinformatics
Prequisites: Some R or Python; some knowledge of machine learning and sequence bioinformatics would be beneficial
Notes:
All work must be done openly and under an approved Free/Open Source Software license. If successful, your work will benefit many scientific and research applications!
Contact Tjeerd Dijkstra at and Thomas Hamm at for further details
Bachelor Thesis topics
Improved feature linking in large-scale mass spectrometry experiments
Prequisites:
Notes:
All work must be done openly and under an approved Free/Open Source Software license. If successful, your work will benefit many scientific and research applications!
Quality control for quantification in large-scale mass spectrometry experiments
Prequisites:
Notes:
All work must be done openly and under an approved Free/Open Source Software license. If successful, your work will benefit many scientific and research applications!