A small piece of my journey through the Puerto Rico Summer Coding Challenge at RosalindPR.info.

Ever since I discovered my passion for bioinformatics about a year and a half ago, I have been actively looking for opportunities in which I could learn the fundamentals and develop my programming skills in this field. Approximately one month ago, I signed up for the Puerto Rico Summer Coding Challenge. A charity competition organized by Math is More Than Numbers founder Danilo T. Pérez-Rivera. The competition aims for individuals to solve over 100 bioinformatics-related classical problems and algorithm challenges. The only requirements are having a deep love for science and being related ethnically in some way to my beautiful Puerto Rico. I signed up here:

From the webpage: “The Puerto Rico Summer Coding Challenge is an opportunity to learn Biology, Coding and so much more, while potentially contributing to Scientific Communication. Join us, as we work through the main rosalind.info problem set, competing for a grand prize starting at $200. The race has already begun!”

Once I registered, I selected EcoExploratorio: Museo de Ciencias de Puerto Rico as my donation pledge. (Remember it is a charity competition)

Today I want to talk to you about one particular problem of the set, a really easy one. This problem requires a little knowledge in Genetics, and it was in this class that the flame of bioinformatics was ignited in me by my dear professor during my second year as an undergraduate student of Cell & Molecular Biology at Universidad Metropolitana (UMET) in San Juan, PR; now known as Universidad Ana G. Méndez (UAGM).

The Problem:

In genetics, we learn that “Mendel’s laws of segregation and independent assortment are excellent for the study of individual organisms and their progeny, but they say nothing about how alleles move through a population over time. Our first question is: When can we assume that the ratio of an allele in a population, called the allele frequency, is stable?”

The Hardy-Weinberg principle states that if a population is in genetic equilibrium for a given allele, then its frequency will remain constant and evenly distributed through the population. Unless the gene in question is important to survival or reproduction, Hardy-Weinberg usually offers a reasonable enough model of population genetics.

The above introduction to genetic equilibrium leaves us with a basic and yet very practical question regarding gene disorders: if we know the number of people who have a disease encoded by a recessive allele, can we predict the number of carriers in the population?

Let’s find out!

Figure 1. Problem: “Finding Disease Carriers” http://rosalind.info/problems/afrq/

Here we are looking at a probability exercise regarding the occurrence of alleles in a population following the Hardy-Weinberg principle. For a population in genetic equilibrium for any given alleles, we are given the frequency of homozygous recessive (aa) individuals for each one and are asked to return the probabilities of any randomly selected individual having at least one recessive allele.

We can classify the population into three groups:

A — Homozygous recessive (aa)
B — Heterozygous (Aa)
C — Homozygous dominant (AA)

We know that A + B + C = 1 (The sum of the probabilities of all outcomes equals 1). The probability (P) of picking individuals who are not homozygous dominants is P = 1-C, and by solving the equation above we have that 1-C = A + B; which is what we will be calculating.

If we denote the probability of a chromosome having a recessive allele with the letter q, and the probability of a chromosome having a dominant allele with the letter p, then since each individual carries two chromosomes and following the Hardy-Weinberg principle (p² + 2pq + q²) a homozygous recessive (aa) can be described as A = q², a heterozygous individual as B = 2pq, and a homozygous dominant as C = p².

When we put this together we get that P = q² + 2pq (P = A + B). We also know that p + q = 1. So p = 1-q, if we substitute, this gives us:

P=q² + 2(1-q)q

Which in turn can be simplified to:

P = 2A^ 0.5-A.

Now we have our formula, so all we need to do is write a program that calculates all the given alleles.

Figure 2. My pythonic approach to the problem. (but wait there's more…)
Figure 3. Success! Results as expected.

I told you it was easy! Right?

As you can see, Rosalind.info is a great platform in which we can combine the knowledge we acquire in our courses and apply it to achieve an interactive way of learning. If you like either bioinformatics, biology, computer science, or any other scientific field, I invite you to join us in this wonderful summer journey.

Cell & Molecular Biology