Friday, March 14, 2014

Global Alignment Problem

The solution to the Manhattan problem described here only returned us the maximum score, but did not give instructions on how exactly we should walk through Manhattan to achieve this score. To lay out the path (and to use the solution in bioinformatics to align aminoacid sequences) we need an additional step: keep the history of our walk so that we could "backtrack" from the last intersection all the way back to the start. Essentially, it is simple: when we arrive to a new node, we always come from somewhere. This is the node we need to remember.

We should also consider that, unlike Manhattan graph, where we only had streets going "west" or "south", we now have "diagonal" streets if we want to adapt the problem to align aminoacid sequences. Consider the image below:

Alignment graph matches

The red diagonal arrows show the places where one sequence matches the other. We can give these edges a higher "score" so the path that goes through most of such edges is the highest scored. In this case we have three ways to reach any node s[i,j]: we can come from s[i - 1,j] by moving down, from s[i,j - 1] by moving right, or from s[i - 1, j - 1] by moving diagonally. In the simplest case, we'll calculate anything other than a match as a 0, and a match as 1. Then the score for any s[i,j] node will be calculated using the following formula:

Global alignment formula

In more complex scenarios, we may use scoring matrices, where each combination of two aminoacids is given a certain score, depending on how biologically reasonable is this combination.

Therefore, the following algorithm is used to calculate the score matrix s, and at the same time fill the backtracking matrix backtrack.

AlignmentMatrix(v, w)
 for i ← 0 to |v|
  si, 0 ← 0
 for j ← 0 to |w| 
  s0, j ← 0
 for i ← 1 to |v|
  for j ← 1 to |w|
   si, j = max{si-1, j, si,j-1, si-1, j-1 + 1 (if vi = wj)}
   backtracki, j = ↓ if si, j = si-1, j ; → if si, j = si, j-1 ; ↘ if si, j = si-1, j-1 + 1
 return s|v|, |w| , backtrack

After both the scoring matrices are filled, we will use the following algorithm to backtrack from the s[i,j] all the way back to the starting node. Along the way we will output two string. In places where there is a match between the strings, a symbol will be placed, and in other places we will fill in a dash. Following is the backtracking algorithm. We are filling two strings, v and w. If we came to the point i,j by a diagonal edge, we have a match, and output the same symbol into both strings. If we came by a vertical or horisontal edge, one of the strings receives a symbol, and the other a "mismatch" - a "-".

OUTPUTALIGNMENT(backtrack, v, w, i, j)
 if i = 0 or j = 0
  return
 if backtracki, j = ↓
  output vi = "-"
  outpuv wi
  OUTPUTLCS(backtrack, v, i - 1, j)
 else if backtracki, j = →
  output vi
  output wi = "-"
  OUTPUTLCS(backtrack, v, i, j - 1)
 else
  OUTPUTLCS(backtrack, v, i - 1, j - 1)
  output vi
  output wi

This is the C# implementation of the algorithms.

public static void GlobalAlignment(string s1, string s2)
{
 int s1l = s1.Length;
 int s2l = s2.Length;

 int[,] matrix = new int[s1l + 1, s2l + 1];
 int[,] backtrack = new int[s1l + 1, s2l + 1];

 for (int i = 0; i <= s1l; i++) for (int j = 0; j <= s2l; j++) matrix[i, j] = 0;
 for (int i = 0; i <= s1l; i++) for (int j = 0; j <= s2l; j++) backtrack[i, j] = 0;

 for (int i = 0; i <= s1l; i++) matrix[i, 0] = 0;
 for (int j = 0; j <= s2l; j++) matrix[0, j] = 0;

 for (int i = 1; i <= s1l; i++)
 {
  for (int j = 1; j <= s2l; j++)
  {
   matrix[i, j] = Math.Max(Math.Max(matrix[i - 1, j], matrix[i, j - 1]), matrix[i - 1, j - 1] + 1);

   if (matrix[i, j] == matrix[i - 1, j]) backtrack[i, j] = -1;
   else if (matrix[i, j] == matrix[i, j - 1]) backtrack[i, j] = -2;
   else if (matrix[i, j] == matrix[i - 1, j - 1] + 1) backtrack[i, j] = 1;
  }
 }

 OUTPUTLCSALIGN(backtrack, s2, s1, s1l, s2l);

 string aligned1 = ReverseString(Result);
 string aligned2 = ReverseString(Result2);
}

public static void OUTPUTLCSALIGN(int[,] backtrack, string v, string w, int i, int j)
{
 if (i == 0 || j == 0)
  return;
 if (backtrack[i, j] == -1)
 {
  Result = Result + "-";
  Result2 = Result2 + w[i - 1];
  OUTPUTLCSALIGN(backtrack, v, w, i - 1, j);
 }
 else if (backtrack[i, j] == -2)
 {
  Result = Result + v[j - 1];
  Result2 = Result2 + "-";
  OUTPUTLCSALIGN(backtrack, v, w, i, j - 1);
 }
 else
 {
  Result = Result + v[j - 1];
  Result2 = Result2 + w[i - 1];
  OUTPUTLCSALIGN(backtrack, v, w, i - 1, j - 1);
 }
}

public static string ReverseString(string input)
{
 char[] charArray = input.ToCharArray();
 Array.Reverse(charArray);
 return new string(charArray);
}

A slightly modified version of the function takes into account a penalty for an insertion / deletion (-5 in this case) and uses a BLOSUM62 scoring matrix

public static void GlobalAlignment(string s1, string s2)
{
 int indelPenalty = -5;

 int s1l = s1.Length;
 int s2l = s2.Length;

 int[,] matrix = new int[s1l + 1, s2l + 1];
 int[,] backtrack = new int[s1l + 1, s2l + 1];

 for (int i = 0; i <= s1l; i++) for (int j = 0; j <= s2l; j++) matrix[i, j] = 0;
 for (int i = 0; i <= s1l; i++) for (int j = 0; j <= s2l; j++) backtrack[i, j] = 0;

 for (int i = 0; i <= s1l; i++) matrix[i, 0] = indelPenalty * i;
 for (int j = 0; j <= s2l; j++) matrix[0, j] = indelPenalty * j;

 for (int i = 1; i <= s1l; i++)
 {
  for (int j = 1; j <= s2l; j++)
  {
   //deletion
   int delCoef = matrix[i - 1, j] + indelPenalty;
   //insertion
   int insCoef = matrix[i, j - 1] + indelPenalty;
   //match / mismatch
   int mCoef = matrix[i - 1, j - 1] + BlosumCoeff(s1[i - 1], s2[j - 1]);

   matrix[i, j] = Math.Max(Math.Max(delCoef, insCoef), mCoef);

   if (matrix[i, j] == delCoef) backtrack[i, j] = -1; //
   else if (matrix[i, j] == insCoef) backtrack[i, j] = -2; //
   else if (matrix[i, j] == mCoef) backtrack[i, j] = 1;
  }
 }

 OUTPUTLCSALIGN(backtrack, s2, s1, s1l, s2l);

 string aligned1 = ReverseString(Result);
 string aligned2 = ReverseString(Result2);
}

Given the following two strings

PLEASANTLY
MEANLY

The following alignment will be returned

PLEASANTLY
-MEA--N-LY

by . Also posted on my website

Tuesday, March 11, 2014

Manhattan Tourist problem

An introductory exercise to aligning amino acid sequences is the Manhattan tourist problem. Suppose a tourist starts his route in the top left corner of the map and wants to visit as many attractions on the way as possible, and finish in the down right corner. There is a restriction however - he can only move either to the right or down.

A Simplified Map of Manhattan

To describe the problem in numbers and data structures, the map is converted to the directed grid shown below. A tourist can move along the edges of the grid and the score of 1 is added if he passes an attraction, which are assigned to the bold edges. Of all possible paths we are interested in the one with the highest score.

Map of Manhattan as a Graph

The solution to this problem is equivalent to a more general problem. This time every "street", or the edge of the graph, is assigned a score. The goal is to find a path which gains the highest score.

Generic Manhattan Problem

The problem can be brute forced, of course, by calculating the score for all possible paths. This is not practical for the larger graphs of course. A better approach is to calculate the highest possible score at every intersection, starting from the top left corner. Here is how the logic goes: If the tourist goes only towards the left in the grid, or only down, there are no choices for him. So it is easy to calculate the highest scores for the top and left row of intersections. Now we have these scores calculated:

Generic Manhattan Problem - First Row and Column Solved

Now when that's done, we can proceed to the next row. How can the tourist get to intersection (1, 1)? He can either go to (0, 1) first, and gain 3 + 0 = 3 score, or go to (1, 0) and gain 1 + 3 = 4 score. Therefore, the maximum score he can gain at (1, 1) is 4.

Generic Manhattan Problem - Cell (1,1) Solved

The generic formula to use is quite intuitive: To reach an intersection, you have to arrive from one of two possible previous locations, and traverse a corresponding edge. You choose the path where the sum of the score of a previous location, plus the score of the edge, is higher.

Generic Manhattan Formula

We continue calculations along the column all the way down. After that, we can move on to the next row.

Generic Manhattan Problem - Second Column Solved

Following this logic, it is easy to compute the maximum score for each intersection and find out the maximum score that can be gained while traversing Manhattan.

Generic Manhattan Problem - Fully Solved

Along the way the algorithm also becomes clear: the maximum score at any intersection is the maximum of the following two values:

MANHATTANTOURIST(n, m, down, right)
 s0, 0 ← 0
 for i ← 1 to n
  si, 0 ← si-1, 0 + downi, 0
 for j ← 1 to m
  s0, j ← s0, j−1 + right0, j
 for i ← 1 to n
  for j ← 1 to m
   si, j ← max{si - 1, j + downi, j, si, j - 1 + righti, j}
 return sn, m

The following C# function implements the algorithm and returns the highest possible score, assuming we input all the scores for the edges towards the right and all the edges pointing down in the form of two-dimension arrays.

public static int ManhattanProblem(int[,] RightMatrix, int[,] DownMatrix)
{
 int n = RightMatrix.GetLength(0) + 1;
 int m = DownMatrix.GetLength(1) + 1;
 int[,] ManhattanMatrix = new int[n, m];

 ManhattanMatrix[0, 0] = 0;

 for (int i = 1; i <= n; i++)
 {
  ManhattanMatrix[i, 0] = ManhattanMatrix[i - 1, 0] + DownMatrix[i - 1, 0];
 }

 for (int j = 1; j <= m; j++)
 {
  ManhattanMatrix[0, j] = ManhattanMatrix[0, j - 1] + RightMatrix[0, j - 1];
 }

 for (int i = 1; i <= n; i++)
 {
  for (int j = 1; j <= m; j++)
  {
   ManhattanMatrix[i, j] =
    Math.Max(ManhattanMatrix[i - 1, j] + DownMatrix[i - 1, j], 
    ManhattanMatrix[i, j - 1] + RightMatrix[i, j - 1]);
  }
 }

 return ManhattanMatrix.Cast<int>().Max();
}

Understanding this problem turns out to be important to understanding sequence alignment in bioinformatics.

by . Also posted on my website