README FILE FOR THE PROGRAM MCMCALGN Mcmcalgn is a software package for joint sampling of alignments and mutation parameter values for given DNA-Sequences. Mcmcalgn was developed by Dirk Metzler e-mail: dmetzler@math.uni-frankfurt.de www: http://www.math.uni-frankfurt.de/~stoch/Leute/metzler/ The homepage of mcmcalgn is: www: http://www.math.uni-frankfurt.de/~stoch/software/mcmcalgn ........................................................................ WHAT YOU CAN DO WITH MCMCALGN In "Assessing variability by joint sampling of alignments and mutation rates" (D.Metzler, R.Fleissner, A.v.Haeseler & A.Wakolbinger (2000), to be submitted to J. Mol. Evol.) we describe the algorithm behind mcmcalgn and give application examples. You can cite this paper if you want to refer to the mcmcalgn software in a scientific publication. The following is a short introduction into the usage of mcmcalgn: Assume that two DNA-Sequences evolved from a common ancestor according to the Thorne, Kishino and Felsenstein Model, as described in their paper "An evolutionary model for maximum likelihood alignment of DNA sequences" (1991, J. Mol. Evol. 33). The authors show how maximum likelihood estimators for the evolution parameters (substitution, insertion and deletion rate) can be calculated from the sequences by efficient integration over all alignments. We address the following questions: - What is a confidence range for the evolution parameters? - What do typical alignments for the two sequences look like? (In general, optimal alignments are not typical!) The program mcmcalgn helps you to find approximative answers to these questions by printing out samples of parameters and alignments according to the common distribution of parameter and alignments conditioned on the sequences. The a priori probabilities for the alignments (given the parameters) are due to the TKF-Model; for the evolution parameters , i.e. the rates for substitutions, insertions, and deletions, we assume a slight exponential prior (that should not have too much influence on the results). Of course, the mutation rates depend on the time scaling. We use the evolutionary distance between the two sequences as time unit. This means that e.g. the substitution rate is the expected number of substitutions that affected some arbitrarily selected position in the two sequences since they split up from their most recent common ancestor (MRCA). Furthermore we assume that the probability distribution for the sequence of the MRCA is the equilibrium of the TKF evolution dynamic. (It is one of the remarkable properties of the TKF model that this equilibrium really exists!) Mcmcalgn uses a Gibbs-sampling strategy. As for many Monte Carlo Markov Chain (MCMC) sampling algorithms, the samples are not perfectly stochastically independent. However if the input sequences are not too long and the program is run with reasonable options (see below), the dependencies are supposed to be negligible. .......................................................................... INSTALLING THE PROGRAM The main thing you need is a C++ compiler and the C++ standard library. If you have the GNU compiler collection (see http://www.fsf.org/software/gcc/gcc.html ) and the make command on your system , all you need to do is to run the make command in the directory of the files Makefile, mcmcalgn.cc, dnalign.cc, dnalign.hh, eingabe.cc, and eingabe.hh. If you use a different C++ compiler you need to change the Makefile: Replace "g++" by the name of your compiler and set OPT to your favorite options. If everything works, you get an executable file named mcmcalgn. (The make command is usually available on UNIX systems. If you work on MS-Windows or MS-DOS, check out http://www.gnu.org/order/windows.html . A port of the GNU compiler and other tools for the MacIntosh can be found on ftp://ftp.cygnus.com/pub/mac .) The Tcl/Tk-program mcmcalgn.tcl provides a graphical user interface that makes the setting of options for mcmcalgn more comfortable, the Tcl/Tk-program showalgn.tcl can be used to visualize the sampling results in various ways. If you want to use the Tcl/Tk-programs, you need a Tcl/Tk interpreter. It is named wish, usually installed on UNIX systems, and contained in all common Linux Distributions. It is freely available for many platforms (including MS-Windows and Apple MacIntosh) on http://dev.scriptics.com/software/tcltk/downloadnow83.tml . ............................................................................. INPUT DATA FORMAT Each of the two input sequences must be stored in a separate file. The files must not contain anything else but the letters A,C,G, and T for the nucleotides, and underscores _ for gaps if the sequences are pre-aligned. The files should not contain anything like blanks, line ends (not even at the end of the file), comments, sequence names, .... ............................................................................. OUPUT FILE While running, the program writes the current alignment and parameters to the standard output. In addition to this, the sampled parameters and alignments are stored in an output file. The alignments are coded in sequences of the letters d (for pairings of bases), h (for a gap in the first sequence), and v (for a gap in the second sequence). (d, h, and v are short cuts for the directions "diagonal", "horizontal", and "vertical", referring to the graphical representations of alignments.) The first parameter-values/alignment combination in the output file is the initial alignment. It can be an alignment that was given to the program, see below. The parameter-values/alignment combination with the highest posterior probability of all sampling results is printed at the end of the output file once more (the same holds for the standard output). The name of the output file must be specified by the user at the beginning of the program run (see below). Each row contains three numbers that belong to one sample. The first one is the substitution rate, the second number is the insertion rate and the third number is the natural logarithm of the probability for the sampled alignment and the sequence data, given the sampled mutation rates. (The sampled deletion rate is not printed, because it is always very close to the insertion rate. This results from the equilibrium assumption for the MRCA sequence.) ............................................................................. RUNNING THE PROGRAM When you start mcmcalgn without any further argument, the program will ask for some options. Questions and typical answers at the start of the program could look like this: > mcmcalgn input file for sequence 1 : seq_1.dna input file for sequence 2 : seq_2.dna output file for parameter samples : samples.dat please choose number of initial alignment type : 0 => alignment given in input files 1 => sample alignment for parameters given below 2 => most probable alignment for parameters given below 0 substitution rate : 0.01 insertion rate : 0.001 no. of runs ... ... during burn-in : 10000 ... after burn-in : 100000 ... in each sampling interval : 100 size of resampling range : 30 . . At first you are asked for the names of the input files and the output file. Then you are asked to choose the initial alignment. If your input sequences are already aligned and the alignment is coded by underscores _ in the input files, you may choose this alignment by typing 0. If you want to start the MCMC sampling with the most probable alignment for the initial parameters, type 1. If you want to use an alignment that is sampled according to the posterior probability, given the sequences and the specified mutation rates, type 2. If your sequences are very long, your computer may run out of memory if you choose 1 or 2. In this case it is recommended that you first align the sequences with some other program and then use the result as initial alignment (typing 0). After the initial alignment you have to specify the initial substitution and insertion rates. (For simplicity, the initial deletion rate is just assumed to be 0.1% higher than the insertion rate.) Then you will be asked for the duration of the burn-in phase. The idea of the burn-in is the following: The samples should be quite independent of the initial rates and the initial alignment. If the initial values are in a very unlikely range, then it might take many steps the Monte Carlo Markov chain has moved into the typical area. The next numbers to be specified are the total number of runs after burn-in and the number of steps between two samples. Of course, the fraction of these two numbers will be the number of parameter samples you obtain. If you detect dependencies between successive parameter samples (e.g. successive samples have often similar values) you should run the program again with more steps in each sampling interval. The last option is the size of the realignment range. This is the length of the piece that is changed in the current alignment at a time. You should make it large enough so that the alignment can change sufficiently. On the other hand, the program becomes slow, since the runtime for realignment depends quadratically on the realignment range. ............................................................................ USING A FILE FOR THE OPTIONS As an alternative to the interactive input of options, you can give mcmcalgn a file from which it reads the options. The file must contain all options in the same order as in the interactive mode. For the above example, the file would look like this: seq_1.dna seq_2.dna samples.dat 0 0.01 0.001 10000 100000 100 30 To start mcmcalgn with an option file just give mcmcalgn the file name as argument. For example, if you have written the options to a file named mcmcalgn.opt, you can can start mcmcalgn from a UNIX-shell or from MS-DOS with the command "mcmcalgn mcmcalgn.opt". Of course, you can prepare your options file with any text editor, but it is more convenient to use the Tcl/Tk-program mcmcalgn.tcl, which is quite self-explaining. .............................................................................. VISUALIZATION OF RESULTS: SHOWALGN.TCL The Tcl/Tk-program showalgn.tcl reads the output file of mcmcalgn and presents them graphically. Start the program with the same option file you used to start mcmcalgn, e.g. "showalgn.tcl mcmcalgn.opt". Showalgn.tcl displays the results in three different window frames: The main window shows graphical representations of the sampled alignments and a coordinate system with dots representing the sampling results for the substitution rate and the insertion rate (which is supposed to be equal to the deletion rate up to a negligible amount). The initial alignment/parameter-value combination is shown in purple, the most probable of all sampled combinations in red. The thickness of the alignment paths refer to their frequency in the sample. If you click on a dot representing parameter values with the left or the right mouse button, the dot and the alignment path belonging to it will turn yellow or green, respectively. You can reset it by a click with the middle mouse button. A second frame shows the initial alignment and a barchart displaying how many of the sampled alignments differed from it in the single positions. If you click on the parameter dots in the first window, the same is shown for the alignment belonging to it. The third window shows the current alignment and contains a button to print it to standard output. The first two window frames contain buttons for saving the graphics in encapsulated Postscript (EPS) format. .............................................................................. OTHER THINGS TO KNOW Mcmcalgn uses the pseudo random generator that comes with your C++ compiler. As initial "random seed" it uses the time of the program start. If you want to set a different seed for some reason, you can give the program some positive integer as second argument after the name of the options file. Of course, if you start the program with the same seed twice, it produces the same "random" results in both runs. .............................................................................. LICENSE The software mcmcalgn_1.0 (including the Versions 1.0 of the programs mcmcalgn, mcmcalgn.tcl and showalgn.tcl) for joint sampling alignments and mutation parameter values for two given sequences is distributed freely under the following license terms: Copyright (C) 2000 Dirk Metzler This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. See the file LICENSE for details. If you use this software for scientific work, please cite the following article: D.Metzler, R.Fleissner, A.v.Haeseler & A.Wakolbinger (2000): Assessing variability by joint sampling of alignments and mutation rates. (to be submitted to J. Mol. Evol.) ........................................................................... BUG REPORTS If you find a bug in the program, please send me an e-mail. Dirk Metzler dmetzler@math.uni-frankfurt.de