About SPOTCLUST
[Please cite]

The ultimate goal of our project is to develop mathematical models for analyzing and merging information in heterogeneous
genotyping and epidemiological databases, and use these models to develop tools for control, understanding, prevention, and
treatment of infectious diseases.

In SPOTCLUST, we focus on*Mycobacterium tuberculosis* complex (MTC), which includes *M. tuberculosis*, *M.
bovis*, *M. africanum*, *M. canetti*, and *M. microti*, because of its great public health importance, data
availability, and our relevant expertise.

Tuberculosis (TB) represents a re-emerging serious health threat worldwide. TB infection continues to grow causing over 2 million deaths each year despite the fact that it is largely curable with proper treatment.

Advances in molecular methods, enriching traditional epidemiology, have contributed significantly to our understanding of the spread of TB. Differentiating between various patent isolates and using these data to identify epidemiological links are major molecular epidemiological applications for MTC genotyping.

Our efforts have been concentrated on developing mathematical models for spacer oligonucleotide typing (spoligotyping) data. Spoligotyping method exploits polymorphism in the direct repeat region of chromosome of the MTC bacteria. This method results in simple binary pattern for each TB patient and is widely used for MTC strain discrimination.

SPOTCLUST represents a novel approach to advance global studies of MTC genotyping data. It uses mixture models to identify families within MTC bacteria based on their spoligotyping patterns. SPOTCLUST incorporates biological information on spoligotype evolution without attempting to derive the full phylogeny of MTC.

We hope that creation of this web site will be followed by generation of comments and suggestions from the scientists with spoligotyping data of their own. Our goal is to promote active and mutually beneficial collaboration among TB controllers and biological and computational scientists.

In SPOTCLUST, we focus on

Tuberculosis (TB) represents a re-emerging serious health threat worldwide. TB infection continues to grow causing over 2 million deaths each year despite the fact that it is largely curable with proper treatment.

Advances in molecular methods, enriching traditional epidemiology, have contributed significantly to our understanding of the spread of TB. Differentiating between various patent isolates and using these data to identify epidemiological links are major molecular epidemiological applications for MTC genotyping.

Our efforts have been concentrated on developing mathematical models for spacer oligonucleotide typing (spoligotyping) data. Spoligotyping method exploits polymorphism in the direct repeat region of chromosome of the MTC bacteria. This method results in simple binary pattern for each TB patient and is widely used for MTC strain discrimination.

SPOTCLUST represents a novel approach to advance global studies of MTC genotyping data. It uses mixture models to identify families within MTC bacteria based on their spoligotyping patterns. SPOTCLUST incorporates biological information on spoligotype evolution without attempting to derive the full phylogeny of MTC.

We hope that creation of this web site will be followed by generation of comments and suggestions from the scientists with spoligotyping data of their own. Our goal is to promote active and mutually beneficial collaboration among TB controllers and biological and computational scientists.

Ideally, to access genetic variability between bacterial strains, we would sequence and then compare their whole genomes. This,
however, is a time-consuming, labor and cost-intensive process, which makes it impractical for TB control. To overcome this
difficulty, only specific genomic loci that bear enough dissimilarity among different strains are used to produce genotypes of
MTC isolates. These molecular genotyping methods exploit the polymorphism in the number and genomic location of repetitive
elements.

In general, to be used as a genetic marker, an element should be locus-specific, polymorphic and easily genotyped. Restriction fragment length polymorphism (RFLP) analysis with probes derived from the insertion element IS*6110*, introduced in 1993
(van Embden, 1993), is the "gold standard"
method for typing MTC strains (Mostrom, 2002).
IS*6110* is characterized by
good discriminatory power and high reproducibility. However, it is labor-intensive, requiring culturing of the slow-growing MTC
bacteria for several weeks, and is difficult to standardize between laboratories (Braden,
2002). Moreover, this method does not
provide sufficient strain discrimination when fewer than five (van Soolingen, 2001;
Spurgiesz, 2003) or too high number (Bifani, 2002) of IS*6110*-hybridizing bands is present.

Development of PCR-based genotyping methods has greatly improved the typing of MTC strains. PCR-based methods do not require culturing the bacteria and only small amounts of DNA, which can be obtained directly from the clinical specimen, are sufficient for analysis. The most widely used PCR-based method is the spoligotyping.

In general, to be used as a genetic marker, an element should be locus-specific, polymorphic and easily genotyped. Restriction fragment length polymorphism (RFLP) analysis with probes derived from the insertion element IS

Development of PCR-based genotyping methods has greatly improved the typing of MTC strains. PCR-based methods do not require culturing the bacteria and only small amounts of DNA, which can be obtained directly from the clinical specimen, are sufficient for analysis. The most widely used PCR-based method is the spoligotyping.

Spoligotyping is based on the polymorphism in the direct repeat (DR) locus of the mycobacterial chromosome (Kamerbeek, 1997).
The DR locus is one of the most well studied loci of the MTC genome showing considerable strain-to-strain polymorphism (Fang, 1998). The function
of the DR locus in MTC bacteria is presently unknown (van Embden, 2000).

The well-conserved 36-bp direct repeats are interspersed with unique spacer sequences varying from 35 to 41 bp in size. The order of the spacers was found to be well conserved (van Embden, 2000). The region comprised the DR plus the adjacent spacer has been termed the direct variable repeat (DVR) (Groenen, 1993). Currently, 94 different spacer sequences have been identified of which 43 are used for MTC strain differentiation (van Embden, 2000). Clinical isolates of MTC bacteria can be differentiated by the presence or absence of one or more spacers.

Each spoligotype can then be conveniently represented as a 43-dimensional binary vector; for example, the most common spoligotype pattern of*M. tuberculosis* Beijing strain is:
. Octal code
designations (
for *M. tuberculosis* Beijing) are also widely used (Dale,
2001).

Spoligotypes are believed to evolve by deletion of a single or multiple contiguous DVRs (Warren, 2002). Various genetic mechanisms, such as homologous recombination, transposition, DNA replication slippage, or point mutation can cause the deletion (Mostrom, 2002; Warren, 2002). The DR region is one of the hotspots for the IS*6110*
integration (Groenen, 1993; Legrand, 2001).

Please note that we do not attempt to derive MTC phylogeny based on spoligotyping. Most of the algorithms for phylogenetic trees are based on the assumption of markers independence. The fact that spoligotypes may evolve by a loss of several adjacent DVRs (which means that they are dependent on each other) complicates the use of these patterns for deriving MTC phylogeny (Warren, 2002).

The well-conserved 36-bp direct repeats are interspersed with unique spacer sequences varying from 35 to 41 bp in size. The order of the spacers was found to be well conserved (van Embden, 2000). The region comprised the DR plus the adjacent spacer has been termed the direct variable repeat (DVR) (Groenen, 1993). Currently, 94 different spacer sequences have been identified of which 43 are used for MTC strain differentiation (van Embden, 2000). Clinical isolates of MTC bacteria can be differentiated by the presence or absence of one or more spacers.

Each spoligotype can then be conveniently represented as a 43-dimensional binary vector; for example, the most common spoligotype pattern of

Spoligotypes are believed to evolve by deletion of a single or multiple contiguous DVRs (Warren, 2002). Various genetic mechanisms, such as homologous recombination, transposition, DNA replication slippage, or point mutation can cause the deletion (Mostrom, 2002; Warren, 2002). The DR region is one of the hotspots for the IS

Please note that we do not attempt to derive MTC phylogeny based on spoligotyping. Most of the algorithms for phylogenetic trees are based on the assumption of markers independence. The fact that spoligotypes may evolve by a loss of several adjacent DVRs (which means that they are dependent on each other) complicates the use of these patterns for deriving MTC phylogeny (Warren, 2002).

We applied our algorithm to 535 different spoligotype patterns identified among 7166 strains isolated between 1996 and 2004,
primarily from New York State TB patients.

We perform unsupervised classification (clustering) of spoligotyping patterns. Clustering (please do not confuse with finding
clusters of two isolates with matching genotypes) is a process of grouping the objects in the data by some similarity measure.
"Unsupervised" means that we assume no prior knowledge on what family each
spoligotype belongs to. All we start with is our
data, a collection of spoligotypes.

The clustering methods can be divided into two major categories: discriminative (distance-based methods) and generative (model-based) methods.

Distance-based methods require a metric of pairwise distance between data points. This distance measure is often difficult to define, especially when the data are complex, for example, biological sequences. Model-based clustering techniques, such as mixture models, have the advantages that they do not require the distance metric.

The model-based approach is the most appropriate for spoligotypes, because the distance measure between spoligotypes has not been determined yet. We cannot precisely determine how far is, say,*M. tuberculosis* Beijing from *M. bovis*, given
their spoligotyping patterns. The belief is that spoligotypes develop by the deletion of spacers, but it is not clear in what
order and how many spacers can be lost simultaneously; therefore, the distance can be probabilistically assessed as a
probability of a "child" spoligotype having been evolved from a "parent"
spoligotype by mostly losing but not gaining
spacers.

We assume that a multivariate Bernoulli mixture model (MBMM) generates the data and that there is a one-to-one correspondence between mixture model components and spoligotype families. Each model component satisfies the Naïve Bayes assumption. Naïve Bayes assumption allows treating all the features (in our case, a feature is a presence/absence of a spacer - 1 or 0) as independent of each other given the class.

Bernoulli distribution has only one parameter,*p*, the probability of success (spacer in our case) in a trial with two
possible outcomes. The probability of the absence of a spacer is (1-*p*). Thus each position in a spoligotype can be
modeled as a Bernoulli distribution. There are 43 positions; therefore, multivariate Bernoulli distribution (naturally, with 43
parameters, each of which is a probability of a spacer) is used. Finally, we know that there are families within the
spoligotyping data; therefore, some number of different multivariate Bernoulli distributions, each corresponding to a family,
should model the data. Thus we have a mixture of components each corresponding to a distribution.

Now let us formalize our probabilistic framework.

Let be a set of spoligotypes that we want to classify. Each spoligotype is a binary 43-dimensional vector: . Let be a mixture model, which consists of components: . Each mixture component is described by parameters , that are mixing weight of the component, , and 43 Bernoulli distributions. The mixing weights satisfy the constrains:

The probability of a spoligotype being generated by is

Thus, to generate a spoligotype, first a mixture component is chosen with a probability , then its parameters are used to produce a spoligotype sequence. Let us denote each spacer of as (either 0 or 1). Then each mixture component has parameters , where is a probability of a spacer being present and is a probability of a spacer being absent at a position of . The probability of a spoligotype given component is:

Given the Naïve Bayes assumption, the probability that component has generated spoligotype is given as follows:

for , . The parameters for finite mixture models are often estimated by the maximum likelihood (ML) approach. Expectation-maximization (EM) is the most commonly used algorithm for finding a local maximum of the likelihood of the observed data.

The clustering methods can be divided into two major categories: discriminative (distance-based methods) and generative (model-based) methods.

Distance-based methods require a metric of pairwise distance between data points. This distance measure is often difficult to define, especially when the data are complex, for example, biological sequences. Model-based clustering techniques, such as mixture models, have the advantages that they do not require the distance metric.

The model-based approach is the most appropriate for spoligotypes, because the distance measure between spoligotypes has not been determined yet. We cannot precisely determine how far is, say,

We assume that a multivariate Bernoulli mixture model (MBMM) generates the data and that there is a one-to-one correspondence between mixture model components and spoligotype families. Each model component satisfies the Naïve Bayes assumption. Naïve Bayes assumption allows treating all the features (in our case, a feature is a presence/absence of a spacer - 1 or 0) as independent of each other given the class.

Bernoulli distribution has only one parameter,

Now let us formalize our probabilistic framework.

Let be a set of spoligotypes that we want to classify. Each spoligotype is a binary 43-dimensional vector: . Let be a mixture model, which consists of components: . Each mixture component is described by parameters , that are mixing weight of the component, , and 43 Bernoulli distributions. The mixing weights satisfy the constrains:

and | (1) |

The probability of a spoligotype being generated by is

(2) |

Thus, to generate a spoligotype, first a mixture component is chosen with a probability , then its parameters are used to produce a spoligotype sequence. Let us denote each spacer of as (either 0 or 1). Then each mixture component has parameters , where is a probability of a spacer being present and is a probability of a spacer being absent at a position of . The probability of a spoligotype given component is:

(3) |

Given the Naïve Bayes assumption, the probability that component has generated spoligotype is given as follows:

(4) |

for , . The parameters for finite mixture models are often estimated by the maximum likelihood (ML) approach. Expectation-maximization (EM) is the most commonly used algorithm for finding a local maximum of the likelihood of the observed data.

EM is a class of iterative algorithms for ML estimation useful for a variety of problems with missing data (Dempster, 1977). On
each iteration of the algorithm , there are two steps, - the expectation (E-) step and the maximization (M-) step. In the
E-step, the expected values of the missing data, given the observed data and the current parameter estimates, are computed so as
to maximize the total log-likelihood. In the M-step, the expected values of the missing data computed in the E-step are used to
re-estimate the parameters and to update the total log-likelihood. The steps are iterated until the difference between current
and subsequent estimates is small.

In the case of our finite mixture model, the missing data is the class (family) labels for each spoligotype. EM iteratively refines an initial model to better fit the data.

Let be a collection of n spoligotypes. The number of components in a mixture model with parameters is . The total log-likelihood of , given , is

(5) |

Iterating between the E- and M-steps results in non-decreasing sequence of values for the total log-likelihood.

The EM algorithm for a mixture of multivariate Bernoulli distributions can be summarized as follows:

- Choose initial parameters settings,
set
, initialize for each .
- Let . Repeat until convergence.
- E-step:
- Let
where
.
- Let
for
,
,

where is parameter for the Bernoulli mixture, if spacer is present, and if spacer is absent.

- Let
where
.
- M-step:
- Let
for
.
- Let
for
.

The Laplacian prior is used to smooth the probability estimates (McCallum and Nigam, 1998). - Let
where
.
- Let

- Let
for
.

- E-step:

Initialization of EM requires a special attention, since the solution of the algorithm is highly dependent upon its starting points. For MBMM, the initialization includes choosing both the most appropriate parameter values and the number of components in the mixture.

Model initialization and validation are very important aspects of clustering. We initialized two models - SpolDB3-based and randomly initialized. As validation criteria, we used the log-likelihood and the stability (or average best match, see Hopcroft, 2004) of the model.

**Subsections**

To incorporate expert knowledge into our method, we used the published prototypes extracted from a publicly available global
spoligotyping database SpolDB3 ((Filliol, 2002). These prototypes reflected the expert
defined visual recognition rules
applied to the database.

We extracted seeds (starting parameters for EM) for 32 mixture components (the prototypes for *M. africanum* and *M.
tuberculosis* subfamily CAS were combined into seeds for two corresponding components; and the prototype for *M. canetti*
was excluded from the analysis). Based on visual inspection, we supplemented the 32 seeds with four additional prototypes for
spoligotypes that did not match any of the SpolDB3-based prototypes. These four prototypes (ordered for convenience) can
schematically be shown as in Figure 2.

Thus the model, which we called "SpolDB3-based", had 36 components. In other words, the model order was 36. The initial weights of the components of the mixture were all the same, .

The stability of the clusters produced by a model initialized with SpolDB3-based prototypes was assessed by comparing the 36 identified families with the families identified by 100 randomly initialized 36-order models (see Figure 3). The stability values for these families gave us an idea of how reproducible the families were when the model was initialized randomly.

Another model was initialized randomly. We employed Monte Carlo cross-validation (MCCV) approach (Smyth, 1996) to find , the number of components in the mixture. The schematic representation of MCCV is given in Figure 4.

MCCV divides the data randomly 100 (in our case) times into disjoint test and train partitions. The test subset is a fraction of the data set. We used = 0.3. For each of the 100 partitions, we vary from to , the values for which are chosen based on our knowledge of the data. We varied from 30 to 60.

First the EM algorithm finds the mixture parameters using the train partition. EM is randomly restarted 10 times and the highest
log-likelihood solution is used as a trained model. EM is initialized using the *k*-means algorithm that is itself
initialized randomly. At each run out of 10, EM iterates until the total log-likelihood change is less then or until
the change of the sum of components' weights is less than . Alternatively, it stops when the number of iterations
achieves 30. For the highest total log-likelihood model, EM is allowed to run 300 additional times or until convergence. Before
starting the 300 iterations, each prototype (where d is a spacer position and j is the mixture component) is modified,
by adding randomness (Juan, 2004):

(6) |

where , , is a random number, and measures the ``global randomness'' of . We used .

Please find the schema of the random initialization of EM below:

Each trained -order model is applied to the test set, and the test data log-likelihood, is calculated. The procedure is repeated 100 times and the average (cross-validated) test data log-likelihood, , is calculated for each . The plot of as a function of ( Figure 6) shows what is the most probable for the given data.

We have chosen 48 to be the optimal model order, since this point corresponds to a pick in average test log-likelihoods. Moreover, after this point, the curve levels off. This indicates that a further increase in the number of parameters will not improve much the log-likelihood.

After we had decided on , we generated, as described above (see Figure 5), 100 randomly initialized mixture models and calculated the total stabilities (over the resulting families) for each of them relative to the other 99 models. We chose a final mixture model based on the model stability and the total log-likelihood (see Figure 7).

We call the stability of a cluster (family) of spoligotypes the average best match between the cluster identified by a model and clusters identified using other models. If we define two clusters and and treat them as sets, the match (between 0 and ) will be (Hopcroft, 2004):

(7) |

High match values mean that the sets have many spoligotypes in common and are roughly of the same size.

Figure 8 contains a simple explanation of how the best match values were calculated. Here the best match values are found for each family identified by Model 1 with respect to Model 2.

Since we initialized the 100 models randomly, they each cluster the data into different families. For each cluster in each model, we calculate the best match with respect to all other models. To calculate the stability of a cluster, we average its best match values over all of the models, thus obtaining average cluster stability. To find the model stability, we take the average of stability of the clusters identified by this model.

As we know, it is assumed that DR locus evolves by losing one or multiple contiguous DVRs, and that spacer acquisition is a very rare event. After we have applied MBMM to our data, we could see that some members of the resulting families did not correspond to this assumption. This means that in some positions of "children" spoligotypes (i.e. evolved from a ''parent'' spoligotype, represented as a prototype for the family) spacers were gained relatively to their ''parent''.

We modified our MBMM by introducing into it Hidden Parent assumption, which helped identifying biologically more relevant families.

Figure 9 contains SpolDB3 prototype and logos (Schneider, 1990) of
We assume that each spoligotype family has an unobserved Hidden Parent represented by a MBMM, and that the children of the
Parent (the observed strains) may lose a spacer with small probability, but are extremely unlikely to gain one. If we observe
, then the Hidden Parent of the spoligotype should be generating a 0 with high probability and a 1 with some
non-negligible probability (the child can lose a spacer) at a position . Therefore, the probability of spoligotype **x**
given mixture component using the Hidden Parent assumption becomes:

(8) |

where the probability of losing a spacer is relatively low: and , and the probability of gaining a spacer is extremely low: , .

The maximum likelihood estimate for (probability of a spacer in position given model component ), calculated at the M-step of the EM algorithm, becomes:

(9) |

where and .

We used two MBMMs, SpolDB3-based and randomly initialized; therefore, two sets of families were identified. In this web site, we give the following information on each of the identified families: a logo (Schneider, 1990) for the spoligotypes from strains forming this family, stability of the family and the data on each spoligotype in the family. The spoligotype data include NY State number of each spoligotype pattern, number of occurrences of the pattern (i.e., the total number of patients infected with strains bearing this pattern), probability of the pattern to belong to the family, the pattern itself in binary format and octal format, label given by SpolDB3-based model (for families identified by the randomly initialized model), and the countries of origin of the strains. Families identified by SpolDB3-based model can be viewed in the original order (Filliol, 2002) or sorted by their stabilities.

Please go to families identified by the SpolDB3-based Model or to families identified by the Randomly Initialized Model.

Please click here if you want to sumbit your data to SPOTCLUST.

Please send us your questions, comments, and suggestions.

We would like to emphasize that SPOTCLUST identified spoligotyping families based on the 535 patterns form New York state patients. Some of the families will be redefined as more data become available. We also plan to expand SPOTCLUST by adding epidemiological data to it.

1

P.J. Bifani, B. Mathema, N.E. Kurepina, and B.N. Kreiswirth

Global dissemination of the*Mycobacterium tuberculosis* W-Beijing family strains

*Trends Microbiol*, 10:45-52, 2002.

Global dissemination of the

2

C.R. Braden, J. T. Crawford, and B.A. Schable

Assessment of*Mycobacterium tuberculosis* genotyping in a large
laboratory network

*Emerg Infect Dis*, 8:1210-1215, 2002.

Assessment of

3

J.W. Dale, D. Brittain, A.A. Cataldi, D. Cousins, J.T. Crawford, J. Driscoll et al.

Spacer oligonucleotide typing of bacteria of the*Mycobacterium tuberculosis* complex: recommendations for standard
nomenclature

*Int J Tuber Lung Dis*, 5:216-219, 2001.

Spacer oligonucleotide typing of bacteria of the

4

.P. Dempster, N.M. Laird, and D.B. Rubin

Maximum likelihood from incomplete data via the EM algorithm

*Journal of the Royal Statistical Society*, B(39):1-38, 1977.

Maximum likelihood from incomplete data via the EM algorithm

5

Z. Fang, N. Morrison, B. Watt, C. Doig, and K.J. Forbes

IS*6110* Transposition and Evolutionary Scenario of the Direct Repeat Locus in a Group of Closely Related
*Mycobacterium tuberculosis* Strains

*Int J Bacteriol*, 180:2102-2109, 1998.

IS

6

I. Filliol, J.R. Driscoll, D. van Soolingen, B.N. Kreiswirth, K.Kremer, G. Valétudie et al.

Global Distribution of*Mycobacterium tuberculosis* Spoligotypes

*Emerg Infect Dis*, 8:1347-1349, 2002.

Global Distribution of

7

P.M. Groenen, A.E. Bunschoten, D. van Soolingen, and J. D. van Embden

Nature of DNA polymorphism in the direct repeat cluster of*Mycobacterium tuberculosis*; application for strain
differentiation by a novel typing method

*Mol Microbiol*, 10:1057-1065, 1993.

Nature of DNA polymorphism in the direct repeat cluster of

8

A. Juan, J. García-Hernández, and E. Vidal

EM initialization for Bernoulli mixture learning

*SSPR/SPR*, 635-643, 2004.

EM initialization for Bernoulli mixture learning

9

J. Hopcroft, O. Khan, B. Kulis, and B. Selman

Tracking evolving communities in large linked networks

*Proc Natl Acad Sci USA*, 101:5249-5253, 2004.

Tracking evolving communities in large linked networks

10

J. Kamerbeek, L. Schouls, A. Kolk, M. van Agterveld, D. van Soolingen, S. Kuijper et al.

Simultaneous detection and strain differentiation of*Mycobacterium tuberculosis* for diagnosis and epidemiology

*J Clin Microbiol*, 35:5907-5914, 1997.

Simultaneous detection and strain differentiation of

11

E. Legrand, I. Filliol, C. Sola, and N. Rastogi

Use of Spoligotyping To Study the Evolution of the Direct Repeat Locus by IS*6110* Transposition in *Mycobacterium
tuberculosis*

*J Clin Microbiol*, 39:1595-1599, 2001.

Use of Spoligotyping To Study the Evolution of the Direct Repeat Locus by IS

12

P. Mostrom, M. Gordon, C. Sola, M. Ridell, and N. Rastogi

Methods used in the molecular epidemiology of tuberculosis

*Clin Microbiol Infect*, 8:694-704, 2002.

Methods used in the molecular epidemiology of tuberculosis

13

T.D. Schneider and R.M. Stephens

Sequence Logos: A New Way to Display Consensus Sequences

*Nucleic Acids Res*, 18:6097-6100,1990.

Sequence Logos: A New Way to Display Consensus Sequences

14

P. Smyth

Clustering using Monte Carlo cross-validation

*Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR*, 126-133,
1996.

Clustering using Monte Carlo cross-validation

15

R.S. Spurgiesz, T.N. Quitugua, K.L. Smith, J. Schupp, E.G.Palmer, R.A. Cox, and P. Keim

Molecular Typing of*Mycobacterium tuberculosis* by Using Nine
Novel Variable-Number Tandem Repeats across the Beijing Family and Low-Copy-Number
IS*6110* Isolates

*J Clin Microbiol*, 41:4224-4230, 2003.

Molecular Typing of

16

J.D.A. van Embden, M.D. Cave, J.T. Crawford and, J. Dale, K.D. Eisenach, B. Gicquel et al.

Strain identification of*Mycobacterium tuberculosis* by DNA
fingerprinting: recommendations for a standardized methodology

* J Clin Microbiol*, 31:406-409, 1993.

Strain identification of

17

J.D.A. van Embden, T. van Gorkom, K. Kremer, R. Jansen, B.A. van Der Zeijst, and L.M. Schouls

Genetic variation and evolutionary origin of the direct repeat locus of*Mycobacterium tuberculosis* complex bacteria

* J Bacteriol*, 9:2393-2401, 2000.

Genetic variation and evolutionary origin of the direct repeat locus of

18

D. van Soolingen, D.W. Hermans, P.E. de Haas, D.R. Soll, and J.D.A. van Embden

Occurrence and stability of insertion sequences in*Mycobacterium tuberculosis* complex strains: evaluation of an insertion
sequence-dependent DNA polymorphism as a tool in the epidemiology of tuberculosis

* J Clin Microbiol*, 29:2578-2586, 2001.

Occurrence and stability of insertion sequences in

19

R.M. Warren, E.M. Streicher, S.L. Sampson, G.D. van der Spuy, M. Richardson, D. Nguyen et al.

Microevolution of the Direct Repeat Region of*Mycobacterium tuberculosis*: Implications for Interpretation of
Spoligotyping Data

* J Clin Microbiol*, 40:4457-4465, 2002.

Microevolution of the Direct Repeat Region of

This Info section was generated using the
**LaTeX**2`HTML` translator Version 2002-2-1 (1.70)

Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.

Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.