Analysis of Genetic Structure and Dispersal Patterns in a Population of Sea Beet
 ^{*}Norwegian Institute for Nature Research, 7005 Trondheim, Norway
 ^{†}Institute of Terrestrial Ecology, Furzebrook Research Station, Wareham, Dorset UK BH20 5AS, United Kingdom
 ^{‡}Department of Mathematical Sciences, Lade Section, Norwegian University of Science and Technology, 7034 Trondheim, Norway
 Corresponding author: Jarle Tufto, Department of Mathematical Sciences, Lade Section, Norwegian University of Science and Technology, 7034 Trondheim, Norway. Email: jarlet{at}math.ntnu.no
Abstract
A model of the migration pattern in a metapopulation of sea beet (Beta vulgaris L. ssp. maritima), based on the continuous distributions of seed and pollen movements, is fitted to gene frequency data at 12 isozyme and RFLP loci by maximum likelihood by using an approximation of the simultaneous equilibrium distribution of the gene frequencies generated by the underlying multivariate stochastic process of genetic drift in the population. Several alternative restrictions of the general model are fitted to the data, including the island model, a model of complete isolation, and a model in which the seed and pollen dispersal variances are equal. Several likelihood ratio tests between these alternatives are performed, and median bias in the estimated parameters is corrected by using parametric bootstrapping. To assess the fit of the selected model, the predicted covariances are compared with covariances computed from the data directly. The dependency of estimated parameters on the ratio between effective and absolute subpopulation sizes, which is treated as a known parameter in the analysis, is also examined. Finally, we note that the data also appear to contain some information about this ratio.
LEVELS of gene flow can be estimated from different kinds of data, either directly by different forms of capturerecapture methods or indirectly from geographic genetic differentiation generated by local genetic drift. Even though all information in genetic data on which indirect methods are based lies in the variances and covariances of the gene frequencies only, indirect methods are potentially very useful because data at several loci represent independent realizations of the same underlying process. If appropriate assumptions are built into the analysis, indirect methods may also potentially produce gene flow estimates reflecting average levels several years into the past (Slatkin 1985), thus being more relevant in an evolutionary context.
Although some theoretical results that allow inferences to be drawn about the pattern of migration from observations of genetic geographic differentiation exist, these theoretical models are often highly idealized. In the island model of migration, for example, it is assumed that all migrations occur via some large outside world populations (Wright 1943). If we believe this assumption, the effective number of migrants that enter each subpopulation each generation, N_{e}m, can be calculated from the estimated amount of genetic differentiation between the subpopulations. In Kimura and Weiss' (1964) analysis of decrease in genetic correlation in infinite stepping stone models, migration also occurs between neighboring subpopulations. It is also assumed that the subpopulations constitute completely panmictic, discrete, regularly spaced units. In most natural populations, however, individuals are more or less continuously distributed across space in varying densities. In most plant species, dispersal of genes, through seed and pollen, also occur over distances of almost any length. The distributions of these seed and pollen displacements are often highly leptokurtic.
To obtain reliable and useful estimates of levels of gene flow, it seems necessary to incorporate more realistic assumptions into the analysis and to take into account the geographic structure of the population under study, variation in local subpopulation sizes, and the form of dispersal. In this article, we analyze a set of gene frequency data (Raybouldet al. 1996) from a metapopulation of sea beet (Beta vulgaris L. ssp. maritima) by using and further developing methods from Tufto et al. (1996). In this approach, the likelihood of different migration matrices, given the data, is computed using an approximation of the simultaneous stationary distribution of the gene frequencies generated by the underlying multivariate stochastic process of genetic drift in the subpopulations under consideration. After presenting the data and a brief summary of the basic approach, we introduce a general model of migration patterns in plant populations in which the migration matrix is related to the continuous dispersal probability distributions of seed and pollen movements. Two such dispersal distributions applying specifically to the data used are then given. The parameters of these distributions are then estimated from the gene frequency data by maximum likelihood.
Because only an approximate likelihood is computed and the data are limited, we can expect estimates of the different parameters to be biased. By making extensive use of bootstrapping methods, however, these biases can be corrected to a great extent. Using parametric bootstrapping from several restrictions of the general model of the migration pattern, likelihood ratio tests between some alternative restricted models, including the island model, are also performed. The fit of the selected model is then inspected by comparing predicted and observed covariances. Finally, we examine the dependency of the estimated parameters on the assumed ratio between effective and absolute subpopulation sizes. We also note that the data, which are purely nontemporal, appear to contain some information about local effective subpopulation sizes, possibly because time in the underlying process is discrete.
THE DATA
The data consist of allele frequencies of 12 isozyme and RFLP loci from 220 sea beet plants at Furzey Island, Poole Harbour, UK, that were previously analyzed in Raybould et al. (1996). The coordinates of all sampled plants in space and the subdivision used to construct subpopulations are shown in Figure 1.
Because almost all plants seen in the study area were sampled, the absolute subpopulation sizes, which are part of the model input, could be counted directly. This also means that the gene frequencies in subpopulations could be determined without any sampling error.
The model also involves the effective sizes of each of the subpopulations, which are generally much harder to determine. The effective size N_{e} of a population will generally equal the absolute population size N, multiplied by some factor that depends on demographic parameters. Sea beet is a selfincompatible, iteroparous perennial with a life span of at least 3 years in cultivation (Bruunet al. 1995; Van Dijket al. 1997). Personal observations suggest that plants in the Dorset populations live for at least 5 years. Although sea beet is gynodioecious in France (Boutin et al. 1988), the Dorset plants are all hermaphrodite. According to Nunney (1993), there is a general tendency for N_{e} to approach N/2 as generation length and generation overlap increases. We have therefore used this limiting value for the effective size of each subpopulation, although estimates of this ratio are known to vary widely from ~0.2 to 0.8 (Frankham 1995).
COMPUTING THE LIKELIHOOD
Tufto et al. (1996) present a general approach for estimating the matrix M of migration rates between a set of populations, i = 1,2,…,n, from geographic neutral genetic variation, in part using ideas in Courgeau (1974) and Felsenstein (1982). The model is based on the underlying multivariate stochastic process
The basic idea of the approach is to assume that the migration matrix is of a certain form, e.g., stepping stone like, then compute the covariance matrix from the migration matrix and the effective population sizes, and then use the multivariate normal distribution to repeatedly compute the probability of the observed gene frequencies for different parameter values until the likelihood is maximized. With gene frequency data on several independent loci, the total likelihood is equal to the product of the multinormal probabilities calculated for each loci.
Conditioning on
MIGRATION MATRICES IN PLANT POPULATIONS
General considerations: Following Bodmer and CavalliSforza (1968), each element m_{ij} of the matrix M in (1), the backward migration matrix, is defined as the conditional probability that a gene originates from population j, given that it dispersed to population i. The purpose of this section is to define a biologically realistic model for the migration pattern, which is essentially a specification of M as a function of the basic parameters involved.
The migration probabilities will generally depend on the relative number of individuals in each subpopulation because large subpopulations will contribute with a greater proportion of genes to subsequent generations. It must also be remembered that gene movements in plant populations occur in either one or two stages: through seed dispersal only or through both pollen and seed dispersal, possibly via some third population.
Let M^{(s)} and M^{(p)} be the backward migration matrices for seed and pollen dispersal, i.e.,
Let
For neighboring subpopulations and, at least, for dispersal within subpopulation, the expectation in (6) must be computed numerically. For the sea beet data used in this article, for which the positions of all individual plants were available, this is straightforward to do. When the distance between two populations is large, the expectation can be computed using the firstorder approximation
Seed and pollen dispersal distribution in sea beet: In sea beet, pollen is wind dispersed, and the resulting dispersal distributions are thus two dimensional. Tufto et al. (1997) discuss several plausible forms that
The populations of sea beet considered in this article were located along the shoreline, and seed dispersal was known to occur mostly by the help of tidal currents moving back and forth along the shoreline in approximately one dimension, w. If we assume that seeds are equally likely to be released into the water at any point in time, so that dispersals in both directions have equal probability, and also assume that the probability of deposition in a small length interval w, w + dw is λ_{s}dw, then the random displacement W of each seed will follow the Laplace distribution
Note that the seed and pollen dispersal distributions (7) and (8) differ greatly in their degree of kurtosis, i.e., to what extent probability is concentrated around the mean and in the tails of the distribution. This difference is in part a result of the assumptions that pollen dispersal occurs in two dimensions whereas seeds only disperse in approximately one dimension (along the shoreline).
We also assume that the backward probabilities of migration from the outside world are equal to u for all subpopulations. This may be a questionable assumption in part because the subpopulations differ in their density and in their distance from the outside world. However, if these backward migration rates are small, this assumption may not be very critical. The parameter u can also be thought of as a mutation rate, in which case, it would be the same for all subpopulations.
BOOTSTRAPPING METHODS
Simulation procedure: Several strategies can be used to simulate the distribution of the data. One strategy would be to reconstruct the distribution from the data itself by resampling gene frequency vectors with equal probability from each of the observed loci. This, however, would not take advantage of the knowledge we have about the process generating the data. A better strategy is to stimulate the process directly by iterating (1) for a sufficient number of generations until the equilibrium distribution is attained. It must be kept in mind, however, that it is the distribution conditional on the observed gene frequency means at each locus that is relevant. This conditional distribution can be generated approximately by first simulating, say, 500 generations, then keeping track of the gene frequency mean for each iteration of the process, and stopping the iterations just after (or before) the gene frequency mean passes the target mean. This simulation technique eliminates the extra variance in the parameter estimates that would arise if the unconditional distribution, which very likely would contain nonpolymorphic data, was used.
Bias correction: The sampling distribution of an estimator
This procedure, however, is not invariant with respect to transformations of the parameter, and in cases where the distribution of the estimator is very skewed, it can often make more sense to construct a socalled median biascorrected estimate
Likelihood ratio tests: Several tests between different hypothesis about the unknown migration pattern will be of interest. In general, in tests of a model H_{0} against some other model H, the ratio Λ of the maximum likelihood under H and H_{0} is usually used as a test statistic. If there is a small probability, given that H_{0} is true, that Λ is larger than the value of Λ computed from the observed data, Λ*, H_{0} can be rejected in favor of H. This probability, P(Λ > Λ*H_{0}), will also generally depend on the true value of the nuisance parameters θ_{0} under H_{0}. A reasonable choice of θ_{0} to simulate from is the value of θ_{0} that is in best agreement with the observed data, which in our case is the median biascorrected estimate
RESULTS
Model selection: Four different restrictions of the general model of the migration pattern were considered. We first fitted the model under the constraint that the variances of the seed and pollen dispersal displacements,
We then tested this model against the alternative extended model using both λ_{s} and λ_{p} as free parameters in addition to u:
The second hypothesis nested within (9) is defined by the constraints
Fitting (12) to the data gave an uncorrected estimate
of the seed and pollen dispersal standard deviation of
Observed and predicted covariances: To assess the fit ofthe selected model, we will compare the (conditional) standardized covariances predicted by the model with those computed from the data directly. These are shown in Figure 6. Some interesting points can be noted. First, there appears to be reasonably good agreement between the model predictions and the observations. The main level of isolation is between the two transects (see Figure 1). This can be seen in the predicted values as “steps” in the covariances between subpopulations 16 and 17. Within the less dense transect 2 (subpopulations 17–21), there is also a steeper decline in the covariances with increasing geographic distance than within transect 1, where there is only a slight effect of isolation by distance. There also appears to be reasonably good agreement between predicted and observed variances, in that relatively small and/or isolated subpopulations, e.g., subpopulations 15 and 16, have larger predicted and observed variances.
The estimated migration matrix: The migration matrix computed at the biascorrected estimate of σ is shown in Figure 7. On average, there is ~70% migration of genes from neighboring populations. Large, isolated, dense subpopulations, such as subpopulation number 20, receive a smaller proportion from neighboring subpopulations, whereas very small subpopulations (e.g., number 10 consisting of 4 individuals) receive a larger proportion of genes from its neighbors than from itself. It is also apparent in several subplots that large subpopulations (e.g., subpopulation number 4 consisting of 29 individuals) also make a large contribution to distant subpopulations because of their size.
Having estimated the variances in the seed and pollen dispersal displacements under the assumption that these are equal, the total variance of the gene displacements, measured along the shoreline, if we assume independence between X and W, becomes Var(W) + ½Var(X) (Crawford 1984), which gives a gene displacement standard deviation equal to 91.31 m for model (12).
Dependencies on N_{e}/N: Because the data are nontemporal, we do not expect much information about the ratio between the effective and absolute population size of each subpopulation, N_{e}/N, to be available in the data, and this ratio has, therefore, so far been treated as a known parameter in the analysis described above. Because our choice of N_{e}/N = ½ is clearly rather arbitrary, we will, however, look at how our estimate of σ under the selected model (12) depends on the choice of N_{e}/N. The maximum likelihood estimate of σ obtained by fitting the model for each of several different N_{e}/N values is shown in Figure 8.
First note that on a log–log scale, for values of N_{e}/N larger than ~0.2, there appears to be a linear relationship with
Also note that the estimate of σ appears to tend to infinity when N_{e}/N becomes smaller than ~0.15. The estimate of σ is then about the same order of magnitude as the size of the study area. Any further increase in σ then has little influence on the migration matrix, and hence the likelihood, which means that the maximum likelihood estimate of σ tends to infinity.
We expected the data to contain almost no information about N_{e}/N, making the maximum likelihood almost independent of this parameter. To verify this, we plotted the maximum likelihood obtained by fitting the model for each of the different N_{e}/N values. This is shown in Figure 9. Surprisingly, the likelihood is strongly dependent on N_{e}/N and has its maximum at ~0.43.
DISCUSSION
We have shown how a quite realistic model based on the underlying genetic multivariate drift process in a subdivided population can be fitted to gene frequency data by approximate maximum likelihood methods. The model uses information known about absolute and effective population sizes, as well as geographic distances within and between subpopulations. Assumptions about the underlying mechanisms of seed and pollen movement are also incorporated. A data set consisting of allele frequencies at 12 isozyme and RFLP loci in 21 subpopulations of sea beet was used, and inferences about at least one parameter, the seed and pollen displacement variance, could be made. We were also able to reject the island model in favor of our more general model. It could also be concluded that no significant evidence for migration from the outside world or for mutation is present in the data.
There appears to be quite good agreement between the observed and predicted covariances. However, the fact that the observed covariances are dependent, also given that the selected model is true, makes comparison with the covariances predicted by the fitted model dangerous. If we look at Figure 6, there appear to be some discrepancies between the observed and predicted correlations with subpopulation number 15, in that subpopulation 15 appears to be more similar to subpopulations 16–21 and less similar to subpopulations 1–14 than predicted by the fitted model. This may suggest that some isolating barrier is present between subpopulations 14 and 15. Because the correlations are dependent, however, this discrepancy may very well be an effect of chance, and it would be desirable to do some more formal test of goodness of fit. As suggested by Felsenstein (1982), this can be done by testing the selected model against a “full” model in which all the n(n − 1)/2 covariances of the n − 1 contrasts are used as free parameters. Such a test would require data on a large number of loci, however, partly because the number of parameters under the full model is very large. The observed covariance matrix will also be singular if the number of loci is less than n − 1, which excludes the possibility of doing this test in most cases.
If the degree of kurtosis of the gene displacements has an effect on the genetic structure, we could hope to be able to obtain separate estimates of the seed and pollen dispersal variances because these distributions, (7) and (8), differ greatly in their degree of kurtosis. However, the fact that the extended model (10) gave almost no increase in the likelihood suggests that the data contain very little information about this.
In a previous, more traditional analysis of the data than the one used here, pairwise genetic distances were regressed against geographic distances (Raybouldet al. 1996). These authors also found a “step” in the genetic distances between the two transects by including dummy variables representing transect membership in the regression, and they concluded that this suggested continuous extinction and recolonization of the subpopulations. However, the analysis used here shows that the “step” in the covariances between subpopulations 16 and 17 can be fully explained by the geographic location of the subpopulations only (Figure 1) and the assumed pattern of migration.
We still need to be concerned about the assumption that population sizes remain constant over time, however. If there are large autocorrelated fluctuations in local population sizes, this is likely to constantly keep the gene frequencies away from their equilibrium distribution if the number of generations necessary to reach equilibrium is large. It is, however, only relevant to consider the rate of convergence of the covariances of the contrast, and these will converge much faster than the covariance matrix of the entire distribution (on the order of only several generations; Harpending and Ward 1982, p. 245).
It is also very interesting that there appears to be information in the data about the ratio between effective and absolute subpopulation sizes. This is quite suprising when considering the fact that the data are purely nontemporal. It is well known that in continuous time diffusion approximations of, e.g., the island model, the rate of migration and the effective population size of each island are completely confounded. However, in our study population and in the model, generations are discrete. The effective subpopulation sizes are also quite small, between 1 and 14.5 individuals, and the rates of migration between neighboring populations are relatively high, which suggest that there will be quite large changes in the equilibrium covariances at different points in the life cycle. The magnitude of these changes depends on N_{e}/N. Even though the maximum likelihood estimate of N_{e}/N = 0.43 obtained seems plausible, it is not clear how reliable this estimate is, however. One potential problem is that the model has nonoverlapping generations, whereas generations overlap in the study population, possibly also with agedependent mortalities and fecundities. It is also not entirely clear if the point in time in the life cycle of the predicted covariances coincides with the point in time at which the actual sampling occurred.
There has been some concern in the literature that direct and indirect methods have produced very different estimates of gene flow. There is a general tendency for indirect estimates to be much larger than estimates obtained by direct methods (e.g., Koeniget al. 1996). According to these authors, the discrepancy results from systematic biases in direct estimates caused by finite study areas. We note here that our estimate of seed and pollen dispersal standard deviations of 74.56 m are comparable to estimates from direct studies of seed dispersal that range between 27 and 100 m (Slatkin 1985) and to estimates of pollen dispersal of 30 m (Nurminiemiet al. 1997).
Availability of software: The SPlus and C code used to do this analysis and accompanying documentation is available on the internet at http://www.math.ntnu.no/~jarlet/migration/
Acknowledgments
We thank Ralph Clarke for helpful discussions. This study was financially supported by the Norwegian Research Council and the Norwegian Institute for Nature Research A.F.R. also received financial support from the Department of Environment's Genetically Modified Organisms Research Program.
APPENDIX: APPROXIMATIONS OF THE CONDITIONAL COVARIANCE MATRIX
In Tufto et al. (1996), an approximation based on (2) directly was shown to break down as the fluctuations around q become large. On the basis of the idea that the amount of genetic drift in the neighborhood of the observed mean gene frequency is most important in determining
Tufto et al. (1996), however, did not examine the behavior of the approximation based on (A1) as betweenpopulation differentiation increases in any great detail. We have now found the solution of (18) in Tufto et al. (1996), in situations with large betweenpopulation differentiation, to sometimes result in matrices C having negative determinants; i.e., matrices obtained using the proposed method are not always proper covariance matrices. Even though the method was originally intended to work only in situations with limited differentiation, this deficiency of the method is clearly unsatisfactory, as it makes the log likelihood undefined in some parts of the parameter space. We have therefore developed a third, somewhat less complicated, approximation.
As noted in Tufto et al. (1996), it is the dynamics of the process in the neighborhood of the observed mean gene frequency that is important in determining the conditional covariances. The third approximation is constructed by assuming that the stochastic elements of the column vector e in (1) have constant variances, equal to
It might be mentioned that matrix equation (2) using (A2) can be solved straightforwardly in Splus after diagonalization of M and M^{T} and some rearrangement, with operations involving only (n × n) matrices, not suffering from the rapid increase with n in computing time required to solve (2) and (18) in Tufto et al. (1996).
Footnotes

Communicating editor: B. S. Weir
 Received August 27, 1997.
 Accepted March 31, 1998.
 Copyright © 1998 by the Genetics Society of America