Draft Standards for Post-Calculation Health Reports

From AlchemistryWiki
Jump to navigation Jump to search

Purpose

This page contains a discussion regarding a draft standard practice for analyses to be run following an alchemical free energy calculation to provide a "health report" indicating the quality of an alchemical free energy calculations, and identifying some common modes of failure generally known to occur in these types of calculations. The hope is to begin a community discussion on this topic so that multiple implementations of these standard checks will be available for different popular codes.

Division of simulation data into "equilibrated region" and "production analysis region"

While a Markov chain Monte Carlo simulation does not require that any initial "equilibration" region be identified and discarded to achieve asymptotically consistent results (see Charles Geyer's discussion Burn-In is Unnecessary), the initial starting configuration and box volumes for an alchemically free energy calculation are often so highly atypically samples from the equilibrium distribution that failing to discard some initial transients of relaxation to equilibrium will lead to highly biased free energy estimates. It is therefore recommended that all simulations properly identify a division between the initial "unequilibrated region", in which the system transiently relaxes to a more typical set of samples from equilibrium, and the "equilibrated production region" of data that is analyzed to produce estimates of the free energy. This also allows the analyst to ensure that the "equilibrated region" is sufficiently long as to produce a useful estimate to the practitioner.

This should be done for each "leg" of the alchemical free energy simulation thermodynamic cycle.

Manual inspection of key properties

If properties of the simulation are to be inspected manually (which is not recommended, but may be useful for graphical illustration), the practitioner is advised to examine the following properties:

Box volume

Many simulation preparation schemes for explicit solvent simulations utilize a scheme that replicates a small equilibrated water box (potentially generated with a different solvent or a different set of nonbonded simulation cutoffs or dispersion corrections), places the complex or ligand within this box, and then culls waters with significant overlap with the solute molecules. Counterions are added through one of several heuristic schemes. This process can lead to initial solvent densities that significantly different densities than the average equilibrium density, where density deviations of even 0.1% are highly significant and can result in enormous artificial pressure deviations.

Additionally, initiating a simulation containing an alchemically-modified solute from configurations typical of an unmodified solute will have incorrect initial solvent densities.

We recommend some quantity relating to the box volume be computed and examined over the course of the simulation, examining this for transients

Individual alchemical simulations can be inspected separately.

Replica-exchange simulations are more difficult to inspect because exchanges between replicas complicate the inspection of per-state traces, while per-replica traces hop among many alchemical states with different typical volumes. Therefore, it is recommended that the average volume across all replicas be inspected as a function of simulation time to inspect for initial transients.

TODO: Show examples

Reduced potential

Automated equilibration determination

It is highly recommended that an automated method instead be employed for dividing the simulation history into "unequilibrated" and "equilibrated" regions, though manual inspection of the properties suggested to verify the automated scheme has correctly divided the simulation is often a good visual check of simulation quality.

One concept useful in developing automated equilibration detection points is the notion that the simulation should rapidly sample many uncorrelated samples after reaction the "typical, equilibrated" region of configuration space. This allows us to formulate a scheme such as the following:

It is recommended that the production region of the simulation trajectory be defined as the interval that terminates at the end of the simulation and begins at the time which maximizes the number of uncorrelated samples in that interval. This scheme is a simpler form of one suggested by Wei Yang that requires statistical assumptions of normality. To implement such a scheme, several equally-spaced origins [math]\displaystyle{ t_0 }[/math] of the simulation interval are considered, and the number of uncorrelated configurations in the intervals [math]\displaystyle{ [t_0, T] }[/math] are computed, where [math]\displaystyle{ T }[/math] denotes the end of the simulation.

For each interval, the number of effectively uncorrelated configurations [math]\displaystyle{ N_{eff} }[/math] is given by

[math]\displaystyle{ N_{eff} = (T - t_0 + 1) / g }[/math]

where [math]\displaystyle{ g = 1 + 2 \tau }[/math] is the statistical inefficiency computed for the interval [math]\displaystyle{ [t_0, T] }[/math], with [math]\displaystyle{ \tau }[/math] denoting the integrated autocorrelation time computed over that interval.

An example of this for a Hamiltonian-exchange simulation can be seen here: Automated-equilibration-detection.png The maximum of [math]\displaystyle{ N_{eff} }[/math] represents the best choice of the the beginning of the production region [math]\displaystyle{ t_0 }[/math].


The integrated autocorrelation time [math]\displaystyle{ g }[/math] should be computed for an appropriate property of the simulation. If there are structural relaxation times known to be critical, these should be evaluated and the maximum statistical inefficiency over these considered as the effective statistical inefficiency of note.

If obvious structural relaxation times are not available, the appropriate time-dependent quantity to use for computation of the statistical inefficiency [math]\displaystyle{ g }[/math] is the negative logarithm of the unnormalized probability density. This quantity is given by

[math]\displaystyle{ u(x_t) = - \ln \pi(x_t) + c }[/math]

where [math]\displaystyle{ \pi(x) }[/math] represents the normalized probability density for the MCMC simulation, [math]\displaystyle{ x_t }[/math] is the sampled simulation state of the system at time index [math]\displaystyle{ t }[/math] (which may include positions, velocities, box vectors, etc.), and [math]\displaystyle{ c }[/math] is an arbitrary additive constant.

For example, if independent simulations of Hamiltonians [math]\displaystyle{ H_k }[/math] are carried out for [math]\displaystyle{ k = 1,\ldots,K }[/math], simulation index [math]\displaystyle{ k }[/math] would use the quantity

[math]\displaystyle{ u_k(x_t) = \beta_k H_k(x) }[/math]

where [math]\displaystyle{ \beta_k = (k_B T_k)^{-1} }[/math] is the inverse temperature of simulation index [math]\displaystyle{ k }[/math].

For Hamiltonian exchange (replica-exchange among Hamiltonians) simulations, the appropriate quantity is

[math]\displaystyle{ u(X_t) = \sum_{k=1}^K u_{s_k}(x_{kt}) = \sum_{k=1}^K \beta_{s_{kt}} H_{s_{kt}}(x_{kt}) }[/math]

where [math]\displaystyle{ X_t = \{x_{1t}, \ldots, x_{Kt}\} }[/math] denotes the set of sampled simulation states at simulation iteration or time index [math]\displaystyle{ t }[/math], and [math]\displaystyle{ s_{kt} }[/math] is the thermodynamic state index sampled by replica [math]\displaystyle{ k }[/math] at simulation iteration or time index [math]\displaystyle{ t }[/math]. This quantity looks like this for a typical simulation: Hamiltonian-exchange-reduced-potential.png

Computation of statistical inefficiencies

Verifying acceptable alchemical state overlap

Computation of statistical error

Determining whether the calculation has converged

Detection of infrequent events coupled to the alchemical free energy

Detection of artifacts in the simulation parameters

Cycle closure in relative free energy calculations