2014 Workshop Tuesday Afternoon Discussion

Tuesday Afternoon Discussion

Standard, Data Sets, and Evaluation

Discussion leaders: John Chodera and Paul Czodrowski

Standard sets with analytical or known computed results for testing/comparing codes

Huafeng Xu: Comparison of binding equilibrium and alchemical methods for weak binders

can different programs give the same answers on the same system? to statistical/experimental error?

good to have simple calculations that can be quickly converged to small statistical error

AMBER/CHARMM compute things like kinetic energy differently, so may be challenges in comparing

maybe 1 kJ/mol is sufficient for agreement if experimental agreement is good?

should have guidelines for validating/evaluating new free energy codes

hydration free energy calculations?

MRS sidechain analogues results might be a good reference set

Assessment: How accurately do we predict binding affinities

How can we quantify expected performance on real problems?

take a lesson from high frequency trading by using historical data to “predict” more recent data?

how much do we have to worry about “overfitting” based on (re)using historical data?

can we compile 5-10 historical lead series data to do some of this evaluation?

can we accumulate historical data on typical alchemical transformations to estimate performance?

perhaps we need to get precise results before we address accuracy?

can we build a database of alchemical transformations, like there are databases of chemical transformations?

null models: MW correlation

It may be physically realistic that molecular weight correlates with affinity

Dispersion always makes a big contribution
Water is always a less dense fluid than the protein
so dispersion is enhanced in pocket
so tighter still fitting ligand will always bind better (tox etc. excluded)

Benchmarking: How rapidly do we get a useful result?

Is the time to a useful result of critical importance?

How much do calculations cost?

maybe <24h turnaround is not critical if we can examine hard compounds

early stage and late stage each 2 years; have 5 years to make an impact with FEP; focus on 24h timeframe not as important as accuracy

different thresholds for different kinds of challenges; 24h, 1wk, 1mo

Parameterization: What data is needed to parameterize forcefields?

Do we need standard QM datasets? Experimental datasets? Some QM data sets (vapor) are stored by NIST.

What kinds of data are needed?

'Predictive Challenges: How can we keep the community honest?

Are SAMPL-style predictive challenges helpful?

What kinds of datasets/challenges are most useful? What frequency?

How can we best organize them? Some standard set of tests that different groups are doing doing the same thing

Some debate if we should get within experimental error or statistical error?
- We ideally want to replace the experiment, so we need to agree within it
- But we need to agree to statistical error for consistent
It's hard to converge these numbers
- We can actually learn a lot by converging simple calculations
Between packages, we really can't make direct comparisons since certain aspects are treated differently (e.g. temperature)
- But we really need to converge to the same answer that someone will use to make a prediction
We need to come up with some baseline number of what should our convergence criteria be
- What is considered a good correlation?
- Companies are throwing tons of money at a number they can trust
- For most companies, there is a limit to what they think is efficient, at some compute cost, experiments are cheaper
For people developing new code, there needs to be some kind of standard test set
- We have regression tests which test the code, but not the methods
- GROMACS for example has hundreds of regression tests but still isn't good enough to catch many bugs.

We are not necessarily competing with experiments

Assays come out, chemists make the easiest compounds, simulations can work on the other stuff so after 1 week, we have an idea for more compounds to test from computations

There really is not much agreement on what and how we should test some kind of optimized set

We need numbers that are converged to be reliable
MRS/Pande paper of sidechain analogue results is a good benchmark set
- We should get down to a very small number

How accurate do we predict the binding affinities Wall Street approach:

Model is proposed, historical data is used, see if we can get what actually happened
- May have issue of overfitting
- This may be viable where we look at previously designed dugs and lead optimization procedures
- We don't really have the data yet to do this though
We are beginning to accrete a body of data that we did not have 2 years ago, we can come up with the questions people would actually want to ask
- e.g. How long would it take to converge a calculation if I introduce a new atom, or what happens if i change ring size?
- gathering the body of data will allow people to phrase their questions within the database of what we have done
- As industry uses it, more practical problem questions can actually be raised

Going straight to lead optimization may be going to fast

Let's build on what we are confident in (test accuracy on it first)
How much more complicated from there
- Once we get precise, then we can work on accurate
We had a question previously of FEP databases ever done, where is it?
- We have databases of chemical reactions, why cant we have then for Free Energy transformations?

How fast does do we get a useful result?

If used in tandem with expt. its okay to be a bit slow (~1week)
Lead optimization can really be a 5year process, so there is a bit of time we have
- Mostly agree, but there are different kinds of problems in lead optimization
- Lots of chemists design problems that are solvable in the time frame of the posed question
- There are different thresholds to cross, a 24hr problem, a 1week problem, a 1mo problem

2014 Workshop Tuesday Afternoon Discussion

Tuesday Afternoon Discussion

Navigation menu

Search