iplant g-to-p test case for visualization: model parameter estimation for qtl analysis prepared by...

17
iPlant G-to-P test case for visualization: Model parameter estimation for QTL analysis • Prepared by Jeff White out of Feb 16-17 working group meeting in Kansas City • Basic data set is Maize NAM lines – 27 populations x ~200 lines – 11 environments (6 sites x 2 years except Puerto Rico) • Parameters estimated for CSM-CERES-Maize – P1: Was used as a surrogate for earliness per se, but is actually duration of juvenile phase – P2: Determines degree of delay for daylengths longer than the critical short daylength • Prepared by Jeff White, USDA ARS, ALARC • Phenotypic data were provided by Maize NAM project on the understanding they would not be redistributed or published until their phenology paper comes out. Thus, a SAS program is available but we’d need to check with Ed Buckler and Jim Holland before I provide that data file.

Upload: nelson-allison

Post on 16-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

iPlant G-to-P test case for visualization:Model parameter estimation for QTL analysis• Prepared by Jeff White out of Feb 16-17 working group meeting in

Kansas City• Basic data set is Maize NAM lines

– 27 populations x ~200 lines– 11 environments (6 sites x 2 years except Puerto Rico)

• Parameters estimated for CSM-CERES-Maize– P1: Was used as a surrogate for earliness per se, but is actually duration of

juvenile phase– P2: Determines degree of delay for daylengths longer than the critical

short daylength• Prepared by Jeff White, USDA ARS, ALARC• Phenotypic data were provided by Maize NAM project on the

understanding they would not be redistributed or published until their phenology paper comes out. Thus, a SAS program is available but we’d need to check with Ed Buckler and Jim Holland before I provide that data file.

About the graphs

• These are all pretty simple Y vs X type graphs• A key feature that is not shown is the ability to

drill down by identifying points or clouds of points (e.g., a single line or location)

• The first series of plots are all observed vs simulated.

• The second series deals with the coefficients and prediction error (as RMSE)

Observed vs predicted: locations 1:1

1. I added the 1:1 line2. Ideally it might have the

linear regression or even regressions by location

3. For any point or cluster, one would want the population, line, & year. For example, the chain of blue points…

4. This graph may be misleading because many points are overlain. One option would be to start with a density plot.

What are these?

Observed vs predicted: environments1:1

1. Here locations are subdivided by

2. year.3. Note the poor handling of

the legend by SAS Gplot.

4. For any point or cluster, one would want the population, line, & year. For example, the chain of blue points…

Curious difference between Aurora in 2006 and 2007. Why?

Observed vs predicted: NY by two years1:1

1. Just looking at NY datasets2. Scale could have been re-

sized, although conserving the scale helps in comparisons across plots.

3. Again, for any point or cluster, one would want the population, line, & year. For example, the chain of blue points…

4. This graph may be misleading because many points are overlain. One could use open and closed symbols, or allow toggling by year to highlight points.

What are these?

Observed vs predicted: populations1:1

1. I added the 1:1 line2. It looks like there is a

major problem with population 26. Could it be that the trusty Dr. White forgot to calibrate this population? Or that GenCalc failed to converge…?

3. Again, there are interesting chains. The circled one is NY, 2006.

4. This graph is very ugly because many points are overlain. Could we toggle populations on and off with check boxes?

What are these?

Deviations of Simulated - Observed: populations1:1

1. Deviation plots provide a different perspective

2. Again, it looks like there is a major problem with population 26, but the wider spread of points allows one to see possible problems with population 5 and 27.

What are these?

From here onward, there is a change in datsets. The next series is based on the two fitted model coefficients and associated data

• Model parameters– P1 = length of the juvenile phase– P2 = photoperiod sensitivity

• Associated data– RMSE = root mean square of prediction– No. observ = number of observations that the

optimization program used. Maximum possible number is 11.

P1 vs P2 - Populations

1. Note suspicious clumping of values

2. No clear trends of P1 in relation to P2, which is generally good.

3. It would be nice to highlight individual populations.

RMSE vs P1 : populations

1. Again note suspicious clumping of P1 values

2. Slight trend of increasing RMSE with P1, which makes sense.

3. It would be nice to highlight individual populations.

RMSE vs P2 : populations

1. Again note suspicious clumping of values.

2. No trend of increasing RMSE with P2, which makes sense.

3. It would be nice to highlight individual populations.

RMSE vs Number of observations: populations

1. Suggests goodness of fit declines with less than 7 observations but perhaps gets slightly better as observations increase.

2. It would be nice to highlight individual populations or pull up underlying data of individual points.

Why is this point so far off?

RMSE vs slope of observed vs simulated for each line: populations

1. Suspicious clumping of values.

2. It would be nice to highlight individual populations.

Slope of observed vs simulated for each line: first three populations

1. Clumping problem remains.

2. Limiting to three populations shows interesting differences.

3. But gain there may be problems with data points that sit on top of each other.

Array for viewing very large sets of observed or simulated phenotypes – see next page

• Vertical axis is populations (1 to 27)– Within each population the rows are ordered by

location and year:• 01 Aurora, NY 2006 Summer NY 06128• 02 Aurora, NY 2007 Summer NY 07135• 03 Clayton, NC 2006 Summer NC 06122• 04 Clayton, NC 2007 Summer NC 07120• 05 Columbia, MO 2006 Summer MO 06137• 06 Columbia, MO 2007 Summer MO 07138• 07 Urbana, IL 2006 Summer IL 06128• 08 Urbana, IL 2007 Summer IL 07137• 09 Homestead, FL 2006 Winter FL 06265• 10 Homestead, FL 2007 Winter FL 07282• 11 Puerto Rico 2006 Winter PR 06314

• The horizontal axis is ordered by mean time to anthesis across all environments (location x year)

• Each symbol is a binned value of observed days to anthesis. White spaces indicate missing values.

Slope of observed vs simulated for each line: first three populations

1. Clumping problem remains.

2. Limiting to three populations shows interesting differences.

Concluding remarks

• The basic principal in the first two set of examples is Y vs X with ability to drill down to subsets of data, especially to identify specific populations, locations or lines (factors that describe the data).

• The third example is more speculative and its real value is unclear. The objective is to provide a quick overview of large arrays of data such as the maize NAM observed anthesis data. The patterns would be clearer with better scaling and color coding of the “bins”. GIS software is much better at this than SAS.