Iowa State University Brigham Young University University of Georgia

Fiber Evolution

Introgression Populations
Homoeolog-specific Profiling
Genetic Networks & Phenotype
Effects of Selection
Sequence Capture

Genetic and Physical mapping resources
Comparative BAC Sequencing
Genome Sequence Resources
EST D-genome map
EST Resources

Web Database
Education and Outreach
Significance for cotton industry
Cotton Literature
Cotton Links
Wendel Lab
PGML (Paterson Lab)
Udall Lab

Lists & protocols
How to
CEGC Site Search

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
turn explanations on/off

Homoeolog-specific Comparative Expression Profiling Platform

Overview | EST assembly and Microarray development | People | Publications


Develop a novel homoeolog-specific comparative expression profiling platform using a vastly enriched EST resource.

To bolster our ability to connect phenotype to genotype, we propose to enhance custom microarrays that we pioneered, and which distinguish gene expression of homoeologous loci in allotetraploid cotton. The design depends on a priori detection of SNPs that are diagnostic for the A and D homoeologs, which we infer in allopolyploid cotton after sequencing orthologous ESTs from diploid species bearing these respective genomes. At present, the EST resources for diploid and allopolyploid have permitted the confident alignment of over 2000 pairs of homoeologous contigs, but this number is expected to rise rapidly with more EST data, particularly for the limiting diploid reference genomes (GenBank includes about 230,000 ESTs from allopolyploid cotton, 70,000 from the D-genome, and 40,000 from the A-genome). Our objective is to augment these numbers to the extent that we can generate a new microarray capable of distinguishing homoeologs for more than 50% of the genes in the genome.

Toward this end, we propose to use 3'UTR-anchored 454 sequencing to generate a minimum of 480 million bases (80 Mb bases per taxon = 400,000 reads X 200 bases) of expressed sequence from both diploid genomes and the four allotetraploid parents used to generate the NILs in G. hirsutum and G. barbadense (wild and cultivated of each). 454 sequencing (Roche) uses massively parallel pyrosequencing to generate approximately 80 Mb of sequences in a 4-hour sequencing run. This technology has been used to discover coding sequences of Arabidopsis, Medicago, and Zea. Eveland et al. used library-specific, three-base key codes to 'label' anchored-3'UTR transcripts from two maize ovule cDNA samples. Gossypium contains an estimated 45,000 to 53,000 genes; assuming that half are represented at detectable levels in normalized cDNA samples, which is a conservative estimate), each gene would be sampled in a 3'UTR-anchored sequence, on average, at a 16x coverage in a single 454 sequencing run {400,000 reads x 200 bp = 80Mb; divided by (25,000 genes x 200 bp) = 16x coverage}. In addition to the coding sequence traditionally generated for ESTs, use of 3'UTR-anchors will identify numerous potentially discriminating targets for the homoeologous microarray because of lower purifying selection on the non-coding bases. A typical 200 bp 454 cDNA-read will sequence both the UTR and a portion of the 3' coding sequence. We may exceed these targets based on new 454 chemistry, with read lengths projected to approach 400-500 bp in 2008, but costing 1.5x to 1.7x more than current estimates because longer sequencing-by-synthesis reactions require more reagents. Our intention is to stay alert to improvements in 454 (and similar) technology and budget for longer 454 reads.

To generate normalized libraries, RNAs will be isolated from plants growing in the same environment, using a diverse set of tissues, as in Hovav et al. To maximize gene discovery and comparability among samples, we will create an equimolar mix of RNAs for each taxon from (1) a bulk of non-fiber organs (e.g., whole seedlings, leaves, roots, floral organs), with (2) fiber RNAs (mixing stages from primary through early secondary wall synthesis). To enhance gene representation in each 454 run, cDNAs will be normalized prior to sequencing. We propose to run a plate of 454 sequencing for each of 6 parental RNA mixes, generating ~480 Mb of ESTs (at current read lengths; expected to go up soon).

To facilitate homoeolog identification in allopolyploid cotton, we propose to also use Sanger sequencing to generate an additional 40,000 ESTs from each of the A- and D-genome diploids, both for bioinformatic diagnosis of homoeologs in allopolyploid cotton (present ESTs from diploids are the primary limitation in this regard) and to provide additional assembly scaffolding for 454 sequences. Genbank ESTs are based on Sanger-sequencing methodology and these long reads will enhance assembly of 454-reads into contigs. Toward this end, 20,000 clones will be bi-directionally sequenced (forward and 3'UTR-anchored reverse), yielding 40,000 Sanger ESTs from each of the A- and D-genome diploids. By using an aliquot of cDNA prepared for the diploid 454-sequencing to create cDNA libraries, we will generate maximum overlap between the Sanger and 454 reads.

We welcome your comments and suggestions.