Iowa State University Brigham Young University University of Georgia

Fiber Evolution

Introgression Populations
Homoeolog-specific Profiling
Genetic Networks & Phenotype
Effects of Selection
Sequence Capture

Genetic and Physical mapping resources
Comparative BAC Sequencing
Genome Sequence Resources
EST D-genome map
EST Resources

Web Database
Education and Outreach
Significance for cotton industry
Cotton Literature
Cotton Links
Wendel Lab
PGML (Paterson Lab)
Udall Lab

Lists & protocols
How to
CEGC Site Search

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
turn explanations on/off

EST analysis aids

Cotton assembly file download:   All of the ESTs were combined to create a single, omnibus assembly of the Gossypium transcriptome that we have named Cotton46a where '46a' refers to the assembly iteration. Sanger and 454 ESTs were assembled using the CLCBio Genomics Workbench (v. 3.7.1) with the following parameters set for all collections of input sequence (similarity = 0.95, length fraction = 0.5, insertion cost = 3, deletion cost = 3, mismatch cost = 2). Quality values were used for all 454 reads and as well as the G. raimondii Sanger reads, though most other Sanger reads retrieved from Genbank lacked quality scores. ESTScan 3.0.3 was used to predict the coding sequences of each contig based on the codon preference matrix of A. thaliana.
Cotton32: Cotton46:

Genome specific EST assembly file download:  
A-genome: D-genome:

Cotton EST file download:   Cotton fiber RNAs were extracted from G. arboreum (accession cv. AKA8401), G. barbadense (accessions K101 and cv. Pima S6), G. hirsutum (accessions cv. Maxxa and TX2094), and G. raimondii (accession unnamed) using a modified hot-borate protocol described in Hovav et al. [1]. RNA extractions were prepared for sequencing using the Illumina mRNA- Seq Prep. Kit. Illumina sequencing (Illumina, Inc., San Diego, CA) was performed by the Iowa State DNA Facility. Quality filtered reads were aligned to the 56,373 454/Sanger reference contigs using the bwtsw algorithm implemented by the BWA read-mapping software [2], leaving all parameters set to default.
  • 1) Hovav R, Udall JA, Chaudhary B, Rapp R, Flagel L, Wendel JF: Partitioned expression of duplicated genes during development of a single cell in a polyploid plant. Proc Natl Acad Sci USA 2008, 105:6191-6195.
  • 2) Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25:1754-1760.
EST data are grouped by accession as listed below.
The reads generated by this project are listed on NCBI's SRA. All of the 454 and Illumina next-gen sequencing data can be downloaded in Fastq format (sequence + quality values). These entries are all listed under the SRA Study number: SRP001603 (

The abbreviations of the different DNA/cDNA samples are as follows:
  • A2 = G. arboreum cv. AKA1804 (A-genome cultivar or G. arboreum - mideast cotton)
  • D5 = G. raimondii cv. GN33 (D-genome An accession of G. raimondii growing in the Bessey Greenhouse at Iowa State University)
  • K101 = G. barbadense cv. K101 (AD2 genome; 'wild' type of G. barbadense - kidney bean shaped seed)
  • Maxxa = G. hirsutum cv. Acala Maxxa (AD1 genome; old cultivar with high fiber quality of the Acala type - California cotton)
  • PS6 = G. barbadense cv. Pima-S6 (AD2 genome; old cultivar of Egyptian/Sea Island/Pima cotton)
  • Tx2094 = G. hirsutum cv. Tx2094 (AD1 genome; wild type from yucatan pennisula)
The label of 10 and 20 dpa refers to fiber collected at 10 and 20 days post anthesis.
RNA was extracted from these fibers and cDNA was sequenced at the ISU DNA sequencing facility. The cDNA was NOT normalized so that expression levels of a sequenced gene may be estimated by counting the number of times an individual transcript was sequenced. In addition to access from the SRA, we also have the Illumina sequence runs available for download from the webpage for your convenience. These files are identical to those cDNA files available through our project on the SRA.

             10 dpa:
20 dpa:

Download PolyCat pipeline
Read mapping is a fundamental part of next-generation genomic research and is complicated by genome duplication of many plants. PolyCat is a pipeline for mapping and read categorization of next-generation sequence data produced from allopolyploid organisms including whole genome shotgun, RNA-seq, and bisulfite-treated reads. It is written in C++ and Perl and needs the following programs and libraries installed:
  • Samtools (
  • Bamtools (
  • BioPerl (
  • Bio::DB::SAM perl module
We also recommend the use of the following (but polyCat can handle any BAM file):
  • Sickle (
  • GSNAP (
The PolyCat package and D-genome files, required for Gossypium related applications, can be downloaded at the following links.

PolyCat web portal in which evaluation sequence data sets may be submitted for mapping and categorizing.

We welcome your comments and suggestions.