EST analysis aids
Cotton assembly file download:
All of the ESTs were combined to create a single, omnibus assembly of the Gossypium transcriptome that we have named
Cotton46a where '46a' refers to the assembly iteration. Sanger and 454 ESTs were assembled using the CLCBio Genomics Workbench (v. 3.7.1)
with the following parameters set for all collections of input sequence (similarity = 0.95, length fraction = 0.5, insertion cost = 3,
deletion cost = 3, mismatch cost = 2). Quality values were used for all 454 reads and as well as the G. raimondii Sanger reads,
though most other Sanger reads retrieved from Genbank lacked quality scores. ESTScan 3.0.3 was used to predict the coding sequences of
each contig based on the codon preference matrix of A. thaliana.
Genome specific EST assembly file download:
Cotton EST file download:
Cotton fiber RNAs were extracted from G. arboreum (accession cv. AKA8401), G. barbadense (accessions K101 and cv. Pima S6),
G. hirsutum (accessions cv. Maxxa and TX2094), and G. raimondii (accession unnamed) using a modified hot-borate protocol
described in Hovav et al. . RNA extractions were prepared for sequencing using the Illumina mRNA- Seq Prep. Kit. Illumina sequencing
(Illumina, Inc., San Diego, CA) was performed by the Iowa State DNA Facility. Quality filtered reads were aligned to the 56,373 454/Sanger
reference contigs using the bwtsw algorithm implemented by the BWA read-mapping software , leaving all parameters set to default.
EST data are grouped by accession as listed below.
- 1) Hovav R, Udall JA, Chaudhary B, Rapp R, Flagel L, Wendel JF: Partitioned expression of
duplicated genes during development of a single cell in a polyploid plant.
Proc Natl Acad Sci USA 2008, 105:6191-6195.
- 2) Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics 2009, 25:1754-1760.
The reads generated by this project are listed on NCBI's SRA. All of the 454 and Illumina
next-gen sequencing data can be downloaded in Fastq format (sequence + quality values).
These entries are all listed under the SRA Study number:
The abbreviations of the different DNA/cDNA samples are as follows:
The label of 10 and 20 dpa refers to fiber collected at 10 and 20 days post anthesis.
- A2 = G. arboreum cv. AKA1804 (A-genome cultivar or G. arboreum - mideast cotton)
- D5 = G. raimondii cv. GN33 (D-genome An accession of G. raimondii growing in the Bessey Greenhouse at Iowa State University)
- K101 = G. barbadense cv. K101 (AD2 genome; 'wild' type of G. barbadense - kidney bean shaped seed)
- Maxxa = G. hirsutum cv. Acala Maxxa (AD1 genome; old cultivar with high fiber quality of the Acala type - California cotton)
- PS6 = G. barbadense cv. Pima-S6 (AD2 genome; old cultivar of Egyptian/Sea Island/Pima cotton)
- Tx2094 = G. hirsutum cv. Tx2094 (AD1 genome; wild type from yucatan pennisula)
RNA was extracted from these fibers and cDNA was sequenced at the ISU DNA sequencing facility.
The cDNA was NOT normalized so that expression levels of a sequenced gene may be estimated by
counting the number of times an individual transcript was sequenced. In addition to access
from the SRA, we also have the Illumina sequence runs available for download from the webpage
for your convenience. These files are identical to those cDNA files available through our
project on the SRA.
Download PolyCat pipeline
Read mapping is a fundamental part of next-generation genomic research and is complicated by genome duplication of many plants.
PolyCat is a pipeline for mapping and read categorization of next-generation sequence data produced from allopolyploid organisms
including whole genome shotgun, RNA-seq, and bisulfite-treated reads. It is written in C++ and Perl and needs the following programs
and libraries installed:
We also recommend the use of the following (but polyCat can handle any BAM file):
- Samtools (samtools.sourceforge.net)
- Bamtools (github.com/pezmaster31/bamtools)
- BioPerl (bioperl.org)
- Bio::DB::SAM perl module
The PolyCat package and D-genome files, required for Gossypium related applications, can be downloaded at the following links.
- Sickle (github.com/najoshi/sickle)
- GSNAP (research-pub.gene.com/gmap)
PolyCat web portal in which evaluation sequence
data sets may be submitted for mapping and categorizing.
We welcome your comments and suggestions.