So far, EST terminus information has been largely ignored in the major public-domain EST databases.  This is mainly because the majority of sequence processing packages or pipelines focus on "cleaning" or "trimming" spurious sequences in ESTs, including vector fragments, insert-flanking restriction endonuclease recognition sites, adapter sequences and/or polyA/T tails.  However, our recent research proved that without inspecting cDNA terminus structures, conventional bioinformatics pipelines appear to be problematic when they process some raw EST trace files.  Consequently, they could create unclean or under-trimmed EST sequences, which will definitely have cascading and deleterious impacts to many downstream EST applications that use these sequences.   On the other hand, lots of sequences have been over-trimmed with regard to their terminus structures and represent loss of directional and positional information of mRNA 3'/5' ends.

    For C. reinhardtii ESTs, there were a total of 167,641 EST sequences deposited in GenBank dbEST, as indicated by its 09-25-2007 database release.   We have downloaded these EST sequences and compared them with our final sequences.   It is clear to us that a large amount of GenBank EST sequences are either under trimmed or over trimmed.  Using the following web interfaces, you can find out how many sequeneces are incorrectly trimmed.  As we are accumulating more and more evidence from other species, what we find in C. reinhardtii is not surprising to us at all, because we do believe this is a common problem for many species in NCBI dbEST.
Continue