TXGP RNAseq assembly
From Marcotte Lab
Contents |
Raw data used for TXGP RNA-seq assembly
Dataset | Contributor | Samples | Reads | Assembled Tx(raw) | X. laevis genes | X. tropicalis genes | H. sapiens genes |
---|---|---|---|---|---|---|---|
Amin201106_XENLA | Nirav Amin, Frank Conlon (UNC) | 2 (no rep) |
~30M/library (75bp, single) 61M total |
~ 591k | 13,523 | 10,225 | 11,540 |
Park201106_XENLA | Tae Joo Park, Richard Harland (UC Berkeley) | 5 (no rep) |
~100M/library (50bp, single) 500M total |
~ 1,480k | 14,890 | 12,648 | 13,328 |
TXGP201107_XENLA | TXGP | 2 (1x2 rep) |
~100M/library (100bp, paired) 400M total |
~ 1,677k | 14,441 | 12,482 | 12,986 |
Chung201110_XENLA | Meii Chung, John Wallingford (UT Austin) | 4 (2x2 rep) |
16~38M/library (50bp, paired) 222M total |
~ 600k | 11,198 | 7,871 | 9,134 |
Quigley201112_XENLA | Ian Quigley, Christopher R. Kintner (Salk Institute) | 9 (unknown rep) |
23~50M/library (50bp,single) 311M total |
~ 647k | 13,291 | 10,790 | 11,383 |
Jarikji201201_XENLA | Zeina Jarikji, Marko Horb (MBL) | 9 (3x3 rep) |
39~72M/library (100bp,paired) 932M total |
~ 3,254k |
.. | .. | .. |
TeperekTkacz201202_XENLA | Marta Teperek-Tkacz, John Gurdon (Gurdon Institute) | 1 (no rep) |
94M/library (50bp,paired) 200M total |
~436k | 13,838 | 10,559 | 11,409 |
- assembled Tx == number of peptide query sequences for BLAST search.
Pre-processing
- Filter out reads with no-call.
- Trim 5' or 3' end if necessary.
- For paired-end library, compile paired reads (without filter-out reads at both side of pair).
Tx Assembly
- We currently use velvet+oases pipeline, with different k-mer (25,29,33,37,41,45).
- After first-round assembly, do the second round assembly with contigs of each k-mer, with k-mer 33.
Post-processing with orthology
- Translate k33_merged assembled transcripts into peptides with standard codon table. Take longest peptide sequences from 6-frame translation.
- Do BLAST to model oragnism protein sequences
- EnsEMBL-63: HUMAN, MOUSE, DANRE(zebrafish), XENTR(X. tropicals)
- XenBase: XENLA (2011-dec version)
- Filter out BLAST hits with following conditons.
- E-value < 0.01
- Alignment percentage (aligned length/min(query_seq,target_seq)) > 0.50
- len(query_seq)/len(target_seq) > 0.50 (to get rid of short peptides from assembled Tx)
- Make a group of sequences per each model organism sequence (putative ortho-group).
- Do multiple sequence alignment of ortho-groups with MUSCLE.
- Based on second-iteration tree (generated by MUSCLE), select representative sequences per clusters (up to 6 sequences per group).
(under development for further steps)