Raw data used for TXGP RNA-seq assembly

Dataset	Contributor	Samples	Reads	Assembled Tx(raw)	X. laevis genes	X. tropicalis genes	H. sapiens genes
Amin201106_XENLA	Nirav Amin, Frank Conlon (UNC)	2 (no rep)	~30M/library (75bp, single) 61M total	~ 591k	13,523	10,225	11,540
Park201106_XENLA	Tae Joo Park, Richard Harland (UC Berkeley)	5 (no rep)	~100M/library (50bp, single) 500M total	~ 1,480k	14,890	12,648	13,328
TXGP201107_XENLA	TXGP	2 (1x2 rep)	~100M/library (100bp, paired) 400M total	~ 1,677k	14,441	12,482	12,986
Chung201110_XENLA	Meii Chung, John Wallingford (UT Austin)	4 (2x2 rep)	16~38M/library (50bp, paired) 222M total	~ 600k	11,198	7,871	9,134
Quigley201112_XENLA	Ian Quigley, Christopher R. Kintner (Salk Institute)	9 (unknown rep)	23~50M/library (50bp,single) 311M total	~ 647k	13,291	10,790	11,383
Jarikji201201_XENLA	Zeina Jarikji, Marko Horb (MBL)	9 (3x3 rep)	39~72M/library (100bp,paired) 932M total	~ 3,254k	..	..	..
TeperekTkacz201202_XENLA	Marta Teperek-Tkacz, John Gurdon (Gurdon Institute)	1 (no rep)	94M/library (50bp,paired) 200M total	~436k	13,838	10,559	11,409

Pre-processing

Filter out reads with no-call.
Trim 5' or 3' end if necessary.
For paired-end library, compile paired reads (without filter-out reads at both side of pair).

We currently use velvet+oases pipeline, with different k-mer (25,29,33,37,41,45).
After first-round assembly, do the second round assembly with contigs of each k-mer, with k-mer 33.

Translate k33_merged assembled transcripts into peptides with standard codon table. Take longest peptide sequences from 6-frame translation.
Do BLAST to model oragnism protein sequences
- EnsEMBL-63: HUMAN, MOUSE, DANRE(zebrafish), XENTR(X. tropicals)
- XenBase: XENLA (2011-dec version)
Filter out BLAST hits with following conditons.
- E-value < 0.01
- Alignment percentage (aligned length/min(query_seq,target_seq)) > 0.50
- len(query_seq)/len(target_seq) > 0.50 (to get rid of short peptides from assembled Tx)
Make a group of sequences per each model organism sequence (putative ortho-group).
Do multiple sequence alignment of ortho-groups with MUSCLE.
Based on second-iteration tree (generated by MUSCLE), select representative sequences per clusters (up to 6 sequences per group).

(under development for further steps)