Latest revision as of 01:17, 13 April 2012

Raw data used for TXGP RNA-seq assembly

Dataset	Contributor	Samples	Reads	Assembled Tx(raw)	X. laevis genes	X. tropicalis genes	H. sapiens genes
Amin201106_XENLA	Nirav Amin, Frank Conlon (UNC)	2 (no rep)	~30M/library (75bp, single) 61M total	~ 591k	13,523	10,225	11,540
Park201106_XENLA	Tae Joo Park, Richard Harland (UC Berkeley)	5 (no rep)	~100M/library (50bp, single) 500M total	~ 1,480k	14,890	12,648	13,328
TXGP201107_XENLA	TXGP	2 (1x2 rep)	~100M/library (100bp, paired) 400M total	~ 1,677k	14,441	12,482	12,986
Chung201110_XENLA	Meii Chung, John Wallingford (UT Austin)	4 (2x2 rep)	16~38M/library (50bp, paired) 222M total	~ 600k	11,198	7,871	9,134
Quigley201112_XENLA	Ian Quigley, Christopher R. Kintner (Salk Institute)	9 (unknown rep)	23~50M/library (50bp,single) 311M total	~ 647k	13,291	10,790	11,383
Jarikji201201_XENLA	Zeina Jarikji, Marko Horb (MBL)	9 (3x3 rep)	39~72M/library (100bp,paired) 932M total	~ 3,254k	14,613	12,342	13,218
TeperekTkacz201202_XENLA	Marta Teperek-Tkacz, John Gurdon (Gurdon Institute)	1 (no rep)	94M/library (50bp,paired) 200M total	~436k	13,838	10,559	11,409

assembled Tx == number of peptide query sequences for BLAST search.

Pre-processing

Filter out reads with no-call.
Trim 5' or 3' end if necessary.
For paired-end library, compile paired reads (without filter-out reads at both side of pair).

Tx Assembly

We currently use velvet+oases pipeline, with different k-mer (25,29,33,37,41,45).
After first-round assembly, do the second round assembly with contigs of each k-mer, with k-mer 33.

Post-processing with orthology

Translate k33_merged assembled transcripts into peptides with standard codon table. Take longest peptide sequences from 6-frame translation.
Do BLAST to model oragnism protein sequences
- EnsEMBL-63: HUMAN, MOUSE, DANRE(zebrafish), XENTR(X. tropicals)
- XenBase: XENLA (2011-dec version)
Filter out BLAST hits with following conditons.
- E-value < 0.01
- Alignment percentage (aligned length/min(query_seq,target_seq)) > 0.50
- len(query_seq)/len(target_seq) > 0.50 (to get rid of short peptides from assembled Tx)
Make a group of sequences per each model organism sequence (putative ortho-group).
Do multiple sequence alignment of ortho-groups with MUSCLE.
Based on second-iteration tree (generated by MUSCLE), select representative sequences per clusters (up to 6 sequences per group).

(under development for further steps)

@@ Line 1: / Line 1: @@
-= Dataset for RNA-seq assembly =
+= Raw data used for TXGP RNA-seq assembly =
 {| class="wikitable" style="text-align: center;"
@@ Line 6: / Line 6: @@
 !Samples
 !Reads
-!Assembled transcripts(raw)
+!Assembled Tx(raw)
 !''X. laevis'' genes
 !''X. tropicalis'' genes
@@ Line 15: / Line 15: @@
 |Nirav Amin, Frank Conlon (UNC)
 |2 <br/>(no rep)
-|28~33M <br/>(75bp, single)
+|~30M/library<br/> (75bp, single)<br/> 61M total
-|591,321
+|~ 591k
 |13,523
 |10,225
@@ Line 26: / Line 26: @@
 |Tae Joo Park, Richard Harland (UC Berkeley)
 |5 <br/>(no rep)
-|..
+|~100M/library<br/> (50bp, single)<br/> 500M total
-|..
+|~ 1,480k
-|..
+|14,890
-|..
+|12,648
-|..
+|13,328
 |-
@@ Line 37: / Line 37: @@
 |TXGP
 |2 <br/>(1x2 rep)
-|..
+|~100M/library<br/> (100bp, paired)<br/> 400M total
-|..
+|~ 1,677k
-|..
+|14,441
-|..
+|12,482
-|..
+|12,986
 |-
@@ Line 48: / Line 48: @@
 |Meii Chung, John Wallingford (UT Austin)
 |4 <br/>(2x2 rep)
-|..
+|16~38M/library<br/> (50bp, paired)<br/> 222M total
-|..
+|~ 600k
-|..
+|11,198
-|..
+|7,871
-|..
+|9,134
 |-
@@ Line 59: / Line 59: @@
 |Ian Quigley, Christopher R. Kintner (Salk Institute)
 |9<br/>(unknown rep)
-|..
+|23~50M/library<br/>(50bp,single)<br/>311M total
-|..
+|~ 647k
-|..
+|13,291
-|..
+|10,790
-|..
+|11,383
 |-
@@ Line 69: / Line 69: @@
 |Jarikji201201_XENLA
 |Zeina Jarikji, Marko Horb (MBL)
-|15<br/>(5x3 rep)
+|9<br/>(3x3 rep)
-|..
+|39~72M/library<br/>(100bp,paired)<br/>932M total
-|..
+|~ 3,254k<br/>
-|..
+|14,613
-|..
+|12,342
-|..
+|13,218
 |-
@@ Line 81: / Line 81: @@
 |Marta Teperek-Tkacz, John Gurdon (Gurdon Institute)
 |1<br/>(no rep)
-|..
+|94M/library<br/>(50bp,paired)<br/>200M total
-|..
+|~436k
-|..
+|13,838
-|..
+|10,559
-|..
+|11,409
 |-
 |}
+* assembled Tx == number of peptide query sequences for BLAST search.
+= Pre-processing =
+* Filter out reads with no-call.
+* Trim 5' or 3' end if necessary.
+* For paired-end library, compile paired reads (without filter-out reads at both side of pair).
+= Tx Assembly =
+* We currently use velvet+oases pipeline, with different k-mer (25,29,33,37,41,45).
+* After first-round assembly, do the second round assembly with contigs of each k-mer, with k-mer 33.
+= Post-processing with orthology =
+* Translate k33_merged assembled transcripts into peptides with standard codon table. Take longest peptide sequences from 6-frame translation.
+* Do BLAST to model oragnism protein sequences
+** EnsEMBL-63: HUMAN, MOUSE, DANRE(zebrafish), XENTR(X. tropicals)
+** XenBase: XENLA (2011-dec version)
+* Filter out BLAST hits with following conditons.
+** E-value < 0.01
+** Alignment percentage (aligned length/min(query_seq,target_seq)) > 0.50
+** len(query_seq)/len(target_seq) > 0.50 (to get rid of short peptides from assembled Tx)
+* Make a group of sequences per each model organism sequence (putative ortho-group).
+* Do multiple sequence alignment of ortho-groups with MUSCLE.
+* Based on second-iteration tree (generated by MUSCLE), select representative sequences per clusters (up to 6 sequences per group).
+(under development for further steps)
 ----
 [[Category:XenopusGenome]]

Difference between revisions of "TXGP RNAseq assembly"

Latest revision as of 01:17, 13 April 2012

Contents

Raw data used for TXGP RNA-seq assembly

Pre-processing

Tx Assembly

Post-processing with orthology

Navigation menu

Views

Personal tools

Navigation

Projects

Classes

Search

Toolbox