Difference between revisions of "TXGP RNAseq assembly"
From Marcotte Lab
(→Dataset for RNA-seq assembly) |
(→Raw data used for TXGP RNA-seq assembly) |
||
(13 intermediate revisions by one user not shown) | |||
Line 1: | Line 1: | ||
− | = | + | = Raw data used for TXGP RNA-seq assembly = |
{| class="wikitable" style="text-align: center;" | {| class="wikitable" style="text-align: center;" | ||
Line 6: | Line 6: | ||
!Samples | !Samples | ||
!Reads | !Reads | ||
− | !Assembled | + | !Assembled Tx(raw) |
!''X. laevis'' genes | !''X. laevis'' genes | ||
!''X. tropicalis'' genes | !''X. tropicalis'' genes | ||
Line 15: | Line 15: | ||
|Nirav Amin, Frank Conlon (UNC) | |Nirav Amin, Frank Conlon (UNC) | ||
|2 <br/>(no rep) | |2 <br/>(no rep) | ||
− | | | + | |~30M/library<br/> (75bp, single)<br/> 61M total |
− | | | + | |~ 591k |
|13,523 | |13,523 | ||
|10,225 | |10,225 | ||
Line 26: | Line 26: | ||
|Tae Joo Park, Richard Harland (UC Berkeley) | |Tae Joo Park, Richard Harland (UC Berkeley) | ||
|5 <br/>(no rep) | |5 <br/>(no rep) | ||
− | | | + | |~100M/library<br/> (50bp, single)<br/> 500M total |
− | | | + | |~ 1,480k |
− | | | + | |14,890 |
− | | | + | |12,648 |
− | | | + | |13,328 |
|- | |- | ||
Line 37: | Line 37: | ||
|TXGP | |TXGP | ||
|2 <br/>(1x2 rep) | |2 <br/>(1x2 rep) | ||
− | | | + | |~100M/library<br/> (100bp, paired)<br/> 400M total |
− | | | + | |~ 1,677k |
− | | | + | |14,441 |
− | | | + | |12,482 |
− | | | + | |12,986 |
|- | |- | ||
Line 48: | Line 48: | ||
|Meii Chung, John Wallingford (UT Austin) | |Meii Chung, John Wallingford (UT Austin) | ||
|4 <br/>(2x2 rep) | |4 <br/>(2x2 rep) | ||
− | | | + | |16~38M/library<br/> (50bp, paired)<br/> 222M total |
− | | | + | |~ 600k |
− | | | + | |11,198 |
− | | | + | |7,871 |
− | | | + | |9,134 |
|- | |- | ||
Line 59: | Line 59: | ||
|Ian Quigley, Christopher R. Kintner (Salk Institute) | |Ian Quigley, Christopher R. Kintner (Salk Institute) | ||
|9<br/>(unknown rep) | |9<br/>(unknown rep) | ||
− | | | + | |23~50M/library<br/>(50bp,single)<br/>311M total |
− | | | + | |~ 647k |
− | | | + | |13,291 |
− | | | + | |10,790 |
− | | | + | |11,383 |
|- | |- | ||
Line 69: | Line 69: | ||
|Jarikji201201_XENLA | |Jarikji201201_XENLA | ||
|Zeina Jarikji, Marko Horb (MBL) | |Zeina Jarikji, Marko Horb (MBL) | ||
− | | | + | |9<br/>(3x3 rep) |
− | | | + | |39~72M/library<br/>(100bp,paired)<br/>932M total |
− | | | + | |~ 3,254k<br/> |
− | | | + | |14,613 |
− | | | + | |12,342 |
− | | | + | |13,218 |
|- | |- | ||
Line 81: | Line 81: | ||
|Marta Teperek-Tkacz, John Gurdon (Gurdon Institute) | |Marta Teperek-Tkacz, John Gurdon (Gurdon Institute) | ||
|1<br/>(no rep) | |1<br/>(no rep) | ||
− | | | + | |94M/library<br/>(50bp,paired)<br/>200M total |
− | | | + | |~436k |
− | | | + | |13,838 |
− | | | + | |10,559 |
− | | | + | |11,409 |
|- | |- | ||
|} | |} | ||
+ | * assembled Tx == number of peptide query sequences for BLAST search. | ||
+ | |||
+ | = Pre-processing = | ||
+ | * Filter out reads with no-call. | ||
+ | * Trim 5' or 3' end if necessary. | ||
+ | * For paired-end library, compile paired reads (without filter-out reads at both side of pair). | ||
+ | |||
+ | = Tx Assembly = | ||
+ | * We currently use velvet+oases pipeline, with different k-mer (25,29,33,37,41,45). | ||
+ | * After first-round assembly, do the second round assembly with contigs of each k-mer, with k-mer 33. | ||
+ | |||
+ | = Post-processing with orthology = | ||
+ | * Translate k33_merged assembled transcripts into peptides with standard codon table. Take longest peptide sequences from 6-frame translation. | ||
+ | * Do BLAST to model oragnism protein sequences | ||
+ | ** EnsEMBL-63: HUMAN, MOUSE, DANRE(zebrafish), XENTR(X. tropicals) | ||
+ | ** XenBase: XENLA (2011-dec version) | ||
+ | * Filter out BLAST hits with following conditons. | ||
+ | ** E-value < 0.01 | ||
+ | ** Alignment percentage (aligned length/min(query_seq,target_seq)) > 0.50 | ||
+ | ** len(query_seq)/len(target_seq) > 0.50 (to get rid of short peptides from assembled Tx) | ||
+ | * Make a group of sequences per each model organism sequence (putative ortho-group). | ||
+ | * Do multiple sequence alignment of ortho-groups with MUSCLE. | ||
+ | * Based on second-iteration tree (generated by MUSCLE), select representative sequences per clusters (up to 6 sequences per group). | ||
+ | |||
+ | (under development for further steps) | ||
---- | ---- | ||
[[Category:XenopusGenome]] | [[Category:XenopusGenome]] |
Latest revision as of 01:17, 13 April 2012
Contents |
Raw data used for TXGP RNA-seq assembly
Dataset | Contributor | Samples | Reads | Assembled Tx(raw) | X. laevis genes | X. tropicalis genes | H. sapiens genes |
---|---|---|---|---|---|---|---|
Amin201106_XENLA | Nirav Amin, Frank Conlon (UNC) | 2 (no rep) |
~30M/library (75bp, single) 61M total |
~ 591k | 13,523 | 10,225 | 11,540 |
Park201106_XENLA | Tae Joo Park, Richard Harland (UC Berkeley) | 5 (no rep) |
~100M/library (50bp, single) 500M total |
~ 1,480k | 14,890 | 12,648 | 13,328 |
TXGP201107_XENLA | TXGP | 2 (1x2 rep) |
~100M/library (100bp, paired) 400M total |
~ 1,677k | 14,441 | 12,482 | 12,986 |
Chung201110_XENLA | Meii Chung, John Wallingford (UT Austin) | 4 (2x2 rep) |
16~38M/library (50bp, paired) 222M total |
~ 600k | 11,198 | 7,871 | 9,134 |
Quigley201112_XENLA | Ian Quigley, Christopher R. Kintner (Salk Institute) | 9 (unknown rep) |
23~50M/library (50bp,single) 311M total |
~ 647k | 13,291 | 10,790 | 11,383 |
Jarikji201201_XENLA | Zeina Jarikji, Marko Horb (MBL) | 9 (3x3 rep) |
39~72M/library (100bp,paired) 932M total |
~ 3,254k |
14,613 | 12,342 | 13,218 |
TeperekTkacz201202_XENLA | Marta Teperek-Tkacz, John Gurdon (Gurdon Institute) | 1 (no rep) |
94M/library (50bp,paired) 200M total |
~436k | 13,838 | 10,559 | 11,409 |
- assembled Tx == number of peptide query sequences for BLAST search.
Pre-processing
- Filter out reads with no-call.
- Trim 5' or 3' end if necessary.
- For paired-end library, compile paired reads (without filter-out reads at both side of pair).
Tx Assembly
- We currently use velvet+oases pipeline, with different k-mer (25,29,33,37,41,45).
- After first-round assembly, do the second round assembly with contigs of each k-mer, with k-mer 33.
Post-processing with orthology
- Translate k33_merged assembled transcripts into peptides with standard codon table. Take longest peptide sequences from 6-frame translation.
- Do BLAST to model oragnism protein sequences
- EnsEMBL-63: HUMAN, MOUSE, DANRE(zebrafish), XENTR(X. tropicals)
- XenBase: XENLA (2011-dec version)
- Filter out BLAST hits with following conditons.
- E-value < 0.01
- Alignment percentage (aligned length/min(query_seq,target_seq)) > 0.50
- len(query_seq)/len(target_seq) > 0.50 (to get rid of short peptides from assembled Tx)
- Make a group of sequences per each model organism sequence (putative ortho-group).
- Do multiple sequence alignment of ortho-groups with MUSCLE.
- Based on second-iteration tree (generated by MUSCLE), select representative sequences per clusters (up to 6 sequences per group).
(under development for further steps)