CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Advanced:
Appending information from 2nd file into 1st based on intervals

 



Diya123
New User

Apr 11, 2013, 4:34 PM

Post #1 of 2 (1776 views)
Appending information from 2nd file into 1st based on intervals Can't Post

To describe it in detail file 2 each row is a gene. Most genes have one 5UTR start one 5UTR stop one 3UTR start and one 3UTR stop. It can have n number of exons each with a start and a stop and n-1 introns each with intron start and intron stop

Attached is the image

So one gene can have trailing commas for exons and introns but not for UTRís.

If position (4th column) of file 1, falls between a 5UTRstart and 5UTRstop then add a additional column in file 1 for each row and append it with 5UTR_intron
If position (4th column) of file 1, falls between a 5UTRstart and 5UTRstop and also in an exon start and stop then for that row append with 5UTR_exon
If position falls between a exon start and stop then name the row as Exon
If position falls between a intron start and stop then name it as Intron
If position (4th column) of file 1, falls between a 3UTRstart and 3UTRstop then add a additional column in file 1 for each row and append it with 3UTR_intron
If position (4th column) of file 1, falls between a 3UTRstart and 3UTRstop and also in an exon start and stop then for that row append with 3UTR_exon
If position falls in the exons of 2 genes then name it as ambiguos
If position does not fall in any above category then name it as intergenic

Column 13: Exon starts
Column 14: Exon stops
Column 15: 5UTR start
Column 16: 5UTR stop
Column 17: 3UTR start
Column 18: 3UTR stop
Column 19: Intron start
Column 20: Intron stop

Note: Some genes do not have UTRís and some does not have any introns

File 2( 10 million rows)



<c>HWUSI-EAS000_29:2:112:15026:1079#0/1 + chr21 9827004

HWUSI-EAS000_29:2:112:1096:1083#0/1 + chr21 46529599

HWUSI-EAS000_29:2:112:6116:1092#0/1 + chr21 9827328

HWUSI-EAS000_29:2:112:7436:1103#0/1 - chr21 38597405

HWUSI-EAS000_29:2:112:3168:1114#0/1 - chr21 44836222

HWUSI-EAS000_29:2:112:12481:1110#0/1 + chr21 45089410

HWUSI-EAS000_29:2:112:16829:1109#0/1 - chr21 11087783

HWUSI-EAS000_29:2:112:6005:1121#0/1 + chr21 11180428

HWUSI-EAS000_29:2:112:12016:1128#0/1 - chr21 38187834

HWUSI-EAS000_29:2:112:4252:1140#0/1 + chr21 46534847

HWUSI-EAS000_29:2:112:14645:1133#0/1 + chr21 46493472

HWUSI-EAS000_29:2:112:16002:1130#0/1 - chr21 47700601

HWUSI-EAS000_29:2:112:13823:1144#0/1 - chr21 46189143

HWUSI-EAS000_29:2:112:16154:1152#0/1 + chr21 9827328

HWUSI-EAS000_29:2:112:9792:1159#0/1 + chr21 9827404

HWUSI-EAS000_29:2:112:1333:1168#0/1 - chr21 46269533

HWUSI-EAS000_29:2:112:6703:1175#0/1 + chr21 46517134</c>



file 1( gene position file)


Code
hg19.knownCanonical.chrom    Condition_testing    hg19.knownCanonical.chromStart    hg19.knownCanonical.chromEnd    hg19.knownCanonical.transcript    hg19.knownGene.name    hg19.knownGene.chrom    hg19.knownGene.strand    hg19.knownGene.txStart    hg19.knownGene.txEnd    hg19.knownGene.cdsStart    hg19.knownGene.cdsEnd    hg19.knownGene.exonCount    hg19.knownGene.exonStarts    hg19.knownGene.exonEnds    5'UTR_start    5'UTR_stop    3'UTR_start    3'UTR_stop    intron_stop    intron_start  

chr1 1 367658 368597 uc010nxu.2 uc010nxu.2 chr1 + 367658 368597 367658 368597 1 367658, 368597, NA NA NA NA

chr1 1 1266725 1269844 uc010nyk.2 uc010nyk.2 chr1 + 1266725 1269844 1266725 1269844 6 1266725,1267017,1267403,1268300,1268638,1268885, 1266916,1267318,1268186,1268504,1268759,1269844, NA NA NA NA 1267016,1267402,1268299,1268637,1268884

chr1 0 229761980 229795946 uc001hts.1 uc001hts.1 chr1 + 229761980 229795946 229763380 229795044 10 229761980,229763367,229768015,229770663,229779279,229781605,229783256,229786981,229789995,229794846, 229762103,229763506,229768192,229773994,229779440,229781716,229783499,229787069,229790135,229795946, 229761981 229763379 229795045 229795945 229763366,229768014,229770662,229779278,229781604,229783255,229786980,229789994,229794845

chr1 0 206940947 206945839 uc001hen.1 uc001hen.1 chr1 - 206940947 206945839 206941980 206945780 5 206940947,206943173,206944251,206944700,206945615, 206942073,206943239,206944404,206944760,206945839, 206945838 206945781 206941979 206940948 206943172,206944250,206944699,206945614

chr21 0 43731776 43735706 uc002zav.3 uc002zav.3 chr21 - 43731776 43735706 43732365 43735526 3 43731776,43733594,43735402, 43732379,43733741,43735706, 43735705 43735527 43732364 43731777 43733593,43735401

chr21 0 43766466 43771208 uc002zaw.3 uc002zaw.3 chr21 - 43766466 43771208 43766641 43771066 4 43766466,43767594,43769989,43770987, 43766655,43767741,43770139,43771208, 43771207 43771067 43766640 43766467 43767593,43769988,43770986





output:


Code
HWUSI-EAS000_29:2:112:1096:1083#0/1    +    chr21    46529599 Exon  

HWUSI-EAS000_29:2:112:6116:1092#0/1 + chr21 9827328 3'UTR

HWUSI-EAS000_29:2:112:7436:1103#0/1 - chr21 38597405 5'UTR_Intron

HWUSI-EAS000_29:2:112:3168:1114#0/1 - chr21 44836222 intergenic





The appended value in my output is just for reference..



#Note: Also the chr numbers should match in 2 files..Example if the 1st row in file2 coresponds to chr21 then we need to look only for the positions for chr21 in file 1



Thanks,


Kenosis
User

Apr 11, 2013, 6:14 PM

Post #2 of 2 (1770 views)
Re: [Diya123] Appending information from 2nd file into 1st based on intervals [In reply to] Can't Post

Did you have a question? Do you have code to share that's giving you problems?


(This post was edited by Kenosis on Apr 16, 2013, 3:16 PM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives