
Diya123
New User
Apr 11, 2013, 4:34 PM
Post #1 of 2
(13355 views)
|
Appending information from 2nd file into 1st based on intervals
|
Can't Post
|
|
To describe it in detail file 2 each row is a gene. Most genes have one 5UTR start one 5UTR stop one 3UTR start and one 3UTR stop. It can have n number of exons each with a start and a stop and n-1 introns each with intron start and intron stop Attached is the image So one gene can have trailing commas for exons and introns but not for UTR’s. If position (4th column) of file 1, falls between a 5UTRstart and 5UTRstop then add a additional column in file 1 for each row and append it with 5UTR_intron If position (4th column) of file 1, falls between a 5UTRstart and 5UTRstop and also in an exon start and stop then for that row append with 5UTR_exon If position falls between a exon start and stop then name the row as Exon If position falls between a intron start and stop then name it as Intron If position (4th column) of file 1, falls between a 3UTRstart and 3UTRstop then add a additional column in file 1 for each row and append it with 3UTR_intron If position (4th column) of file 1, falls between a 3UTRstart and 3UTRstop and also in an exon start and stop then for that row append with 3UTR_exon If position falls in the exons of 2 genes then name it as ambiguos If position does not fall in any above category then name it as intergenic Column 13: Exon starts Column 14: Exon stops Column 15: 5UTR start Column 16: 5UTR stop Column 17: 3UTR start Column 18: 3UTR stop Column 19: Intron start Column 20: Intron stop Note: Some genes do not have UTR’s and some does not have any introns File 2( 10 million rows) <c>HWUSI-EAS000_29:2:112:15026:1079#0/1 + chr21 9827004 HWUSI-EAS000_29:2:112:1096:1083#0/1 + chr21 46529599 HWUSI-EAS000_29:2:112:6116:1092#0/1 + chr21 9827328 HWUSI-EAS000_29:2:112:7436:1103#0/1 - chr21 38597405 HWUSI-EAS000_29:2:112:3168:1114#0/1 - chr21 44836222 HWUSI-EAS000_29:2:112:12481:1110#0/1 + chr21 45089410 HWUSI-EAS000_29:2:112:16829:1109#0/1 - chr21 11087783 HWUSI-EAS000_29:2:112:6005:1121#0/1 + chr21 11180428 HWUSI-EAS000_29:2:112:12016:1128#0/1 - chr21 38187834 HWUSI-EAS000_29:2:112:4252:1140#0/1 + chr21 46534847 HWUSI-EAS000_29:2:112:14645:1133#0/1 + chr21 46493472 HWUSI-EAS000_29:2:112:16002:1130#0/1 - chr21 47700601 HWUSI-EAS000_29:2:112:13823:1144#0/1 - chr21 46189143 HWUSI-EAS000_29:2:112:16154:1152#0/1 + chr21 9827328 HWUSI-EAS000_29:2:112:9792:1159#0/1 + chr21 9827404 HWUSI-EAS000_29:2:112:1333:1168#0/1 - chr21 46269533 HWUSI-EAS000_29:2:112:6703:1175#0/1 + chr21 46517134</c> file 1( gene position file)
hg19.knownCanonical.chrom Condition_testing hg19.knownCanonical.chromStart hg19.knownCanonical.chromEnd hg19.knownCanonical.transcript hg19.knownGene.name hg19.knownGene.chrom hg19.knownGene.strand hg19.knownGene.txStart hg19.knownGene.txEnd hg19.knownGene.cdsStart hg19.knownGene.cdsEnd hg19.knownGene.exonCount hg19.knownGene.exonStarts hg19.knownGene.exonEnds 5'UTR_start 5'UTR_stop 3'UTR_start 3'UTR_stop intron_stop intron_start chr1 1 367658 368597 uc010nxu.2 uc010nxu.2 chr1 + 367658 368597 367658 368597 1 367658, 368597, NA NA NA NA chr1 1 1266725 1269844 uc010nyk.2 uc010nyk.2 chr1 + 1266725 1269844 1266725 1269844 6 1266725,1267017,1267403,1268300,1268638,1268885, 1266916,1267318,1268186,1268504,1268759,1269844, NA NA NA NA 1267016,1267402,1268299,1268637,1268884 chr1 0 229761980 229795946 uc001hts.1 uc001hts.1 chr1 + 229761980 229795946 229763380 229795044 10 229761980,229763367,229768015,229770663,229779279,229781605,229783256,229786981,229789995,229794846, 229762103,229763506,229768192,229773994,229779440,229781716,229783499,229787069,229790135,229795946, 229761981 229763379 229795045 229795945 229763366,229768014,229770662,229779278,229781604,229783255,229786980,229789994,229794845 chr1 0 206940947 206945839 uc001hen.1 uc001hen.1 chr1 - 206940947 206945839 206941980 206945780 5 206940947,206943173,206944251,206944700,206945615, 206942073,206943239,206944404,206944760,206945839, 206945838 206945781 206941979 206940948 206943172,206944250,206944699,206945614 chr21 0 43731776 43735706 uc002zav.3 uc002zav.3 chr21 - 43731776 43735706 43732365 43735526 3 43731776,43733594,43735402, 43732379,43733741,43735706, 43735705 43735527 43732364 43731777 43733593,43735401 chr21 0 43766466 43771208 uc002zaw.3 uc002zaw.3 chr21 - 43766466 43771208 43766641 43771066 4 43766466,43767594,43769989,43770987, 43766655,43767741,43770139,43771208, 43771207 43771067 43766640 43766467 43767593,43769988,43770986 output:
HWUSI-EAS000_29:2:112:1096:1083#0/1 + chr21 46529599 Exon HWUSI-EAS000_29:2:112:6116:1092#0/1 + chr21 9827328 3'UTR HWUSI-EAS000_29:2:112:7436:1103#0/1 - chr21 38597405 5'UTR_Intron HWUSI-EAS000_29:2:112:3168:1114#0/1 - chr21 44836222 intergenic The appended value in my output is just for reference.. #Note: Also the chr numbers should match in 2 files..Example if the 1st row in file2 coresponds to chr21 then we need to look only for the positions for chr21 in file 1 Thanks,
|