Foro - Perl en Español

por **raquel** » 2007-12-29 08:07 @380

Problema: Tengo un fichero en formato GenBank que almacena 20 secuencias de ADN. A partir de ahí he de extraer la parte codificante (FEATURES-CDS). Supongo que tengo que utilizar las expresiones regulares pero es que no sé ni cómo empezar.

#!/usr/bin/perl -w
# Author: Damien Mattei C.N.R.S / U.N.S.A - UMR 6549
# example: ./idfetch.pl AP001266

use Bio::DB::GenBank;

$gb=new Bio::DB::GenBank();

# this returns a Seq object :
$seq1=$gb->get_Seq_by_acc($ARGV[0]);
print$seq1->display_id()."\n";

foreach$feat($seq1->all_SeqFeatures()){

#print $feat->primary_tag . " " . $feat->source_tag() . "\n" ;

print"Feature from ",$feat->start," to ",
$feat->end," Primary tag  ",$feat->primary_tag,
", produced by ",$feat->source_tag(),"\n";

if($feat->strand== 0 ){
print"Feature applicable to either strand\n";
}else{
print"Feature on strand ",$feat->strand,"\n";# -1,1
}

foreach$tag($feat->all_tags()){
print"Feature has tag ",$tag," with values, ",
join(' ',$feat->each_tag_value($tag)),"\n";
}

print"new feature\n"if$feat->has_tag('new');

}

exit;

__END__

It will display something like that:

[dmattei@pclgmch2 gmap]$ ./idfetch.pl AP001266

AP001266

Feature from 1 to 168978 Primary tag  source, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag chromosome with values, 11

Feature has tag map with values, 11q13

Feature has tag clone with values, RP11-770G2

Feature has tag organism with values, Homo sapiens

Feature has tag db_xref with values, taxon:9606

Feature from 1 to 31550 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 31651 to 48510 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 48611 to 64044 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 64145 to 78208 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 78309 to 89008 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 89109 to 99704 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 99805 to 107965 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 108066 to 116032 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 116133 to 124010 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 124111 to 130494 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 130595 to 136072 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 136173 to 139649 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 139750 to 144590 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 144691 to 148482 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 148583 to 152279 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 152380 to 153632 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment clone_end:T7 

vector_side:left

Feature from 153733 to 155746 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 155847 to 156405 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment clone_end:SP6 

vector_side:right

Feature from 156506 to 158398 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 158499 to 161333 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 161434 to 163304 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 163405 to 164604 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 164705 to 166693 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragment

Feature from 166794 to 168978 Primary tag  misc_feature, produced by 

EMBL/GenBank/SwissProt

Feature on strand 1

Feature has tag note with values, assembly_fragmentColoreado en 0.006 segundos,  usando GeSHi 1.0.8.4

Como ves, el paquete BioPerl sirve para todo lo relacionado con el estudio del ADN. Es MUY recomendable que te lo mires.

Naturalmente, se puede hacer de la forma tradicional, pero con más líneas y tiempo. Si tienes dudas, ya sabes: aquí.

La parte de expresiones regulares la puedes mirar en perlre.

Actualización: Me parece que estas no son las features de las que hablas. Prueba a buscar en Google: features CDS BioPerl.

Actualización: Creo que ya lo encontré: al final de la sección "Getting Sequences" del HOWTO:Feature-Annotation.

por **raquel** » 2007-12-30 05:35 @274

Creo que eso no es lo que me piden. Tengo que extraer la parte codificante de las 20 secuencias. Entonces tengo que hacer un bucle que extraiga la parte comprendida entre ORIGIN y // y después la parte codificante que está indicada en el CDS del tipo:
CDS 56..78
Así pues tengo que emplear un bucle de forma que se quede con la línea del CDS y despues mediante un substr extraer la parte codificante que es la que indican los números.

por **explorer** » 2007-12-30 08:49 @409

A ver... creo que estamos hablando de lo mismo.

La diferencia es que tu quieres tratar el fichero de forma directa mientras que yo te he dado la solución profesional (usar el paquete BioPerl).

Con el siguiente programa se obtienen todas las secuencias de nucleótidos y nombres de los genes, contenidos en un fichero GenBank:

Sintáxis: [ Descargar ] [ Ocultar ]

Using perl Syntax Highlighting

use Bio::SeqIO;

my$seqio_object= Bio::SeqIO->new(-file =>$gb_file);
my$seq_object=$seqio_object->next_seq;

formy$feat_object($seq_object->get_SeqFeatures){
if($feat_object->primary_tageq"CDS"){
print$feat_object->spliced_seq->seq,"\n";
# e.g. 'ATTATTTTCGCTCGCTTCTCGCGCTTTTTGAGATAAGGTCGCGT...'
if($feat_object->has_tag('gene')){
formy$val($feat_object->get_tag_values('gene')){
print"gene: ",$val,"\n";
# e.g. 'NDP', from a line like '/gene="NDP"'
}
}
}
}Coloreado en 0.002 segundos,  usando GeSHi 1.0.8.4

Ahora bien, como tu quieres extraer la información de forma directa, desde el fichero, lo ideal es que publicaras aquí uno de esos ficheros, para que veamos el aspecto que tiene. Puedes abreviar algunas partes (por ejemplo la secuencia de nucleótidos, si es muy larga) mientras se vea la estructura del fichero. A partir de ahí podremos ayudarte más.

por **raquel** » 2007-12-30 11:58 @540

Código: Seleccionar todo: LOCUS NM_130850 1748 bp mRNA linear PRI 18-DEC-2007 DEFINITION Homo sapiens bone morphogenetic protein 4 (BMP4), transcript variant 2, mRNA. ACCESSION NM_130850 VERSION NM_130850.2 GI:157276594 KEYWORDS . SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1748) AUTHORS Goldstein,M., Meller,I. and Orr-Urtreger,A. TITLE FGFR1 over-expression in primary rhabdomyosarcoma tumors is associated with hypomethylation of a 5' CpG island and abnormal expression of the AKT1, NOG, and BMP4 genes JOURNAL Genes Chromosomes Cancer 46 (11), 1028-1038 (2007) PUBMED 17696196 REMARK GeneRIF: Expression analysis of additional genes, AKT1, NOG and its antagonist BMP4, which interact downstream to FGFR1, demonstrated expression differences between primary rhabdomyosarcoma tumors and normal skeletal muscles REFERENCE 2 (bases 1 to 1748) AUTHORS Deng,H., Ravikumar,T.S. and Yang,W.L. TITLE Bone morphogenetic protein-4 inhibits heat-induced apoptosis by modulating MAPK pathways in human colon cancer HCT116 cells JOURNAL Cancer Lett. 256 (2), 207-217 (2007) PUBMED 17640799 REMARK GeneRIF: BMP-4 can protect colon cancer cells from heat-induced apoptosis through enhancing the activation of ERK as well as inhibiting the activation of JNK. REFERENCE 3 (bases 1 to 1748) AUTHORS Milet,J., Dehais,V., Bourgain,C., Jouanolle,A.M., Mosser,A., Perrin,M., Morcet,J., Brissot,P., David,V., Deugnier,Y. and Mosser,J. TITLE Common variants in the BMP2, BMP4, and HJV genes of the hepcidin regulation pathway modulate HFE hemochromatosis penetrance JOURNAL Am. J. Hum. Genet. 81 (4), 799-807 (2007) PUBMED 17847004 REMARK GeneRIF: interactive effect on serum ferritin level of rs235756 in BMP2 and a SNP in HJV, with a small additive effect of a SNP in BMP4 REFERENCE 4 (bases 1 to 1748) AUTHORS Chang,K., Weiss,D., Suo,J., Vega,J.D., Giddens,D., Taylor,W.R. and Jo,H. TITLE Bone morphogenic protein antagonists are coexpressed with bone morphogenic protein 4 in endothelial cells exposed to unstable flow in vitro in mouse aortas and in human coronary arteries: role of bone morphogenic protein antagonists in inflammation and atherosclerosis JOURNAL Circulation 116 (11), 1258-1266 (2007) PUBMED 17785623 REMARK GeneRIF: Endothelial cells coexpress BMP4 antagonists along with BMP4 to minimize the inflammatory response to oscillatory shear stress. REFERENCE 5 (bases 1 to 1748) AUTHORS Milano,F., van Baal,J.W., Buttar,N.S., Rygiel,A.M., de Kort,F., DeMars,C.J., Rosmolen,W.D., Bergman,J.J., VAn Marle,J., Wang,K.K., Peppelenbosch,M.P. and Krishnadath,K.K. TITLE Bone morphogenetic protein 4 expressed in esophagitis induces a columnar phenotype in esophageal squamous cells JOURNAL Gastroenterology 132 (7), 2412-2421 (2007) PUBMED 17570215 REMARK GeneRIF: BMP pathway could play a role in the transformation of normal esophageal squamous cells into columnar cells. REFERENCE 6 (bases 1 to 1748) AUTHORS Nohno,T., Ishikawa,T., Saito,T., Hosokawa,K., Noji,S., Wolsing,D.H. and Rosenbaum,J.S. TITLE Identification of a human type II receptor for bone morphogenetic protein-4 that forms differential heteromeric complexes with bone morphogenetic protein type I receptors JOURNAL J. Biol. Chem. 270 (38), 22522-22526 (1995) PUBMED 7673243 REFERENCE 7 (bases 1 to 1748) AUTHORS Rosenzweig,B.L., Imamura,T., Okadome,T., Cox,G.N., Yamashita,H., ten Dijke,P., Heldin,C.H. and Miyazono,K. TITLE Cloning and characterization of a human type II receptor for bone morphogenetic proteins JOURNAL Proc. Natl. Acad. Sci. U.S.A. 92 (17), 7632-7636 (1995) PUBMED 7644468 REFERENCE 8 (bases 1 to 1748) AUTHORS van den Wijngaard,A., Weghuis,D.O., Boersma,C.J., van Zoelen,E.J., Geurts van Kessel,A. and Olijve,W. TITLE Fine mapping of the human bone morphogenetic protein-4 gene (BMP4) to chromosome 14q22-q23 by in situ hybridization JOURNAL Genomics 27 (3), 559-560 (1995) PUBMED 7558046 REFERENCE 9 (bases 1 to 1748) AUTHORS Oida,S., Iimura,T., Maruoka,Y., Takeda,K. and Sasaki,S. TITLE Cloning and sequence of bone morphogenetic protein 4 (BMP-4) from a human placental cDNA library JOURNAL DNA Seq. 5 (5), 273-275 (1995) PUBMED 7579580 REFERENCE 10 (bases 1 to 1748) AUTHORS Wozney,J.M., Rosen,V., Celeste,A.J., Mitsock,L.M., Whitters,M.J., Kriz,R.W., Hewick,R.M. and Wang,E.A. TITLE Novel regulators of bone formation: molecular clones and activities JOURNAL Science 242 (4885), 1528-1534 (1988) PUBMED 3201241 COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from BQ083000.1, BC020546.2, BX161385.1 and BU683974.1. On Sep 14, 2007 this sequence version replaced gi:19528649. Summary: The protein encoded by this gene is a member of the bone morphogenetic protein family which is part of the transforming growth factor-beta superfamily. The superfamily includes large families of growth and differentiation factors. Bone morphogenetic proteins were originally identified by an ability of demineralized bone extract to induce endochondral osteogenesis in vivo in an extraskeletal site. This particular family member plays an important role in the onset of endochondral bone formation in humans, and a reduction in expression has been associated with a variety of bone diseases, including the heritable disorder Fibrodysplasia Ossificans Progressiva. Alternative splicing in the 5' untranslated region of this gene has been described and three variants are described, all encoding an identical protein. Transcript Variant: This variant (2) utilizes exons 1, 3, 4 and 5 but uses an alternate splice acceptor site on exon 1, resulting in a shorter transcript compared to variant 1. Variants 1, 2 and 3 encode the same protein. Publication Note: This RefSeq record includes a subset of the publications that are available for this gene. Please see the Entrez Gene record to access additional publications. COMPLETENESS: full length. PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP 1-25 BQ083000.1 27-51 26-664 BC020546.2 17-655 665-1728 BX161385.1 642-1705 1729-1748 BU683974.1 1-20 c FEATURES Location/Qualifiers source 1..1748 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="14" /map="14q22-q23" gene 1..1748 /gene="BMP4" /note="bone morphogenetic protein 4; synonyms: ZYME, BMP2B, BMP2B1" /db_xref="GeneID:652" /db_xref="HGNC:1071" /db_xref="HPRD:HPRD_00207" /db_xref="MIM:112262" exon 1..78 /gene="BMP4" /inference="alignment:Splign" /number=1a exon 79..203 /gene="BMP4" /inference="alignment:Splign" /number=3 STS 127..236 /gene="BMP4" /standard_name="MARC_5761-5762:996690803:2" /db_xref="UniSTS:269563" exon 204..580 /gene="BMP4" /inference="alignment:Splign" /number=4 CDS 211..1437 /gene="BMP4" /GO_component="extracellular space" /GO_function="cytokine activity [PMID 14749725]; growth factor activity; signal transducer activity [PMID 9804553]" /GO_process="cartilage development; cell differentiation; growth; negative regulation of myoblast differentiation [PMID 14749725]; negative regulation of striated muscle development [PMID 14749725]; ossification; positive regulation of bone mineralization [PMID 14749725]; positive regulation of osteoblast differentiation [PMID 14749725]; positive regulation of protein amino acid phosphorylation [PMID 14749725]; ureteric bud development [PMID 12141440]" /note="bone morphogenetic protein 2B" /codon_start=1 /product="bone morphogenetic protein 4 preproprotein" /protein_id="NP_570911.2" /db_xref="GI:157276595" /db_xref="CCDS:CCDS9715.1" /db_xref="GeneID:652" /db_xref="HGNC:1071" /db_xref="HPRD:HPRD_00207" /db_xref="MIM:112262" /translation="MIPGNRMLMVVLLCQVLLGGASHASLIPETGKKKVAEIQGHAGG RRSGQSHELLRDFEATLLQMFGLRRRPQPSKSAVIPDYMRDLYRLQSGEEEEEQIHST GLEYPERPASRANTVRSFHHEEHLENIPGTSENSAFRFLFNLSSIPENEVISSAELRL FREQVDQGPDWERGFHRINIYEVMKPPAEVVPGHLITRLLDTRLVHHNVTRWETFDVS PAVLRWTREKQPNYGLAIEVTHLHQTRTHQGQHVRISRSLPQGSGNWAQLRPLLVTFG HDGRGHALTRRRRAKRSPKHHSQRARKKNKNCRRHSLYVDFSDVGWNDWIVAPPGYQA FYCHGDCPFPLADHLNSTNHAIVQTLVNSVNSSIPKACCVPTELSAISMLYLDEYDKV VLKNYQEMVVEGCGCR" exon 581..1733 /gene="BMP4" /inference="alignment:Splign" /number=5 STS 748..1628 /gene="BMP4" /standard_name="BMP4_296" /db_xref="UniSTS:277036" STS 1354..1522 /gene="BMP4" /standard_name="RH66828" /db_xref="UniSTS:66257" STS 1407..1548 /gene="BMP4" /standard_name="RH47295" /db_xref="UniSTS:14466" STS 1440..1545 /gene="BMP4" /standard_name="G15892" /db_xref="UniSTS:65077" polyA_signal 1704..1709 /gene="BMP4" polyA_site 1733 /gene="BMP4" ORIGIN 1 aagaggagga aggaagatgc gagaaggcag aggaggaggg agggagggaa ggagcgcgga 61 gcccggcccg gaagctagga gccattccgt agtgccatcc cgagcaacgc actgctgcag 121 cttccctgag cctttccagc aagtttgttc aagattggct gtcaagaatc atggactgtt 181 attatatgcc ttgttttctg tcaagacacc atgattcctg gtaaccgaat gctgatggtc 241 gttttattat gccaagtcct gctaggaggc gcgagccatg ctagtttgat acctgagacg 301 gggaagaaaa aagtcgccga gattcagggc cacgcgggag gacgccgctc agggcagagc 361 catgagctcc tgcgggactt cgaggcgaca cttctgcaga tgtttgggct gcgccgccgc 421 ccgcagccta gcaagagtgc cgtcattccg gactacatgc gggatcttta ccggcttcag 481 tctggggagg aggaggaaga gcagatccac agcactggtc ttgagtatcc tgagcgcccg 541 gccagccggg ccaacaccgt gaggagcttc caccacgaag aacatctgga gaacatccca 601 gggaccagtg aaaactctgc ttttcgtttc ctctttaacc tcagcagcat ccctgagaac 661 gaggtgatct cctctgcaga gcttcggctc ttccgggagc aggtggacca gggccctgat 721 tgggaaaggg gcttccaccg tataaacatt tatgaggtta tgaagccccc agcagaagtg 781 gtgcctgggc acctcatcac acgactactg gacacgagac tggtccacca caatgtgaca 841 cggtgggaaa cttttgatgt gagccctgcg gtccttcgct ggacccggga gaagcagcca 901 aactatgggc tagccattga ggtgactcac ctccatcaga ctcggaccca ccagggccag 961 catgtcagga ttagccgatc gttacctcaa gggagtggga attgggccca gctccggccc 1021 ctcctggtca cctttggcca tgatggccgg ggccatgcct tgacccgacg ccggagggcc 1081 aagcgtagcc ctaagcatca ctcacagcgg gccaggaaga agaataagaa ctgccggcgc 1141 cactcgctct atgtggactt cagcgatgtg ggctggaatg actggattgt ggccccacca 1201 ggctaccagg ccttctactg ccatggggac tgcccctttc cactggctga ccacctcaac 1261 tcaaccaacc atgccattgt gcagaccctg gtcaattctg tcaattccag tatccccaaa 1321 gcctgttgtg tgcccactga actgagtgcc atctccatgc tgtacctgga tgagtatgat 1381 aaggtggtac tgaaaaatta tcaggagatg gtagtagagg gatgtgggtg ccgctgagat 1441 caggcagtcc ttgaggatag acagatatac acaccacaca cacacaccac atacaccaca 1501 cacacacgtt cccatccact cacccacaca ctacacagac tgcttcctta tagctggact 1561 tttatttaaa aaaaaaaaaa aaaaaggaaa aaatccctaa acattcacct tgaccttatt 1621 tatgacttta cgtgcaaatg ttttgaccat attgatcata tattttgaca aaatatattt 1681 ataactacgt attaaaagaa aaaaataaaa tgagtcatta ttttaaaggt aaaaaaaaaa 1741 aaaaaaaa

Esta es una secuencia. En mi fichero tengo 20 más. Tengo que extraer la parte codificante indicada en el CDS y la parte comprendida entre el ORIGIN y //.

¿Era eso lo que quería que enviara?

por **explorer** » 2007-12-30 12:39 @569

Sí. Una pregunta. ¿Cómo empieza la siguiente secuencia? ¿En el mismo fichero y con una entrada FEATURES otra vez?

por **raquel** » 2007-12-30 12:51 @577

Sí, de nuevo empieza por

Código: Seleccionar todo: LOCUS... DEFINITION... ACCESION... ... FEATURES... ... CDS... ....

y va a continuación dentro del mismo fichero.
El bucle que debe extraerme las partes que he comentado anteriormente debe estar dentro de una subrutina. Por cierto, no tengo mucho nivel porque solo he cursado esta asignatura durante tres meses... Así que a ser posible me gustaría que la sintaxis fuera mucho más básica...

He usado lo que me has mandado pero solo consigo que me abra el fichero, no que me extraiga ambas partes...

por **explorer** » 2007-12-30 21:12 @925

Sintáxis: [ Descargar ] [ Ocultar ]

Using perl Syntax Highlighting

#!/usr/bin/perl
#
# Joaquín Ferrero. Diciembre 2007.
#
# Extracción de secuencias de nucleótidos de un fichero GenBank.
#
# El fichero a tratar se pasa como primer argumento
#

use strict;
use warnings;

# Si no hay argumentos, informamos al usuario
@ARGVordie"Uso: $0 <fichero genbank a usar>\n";

my$fichero=shift@ARGV;

# Abrimos y leemos el fichero
openmy$gbh,'<',$fichero
ordie"ERROR: No pude abrir el fichero $fichero: $!\n";

my@genbank=<$gbh>;

close$gbh;

# Tratamiento del contenido

imprime_secuencias_del(\@genbank);

# Subrutina que imprime las secuencias del genbank
sub imprime_secuencias_del {
my$genbank_ref=shift;

# Tratamiento del fichero: por líneas
# Vamos leyendo línea a línea hasta que encontramos
# un final de sección ORIGIN o fin de fichero, y
# procesamos esa parte
my$parte;
foreach(@{$genbank_ref}){
if(m{^//}){# Si es final de ORIGIN

            procesa($parte);
$parte='';
}
else{# Si no lo es, la guardamos
$parte.=$_;
}
}

# Fin de fichero. Vemos si nos queda algo más que procesar
if($parte){ procesa($parte)}
}

sub procesa {
my$parte=shift;
my@parte=split"\n",$parte;

# Procesamiento de un fichero GenBank.
# Tenemos que leer las partes CDS para saber los rangos,
# y el ORIGIN, con las secuencias de nucleótidos

my@features_cds;# Aquí guardaremos las features
my$origin='';# y aquí, el ORIGIN

for(my$i=0;$i<@parte;$i++){# para todas las líneas

my$linea=$parte[$i];# leemos la línea

# Encontrada una sección de Coding Sequence
if($linea=~/^     CDS/){
my%secuencia_codificante;# almacena una secuencia

# Debemos interpretar las siguientes líneas
# hasta la siguiente sección
do{
if($linea=~m/(\d+)\.\.(\d+)/){# rango de la secuencia
$secuencia_codificante{ parte  }=[$1,$2];
}
elsif($linea=~m{/gene="(.+)"}){# nombre de la secuencia
$secuencia_codificante{ nombre }=$1;
}

$linea=$parte[++$i];# leemos siguiente línea

}while($linea=~/^ {21}/);# mientras estemos dentro del CDS

# Almacenamos la secuencia encontrada (en un array de hash)
push@features_cds,\%secuencia_codificante;

redo;# Volvemos a chequear la misma
# línea (reiniciamos el bucle
# sin mirar la condición del
# for, no sea que venga otra
# sección CDS a continuación)
}

# Encontrada sección ORIGIN
elsif($linea=~/^ORIGIN/){

# leemos la primera línea de nucleótidos
$i++;
$linea=$parte[$i];

# para todas las líneas hasta el final
for(;$i<@parte;$linea=$parte[++$i]){

# División de la línea en dos partes
my($offset,$nucleotidos)=unpack"A9 A*",$linea;

# Tratamiento de la línea
$offset= 0+$offset;# Lo pasamos a número
$nucleotidos=~s/ //g;# fuera espacios

# Ver si tiene un offset numérico
if($offset){
# colocamos los nucleótidos en el lugar indicado por el offset
substr($origin,$offset-1 )=$nucleotidos;
}
# sino, vamos añadiendo los nucleotidos al final
else{
$origin.=$nucleotidos;
}
}
}
}

# Extracción de las secuencias codificantes
foreachmy$secuencia_codificante(@features_cds){

my$nombre_de_la_secuencia=(defined$secuencia_codificante->{nombre})
?$secuencia_codificante->{nombre}
:''
;

my$rango_inicio=$secuencia_codificante->{parte}->[0];
my$rango_fin=$secuencia_codificante->{parte}->[1];

printqq("$nombre_de_la_secuencia"($rango_inicio..$rango_fin):);

printsubstr$origin,$rango_inicio-1,$rango_fin-$rango_inicio+1;

print"\n";
}
}

__END__Coloreado en 0.006 segundos,  usando GeSHi 1.0.8.4

Con este programa, dado el ejemplo anterior, sale:

Código: Seleccionar todo: "BMP4"(211..1437): atgattcctggtaaccgaatgctgatggtcgttttattatgccaagtcctgctaggaggcgcgagccatgctagttt gatacctgagacggggaagaaaaaagtcgccgagattcagggccacgcgggaggacgccgctcagggcagagccatgagctcctgcgggacttcga ggcgacacttctgcagatgtttgggctgcgccgccgcccgcagcctagcaagagtgccgtcattccggactacatgcgggatctttaccggcttca gtctggggaggaggaggaagagcagatccacagcactggtcttgagtatcctgagcgcccggccagccgggccaacaccgtgaggagcttccacca cgaagaacatctggagaacatcccagggaccagtgaaaactctgcttttcgtttcctctttaacctcagcagcatccctgagaacgaggtgatctc ctctgcagagcttcggctcttccgggagcaggtggaccagggccctgattgggaaaggggcttccaccgtataaacatttatgaggttatgaagcc cccagcagaagtggtgcctgggcacctcatcacacgactactggacacgagactggtccaccacaatgtgacacggtgggaaacttttgatgtgag ccctgcggtccttcgctggacccgggagaagcagccaaactatgggctagccattgaggtgactcacctccatcagactcggacccaccagggcca gcatgtcaggattagccgatcgttacctcaagggagtgggaattgggcccagctccggcccctcctggtcacctttggccatgatggccggggcca tgccttgacccgacgccggagggccaagcgtagccctaagcatcactcacagcgggccaggaagaagaataagaactgccggcgccactcgctcta tgtggacttcagcgatgtgggctggaatgactggattgtggccccaccaggctaccaggccttctactgccatggggactgcccctttccactggc tgaccacctcaactcaaccaaccatgccattgtgcagaccctggtcaattctgtcaattccagtatccccaaagcctgttgtgtgcccactgaact gagtgccatctccatgctgtacctggatgagtatgataaggtggtactgaaaaattatcaggagatggtagtagagggatgtgggtgccgctg

Desde luego se puede hacer más corto, pero la sintaxis te será más "complicada". Y desde luego... escribir 143 líneas y dos horas de trabajo cuando ya está resuelto con el BioPerl... creo que te merece la pena aprender a manejarlo...

Actualización: Añadido un +1 en el momento de la extracción de la secuencia, para poder extraer un nucleótido más.

por **raquel** » 2008-01-01 12:54 @579

No he podido mirar hasta hoy el foro... Ahora probaré lo que me has mandado y lo mas importante, intentaré comprenderlo... aunque dudo que sea capaz. De todas formas muchas gracias; ya te contaré cómo me ha ido y si soy capaz de hacer este paso del ejercicio. Te mandaré más dudas...

por **raquel** » 2008-01-02 12:42 @571

Funciona perfectamente pero soy incapaz de seguir y entender el procedimiento... ¿No hay algo más sencillo?

Es que es un trabajo y tengo que explicar todos los pasos que voy siguiendo...

A continuación tengo que identificar los codones con su aminoácido correspondiente y para ello tengo que dividir la secuencia codificante de tres en tres letras. Supongo que será mediante la función substr pero tampoco tengo ni idea de hacerlo...

Foro - Perl en Español

Fichero GenBank

Fichero GenBank

Publicidad

¿Quién está conectado?