Foro - Perl en Español

por **leire_12** » 2009-07-31 08:49 @409

¡¡Hola!!

Necesito una subrutina para parsear las anotaciones de un fichero GenBank. Tengo que pasar como argumento una variable donde se encuentran las anotaciones (que ya tengo), y lo que me tiene que devolver es un hash en el que las llaves sea los títulos de las diferentes secciones (locus, features...) y los valores los datos correspondientes a esa sección. ¿Cómo lo hago?

Raquel os puso lo un ejemplo de fichero GenBank. Yo lo que tengo que hacer es abrir ese fichero y crear un hash que tenga como llaves: LOCUS, DEFINITION, ACCESION etc, y como valores:

LOCUS la información

DEFINITION la información

.
.
.

Tengo el código hecho (lo he cogido del libro "begining perl for bioinformatics") pero no me funciona. ¿¿Me lo podrías echar un vistazo?? Con la subrutina get_file_data() se me separa por una parte el ADN y por la otra las anotaciones. Pero creo que lo que falla es la subrutina parse_annotation() porque no me devuelve nada.

Sintáxis: [ Descargar ] [ Ocultar ]

Using perl Syntax Highlighting

#!usr/bin/perl
use strict;
use warnings;
 
 
my $fh;
my $record;
my $dna='';
my $annotation='';
my $offset;
my %fields = ();
my @features;
my @CDS;
my @codones_key;
my %codones;
my $codon;
my %Aminoacidos;
my $CDS;
my $anotaciones;
my $secuencia_valida;
my $fichero='Escherichiacoli.gb';
my $fichero2= 'Arabidopsisthaliana.gb';
 
 
$fh = open_file($fichero2);
 
$record = get_next_record($fh);
 
($annotation, $dna) = get_annotation_and_dna($record);
 
 
 
 
%fields = parse_annotation ($annotation);
 
 
foreach my $key (keys %fields) {
        print "$key \n";
        print $fields{$key};
}
 
##################################################################
 
sub open_file{
        
        my ($fichero)=@_;
        my $fh;
        unless(open($fh,$fichero))
        {
                print "Error a la hora de abrir el fichero";
                exit;
        }
        
        return $fh;
        
}
 
sub get_next_record {
        #given the filehandle, get the record
     #(we can get the offset by first calling "tell")
    my($fh) = @_;
 
    my($offset);
    my($record) = '';
    my($save_input_separator) = $/;
    $/ = "//\n";
    $record = <$fh>;
    $/ = $save_input_separator;
   
    return $record;
}     
sub get_annotation_and_dna
{
        my($record)=@_;
        
        my($annotation)= '';
        my($dna)= '';
        my($dnaAux) = '';
        my($annotation_aux)='';
        ($annotation, $dna) = ($record=~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);
        $annotation_aux = $annotation;
        $dnaAux=$dna;
        #$dna=~ s/[\s\/]//g;
        
        return ($annotation_aux, $dnaAux);
        
}
 
sub parse_annotation {
        my($annotation)= @_;
        my (%results) = ();
        
        while ($annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm) {
                my $value = $&;
                (my $key = $value) =~ s/^([A-Z]+).*/$1/s;
                $results {$key} = $value;}
        
        
        
        return %results;
}
Coloreado en 0.003 segundos,  usando GeSHi 1.0.8.4

por **explorer** » 2009-07-31 10:27 @477

Con este fichero de entrada

Sintáxis: [ Descargar ] [ Ocultar ]

Using text Syntax Highlighting

LOCUS       X72459                   375 bp    mRNA    linear   PRI 14-NOV-2006

DEFINITION  H.sapiens mRNA for rearranged Ig kappa light chain variable region

            (I.169).

ACCESSION   X72459

VERSION     X72459.1  GI:441386

KEYWORDS    immunoglobulin; J-segment; kappa light chain; V-region.

SOURCE      Homo sapiens (human)

  ORGANISM  Homo sapiens

            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;

            Catarrhini; Hominidae; Homo.

REFERENCE   1  (bases 1 to 375)

  AUTHORS   Klein,R., Jaenichen,R. and Zachau,H.G.

  TITLE     Expressed human immunoglobulin kappa genes and their hypermutation

  JOURNAL   Eur. J. Immunol. 23 (12), 3248-3262 (1993)

   PUBMED   8258341

REFERENCE   2  (bases 1 to 375)

  AUTHORS   Zachau,H.G.

  TITLE     Direct Submission

  JOURNAL   Submitted (26-APR-1993) H.G. Zachau, Institut fuer Physiologische

            Chemie, der Universitaet Muenchen, Schillerstr 44, 8000 Muenchen 2,

            FRG

FEATURES             Location/Qualifiers

     source          1..375

                     /organism="Homo sapiens"

                     /mol_type="mRNA"

                     /isolate="M.L."

                     /db_xref="taxon:9606"

                     /chromosome="2"

                     /clone="I.169"

                     /tissue_type="spleen"

                     /clone_lib="lambda zap II phage library"

     CDS             1..375

                     /note="Protein sequence is in conflict with the conceptual

                     translation"

                     /codon_start=1

                     /product="Ig kappa light chain (VJ)"

                     /protein_id="CAA51127.1"

                     /db_xref="GI:441387"

                     /translation="PAQLLGLLLLWLPGARCAIQLTQSPSSLSASVGDRVTITCRASQ

                     GISSALAWYQQKPGKAPKLLIYDASSLESGVPSRFSGSGSGTDFTLTISSLQPEDFAT

                     YYCQQFNTYPLTFGGGTKVEIKR"

     V_region        1..375

                     /product="Ig kappa light chain (VJ)"

     V_segment       53..336

     J_segment       337..375

                     /note="J-Kappa 4"

ORIGIN     

        1 cccgctcagc tcctggggct tctgctgctc tggctcccag gtgccagatg tgccatccag

       61 ttgacccagt ctccatcctc cctgtctgca tctgtaggag acagagtcac catcacttgc

      121 cgggcaagtc agggcataag cagtgcttta gcctggtatc agcagaaacc agggaaagct

      181 cctaagctcc tgatctatga tgcctccagt ttggaaagtg gggtcccatc aaggttcagc

      241 ggcagtggat ctgggacaga tttcactctc accatcagca gcctgcagcc tgaagatttt

      301 gcaacttatt actgtcaaca gtttaatact tacccgctca ctttcggcgg agggaccaag

      361 gtggagatca aacga
Coloreado en 0.000 segundos,  usando GeSHi 1.0.8.4

Y este programa:

Sintáxis: [ Descargar ] [ Ocultar ]

Using perl Syntax Highlighting

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
 
## Leemos todo el fichero de golpe
open  ADN, '<adn.txt' or die;
my $adn = join '', <ADN>;
close ADN;
 
my %genbank;
 
## Sacamos las secciones
while ($adn =~ /^([A-Z]+)\s*(.*\n(?:^\s.*\n)*)/gm ) {
    my ($clave,$valor) = ($1,$2);
    chomp $valor;
    $genbank{$clave} = $valor;
}
 
## Aspecto del diccionario
use Data::Dumper;
print Dumper \%genbank;
 
__END__
Coloreado en 0.001 segundos,  usando GeSHi 1.0.8.4

Sale:

Sintáxis: [ Descargar ] [ Ocultar ]

Using text Syntax Highlighting

$VAR1 = {

          'ACCESSION' => 'X72459',

          'DEFINITION' => 'H.sapiens mRNA for rearranged Ig kappa light chain variable region

            (I.169).',

          'ORIGIN' => '1 cccgctcagc tcctggggct tctgctgctc tggctcccag gtgccagatg tgccatccag

       61 ttgacccagt ctccatcctc cctgtctgca tctgtaggag acagagtcac catcacttgc

      121 cgggcaagtc agggcataag cagtgcttta gcctggtatc agcagaaacc agggaaagct

      181 cctaagctcc tgatctatga tgcctccagt ttggaaagtg gggtcccatc aaggttcagc

      241 ggcagtggat ctgggacaga tttcactctc accatcagca gcctgcagcc tgaagatttt

      301 gcaacttatt actgtcaaca gtttaatact tacccgctca ctttcggcgg agggaccaag

      361 gtggagatca aacga',

          'LOCUS' => 'X72459                   375 bp    mRNA    linear   PRI 14-NOV-2006',

          'REFERENCE' => '2  (bases 1 to 375)

  AUTHORS   Zachau,H.G.

  TITLE     Direct Submission

  JOURNAL   Submitted (26-APR-1993) H.G. Zachau, Institut fuer Physiologische

            Chemie, der Universitaet Muenchen, Schillerstr 44, 8000 Muenchen 2,

            FRG',

          'FEATURES' => 'Location/Qualifiers

     source          1..375

                     /organism="Homo sapiens"

                     /mol_type="mRNA"

                     /isolate="M.L."

                     /db_xref="taxon:9606"

                     /chromosome="2"

                     /clone="I.169"

                     /tissue_type="spleen"

                     /clone_lib="lambda zap II phage library"

     CDS             1..375

                     /note="Protein sequence is in conflict with the conceptual

                     translation"

                     /codon_start=1

                     /product="Ig kappa light chain (VJ)"

                     /protein_id="CAA51127.1"

                     /db_xref="GI:441387"

                     /translation="PAQLLGLLLLWLPGARCAIQLTQSPSSLSASVGDRVTITCRASQ

                     GISSALAWYQQKPGKAPKLLIYDASSLESGVPSRFSGSGSGTDFTLTISSLQPEDFAT

                     YYCQQFNTYPLTFGGGTKVEIKR"

     V_region        1..375

                     /product="Ig kappa light chain (VJ)"

     V_segment       53..336

     J_segment       337..375

                     /note="J-Kappa 4"',

          'KEYWORDS' => 'immunoglobulin; J-segment; kappa light chain; V-region.',

          'VERSION' => 'X72459.1  GI:441386',

          'SOURCE' => 'Homo sapiens (human)

  ORGANISM  Homo sapiens

            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;

            Catarrhini; Hominidae; Homo.'

        };Coloreado en 0.000 segundos,  usando GeSHi 1.0.8.4

La expresión regular busca secciones compuestas de: palabras que comienzan en la primera columna del fichero, seguida por cero o más líneas que comienzan por caracteres espacio.

por **leire_12** » 2009-07-31 13:52 @619

El problema es que tengo que utilizar el código que he escrito yo arriba porque es el que hemos dado en clase. Yo no le veo ningún error y no sé por qué no funciona.

por **explorer** » 2009-07-31 14:43 @654

He intentado correr tu programa con el fichero de ejemplo que he puesto, y he descubierto que el problema está en la expresión regular de la línea 80: exige que el fichero GenBank termine en una línea que comience por '//'.

Tienes dos opciones: o te aseguras que los ficheros leídos tienen esa marca al final, y se la pones si no la tiene, o modificas la expresión regular de la línea 80.

Esa expresión está así porque, como comenta el usuario wampaier en el hilo 'Contar ocurrencias', esos dos caracteres se usan de separador, dentro de un mismo fichero, para diferenciar distintos registros genbank.

Si vas a tratar ficheros genbank compuestos de un solo registro, entonces creo que se podría cambiar a esto: /^(LOCUS.*ORIGIN\s*\n)(.*)\n/s Es decir, quitar el separador '//'.

Otro detalle, en la línea 1:
#!usr/bin/perl
Si estás en Windows, no es problema, pero si estás en Unix/Linux, sí.
Lo correcto es:
#!/usr/bin/perl

Ahora ya me sale:

Sintáxis: [ Descargar ] [ Ocultar ]

Using text Syntax Highlighting

ORIGIN

ORIGIN

ACCESSION

ACCESSION   X72459

LOCUS

LOCUS       X72459                   375 bp    mRNA    linear   PRI 14-NOV-2006

FEATURES

FEATURES             Location/Qualifiers

     source          1..375

                     /organism="Homo sapiens"

                     /mol_type="mRNA"

                     /isolate="M.L."

                     /db_xref="taxon:9606"

                     /chromosome="2"

                     /clone="I.169"

                     /tissue_type="spleen"

                     /clone_lib="lambda zap II phage library"

     CDS             1..375

                     /note="Protein sequence is in conflict with the conceptual

                     translation"

                     /codon_start=1

                     /product="Ig kappa light chain (VJ)"

                     /protein_id="CAA51127.1"

                     /db_xref="GI:441387"

                     /translation="PAQLLGLLLLWLPGARCAIQLTQSPSSLSASVGDRVTITCRASQ

                     GISSALAWYQQKPGKAPKLLIYDASSLESGVPSRFSGSGSGTDFTLTISSLQPEDFAT

                     YYCQQFNTYPLTFGGGTKVEIKR"

     V_region        1..375

                     /product="Ig kappa light chain (VJ)"

     V_segment       53..336

     J_segment       337..375

                     /note="J-Kappa 4"

REFERENCE

REFERENCE   2  (bases 1 to 375)

  AUTHORS   Zachau,H.G.

  TITLE     Direct Submission

  JOURNAL   Submitted (26-APR-1993) H.G. Zachau, Institut fuer Physiologische

            Chemie, der Universitaet Muenchen, Schillerstr 44, 8000 Muenchen 2,

            FRG

KEYWORDS

KEYWORDS    immunoglobulin; J-segment; kappa light chain; V-region.

VERSION

VERSION     X72459.1  GI:441386

SOURCE

SOURCE      Homo sapiens (human)

  ORGANISM  Homo sapiens

            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;

            Catarrhini; Hominidae; Homo.

DEFINITION

DEFINITION  H.sapiens mRNA for rearranged Ig kappa light chain variable region

            (I.169).Coloreado en 0.000 segundos,  usando GeSHi 1.0.8.4

por **leire_12** » 2009-08-13 05:01 @251

¡Muchas gracias! Ya está resuelto.

Foro - Perl en Español

Subrutina para parsear anotaciones GenBank

Subrutina para parsear anotaciones GenBank

Publicidad

Re: Subrutina para parsear anotaciones GenBank

Re: Subrutina para parsear anotaciones GenBank

Re: Subrutina para parsear anotaciones GenBank

Re: Subrutina para parsear anotaciones GenBank

Re: Subrutina para parsear anotaciones GenBank

Re: Subrutina para parsear anotaciones GenBank

¿Quién está conectado?