Foro - Perl en Español

por **Alfumao** » 2011-01-04 06:31 @313

Buenos días,

Estoy dándole vueltas como un loco a una expresión regular que me permita extraer las líneas con esta estructura de un reporte de resultados.

Línea tipo:

Sintáxis: [ Descargar ] [ Ocultar ]

Using text Syntax Highlighting

 2.4e-24  82.8 0.2  4.1e-24  82.1  0.1  1.4  1  STRM_2308 putative HistiColoreado en 0.000 segundos,  usando GeSHi 1.0.8.4

El problema principal creo que es la expresión regular utilizada para detectar el número "e" (2.4e-24), yo lo he intentado con estas y no ha funcionado:

/^\s+\d+\s+/
/^\s+\d+e\-\d+/
/^\s+\d+\e\-\d+/

¿Alguna sugerencia?

Muchas gracias

Query:       HisKA_3  [M=70]

Accession:   PF07730.3

Description: Histidine kinase

Scores for complete sequences (score includes all domains):

   --- full sequence ---   --- best 1 domain ---    -#dom-

    E-value  score  bias    E-value  score  bias    exp  N  Sequence                  Description

    ------- ------ -----    ------- ------ -----   ---- --  --------                  -----------

    2.4e-24   82.8   0.2    4.1e-24   82.1   0.1    1.4  1  NP_722264.1               putative histidine kinase [Strepto

    1.4e-22   77.2   0.3    2.6e-22   76.3   0.2    1.4  1  NP_721890.1               putative histidine kinase [Strepto

    4.3e-22   75.5   3.4      9e-22   74.5   2.4    1.6  1  NP_720926.1               putative histidine kinase [Strepto

  ------ inclusion threshold ------

        2.7    4.8   9.6       0.24    8.2   0.9    3.1  4  24378851 begin_of_the_skype_highlighting              4 24378851      end_of_the_skype_highlighting|ref|NP_720806.1| hypothetical protein SMU.354 [Stre

Domain annotation for each sequence (and alignments):

>> 24380309|ref|NP_722264.1|  putative histidine kinase [Streptococcus mutans UA159]

   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc

 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----

   1 !   82.1   0.1   8.4e-27   4.1e-24       1      70 []     245     312 ..     245     312 .. 0.99

  Alignments for each domain:

  == domain 1    score: 82.1 bits;  conditional E-value: 8.4e-27

                    HisKA_3   1 ERaRIARELHDsvgQsLsaiklqlelarrlldsdkdpeearealdeirelarealaevRrllgdLRpaal 70 

                                ER+RIARE+HD++g++L++i+  ++++  l+  d dp  a+ +l+++ + +re++++vRr l  +Rp+al

  24380309|ref|NP_722264.1| 245 ERKRIAREIHDTLGHALTGISAGIDAVTVLV--DFDPNHAKSQLKNVSDVVREGIQDVRRSLEKMRPGAL 312

                                9******************************..**********************************987 PP

>> 24379935|ref|NP_721890.1|  putative histidine kinase [Streptococcus mutans UA159]

   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc

 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----

   1 !   76.3   0.2   5.3e-25   2.6e-22       1      66 [.     147     211 ..     147     215 .. 0.97

  Alignments for each domain:

  == domain 1    score: 76.3 bits;  conditional E-value: 5.3e-25

                    HisKA_3   1 ERaRIARELHDsvgQsLsaiklqlelarrlldsdkdpe.earealdeirelarealaevRrllgdLR 66 

                                ER+RI R+LHD++g+++++++l++ela++++  +k    +++++l+e+ +++++++ evR+l+ +L+

  24379935|ref|NP_721890.1| 147 ERNRIGRDLHDTLGHTFAMMSLKTELALKQM--KKGRYeAVQKQLEELNQISHDSMHEVRELVNHLK 211

                                9******************************..999999**************************97 PP

>> 24378971|ref|NP_720926.1|  putative histidine kinase [Streptococcus mutans UA159]

   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc

 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----

   1 !   74.5   2.4   1.8e-24     9e-22       1      70 []     132     200 ..     132     200 .. 0.94

  Alignments for each domain:

  == domain 1    score: 74.5 bits;  conditional E-value: 1.8e-24

                    HisKA_3   1 ERaRIARELHDsvgQsLsaiklqlelarrlldsdkdpeearealdeirelarealaevRrllgdLRpaal 70 

                                ER+RIAR+LHD+v+Q L+a ++ l+++ ++ld + +  +++++l +i+++++ a++++R ll +LRp++l

  24378971|ref|NP_720926.1| 132 ERKRIARDLHDTVSQELFASSMILSGVSHNLD-QLEKKQLQTQLLAIEDMLNNAQNDLRVLLLHLRPTEL 200

                                9******************************6.44444*****************************987 PP

>> 24378851|ref|NP_720806.1|  hypothetical protein SMU.354 [Streptococcus mutans UA159]

   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc

 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----

   1 ?    8.2   0.9   0.00048      0.24      23      64 ..      48      84 ..      45      85 .. 0.81

   2 ?    1.4   0.1     0.061        30      30      59 ..      85     112 ..      82     114 .. 0.86

   3 ?   -2.6   0.1      0.98   4.8e+02      55      62 ..     119     126 ..     116     131 .. 0.67

   4 ?   -2.2   0.0      0.79   3.9e+02      49      63 ..     150     164 ..     143     166 .. 0.67

  Alignments for each domain:

  == domain 1    score: 8.2 bits;  conditional E-value: 0.00048

                    HisKA_3 23 qlelarrlldsdkdpeearealdeirelarealaevRrllgd 64

                               qle+a      + +  ++ ++l++++  +++ l+++R++l++

  24378851|ref|NP_720806.1| 48 QLEVAN-----KNQLLAINQQLTRLQNDLSQQLTDLREVLHQ 84

                               666665.....44444889*********************95 PP

  == domain 2    score: 1.4 bits;  conditional E-value: 0.061

                    HisKA_3  30 lldsdkdpeearealdeirelarealaevR 59 

                                +l   +  ++++++l++i   ++++ +e+ 

  24378851|ref|NP_720806.1|  85 NL--NDSRDRSDKRLEQINLQLNQSVKEMQ 112

                                66..778889*****************995 PP

  == domain 3    score: -2.6 bits;  conditional E-value: 0.98

                    HisKA_3  55 laevRrll 62 

                                l+e+R+++

  24378851|ref|NP_720806.1| 119 LEEMRQTV 126

                                45666665 PP

  == domain 4    score: -2.2 bits;  conditional E-value: 0.79

                    HisKA_3  49 elarealaevRrllg 63 

                                e ++++l e+ ++++

  24378851|ref|NP_720806.1| 150 ENVNQGLGEMKNMAR 164

                                456677777766665 PP

Internal pipeline statistics summary:

-------------------------------------

Query model(s):                            1  (70 nodes)

Target sequences:                       1960  (579731 residues)

Passed MSV filter:                        66  (0.0336735); expected 39.2 (0.02)

Passed bias filter:                       66  (0.0336735); expected 39.2 (0.02)

Passed Vit filter:                        16  (0.00816327); expected 2.0 (0.001)

Passed Fwd filter:                         5  (0.00255102); expected 0.0 (1e-05)

Initial search space (Z):               1960  [actual number of targets]

Domain search space  (domZ):               4  [number of targets reported over threshold]

# CPU time: 0.05u 0.00s 00:00:00.05 Elapsed: 00:00:00.04

# Mc/sec: 1014.53

//Coloreado en 0.001 segundos,  usando GeSHi 1.0.8.4

Las que yo quiero extraer son las que están en la sección Scores for complete sequences.

por **explorer** » 2011-01-04 10:28 @477

Si, suponemos, que la parte que nos interesa, está entre las líneas Scores for complete sequences e inclusion threshold, podemos usarlas como referencias para acotar la zona que queremos analizar:

Sintáxis: [ Descargar ] [ Ocultar ]

Using perl Syntax Highlighting

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
 
use autodie;
 
open my $fh, '<kk.txt';
 
while (<$fh>) {
    if (/^Scores for complete sequences/ .. /inclusion threshold/) {
        print if /^\s*\d[\d.e+-]*/;
    }
}
 
close $fh;
 
__END__
    2.4e-24   82.8   0.2    4.1e-24   82.1   0.1    1.4  1  NP_722264.1 putative histidine kinase [Strepto
    1.4e-22   77.2   0.3    2.6e-22   76.3   0.2    1.4  1  NP_721890.1 putative histidine kinase [Strepto
    4.3e-22   75.5   3.4      9e-22   74.5   2.4    1.6  1  NP_720926.1 putative histidine kinase [Strepto
 
Coloreado en 0.003 segundos,  usando GeSHi 1.0.8.4

Hay un cambio en la expresión regular: hay un \d más, al principio. Si no lo hiciéramos, nos sacaría las líneas que comienzan por guiones. Poniendo el \d delante, obligamos a que la línea comience con un dígito.

por **Alfumao** » 2011-01-11 14:47 @657

Muchas gracias Explorer, pero hay un problema. Yo necesito que al leer el reporte, me reconozca cada línea individualmente para luego realizar un separación de sus elementos (vía split). Así una vez dividida la línea en partes y asignadas cada una de estas a diferentes variables, puedo hacer el "parsing" de dicha línea e imprimir una archivo de salida con los datos que me interesen exclusivamente (no sé si me explico).

Es decir que si selecciono de golpe todo el bloque de resultados como propones, no puedo extraer los datos que me interesan de la forma que expliqué, por eso necesito que la expresión regular me reconozca ese tipo de líneas y de forma individual...

Muchas gracias por tu atención, como siempre. :wink:

por **explorer** » 2011-01-11 14:58 @665

Yo no estoy imprimiendo un bloque.

Si te fijas, en la línea 12 solo estoy imprimiendo una línea cada vez. Ahí está la expresión regular de las líneas interesantes.

Solo tienes que modificar esa línea para procesar cada línea, con el split().

Sintáxis: [ Descargar ] [ Ocultar ]

Using perl Syntax Highlighting

if (/^\s*\d[\d.e+-]*/) {
    # ... procesamos la línea
}
Coloreado en 0.001 segundos,  usando GeSHi 1.0.8.4

por **Alfumao** » 2011-01-11 15:17 @678

Tienes toda la razón, yo sustituí mal en mi programa y seleccionaba todo como un bloque, lo voy a modificar y te comento.

Gracias de nuevo.

Foro - Perl en Español

Expresión regular para identificar una línea

Expresión regular para identificar una línea

Publicidad

Re: Expresión regular para identificar una línea

Re: Expresión regular para identificar una línea

Re: Expresión regular para identificar una línea

Re: Expresión regular para identificar una línea

Re: Expresión regular para identificar una línea

Re: Expresión regular para identificar una línea

¿Quién está conectado?