Foro - Perl en Español

por **dandy** » 2014-03-18 11:11 @507

Hola, estoy intentando procesar varios archivos que he obtenido de Pubchem. Este es un ejemplo del tipo de archivos:

Sintáxis: [ Descargar ] [ Ocultar ]

Using xml Syntax Highlighting

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary v1 20060131//EN" "http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060131/esummary-v1.dtd">
<eSummaryResult>
<DocSum>
        <Id>1811</Id>
        <Item Name="AssayName" Type="String">Experimentally measured binding affinity data derived from PDB</Item>
        <Item Name="AssayDescription" Type="String">This data entry provides a collection of experimentally measured binding affinity data (Kd, Ki, and IC50), which are exclusively for the protein-ligand complexes available in the Protein Data Bank (PDB). All of the binding affinity data compiled in this data entry are cited from original references. This work is contributed by the PDBbind database.</Item>
        <Item Name="ReadoutCount" Type="Integer">0</Item>
        <Item Name="SourceNameList" Type="List">
                <Item Name="string" Type="String">Shanghai Institute of Organic Chemistry</Item>
        </Item>
        <Item Name="ActiveSidCount" Type="Integer">3073</Item>
        <Item Name="ActivityOutcomeMethod" Type="String">other</Item>
        <Item Name="InactiveSidCount" Type="Integer">0</Item>
        <Item Name="InconclusiveSidCount" Type="Integer">0</Item>
        <Item Name="TotalSidCount" Type="Integer">3073</Item>
        <Item Name="XRefDburlList" Type="List">
                <Item Name="string" Type="String">http://www.sioc.ac.cn/esioc/1.htm</Item>
        </Item>
        <Item Name="XRefAsurlList" Type="List">
                <Item Name="string" Type="String">http://www.pdbbind.org.cn</Item>
                <Item Name="string" Type="String">http://www.pdbbind.org.cn</Item>
        </Item>
        <Item Name="ModifyDate" Type="Date">2010/07/01 00:00</Item>
        <Item Name="DepositDate" Type="Date">2009/06/08 00:00</Item>
        <Item Name="HoldUntilDate" Type="Date">1/01/01 00:00</Item>
        <Item Name="AID" Type="Integer">1811</Item>
        <Item Name="TotalCidCount" Type="Integer">2438</Item>
        <Item Name="ActiveCidCount" Type="Integer">2438</Item>
        <Item Name="ProteinTargetList" Type="List">
                <Item Name="ProteinTarget" Type="Structure">
                        <Item Name="Name" Type="String">Chain 1, Crystal Structure Of Gamma-Chymotrypsin In Complex With 7- Hydroxycoumarin</Item>
                        <Item Name="GI" Type="Integer">17943055</Item>
<DocSum>
 
<DocSum>
        <Id>648328</Id>
        <Item Name="AssayName" Type="String">Cytotoxicity against human A549 cells at 10 to 100 uM after 48 hrs by MTT assay</Item>
        <Item Name="AssayDescription" Type="String">Title: New antitumor compounds from Carya cathayensis. Abstract: A new lignan (7R,8S,8'R)-4,4',9-trihydroxy-7,9'-epoxy-8,8'-lignan, and three new phenolics, carayensin-A, carayensin-B, and carayensin-C, together with 13 known compounds were isolated from the shells of Carya cathayensis. Their chemical structures were established mainly by 1D and 2D NMR techniques and mass spectrometry. All the compounds were evaluated for cytotoxicity against several human tumor types including human colorectal cancer cell lines (HCT-116, HT-29), human lung cancer cell line (A549), and human breast cancer cell line (MCF-7). The compounds 1, 5, 6, and 16 are considered to be potential as antitumor agents, which could significantly inhibit the cancer cell growth in a dose-dependent manner.</Item>
        <Item Name="ReadoutCount" Type="Integer">5</Item>
        <Item Name="SourceNameList" Type="List">
                <Item Name="string" Type="String">ChEMBL</Item>
        </Item>
        <Item Name="ActiveSidCount" Type="Integer">8</Item>
        <Item Name="ActivityOutcomeMethod" Type="String"></Item>
        <Item Name="InactiveSidCount" Type="Integer">0</Item>
        <Item Name="InconclusiveSidCount" Type="Integer">0</Item>
        <Item Name="TotalSidCount" Type="Integer">8</Item>
        <Item Name="XRefDburlList" Type="List">
                <Item Name="string" Type="String">http://www.ebi.ac.uk/chembldb/</Item>
        </Item>
        <Item Name="XRefAsurlList" Type="List">
                <Item Name="string" Type="String">http://www.ebi.ac.uk/chembldb/index.php/assay/inspect/805776</Item>
                <Item Name="string" Type="String">http://www.ebi.ac.uk/chembldb/index.php/assay/inspect/805776</Item>
                <Item Name="string" Type="String">http://www.ebi.ac.uk/chembldb/index.php/assay/inspect/805776</Item>
        </Item>
        <Item Name="ModifyDate" Type="Date">2013/07/07 00:00</Item>
        <Item Name="DepositDate" Type="Date">2012/09/09 00:00</Item>
        <Item Name="HoldUntilDate" Type="Date">1/01/01 00:00</Item>
        <Item Name="AID" Type="Integer">648328</Item>
        <Item Name="TotalCidCount" Type="Integer">8</Item>
        <Item Name="ActiveCidCount" Type="Integer">8</Item>
        <Item Name="ProteinTargetList" Type="List"></Item>
</DocSum>
 
<DocSum>
        <Id>399372</Id>
        <Item Name="AssayName" Type="String">Deterrent activity against Perknaster fuscus assessed as induction of sustained retractions of tube-feet by sea-star deterrent assay</Item>
        <Item Name="AssayDescription" Type="String">Title: Purine and nucleoside metabolites from the Antarctic sponge Isodictya erinacea. Abstract: The bright yellow sponge Isodictya erinacea is one of several chemically defended sponges found on the benthos of McMurdo Sound, Antarctica. An investigation of the metabolites from this sponge has resulted in the isolation of purine and nucleoside metabolites, including the previously unreported erinacean (1) and p-hydroxybenzaldehyde. The latter metabolite has been demonstrated to cause a feeding deterrence behavior in Perknaster fuscus, the major predator of antarctic sponges.</Item>
        <Item Name="ReadoutCount" Type="Integer">5</Item>
        <Item Name="SourceNameList" Type="List">
                <Item Name="string" Type="String">ChEMBL</Item>
        </Item>
        <Item Name="ActiveSidCount" Type="Integer">1</Item>
        <Item Name="ActivityOutcomeMethod" Type="String"></Item>
        <Item Name="InactiveSidCount" Type="Integer">6</Item>
        <Item Name="InconclusiveSidCount" Type="Integer">0</Item>
        <Item Name="TotalSidCount" Type="Integer">7</Item>
        <Item Name="XRefDburlList" Type="List">
                <Item Name="string" Type="String">http://www.ebi.ac.uk/chembldb/</Item>
        </Item>
        <Item Name="XRefAsurlList" Type="List">
                <Item Name="string" Type="String">http://www.ebi.ac.uk/chembldb/index.php/assay/inspect/547401</Item>
                <Item Name="string" Type="String">http://www.ebi.ac.uk/chembldb/index.php/assay/inspect/547401</Item>
        </Item>
        <Item Name="ModifyDate" Type="Date">2013/05/08 00:00</Item>
        <Item Name="DepositDate" Type="Date">2010/05/26 00:00</Item>
        <Item Name="HoldUntilDate" Type="Date">1/01/01 00:00</Item>
        <Item Name="AID" Type="Integer">399372</Item>
        <Item Name="TotalCidCount" Type="Integer">7</Item>
        <Item Name="ActiveCidCount" Type="Integer">1</Item>
        <Item Name="ProteinTargetList" Type="List"></Item>
</DocSum>
 
</eSummaryResult>
Coloreado en 0.005 segundos,  usando GeSHi 1.0.8.4

Lo que está entre </DocSum> corresponde a una entrada en la base de datos. A mi me interesa extraer las líneas que contengan:

<Item Name="GI" Type="Integer">

Eso es sencillo. El problema que tengo es que antes de obtener esas líneas quiero evitar las entradas que contengan la siguiente línea:

<Item Name="AssayName" Type="String">Experimentally measured binding affinity data derived from PDB</Item>

Para ello he escrito el siguiente código:

Sintáxis: [ Descargar ] [ Ocultar ]

Using perl Syntax Highlighting

#! /usr/bin/perl
 
my $file=$ARGV[0];
 
open FILE, $file;
 
my @array=<FILE>;
chomp @array;
close FILE;
my $skip_pdb='<Item Name="AssayName" Type="String">Experimentally measured binding affinity data derived from PDB</Item>';
my $gi_id='<Item Name="GI" Type="Integer">';
 
for ($i=0;$i<scalar(@array); $i++){
 
      if ($array[$i]=~ '<DocSum>'){
     
      my $j=$i+2;
      my $k=$i+29; 
          if ($array[$j]=~ $skip_pdb){
          
          next;
          }
          
          
          print "$array[$k]\n";
          
          
      }
}
Coloreado en 0.002 segundos,  usando GeSHi 1.0.8.4

El código funciona, es posible extraer las líneas deseadas (las cuales están en $k). El problema surge cuando intento utilar otros archivos ya que la estructura no es fija, es decir, la línea deseada no siempre está en "$i + 29". ¿Alguna sugerencia?

¡Saludos y gracias!

#!/usr/bin/perl
 
my $file = $ARGV[0];
open FILE, $file;
 
while (<FILE>) {                                # para todas las líneas del archivo
    if (/<Item Name="GI" Type="Integer">/) {    # si contiene lo que buscamos
        print;                                  # imprimimos toda la línea
    }
}
 
close FILE;
Coloreado en 0.001 segundos,  usando GeSHi 1.0.8.4

Podemos hacer una mejora: mostrar solo el valor numérico de la línea.

Sintáxis: [ Descargar ] [ Ocultar ]

Using perl Syntax Highlighting

#!/usr/bin/perl
 
my $file = $ARGV[0];
open FILE, $file;
 
while (<FILE>) {                                        # para todas las líneas del archivo
    if (/<Item Name="GI" Type="Integer">\s*(\d+)/) {    # si contiene lo que buscamos
        print "$1\n";                                   # imprimimos el valor numérico capturado con los paréntesis
    }
}
 
close FILE;
Coloreado en 0.001 segundos,  usando GeSHi 1.0.8.4

A propósito, los patrones de las expresiones regulares suelen estar rodeados por delimitadores (como los '/' que he usado yo).

por **dandy** » 2014-03-18 12:00 @542

Hola, gracias por tu respuesta, pero como mencionaba, extraer solo las líneas con <Item Name="GI" Type="Integer" es fácil, tal como lo muestras en tu código.

El detalle es que no deseo extraer todas.

Quiero evitar las líneas con <Item Name="GI" Type="Integer" pero que están entre los dos <DocSum> (principio y final) más <Item Name="AssayName" Type="String">Experimentally measured binding affinity data derived from PDB</Item>.

Es decir, todas las <Item Name="GI" Type="Integer" asociadas a la línea Type="String">Experimentally measured binding affinity data derived from PDB</Item>, son un solo grupo (delimitado por los dos <DocSum>) y no las deseo incluir (esto es constante en todos los archivos).

Mi código es lo que trata de hacer (línea 15 a la 22), evitar el grupo "Experimentally measured binding affinity data derived from PDB", todas sus líneas Item Name="GI" asociadas y solo imprimir la de los otros grupos (línea 18 del código).

El problema es que en otros archivos la línea que deseo Item Name="GI" no siempre está en "$i + 29". Espero no haberte confundido.

La parte del archivo que muestro más abajo corresponde al grupo que deseo evitar.

Sintáxis: [ Descargar ] [ Ocultar ]

Using xml Syntax Highlighting

<DocSum>
        <Id>1811</Id>
        <Item Name="AssayName" Type="String">Experimentally measured binding affinity data derived from PDB</Item>
        <Item Name="AssayDescription" Type="String">This data entry provides a collection of experimentally measured binding affinity data (Kd, Ki, and IC50), which are exclusively for the protein-ligand complexes available in the Protein Data Bank (PDB). All of the binding affinity data compiled in this data entry are cited from original references. This work is contributed by the PDBbind database.</Item>
        <Item Name="ReadoutCount" Type="Integer">0</Item>
        <Item Name="SourceNameList" Type="List">
                <Item Name="string" Type="String">Shanghai Institute of Organic Chemistry</Item>
        </Item>
        <Item Name="ActiveSidCount" Type="Integer">3073</Item>
        <Item Name="ActivityOutcomeMethod" Type="String">other</Item>
        <Item Name="InactiveSidCount" Type="Integer">0</Item>
        <Item Name="InconclusiveSidCount" Type="Integer">0</Item>
        <Item Name="TotalSidCount" Type="Integer">3073</Item>
        <Item Name="XRefDburlList" Type="List">
                <Item Name="string" Type="String">http://www.sioc.ac.cn/esioc/1.htm</Item>
        </Item>
        <Item Name="XRefAsurlList" Type="List">
                <Item Name="string" Type="String">http://www.pdbbind.org.cn</Item>
                <Item Name="string" Type="String">http://www.pdbbind.org.cn</Item>
        </Item>
        <Item Name="ModifyDate" Type="Date">2010/07/01 00:00</Item>
        <Item Name="DepositDate" Type="Date">2009/06/08 00:00</Item>
        <Item Name="HoldUntilDate" Type="Date">1/01/01 00:00</Item>
        <Item Name="AID" Type="Integer">1811</Item>
        <Item Name="TotalCidCount" Type="Integer">2438</Item>
        <Item Name="ActiveCidCount" Type="Integer">2438</Item>
        <Item Name="ProteinTargetList" Type="List">
                <Item Name="ProteinTarget" Type="Structure">
                        <Item Name="Name" Type="String">Chain 1, Crystal Structure Of Gamma-Chymotrypsin In Complex With 7- Hydroxycoumarin</Item>
                        <Item Name="GI" Type="Integer">17943055</Item>
<DocSum>
Coloreado en 0.003 segundos,  usando GeSHi 1.0.8.4

por **explorer** » 2014-03-18 14:45 @656

Vale, entendido... Además, veo que las marcas no están bien puestas. La línea 34 es otra marca de inicio de <DocSum>.

Con el siguiente programa, se obtiene lo que quieres:

Sintáxis: [ Descargar ] [ Ocultar ]

Using perl Syntax Highlighting

#!/usr/bin/perl
 
my $file = $ARGV[0];
open FILE, $file;
 
my $dentro_de_DocSum = 0;                               # bandera que indica nuestra posición
 
while (<FILE>) {                                        # para todas las líneas del archivo
 
    if (my $pos = m{<DocSum>} ... m{</?DocSum>}) {      # estamos dentro de <DocSum>
 
        if ($pos == 1) {                                # si es la primera línea del bloque
            $dentro_de_DocSum = 1;                      # entonces sí estamos dentro de un bloque
        }
 
        if (m{Experimentally measured binding affinity data derived from PDB}) {        # pero nos encontramos con esto
            $dentro_de_DocSum = 0;                                                      # entonces este bloque no nos interesa
        }
 
        if ($dentro_de_DocSum and m{<Item Name="GI" Type="Integer">\s*(\d+)}) {         # si encontramos un dato
            print "$1\n";                                                               # imprimimos el valor
        }
    }
}
 
close FILE;
Coloreado en 0.002 segundos,  usando GeSHi 1.0.8.4

Se utiliza el operador rango (...), con una "trampa": si la línea 34 del primer XML es correcta (hay varios <DocSum> uno dentro del otro), entonces indicamos en el operador rango que nos vale como marca de fin encontrar otra marca de inicio de <DocSum>.

Si, en cambio, la marca de la línea 34 es realmente una de cierre (</DocSum>), entonces lo mejor es quitar el '?' de la segunda expresión regular del operador rango.

por **dandy** » 2014-03-18 15:31 @688

¡Funciona perfecto!

Gracias.

Foro - Perl en Español

Procesando archivo de Pubchem

Procesando archivo de Pubchem

Publicidad

Re: Procesando archivo de Pubchem

Re: Procesando archivo de Pubchem

Re: Procesando archivo de Pubchem

Re: Procesando archivo de Pubchem

¿Quién está conectado?