Thursday, May 5, 2011

How can I get multiple memories from a Perl regex match?

The purpose of the regex search is to determine all template class instances from C++ header files. The class instances can be formarted such as:

CMyClass<int> myClassInstance;

CMyClass2<
int,
int
> myClass2Instacen;

The search is performed by loading the entire file into a string:

open(FILE, $file);
$string = join('',<FILE>);
close(FILE);

And the following regex is used to determine the class instances even if the class instance spans more then one line in the string:

$search_string = "\s*\w[^typename].*<(\s*\w\s*,?\n?)*)>\s*\w+.*";
$string =~ m/$search_string/;

The problem is that the search returns one hit only even though more class instances exist in the files.

Is it possible to get all hits by use of this approach from one of the regex backreferences variables?

From stackoverflow
  • What you require is the \G modifier. It starts the next match of your string after the last match.

    Here is the documentation from Perl Doc (SO is having trouble with the link, so you'll have to copy and paste):

    http://perldoc.perl.org/perlfaq6.html#What-good-is-'%5cG'-in-a-regular-expression%3f

    Chas. Owens : Direct link to section referred to: http://perldoc.perl.org/perlfaq6.html#What-good-is-%27\G%27-in-a-regular-expression%3f
    Gavin Miller : Thanks Chas :)
  • First, if you are going to slurp files, you should use File::Slurp. Then you can do:

    my $contents = read_file $file;
    

    read_file will croak on error.

    Second, [^typename] does not exclude just the string 'typename' but also any string containing any of those characters. Other than that, it is not obvious to me that the pattern you use will consistently match the things you want it to match, but I can't comment on that right now.

    Finally, to get all the matches in the file one by one, use the g modifier in a loop:

    my $source = '3 5 7';
    
    while ( $source =~ /([0-9])/g ) {
        print "$1\n";
    }
    

    Now that I have had a chance to look at your pattern, I am still not sure of what to make of [^typename], but here is an example program that captures the part between the angle brackets (as that seems to be the only thing you are capturing above):

    use strict;
    use warnings;
    
    use File::Slurp;
    
    my $pattern = qr{
        ^
        \w+                    
        <\s*((?:\w+(?:,\s*)?)+)\s*> 
        \s*
        \w+\s*;
    }mx;
    
    my $source = read_file \*DATA;
    
    while ( $source =~ /$pattern/g ) {
        my $match = $1;
        $match =~ s/\s+/ /g;
        print "$match\n";
    }
    
    __DATA__
    CMyClass<int> myClassInstance;
    
    CMyClass2<
    int,
    int
    > myClass2Instacen;
    
    C:\Temp> t.pl
    int
    int, int
    

    Now, I suspect you would prefer the following, however:

    my $pattern = qr{
        ^
        (
          \w+                    
          <\s*(?:\w+(?:,\s*)?)+\s*> 
          \s*
          \w+
        )
        \s*;
    }mx;
    

    which yields:

    C:\Temp> t.pl
    CMyClass<int> myClassInstance
    CMyClass2< int, int > myClass2Instacen
    
  • I'd do something like this,

    
    #!/usr/bin/perl -w
    use strict;
    use warnings;
    
    local(*F);
    open(F,$ARGV[0]);
    my $text = do{local($/);};
    my (@hits) = $text =~ m/([a-z]{3})/gsi;
    
    print "@hits\n";
    
    assuming you've got some text file like,
    /home/user$ more a.txt
    a bb dkl jidij lksj lai suj ldifk kjdfkj bb
    bb kdjfkal idjksdj fbb kjd fkjd fbb  kadfjl bbb
    bb bb bbd i
    

    this will print out all the hits from the regex:

    
    /home/user$ ./a.pl a.txt
    dkl jid lks lai suj ldi kjd fkj kdj fka idj ksd fbb 
    kjd fkj fbb kad fjl bbb bbd
    



    and a specific solution for your problem, using the same approach, might look like,

    
    #!/usr/bin/perl -w                                                                                                           
    use strict;
    use warnings;
    
    my $text = <<ENDTEXT;
     CMyClass<int> myClassInstance;
    
    CMyClass2<
    int,
    int
    > myClass2Instacen;
    
    
    CMyClass35<
    int,
    int
        > myClass35Instacen;
    
    ENDTEXT
    
    my $basename = "MyClass";
    my (@instances) = $text =~ m/\s*(${basename}[0-9]*\s*\<.*?                                                                
                                (?=\>\s*${basename})                                                                          
                                \>\s*${basename}.*?;)/xgsi;
    
    for(my $i=0; $i<@instances; $i++){
        print $i."\t".$instances[$i]."\n\n";
    }
    
    

    of course you'll probably need to tweak the regex a bit more to fit all the edge cases in your data but that should be a pretty good start.

    Alexandr Ciornii : open my $fh, $ARGV[0] is better than local(*F); open(F,$ARGV[0]); use Perl::Critic on your examples.
    blackkettle : i tried Perl::Critic on my examples (bit of a hassle to install) but it doesn't give any comments/warnings/errors for my example. also, i noted that the pre and code block are not properly escaping my left-right angle brackets...

0 comments:

Post a Comment