|
发表于 2004-4-23 18:06:51
|
显示全部楼层
我也不是在抬杠,下面的文章摘自Perl Cookbook,看完之后你就知道我为什么会这么说了。
20.6. Extracting or Removing HTML Tags
Problem
You want to remove HTML tags from a string, leaving just plain text.
Solution
The following oft-cited solution is simple but wrong on all but the most trivial HTML:
[PHP]
($plain_text = $html_text) =~ s/<[^>]*>//gs; #WRONG
[/PHP]
A correct but slower and slightly more complicated way is to use the CPAN modules:
[PHP]
use HTML:arse;
use HTML::FormatText;
$plain_text = HTML::FormatText->new->format(parse_html($html_text));
[/PHP]
Discussion
As with almost everything else, there is more than one way to do it. Each solution attempts to strike a balance between speed and flexibility. Occasionally you may find HTML that's simple enough that a trivial command line call will work:
[PHP]
% perl -pe 's/<[^>]*>//g' file
[/PHP]
However, this will break on with files whose tags cross line boundaries, like this:
- <IMG SRC = "foo.gif"
- ALT = "Flurp!">
复制代码
So, you'll see people doing this instead:
[PHP]
% perl -0777 -pe 's/<[^>]*>//gs' file
[/PHP]
or its scripted equivalent:
[PHP]
{
local $/; # temporary whole-file input mode
$html = <FILE>;
$html =~ s/<[^>]*>//gs;
}
[/PHP]
But even that isn't good enough except for simplistic HTML without any interesting bits in it. This approach fails for the following examples of valid HTML (among many others):
- <IMG SRC = "foo.gif" ALT = "A > B">
- <!-- <A comment> -->
- <script>if (a<b && a>c)</script>
- <# Just data #>
- <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
复制代码
If HTML comments include other tags, those solutions would also break on text like this:
- <!-- This section commented out.
- <B>You can't see me!</B>
- -->
复制代码
The only solution that works well here is to use the HTML parsing routines from CPAN. The second code snippet shown above in the Solution demonstrates this better technique.
For more flexible parsing, subclass the HTML:arser class and only record the text elements you see:
[PHP]
package MyParser;
use HTML:arser;
use HTML::Entities qw(decode_entities);
@ISA = qw(HTML:arser);
sub text {
my($self, $text) = @_;
print decode_entities($text);
}
package main;
MyParser->new->parse_file(*F);
[/PHP]
If you're only interested in simple tags that don't contain others nested inside, you can often make do with an approach like the following, which extracts the title from a non-tricky HTML document:
[PHP]
($title) = ($html =~ m#<TITLE>\s*(.*?)\s*</TITLE>#is);
[/PHP]
Again, the regex approach has its flaws, so a more complete solution using LWP to process the HTML is shown in Example 20.4.
Example 20.4: htitle
[PHP]
#!/usr/bin/perl
# htitle - get html title from URL
die "usage: $0 url ...\n" unless @ARGV;
require LWP;
foreach $url (@ARGV) {
$ua = LWP::UserAgent->new();
$res = $ua->request(HTTP::Request->new(GET => $url));
print "$url: " if @ARGV > 1;
if ($res->is_success) {
print $res->title, "\n";
} else {
print $res->status_line, "\n";
}
}
[/PHP]
Here's an example of the output:
- % htitle http://www.ora.com
- www.oreilly.com -- Welcome to O'Reilly & Associates!
- % htitle http://www.perl.com/ http://www.perl.com/nullvoid
- http://www.perl.com/: The www.perl.com Home Page
- http://www.perl.com/nullvoid: 404 File Not Found
复制代码
See Also
The documentation for the CPAN modules HTML::TreeBuilder, HTML:arser, HTML::Entities, and LWP::UserAgent; Recipe 20.5 |
|