提取数据

redspider · 发表于 2004-4-22 18:57:58

我想从一个文件中提取数据，再输入另一个文件。
例如有一个html格式的表格，需要把表格内数据提取出来
输入另一个文件。也就是搜索每一行 > 和 </td> 之间的内容，
将其写入另一个文件。
请教，该怎么做

libinary · 发表于 2004-4-22 21:31:14

提取倒是简单，主要是文件格式比较麻烦，比如你的源文件的<td>...</td>都是在一行里面吗？你要写入的目标文件的格式呢？不会是把所有内容都写在一起把。
如果要通用一点、功能强一点，最好用现成的解析html或xml的库

redspider · 发表于 2004-4-22 22:23:22

源文件的<td>...</td>不是在一行里，
一个表格，若干列，若干行。
提取每一单元格里的数据。
写入的目标文件为普通文本格式。
斑竹能否指点一二，:thank

至于在生成的目标文件数据中插入分割符倒是不难。

BBDD · 发表于 2004-4-22 23:01:54

最偷懒的办法：
lynx -dump

redspider · 发表于 2004-4-23 12:29:47

是好办法，

可是用perl怎么实现呐？

BBDD · 发表于 2004-4-23 16:02:29

如果一定要在perl里实现，那就这样：

$result = `lynx -dump $html_file`;

呵呵。
这问题比你想的要难一些，CPAN有一个叫HTML::FormatText的模块，可以把HTML转为TXT；还有一个HTML::TableContentParser模块，是专门处理表格的。
你可以用它们试试看，但效果多半比不上`lynx -dump`。

redspider · 发表于 2004-4-23 17:48:31

大侠，不光是HTML文件，要是在其他格式文件里提取数据呐？
呵呵，不是我抬杠，是在学perl

BBDD · 发表于 2004-4-23 18:06:51

我也不是在抬杠，下面的文章摘自Perl Cookbook，看完之后你就知道我为什么会这么说了。

20.6. Extracting or Removing HTML Tags

Problem

You want to remove HTML tags from a string, leaving just plain text.

Solution

The following oft-cited solution is simple but wrong on all but the most trivial HTML:
[PHP]
($plain_text = $html_text) =~ s/<[^>]*>//gs; #WRONG
[/PHP]
A correct but slower and slightly more complicated way is to use the CPAN modules:

[PHP]
use HTML:

arse;
use HTML::FormatText;
$plain_text = HTML::FormatText->new->format(parse_html($html_text));
[/PHP]

Discussion

As with almost everything else, there is more than one way to do it. Each solution attempts to strike a balance between speed and flexibility. Occasionally you may find HTML that's simple enough that a trivial command line call will work:

[PHP]
% perl -pe 's/<[^>]*>//g' file
[/PHP]
However, this will break on with files whose tags cross line boundaries, like this:

<IMG SRC = "foo.gif"
ALT = "Flurp!">

复制代码

So, you'll see people doing this instead:
[PHP]
% perl -0777 -pe 's/<[^>]*>//gs' file
[/PHP]
or its scripted equivalent:
[PHP]
{
local $/; # temporary whole-file input mode
$html = <FILE>;
$html =~ s/<[^>]*>//gs;
}
[/PHP]
But even that isn't good enough except for simplistic HTML without any interesting bits in it. This approach fails for the following examples of valid HTML (among many others):

<IMG SRC = "foo.gif" ALT = "A > B">
<script>if (a<b && a>c)</script>
<# Just data #>
<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

复制代码

If HTML comments include other tags, those solutions would also break on text like this:

<!-- This section commented out.
<B>You can't see me!</B>
-->

复制代码

The only solution that works well here is to use the HTML parsing routines from CPAN. The second code snippet shown above in the Solution demonstrates this better technique.

For more flexible parsing, subclass the HTML:

arser class and only record the text elements you see:

[PHP]
package MyParser;
use HTML:

arser;
use HTML::Entities qw(decode_entities);

@ISA = qw(HTML:

arser);

sub text {
my($self, $text) = @_;
print decode_entities($text);
}

package main;
MyParser->new->parse_file(*F);
[/PHP]

If you're only interested in simple tags that don't contain others nested inside, you can often make do with an approach like the following, which extracts the title from a non-tricky HTML document:

[PHP]
($title) = ($html =~ m#<TITLE>\s*(.*?)\s*</TITLE>#is);
[/PHP]
Again, the regex approach has its flaws, so a more complete solution using LWP to process the HTML is shown in Example 20.4.

Example 20.4: htitle
[PHP]
#!/usr/bin/perl
# htitle - get html title from URL

die "usage: $0 url ...\n" unless @ARGV;
require LWP;

foreach $url (@ARGV) {
$ua = LWP::UserAgent->new();
$res = $ua->request(HTTP::Request->new(GET => $url));
print "$url: " if @ARGV > 1;
if ($res->is_success) {
print $res->title, "\n";
} else {
print $res->status_line, "\n";
}
}
[/PHP]

Here's an example of the output:

% htitle http://www.ora.com
www.oreilly.com -- Welcome to O'Reilly & Associates!
% htitle http://www.perl.com/ http://www.perl.com/nullvoid
http://www.perl.com/: The www.perl.com Home Page
http://www.perl.com/nullvoid: 404 File Not Found

复制代码

See Also
The documentation for the CPAN modules HTML::TreeBuilder, HTML:

arser, HTML::Entities, and LWP::UserAgent; Recipe 20.5

KornLee · 发表于 2004-4-23 18:15:10

呵呵,perl真的很深奥莫测,慢慢学吧~_~

redspider · 发表于 2004-4-23 18:25:59

我开始晕了

		自动登录	找回密码
密码			注册