人类可以做到吗?
farsidebag
far sidebag
farside bag
far side bag
您不仅需要使用字典,还可能需要使用统计方法来找出最可能的方法(或者,上帝禁止,您所选择的人类语言的实际HMM ...)
有关如何做可能有用的统计信息,请转到Peter Norvig博士,他在21行代码中解决了另一个不同但相关的拼写检查问题: http : //norvig.com/spell-correct.html
(他确实通过将每个for循环折叠为一行来作弊。
更新这卡在我的脑海,所以我今天必须出生。这段代码与Robert Gamble所描述的代码类似,但是它根据提供的字典文件中的单词频率对结果进行排序(现在期望该文件通常代表您的域或英语。我使用了big链接到上方的Norvig的.txt,并为其添加了字典,以弥补遗漏的单词)。
除非频率差异很大,否则两个单词的组合通常会击败三个单词的组合。
我在博客上发布了此代码,并做了一些小的更改
http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/,并在此代码中也写了一些有关下溢的错误。.我很想安静地修复它,但认为这可以帮助一些以前没有看过日志技巧的人: http : //squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/
输出您的文字以及我自己的文字-注意“ orcore”会发生什么:
perl splitwords.pl big.txt words
answerveal: 2 possibilities
- answer veal
- answer ve al
wickedweather: 4 possibilities
- wicked weather
- wicked we at her
- wick ed weather
- wick ed we at her
liquidweather: 6 possibilities
- liquid weather
- liquid we at her
- li quid weather
- li quid we at her
- li qu id weather
- li qu id we at her
driveourtrucks: 1 possibilities
- drive our trucks
gocompact: 1 possibilities
- go compact
slimprojector: 2 possibilities
- slim projector
- slim project or
orcore: 3 possibilities
- or core
- or co re
- orc ore
码:
#!/usr/bin/env perl
use strict;
use warnings;
sub find_matches($);
sub find_matches_rec($\@\@);
sub find_word_seq_score(@);
sub get_word_stats($);
sub print_results($@);
sub Usage();
our(%DICT,$TOTAL);
{
my( $dict_file, $word_file ) = @ARGV;
($dict_file && $word_file) or die(Usage);
{
my $DICT;
($DICT, $TOTAL) = get_word_stats($dict_file);
%DICT = %$DICT;
}
{
open( my $WORDS, '<', $word_file ) or die "unable to open $word_file\n";
foreach my $word (<$WORDS>) {
chomp $word;
my $arr = find_matches($word);
local $_;
# Schwartzian Transform
my @sorted_arr =
map { $_->[0] }
sort { $b->[1] <=> $a->[1] }
map {
[ $_, find_word_seq_score(@$_) ]
}
@$arr;
print_results( $word, @sorted_arr );
}
close $WORDS;
}
}
sub find_matches($){
my( $string ) = @_;
my @found_parses;
my @words;
find_matches_rec( $string, @words, @found_parses );
return @found_parses if wantarray;
return \@found_parses;
}
sub find_matches_rec($\@\@){
my( $string, $words_sofar, $found_parses ) = @_;
my $length = length $string;
unless( $length ){
push @$found_parses, $words_sofar;
return @$found_parses if wantarray;
return $found_parses;
}
foreach my $i ( 2..$length ){
my $prefix = substr($string, 0, $i);
my $suffix = substr($string, $i, $length-$i);
if( exists $DICT{$prefix} ){
my @words = ( @$words_sofar, $prefix );
find_matches_rec( $suffix, @words, @$found_parses );
}
}
return @$found_parses if wantarray;
return $found_parses;
}
## Just a simple joint probability
## assumes independence between words, which is obviously untrue
## that's why this is broken out -- feel free to add better brains
sub find_word_seq_score(@){
my( @words ) = @_;
local $_;
my $score = 1;
foreach ( @words ){
$score = $score * $DICT{$_} / $TOTAL;
}
return $score;
}
sub get_word_stats($){
my ($filename) = @_;
open(my $DICT, '<', $filename) or die "unable to open $filename\n";
local $/= undef;
local $_;
my %dict;
my $total = 0;
while ( <$DICT> ){
foreach ( split(/\b/, $_) ) {
$dict{$_} += 1;
$total++;
}
}
close $DICT;
return (\%dict, $total);
}
sub print_results($@){
#( 'word', [qw'test one'], [qw'test two'], ... )
my ($word, @combos) = @_;
local $_;
my $possible = scalar @combos;
print "$word: $possible possibilities\n";
foreach (@combos) {
print ' - ', join(' ', @$_), "\n";
}
print "\n";
}
sub Usage(){
return "$0 /path/to/dictionary /path/to/your_words";
}
0
我有大约1000个条目的数组,下面是示例:
我希望能够将它们分为各自的词,例如:
我希望我能做到一个正则表达式。但是,由于我没有止境可言,因此我也没有可能要大写的任何大写字母,因此可能需要某种对字典的引用?
我想可以手工完成,但是为什么-什么时候可以用代码完成! =)但是,这让我感到难过。有任何想法吗?