agrep：ベストマッチのみを返す

Question

一致のベクトルを返すRの「agrep」関数を使用しています。 agrepに似た、最良の一致のみを返す関数、または同点の場合は最良の一致を返す関数が必要です。現在、結果のベクトルの各要素でパッケージ「cba」の「sdist（）」関数を使用してこれを行っていますが、これは非常に冗長なようです。

/ edit：これが私が現在使用している関数です。距離を2回計算するのは冗長に思えるので、スピードを上げたいと思います。

library(cba) Word <- 'test' words <- c('Teest','teeeest','New York City','yeast','text','Test') ClosestMatch <- function(string,StringVector) { matches <- agrep(string,StringVector,value=TRUE) distance <- sdists(string,matches,method = "ow",weight = c(1, 0, 2)) matches <- data.frame(matches,as.numeric(distance)) matches <- subset(matches,distance==min(distance)) as.character(matches$matches) } ClosestMatch(Word,words)

Alexander Sigachov · Accepted Answer

RecordLinkageパッケージがCRANから削除されました。代わりにstringdistを使用してください。

library(stringdist) ClosestMatch2 = function(string, stringVector){ stringVector[amatch(string, stringVector, maxDist=Inf)] }

Ramnath · Answer

Agrepパッケージは、レーベンシュタイン距離を使用して文字列を照合します。パッケージRecordLinkageには、レーベンシュタイン距離を計算するためのC関数があります。これを直接使用して、計算を高速化できます。これは、約10倍高速なClosestMatch関数を作り直したものです。

library(RecordLinkage) ClosestMatch2 = function(string, stringVector){ distance = levenshteinSim(string, stringVector); stringVector[distance == max(distance)] }