数据挖掘 - 如何找到R中邮政编码之间的最小距离？ - 吾爱随笔录

如何找到R中邮政编码之间的最小距离？

数据挖掘 r 距离

2022-03-01 18:34:46

我有一个数据集，列出了美国的所有邮政编码及其类型（标准、邮政信箱、大学等）。我想用下一个最接近的标准邮政编码替换邮政信箱和大学邮政编码。我按状态分解数据集，这样 R 就不必进行那么多计算。理论上，我想在第一列有标准邮政编码，在第一行有需要替换的邮政编码，并将两者之间的距离作为交点值。

例如，

                REP 1   REP 2   REP 3   REP 4
STD 1           0.215   0.152   0.025   0.124   
STD 2           0.365   0.410   0.074   0.234
STD 3           0.234   0.201   1.322   0.683
STD 4           0.543   0.282   0.483   0.094 
MINS            STD 1   STD 1   STD 2   STD 4

其中 STD 1 是标准邮政编码，具有自己的经纬度，REP 1 是需要替换的邮政编码（是大学/邮政信箱邮政编码），具有自己的经度和纬度。我只有大约 5 周的 R 经验，所以如果有些事情对我来说不是很有意义，请多多包涵。我曾尝试在 excel 中执行此操作，并且每次尝试计算所有距离时，由于计算太多，因此每次尝试计算所有距离时，都会有一张包含近 10,000 列乘以 40,000 行的表格崩溃。

我觉得这里需要apply()or函数。mapply()我想使用考虑地球曲率（欧几里得等）的公式dist()或geosphere包来计算距离，以保持准确性和可重复性。

如果还有什么可以在这里添加的，请告诉我，我会尽快上传。这是我在阿拉斯加的 R 代码，按字母顺序排列的第一个州。

AK<-subset(db,STAABBRV.x=="AK")
AKPO<-subset(AK,ZipCodeType!="STANDARD",select=c("ZIP_CODE","ZipCodeType","Long","Lat"))
AKPO<-within(AKPO,{IS_PO=ifelse(ZipCodeType!="STANDARD",1,0)})
AKSTANDARD<-subset(AK,ZipCodeType=="STANDARD",select=c("ZIP_CODE","ZipCodeType","Long","Lat"))
AKSTANDARD<-within(AKSTANDARD,{IS_PO=ifelse(ZipCodeType!="STANDARD",1,0)})
table<-rbind(AKSTANDARD,AKPO)
table$ZipCodeType<-NULL
rm(AK,AKPO,AKSTANDARD)

这将设置一个具有列名称“ZIP_CODE”、“Long”、“Lat”和“IS_PO”的表。“IS_PO”是邮政编码是标准还是 po/university 的数字指示符。1 表示邮政编码是 po/univ 邮政编码，0 表示标准邮政编码。我这样做是因为某些函数要求数据集中的数据是同一类型（数字）。

以下是我编写代码来计算最小距离的一些失败尝试。

 lapply(bit::chunk(1, nrow(zipcode), 1e2), function(ridx) {
  merge(zipcode, zipcode[ridx[1]:ridx[2]], by = "dum", allow.cartesian = T)[
    , dist := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2), 
                      matrix(c(longitude.y, latitude.y), ncol = 2))/1609.34 # meters to miles
    ][dist <= 5 # necessary distance treshold
      ][, dum := NULL]
}) %>% rbindlist -> zip_nearby_dt

DOESITWORK<-apply(db, 1, function(x) spDistsN1(matrix(x[3:4], nrow=1),
                                    x[5:6],
                                    longlat=TRUE)) 



mins<-apply(Lat,1,function(x)return(array(which.min(x))))
mins<-data.frame(row=names(mins),col=mins) 
Lat$mins<-apply(mins,1,FUN=function(x)return(paste(x["row"],colnames(Lat[as.numeric(x["col"])]),Lat[x["row"],as.numeric(x["col"])],sep="/")))

1个回答

我想我已经正确阅读了你的问题，看起来你需要一个最近邻实现。如果你不熟悉这个概念，你可以在这里找到 wiki 文章https://en.wikipedia.org/wiki/Nearest_neighbor_search。

我继续写了一个示例实现，您可以将其用作指南。请注意，这是一种蛮力方法，对大数据集没有用处。一旦你掌握了这些材料，我建议你查看一些像 RANN 这样的库，它们有“真正的”实现。

读入一些随机测试数据并清理对于这个测试，让我们假设我们想为每个位置找到最近的美国城市

coord_data = read.csv("~/Downloads/SalesJan2009.csv", stringsAsFactors = F)
coord_data$id = c(1:nrow(coord_data))
coord_data$is_usa = ifelse(coord_data$Country == "United States", 1, 0)
coord_data = coord_data[ , c("id", "Latitude", "Longitude", "is_usa")]
names(coord_data) = tolower(names(coord_data))

定义你的距离函数。在这里，我们有远距离的地理坐标，所以欧几里得不会这样做。我正在使用余弦定律来计算大圆距离，但根据您的需要，应该考虑使用 Haversine 和 Vincenty。要了解更多信息，请从此处开始：https ://en.wikipedia.org/wiki/Great-circle_distance 。

greatCircleDistance = function(latAlpha, longAlpha, latBeta, longBeta, radius = 6371) {
  ## Function taken directly from Wikipedia
  ## Earth radius in km is default (6371)
  ## Long/Lats are in degrees so need helper function to convert to radians
  degreeToRadian = function(degree) (degree * pi / 180)
  deltaLong = degreeToRadian(longBeta) - degreeToRadian(longAlpha)
  sinLat = sin(degreeToRadian(latAlpha)) * sin(degreeToRadian(latBeta))
  cosLat = cos(degreeToRadian(latAlpha)) * cos(degreeToRadian(latBeta))
  ## acos is finicky with precision so we will assume if NA is thrown
  ## the argument was very close to 1 and therefore will return 0
  ## acos(1) == 0
  acosRaw = suppressWarnings(acos(sinLat + cosLat * cos(deltaLong)))
  acosSafe = ifelse(is.na(acosRaw), 0, acosRaw)
  acosSafe * radius
}

英国巴斯尔登和美国帕克维尔之间的距离

greatCircleDistance(coord_data$latitude[1],
coord_data$longitude[1],
coord_data$latitude[2],
coord_data$longitude[2])

 Returns [1] [1] 6929.351 km.

它与谷歌的计算相匹配，所以我们很高兴！

蛮力示例：正如您在 Excel 工作表中注意到的那样，随着数据集变大，这将迅速爆发。有更有效的方法来实现搜索。一个想法是从地理数据结构本身开始并编写一个 R-Tree，但我会把它留给你。

 bruteForceNearestNeighbor = function(geoData) {
      makeCoordinate = function(idx) {
        c("id" = idx, "latitude" = geoData$latitude[idx], "longitude" = geoData$longitude[idx])
      }
      singleCoordMinDistance = function(coordinate, locations) {
        locationsUS = locations[locations$is_us == 1 & locations$id != coordinate["id"], ]
        distances = mapply(greatCircleDistance,
              latAlpha = coordinate["latitude"],
              longAlpha = coordinate["longitude"],
              latBeta = locationsUS$latitude,
              longBeta = locationsUS$longitude)
        closestIndex = which(distances == min(distances))
        locations[closestIndex, "id"]
      }
      nearestNeighbors = vector("numeric", nrow(geoData))
      for ( i in 1:nrow(geoData) ) {
        coord = makeCoordinate(i)
        nearestNeighbors[i] = singleCoordMinDistance(coord, geoData)
      }
      nearestNeighbors
    }

    coord_data$nearest_neighbor = bruteForceNearestNeighbor(coord_data)

其它你可能感兴趣的问题

上一篇基因表达数据集的特征选择下一篇什么是开始未知数据集（JSON数据）数据分析的好方法