Compute the Distance Matrix of a Set of Sites from Their Coordinates in Python | by Carlos J. Uribe

Compute the Distance Matrix of a Set of Sites from Their Coordinates in Python | by Carlos J. Uribe

[ad_1]

To build a distance matrix, we need to obtain the distance between any pair of locations. Sounds simple, but “distance” really depends on the context. Do we consider the number reported by mapping applications, like Google Maps, that take into account the streets network, bridges, parks, etc.? If so, do we take the distance that a pedestrian would walk, or that a car would drive? Or maybe just the good old length of a straight line connecting the two points? Clearly, we have many possible distances to choose from, with varying degrees of accuracy. The first question we have to answer is: how should we define “distance” in the particular context of our problem, and at this stage?

3.1. Should I go the extra mile to gain an extra yard?

It’s natural to feel tempted to use accurate data. In the end, we all know that accuracy is intrinsically valuable, and hence we are inclined to pursue accurate data, the more, the better. But we must also remember that more accurate data entails more complex code and dependencies, and thus more development time and maintenance. As we’re following an agile approach, we don’t let the best be the enemy of the good, so we will start as simple as we can, and then add complexity gradually, only if it is justified.

At this point of having to find distances between locations, we could do as many do, and jump straight to third-party API-based solutions that require app keys, credentials, or even credit card numbers for cloud providers. That approach is fine, but often times it is inefficient, as we can forget that accurate information brings added value, but also comes with added costs.

👁️ There ain’t no such thing as “free accuracy”

Remembering that in general we always “pay a price” for accessing accurate data (which is closely related to the concept of Value of Information) is another reason why taking an agile approach to the problem is a leaner course of action. By starting with simple assumptions on the “required level of accuracy”, and verifying their validity on our own problem data, we are ensuring that, if we eventually need to increase the accuracy of our data, we will be “paying a price” that is worth the (expected) improved results.

So let’s start very simple. We have coordinates. First idea: these coordinates are spread over parcels of the Earth very small compared to the radius of the Earth, so we could treat the latitudes as Y coordinates and the longitudes as X coordinates on a 2D plane, and then just compute the Euclidean distance (fancy term for the usual “straight line”).

  • Pros: a simple formula for distance, no new dependencies or data, spatial relationships between locations are conserved.
  • Cons: latitudes and longitudes are dimensionless numbers, so the numbers we’d get when solving the problem would not be actual distances. This means that some info we care about, like total distance traveled, would not be available, even if we can obtain the optimal tour.

The cons trump the pros, so we need a more complex approach (but still simple). Second idea: treat the coordinates as what they are, points on the Earth, but approximate the Earth as a sphere. A sphere does not have the familiar Euclidean geometry, so we will need a non-trivial formula that considers this spherical geometry when calculating the “straight line” distance between two points. So now it’s just a matter of implementing that formula using the radius of the Earth. We could do that, but we’ll instead rely on a famous library that already does that, and even better.

3.2. Geolocation utilities with geopy

If this article series were especially focused on geospatial data science, it would be valuable to take the time to explain and implement the formula for the great-circle distance, a nice baseline option to compute “straight-line” distances between points on a sphere. However, this article series is about the creation of an optimization-based tourism planning system, so instead of crafting our own formulas for geospatial utilities, we will rely on Geopy to do the heavy lifting for us. That way, we maintain focus on reaching a solution quickly.

Install it by running in an Anaconda prompt (or inside the conda environment we created in the first article, if you created it) the following:

conda install -y -c conda-forge geopy=2.3.0

Now, let’s do a demonstration with geopy for just two locations.

3.3. Getting to the points

Given the coordinates of two points, the geodesic function of geopy computes the distance of the geodesic connecting them across the Earth’s surface. In Geometry, the geodesic is the path of minimal distance between points on a given metric space. In our familiar Euclidean space, straight lines are the geodesics. In a spherical space, great-circles are. The underlying “space” that Geopy’s geodesic function considers is an accurate ellipsoid model of the Earth.

👁 A great-circle is great, but an ellipse is even greater

Earlier I said we would consider the Earth to be a sphere, because it was the simplest workable approximation. In reality, the Earth isn’t a sphere, but an ellipsoid, a solid with a more complex geometry. Now that geopy will spare us from coding our own functions for non-Euclidean geometries, we can upgrade our approximation of the Earth and employ the more accurate ellipsoidal distance between two points, instead of the great-circle distance. A better Earth model for the same lines of code. This indeed is free accuracy, so why not take it?

Here’s a function that computes the ellipsoidal distance between point 1 and point 2, in meters:

from geopy.distance import geodesic

def ellipsoidal_distance(p1, p2) -> float:
""" Calculate distance (in meters) between p1 and p2, where
each point is represented as a tuple (lat, lon) """
return geodesic(p1, p2).meters

What is the distance between the Eiffel Tour and the Louvre?

p1 = df_sites.loc['Tour Eiffel']
p2 = df_sites.loc['Louvre']

ellipsoidal_distance(p1, p2) # output: 3173.119635531859

3173 meters, around 3.2 km. Google Maps says it’s 3.5 km. The computed distance is 8.6 % lower than the “real” distance. Our legs only care about absolute errors in distance, though, which in this case amounts to just 330 extra meters to walk, compared to the estimated distance. Doesn’t seem like a significant error for a tourist who expects to be walking around all day in a big city.

And between the Eiffel Tour and Port de Suffren?

ellipsoidal_distance(
df_sites.loc['Tour Eiffel'],
df_sites.loc['Port de Suffren']
) # output: 328.3147101635456

328 meters, this time 6% lower (just 22 meters shorter) than the 350 meters Google Maps provides. Not that bad for applying a formula. As we would expect, the closer the points are, the less chance there is for streets to zigzag and turns to appear, and hence the lower the error incurred by the ellipsoid model. Looks good enough for our present purposes.

Now we must apply this function to all pairs of locations, thus getting the distance matrix the TSP model needs.

3.4. From coordinates to distance matrix

This is the easy part, where we just have to loop over all the sites twice and compute and store the distance between each pair. The below function does that. Note that the distance metric is passed as an optional argument, being the ellipsoidal distance we used before the default. We leave the door open to better distance metrics to be passed in the future.

def compute_distance_matrix(df_sites, dist_metric=ellipsoidal_distance):
""" Creates an N x N distance matrix from a dataframe of N locations
with a latitute column and a longitude column """
df_dist_matrix = pd.DataFrame(index=df_sites.index,
columns=df_sites.index)

for orig, orig_loc in df_sites.iterrows(): # for each origin
for dest, dest_loc in df_sites.iterrows(): # for each destination
df_dist_matrix.at[orig, dest] = dist_metric(orig_loc, dest_loc)
return df_dist_matrix

df_distances = compute_distance_matrix(df_sites)

display(df_distances)

Figure 3. Distance matrix resulting from using the ellipsoidal model of the Earth. (Image by author)

And there we have it! As expected, the diagonal of the matrix is zero, and the matrix is symmetric. The index and columns of the output dataframe contain the names of the input sites.

Functionality demonstrated. Now we can do better to facilitate the use of this function. Let’s wrap up this functionality inside a class in a convenient manner, for easy re-use, and more importantly, for easier integration with the optimization model of the TSP we built in the previous sprint.

[ad_2]
Source link

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *