Representing a rail network as a graph is nothing new, its the most obvious way to do it. Nodes are the stations and edges are the lines between them. But what happens when you apply algorithms like PageSort to the graph? Will it be able to pick out the stations a human would intuitively pick out as important? Lets find out
Imports first, I'll use NetworkX as the graph library. It seems to be the easiest and most full featured library around.
%matplotlib inline
import colorsys
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from collections import Counter
from bokeh.plotting import figure, show
from bokeh.resources import CDN
from bokeh.io import output_notebook
output_notebook( resources=CDN )
pd.set_option('max_colwidth', 200)
I found it suprisingly hard to find a nicely structured dataset of stations and connections between them. Luckily this library had some CSVs buried in it with just what I was looking for. So i yanked them out and opened them in pandas
lines = pd.read_csv('london.lines.csv', index_col=0)
stations = pd.read_csv('london.stations.csv', index_col=0)
connections = pd.read_csv('london.connections.csv')
Look at the nice data! Some of it is probably out of date, but not in any major way.
lines.head(3)
stations.head(3)
#drop station display_name for the future, its an ugly column
stations.drop('display_name', axis=1, inplace=True)
connections.head(3)
We can create a naive graph super easily from this.
graph = nx.Graph()
for connection_id, connection in connections.iterrows():
station1_name = stations.ix[connection['station1']]['name']
station2_name = stations.ix[connection['station2']]['name']
graph.add_edge(station1_name, station2_name, time = connection['time'])
#add the connection between Bank and Monument manually
graph.add_edge('Bank', 'Monument', time = 1)
Already we can do some kind of interesting stuff, like get a reasonable path between Oxford Circus and Canary Wharf
nx.shortest_path(graph, 'Oxford Circus', 'Canary Wharf', weight='time')
And run PageRank on the network!
pagerank = nx.pagerank_numpy(graph)
pagerank = pd.DataFrame(pagerank.items(), columns=['name', 'pagerank'])
stations = pd.merge(stations, pagerank, on='name', right_index=True)
stations.sort_values('pagerank', ascending=False).head(10)
Those results look incredibly good considering how little work we've put in! With the exception of Turnham Green, this seems like a very reasonable list of the most important tube stations in London. And this is all without taking into account the Overground network, or looking at any traffic stats!
NetworkX also implements the HITS algorithm. It was originally designed to differenciate between web pages which acted as hubs of information and those which acted as authoritive sourves of information. It does this by looking at incoming and outgoing edges from each node. In an undirected graph (like we're using), incoming and outgoing edges are the same, but the results I found when I applied it to the Tube graph were quite interesting!
hits = nx.hits_scipy(graph, max_iter=1000)[0]
hits = pd.DataFrame(hits.items(), columns=['name', 'hits'])
stations = pd.merge(stations, hits, on='name', right_index=True)
stations.sort_values('hits', ascending=False).head(10)
Where PageRank finds the important stations, the HITS algorithm seems to be pretty good at finding the busy stations, still without any traffic data! Neat!
Lets visualise the importance of stations as defined by pagerank. Less important stations will be colored green, and more important stations will be colored red.
def pseudocolor(val):
h = (1.0 - val) * 120 / 360
r, g, b = colorsys.hsv_to_rgb(h, 1., 1.)
return r * 255, g * 255, b * 255
normed = stations[['longitude', 'latitude', 'pagerank']]
normed = normed - normed.min()
normed = normed / normed.max()
locations = dict(zip(stations['name'], normed[['longitude', 'latitude']].values))
pageranks = dict(zip(stations['name'], normed['pagerank'].values))
p = figure(
x_range = (.4,.7),
y_range = (.2,.5),
height= 700,
width= 900,
)
for edge in graph.edges():
p.line(
x= [locations[pt][0] for pt in edge],
y= [locations[pt][1] for pt in edge],
)
for node in graph.nodes():
x = [locations[node][0]]
y = [locations[node][1]]
p.circle(
x, y,
radius = .01 * pageranks[node],
fill_color = pseudocolor(pageranks[node]),
line_alpha=0)
p.text(
x, y,
text = {'value':node},
text_font_size = str(min(pageranks[node] * 12, 10)) + "pt",
text_alpha = pageranks[node],
text_align='center',
text_font_style='bold')
show(p)
The above data looks good, but the graph is terribly simple. It assumed that there is no inherent cost to switching trains at a station. This is definitely not true. A train change at a station takes at least as long as going one average stop.
To model this, we need to put edges between the lines at each station. For example, instead of having a single 'Oxford Circus' node in our graph, we will now have the nodes 'Oxford Circus' (to represent entrance and exit of the main station), 'Oxford Circus - Central Line', 'Oxford Circus - Victoria Line', etc. Think of it like each line has its own layer in a 3d tube map, and when 2 lines share a station there is an edge between the layers at that point.
This graph still isn't quite accurate to the real world. The nodes should really be platforms instead of stops on a line. For example, the District and Circle lines share a lot of platforms. But using this approach, we can easily model the time it takes to switch between lines.
stations.drop(['pagerank', 'hits'], axis=1, inplace=True)
graph = nx.Graph()
for connection_id, connection in connections.iterrows():
line_name = lines.ix[connection.line]['name']
station1_name = stations.ix[connection.station1]['name']
station2_name = stations.ix[connection.station2]['name']
#Oxford Circus - Central Line
node1_name = "%s - %s" % (station1_name, line_name)
node2_name = "%s - %s" % (station2_name, line_name)
#"Bond Street - Central Line" to "Oxford Circus - Central Line"
graph.add_edge(node1_name, node2_name, time = connection['time'])
#"Oxford Circus - Central Line" to "Oxford Circus"
graph.add_edge(node1_name, station1_name, time = 1)
graph.add_edge(node2_name, station2_name, time = 1)
#"Oxford Circus - Central Line" to "Oxford Circus - Victoria Line"
for neighbor in graph.neighbors(station1_name):
graph.add_edge(node1_name, neighbor, time = 0.5)
for neighbor in graph.neighbors(station2_name):
graph.add_edge(node2_name, neighbor, time = 0.5)
graph.add_edge('Bank', 'Monument', time = 1)
We attach a weight to each edge, which is just the inverse of the time
for node1, neighbors in graph.edge.iteritems():
for node2, edge in neighbors.iteritems():
graph[node1][node2]['weight'] = 1.0 / edge['time']
To get an idea of what the graph looks like, we can inspect it a little
graph.neighbors('Oxford Circus')
graph.neighbors('Oxford Circus - Central Line')
nx.shortest_path(graph, 'Oxford Circus', 'Canary Wharf', weight='time')
Lets see what PageRank thinks of our more realistic graph
pagerank = nx.pagerank_numpy(graph, weight='weight')
pagerank = pd.DataFrame(pagerank.items(), columns=['name', 'pagerank'])
stations = pd.merge(stations, pagerank, on='name', right_index=True)
A stations pagerank score, should probably be the sum of its own score, plus that of its platforms
def node_name_to_pagerank(node_name):
return pagerank[pagerank['name'] == node_name]['pagerank'].iloc[0]
for station_id, station in stations.iterrows():
for neighbor in graph.neighbors(station['name']):
stations.ix[station_id, 'pagerank'] += node_name_to_pagerank(neighbor)
stations.sort_values('pagerank', ascending=False).head(10)
The results seem just slightly more intuitive than our previous list. Notable differences with our earlier run of PageRank are that this list ranks Moorgate, Hammersmith, and Notting Hill Gate higher, but Earl's Court is no longer in the top 10. Its difficult to compare the lists quantifiably, and objectively say one is better than the other. I guess you would have to survey London commuters and ask them to rank their most important stations.
Lets plot these results again.
def node_name_to_station(node_name):
return node_name.split(' - ')[0]
normed = stations[['longitude', 'latitude', 'pagerank']]
normed = normed - normed.min()
normed = normed / normed.max()
locations = dict(zip(stations['name'], normed[['longitude', 'latitude']].values))
pageranks = dict(zip(stations['name'], normed['pagerank'].values))
p = figure(
x_range = (.4,.7),
y_range = (.2,.5),
height= 700,
width= 900,
)
for edge in graph.edges():
x1 = locations[node_name_to_station(edge[0])][0]
x2 = locations[node_name_to_station(edge[1])][0]
y1 = locations[node_name_to_station(edge[0])][1]
y2 = locations[node_name_to_station(edge[1])][1]
if not (x1 == x2 and y1 == y2):
p.line(
x= [x1, x2],
y= [y1, y2],
)
for node, location in locations.iteritems():
x = [locations[node][0]]
y = [locations[node][1]]
p.circle(
x, y,
radius = .01 * pageranks[node],
fill_color = pseudocolor(pageranks[node]),
line_alpha=0)
p.text(
x, y,
text = {'value':node},
text_font_size = str(min(max(1, pageranks[node] * 12), 10)) + "pt",
text_alpha = pageranks[node],
text_align='center',
text_font_style='bold')
show(p)
Now that we have a more complex representation of the tube graph, we can do some more interesting stuff. Wouldn't it be fun to see which connections, as opposed to end points, are the most important to the system? Intuitively, I reckon that the pathway between Bank and Monument should be important.
But of course, PageRank only works on nodes, not edges. So we transform our graph to its line graph representation. Each edge in the original graph becomes a node in the line graph. Edges in the line graph exist if there was a common node between two edges in the original.
Original, I tried to perform PageRank on the line graph of the original, but the results from this were terrible, no better than nonsense. Then I tried again with the HITS algorithm. The results were better, but still not very good.
linegraph = nx.line_graph(graph)
edgerank = nx.hits(linegraph)
edgerank = pd.DataFrame([(k[0], k[1], v) for k, v in edgerank[0].items()], columns=['node1', 'node2', 'edgerank'])
edgerank.sort_values('edgerank', ascending=False).head()
The connections between the nodes in the most important stations dominate the top of the list, generally clustering by which station they belong to. Not terribly interesting to look at.
With the failure of the linegraph method, I wanted to find a new method to rank edges. Two obvious heuristics come to mind.
def node_to_station_pagerank(node_name):
station_name = node_name.split(' - ')[0]
return stations[stations['name'] == station_name]['pagerank'].iloc[0]
edgerank = edgerank[edgerank['node1'] != edgerank['node2']]
for edge_id, edge in edgerank.iterrows():
score = node_name_to_pagerank(edge['node1']) + node_name_to_pagerank(edge['node2'])
edgerank.set_value(edge_id, 'edgerank', score)
edgerank.sort_values('edgerank', ascending=False).head()
Hmm, not very promising. Lets try another way
NetworkX provides a very simple way of calculating all possible shortest paths around your graph. It uses the Floyd-Warshall algorithm to generate a list shortest paths from every possible origin to every possible destination in just a few minutes.
all_shortest_paths = nx.floyd_warshall_predecessor_and_distance(graph, weight='time')[0]
This gives us a dictionary of dictionaries. The first maps origins to the inner dictionaries. The inner dictionaries map a destination to the preceeding node in the shortest path to that destination. We can backtrack along this path and count up every time our path crossed an edge. The idea is that the more times we cross an edge, the more important it is.
counter = Counter()
for station in stations['name']:
for destination, pred in all_shortest_paths[station].iteritems():
if any(stations['name'] == destination):
while not pred == station:
counter[tuple(sorted([destination, pred]))] += 1
destination, pred = pred, all_shortest_paths[station][pred]
edgerank = pd.DataFrame([(k[0], k[1], v) for k, v in counter.items()], columns=['node1', 'node2', 'edgerank'])
edgerank.sort_values('edgerank', ascending=False).head()
Its not what I was expecting, but it seems like a pretty sound list. Although the Bank to Monument passage is used much less than I would have thought. This could be just some bias from me, since I know I use it quite a bit.
edgerank[(edgerank['node1'] == 'Bank') & (edgerank['node2'] == 'Monument')]
What are the most used intra-station connections?
node_name_to_station = np.vectorize(node_name_to_station)
intra_station_edgeranks = edgerank[node_name_to_station(edgerank['node1']) == node_name_to_station(edgerank['node2'])]
intra_station_edgeranks.sort_values('edgerank', ascending=False).head(10)
This looks like a very sensible list. These pathways are usually pretty busy. Next time you're passing through one of those passage ways, see if its optimized!
Now lets put everything together and visualise both the important stations and connections that we were able to pick out. Similarly to stations, green connectons are less important than red ones.
edgerank_normed = edgerank['edgerank']
edgerank_normed = edgerank_normed - edgerank_normed.min()
edgerank_normed = edgerank_normed / edgerank_normed.max()
edgerank_normed = dict(zip(map(tuple, edgerank[['node1', 'node2']].values), edgerank_normed))
normed = stations[['longitude', 'latitude', 'pagerank']]
normed = normed - normed.min()
normed = normed / normed.max()
locations = dict(zip(stations['name'], normed[['longitude', 'latitude']].values))
pageranks = dict(zip(stations['name'], normed['pagerank'].values))
p = figure(
x_range = (.4,.7),
y_range = (.2,.5),
height= 700,
width= 900,
)
for index, (node1, node2, edgerank_score) in edgerank.iterrows():
x1 = locations[str(node_name_to_station(node1))][0]
x2 = locations[str(node_name_to_station(node2))][0]
y1 = locations[str(node_name_to_station(node1))][1]
y2 = locations[str(node_name_to_station(node2))][1]
edgerank_score = edgerank_normed[(node1, node2)]
if not (x1 == x2 and y1 == y2):
p.line(
x= [x1, x2],
y= [y1, y2],
line_color = pseudocolor(edgerank_score),
line_width = max(0.5, 3 * edgerank_score)
)
for node, location in locations.iteritems():
x = [locations[node][0]]
y = [locations[node][1]]
p.circle(
x, y,
radius = .01 * pageranks[node],
fill_color = pseudocolor(pageranks[node]),
line_alpha=0)
p.text(
x, y,
text = {'value':node},
text_font_size = str(min(max(2, pageranks[node] * 12), 10)) + "pt",
text_alpha = pageranks[node],
text_align='center',
text_font_style='bold')
show(p)
Who would have thought that PageRank and some simple heuristics could produce such intuitive results about the London Tube? I know I was blown away when I quickly tested the idea with the simple graph, I wasn't expecting it to work that nicely at all.
At one point during this write up, I had the idea that this information could be useful to TFL, and the people designing the network. But then I realised this I've probably backwards, and that it's not quite true. The fact that PageRank picked out a lot of big, important stations is no mistake. The underground network has been designed this way. These algorithms and heuristics do not inform the network designers, it is probably the designers that inform the algorithms. They have structured their network in a logic way, and all we have done is verify that. Our algorithms worked only as well as the network was designed.