The London Tube as a Graph

Representing a rail network as a graph is nothing new, its the most obvious way to do it. Nodes are the stations and edges are the lines between them. But what happens when you apply algorithms like PageSort to the graph? Will it be able to pick out the stations a human would intuitively pick out as important? Lets find out

Imports first, I'll use NetworkX as the graph library. It seems to be the easiest and most full featured library around.

In [1]:
%matplotlib inline

import colorsys
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from collections import Counter
from bokeh.plotting import figure, show
from bokeh.resources import CDN
from bokeh.io import output_notebook
output_notebook( resources=CDN )

pd.set_option('max_colwidth', 200)
BokehJS successfully loaded.

I found it suprisingly hard to find a nicely structured dataset of stations and connections between them. Luckily this library had some CSVs buried in it with just what I was looking for. So i yanked them out and opened them in pandas

In [2]:
lines       = pd.read_csv('london.lines.csv', index_col=0)
stations    = pd.read_csv('london.stations.csv', index_col=0)
connections = pd.read_csv('london.connections.csv')

Look at the nice data! Some of it is probably out of date, but not in any major way.

In [3]:
lines.head(3)
Out[3]:
name colour stripe
line
1 Bakerloo Line AE6017 NaN
3 Circle Line FFE02B NaN
6 Hammersmith & City Line F491A8 NaN
In [4]:
stations.head(3)
Out[4]:
latitude longitude name display_name zone total_lines rail
id
1 51.5028 -0.2801 Acton Town Acton<br />Town 3 2 0
2 51.5143 -0.0755 Aldgate NaN 1 2 0
3 51.5154 -0.0726 Aldgate East Aldgate<br />East 1 2 0
In [5]:
#drop station display_name for the future, its an ugly column
stations.drop('display_name', axis=1, inplace=True)
In [6]:
connections.head(3)
Out[6]:
station1 station2 line time
0 11 163 1 1
1 11 212 1 2
2 49 87 1 1

We can create a naive graph super easily from this.

A simplified graph

In [7]:
graph = nx.Graph()

for connection_id, connection in connections.iterrows():
    station1_name = stations.ix[connection['station1']]['name']
    station2_name = stations.ix[connection['station2']]['name']
    graph.add_edge(station1_name, station2_name, time = connection['time'])
    
#add the connection between Bank and Monument manually
graph.add_edge('Bank', 'Monument', time = 1)

Already we can do some kind of interesting stuff, like get a reasonable path between Oxford Circus and Canary Wharf

In [8]:
nx.shortest_path(graph, 'Oxford Circus', 'Canary Wharf', weight='time')
Out[8]:
['Oxford Circus',
 'Tottenham Court Road',
 'Holborn',
 'Chancery Lane',
 "St. Paul's",
 'Bank',
 'Shadwell',
 'Limehouse',
 'Westferry',
 'West India Quay',
 'Canary Wharf']

And run PageRank on the network!

In [9]:
pagerank = nx.pagerank_numpy(graph)
pagerank = pd.DataFrame(pagerank.items(), columns=['name', 'pagerank'])
stations = pd.merge(stations, pagerank, on='name', right_index=True)
In [10]:
stations.sort_values('pagerank', ascending=False).head(10)
Out[10]:
latitude longitude name zone total_lines rail pagerank
id
145 51.5308 -0.1238 King's Cross St. Pancras 1.0 6 1 0.007915
11 51.5226 -0.1571 Baker Street 1.0 5 0 0.007613
13 51.5133 -0.0886 Bank 1.0 4 0 0.007140
74 51.4920 -0.1973 Earl's Court 1.5 2 0 0.007047
193 51.5154 -0.1755 Paddington 1.0 4 1 0.006178
265 51.4951 -0.2547 Turnham Green 2.5 2 0 0.006108
279 51.5036 -0.1143 Waterloo 1.0 4 1 0.006082
107 51.5067 -0.1428 Green Park 1.0 3 0 0.005852
225 51.5117 -0.0560 Shadwell 2.0 2 0 0.005845
192 51.5150 -0.1415 Oxford Circus 1.0 3 0 0.005757

Those results look incredibly good considering how little work we've put in! With the exception of Turnham Green, this seems like a very reasonable list of the most important tube stations in London. And this is all without taking into account the Overground network, or looking at any traffic stats!

NetworkX also implements the HITS algorithm. It was originally designed to differenciate between web pages which acted as hubs of information and those which acted as authoritive sourves of information. It does this by looking at incoming and outgoing edges from each node. In an undirected graph (like we're using), incoming and outgoing edges are the same, but the results I found when I applied it to the Tube graph were quite interesting!

In [11]:
hits = nx.hits_scipy(graph, max_iter=1000)[0]
hits = pd.DataFrame(hits.items(), columns=['name', 'hits'])
stations = pd.merge(stations, hits, on='name', right_index=True)
In [12]:
stations.sort_values('hits', ascending=False).head(10)
Out[12]:
latitude longitude name zone total_lines rail pagerank hits
id
192 51.5150 -0.1415 Oxford Circus 1 3 0 0.005757 0.059742
107 51.5067 -0.1428 Green Park 1 3 0 0.005852 0.059523
197 51.5098 -0.1342 Picadilly Circus 1 2 0 0.003862 0.047062
28 51.5142 -0.1494 Bond Street 1 2 0 0.004143 0.042528
285 51.5010 -0.1254 Westminster 1 3 0 0.004058 0.035861
279 51.5036 -0.1143 Waterloo 1 4 1 0.006082 0.033215
259 51.5165 -0.1310 Tottenham Court Road 1 2 0 0.004067 0.031668
151 51.5113 -0.1281 Leicester Square 1 2 0 0.004019 0.031330
11 51.5226 -0.1571 Baker Street 1 5 0 0.007613 0.029949
13 51.5133 -0.0886 Bank 1 4 0 0.007140 0.028016

Where PageRank finds the important stations, the HITS algorithm seems to be pretty good at finding the busy stations, still without any traffic data! Neat!

Lets visualise the importance of stations as defined by pagerank. Less important stations will be colored green, and more important stations will be colored red.

In [13]:
def pseudocolor(val):
    h = (1.0 - val) * 120 / 360
    r, g, b = colorsys.hsv_to_rgb(h, 1., 1.)
    return r * 255, g * 255, b * 255
In [14]:
normed = stations[['longitude', 'latitude', 'pagerank']]
normed = normed - normed.min()
normed = normed / normed.max()
locations = dict(zip(stations['name'], normed[['longitude', 'latitude']].values))
pageranks = dict(zip(stations['name'], normed['pagerank'].values))

p = figure(
    x_range = (.4,.7),
    y_range = (.2,.5),
    height= 700,
    width= 900,
)
for edge in graph.edges():
    p.line( 
        x= [locations[pt][0] for pt in edge],
        y= [locations[pt][1] for pt in edge],
    )

for node in graph.nodes():
    x = [locations[node][0]]
    y = [locations[node][1]]
    p.circle(
        x, y, 
        radius = .01 * pageranks[node], 
        fill_color = pseudocolor(pageranks[node]), 
        line_alpha=0)
    p.text(
        x, y, 
        text = {'value':node}, 
        text_font_size = str(min(pageranks[node] * 12, 10)) + "pt", 
        text_alpha = pageranks[node],
        text_align='center',
        text_font_style='bold')
    
show(p)

A realistic graph

The above data looks good, but the graph is terribly simple. It assumed that there is no inherent cost to switching trains at a station. This is definitely not true. A train change at a station takes at least as long as going one average stop.

To model this, we need to put edges between the lines at each station. For example, instead of having a single 'Oxford Circus' node in our graph, we will now have the nodes 'Oxford Circus' (to represent entrance and exit of the main station), 'Oxford Circus - Central Line', 'Oxford Circus - Victoria Line', etc. Think of it like each line has its own layer in a 3d tube map, and when 2 lines share a station there is an edge between the layers at that point.

This graph still isn't quite accurate to the real world. The nodes should really be platforms instead of stops on a line. For example, the District and Circle lines share a lot of platforms. But using this approach, we can easily model the time it takes to switch between lines.

In [15]:
stations.drop(['pagerank', 'hits'], axis=1, inplace=True)
In [16]:
graph = nx.Graph()

for connection_id, connection in connections.iterrows():
    line_name = lines.ix[connection.line]['name']
    station1_name = stations.ix[connection.station1]['name']
    station2_name = stations.ix[connection.station2]['name']

    #Oxford Circus - Central Line
    node1_name = "%s - %s" % (station1_name, line_name)
    node2_name = "%s - %s" % (station2_name, line_name)
    
    #"Bond Street - Central Line" to "Oxford Circus - Central Line"
    graph.add_edge(node1_name, node2_name,    time = connection['time'])
    
    #"Oxford Circus - Central Line" to "Oxford Circus"
    graph.add_edge(node1_name, station1_name, time = 1)
    graph.add_edge(node2_name, station2_name, time = 1)
    
    #"Oxford Circus - Central Line" to "Oxford Circus - Victoria Line"
    for neighbor in graph.neighbors(station1_name):
        graph.add_edge(node1_name, neighbor, time = 0.5)        
    for neighbor in graph.neighbors(station2_name):
        graph.add_edge(node2_name, neighbor, time = 0.5)
        
graph.add_edge('Bank', 'Monument', time = 1)

We attach a weight to each edge, which is just the inverse of the time

In [17]:
for node1, neighbors in graph.edge.iteritems():
    for node2, edge in neighbors.iteritems():
        graph[node1][node2]['weight'] = 1.0 / edge['time']

To get an idea of what the graph looks like, we can inspect it a little

In [18]:
graph.neighbors('Oxford Circus')
Out[18]:
['Oxford Circus - Bakerloo Line',
 'Oxford Circus - Victoria Line',
 'Oxford Circus - Central Line']
In [19]:
graph.neighbors('Oxford Circus - Central Line')
Out[19]:
['Oxford Circus',
 'Tottenham Court Road - Central Line',
 'Oxford Circus - Bakerloo Line',
 'Oxford Circus - Central Line',
 'Oxford Circus - Victoria Line',
 'Bond Street - Central Line']
In [20]:
nx.shortest_path(graph, 'Oxford Circus', 'Canary Wharf', weight='time')
Out[20]:
['Oxford Circus',
 'Oxford Circus - Victoria Line',
 'Green Park - Victoria Line',
 'Green Park - Jubilee Line',
 'Westminster - Jubilee Line',
 'Waterloo - Jubilee Line',
 'Southwark - Jubilee Line',
 'London Bridge - Jubilee Line',
 'Bermondsey - Jubilee Line',
 'Canada Water - Jubilee Line',
 'Canary Wharf - Jubilee Line',
 'Canary Wharf']

Lets see what PageRank thinks of our more realistic graph

In [21]:
pagerank = nx.pagerank_numpy(graph, weight='weight')
pagerank = pd.DataFrame(pagerank.items(), columns=['name', 'pagerank'])
stations = pd.merge(stations, pagerank, on='name', right_index=True)

A stations pagerank score, should probably be the sum of its own score, plus that of its platforms

In [22]:
def node_name_to_pagerank(node_name):
    return pagerank[pagerank['name'] == node_name]['pagerank'].iloc[0]
In [23]:
for station_id, station in stations.iterrows():
    for neighbor in graph.neighbors(station['name']):
        stations.ix[station_id, 'pagerank'] += node_name_to_pagerank(neighbor)
In [24]:
stations.sort_values('pagerank', ascending=False).head(10)
Out[24]:
latitude longitude name zone total_lines rail pagerank
id
145 51.5308 -0.1238 King's Cross St. Pancras 1.0 6 1 0.013119
11 51.5226 -0.1571 Baker Street 1.0 5 0 0.010972
13 51.5133 -0.0886 Bank 1.0 4 0 0.009416
193 51.5154 -0.1755 Paddington 1.0 4 1 0.009205
279 51.5036 -0.1143 Waterloo 1.0 4 1 0.008860
87 51.5074 -0.1223 Embankment 1.0 4 0 0.008329
167 51.5186 -0.0886 Moorgate 1.0 4 1 0.008037
156 51.5178 -0.0823 Liverpool Street 1.0 4 1 0.008005
110 51.4936 -0.2251 Hammersmith 2.0 3 0 0.007002
186 51.5094 -0.1967 Notting Hill Gate 1.5 3 0 0.006799

The results seem just slightly more intuitive than our previous list. Notable differences with our earlier run of PageRank are that this list ranks Moorgate, Hammersmith, and Notting Hill Gate higher, but Earl's Court is no longer in the top 10. Its difficult to compare the lists quantifiably, and objectively say one is better than the other. I guess you would have to survey London commuters and ask them to rank their most important stations.

Lets plot these results again.

In [25]:
def node_name_to_station(node_name):
    return node_name.split(' - ')[0]
In [26]:
normed = stations[['longitude', 'latitude', 'pagerank']]
normed = normed - normed.min()
normed = normed / normed.max()
locations = dict(zip(stations['name'], normed[['longitude', 'latitude']].values))
pageranks = dict(zip(stations['name'], normed['pagerank'].values))

p = figure(
    x_range = (.4,.7),
    y_range = (.2,.5),
    height= 700,
    width= 900,
)
for edge in graph.edges():
    x1 = locations[node_name_to_station(edge[0])][0]
    x2 = locations[node_name_to_station(edge[1])][0]
    y1 = locations[node_name_to_station(edge[0])][1]
    y2 = locations[node_name_to_station(edge[1])][1]
    
    if not (x1 == x2 and y1 == y2):   
        p.line( 
            x= [x1, x2],
            y= [y1, y2],
        )

for node, location in locations.iteritems():
    x = [locations[node][0]]
    y = [locations[node][1]]
    p.circle(
        x, y, 
        radius = .01 * pageranks[node], 
        fill_color = pseudocolor(pageranks[node]), 
        line_alpha=0)
    p.text(
        x, y, 
        text = {'value':node}, 
        text_font_size = str(min(max(1, pageranks[node] * 12), 10)) + "pt", 
        text_alpha = pageranks[node],
        text_align='center',
        text_font_style='bold')
    
show(p)

Edge Rank

Now that we have a more complex representation of the tube graph, we can do some more interesting stuff. Wouldn't it be fun to see which connections, as opposed to end points, are the most important to the system? Intuitively, I reckon that the pathway between Bank and Monument should be important.

But of course, PageRank only works on nodes, not edges. So we transform our graph to its line graph representation. Each edge in the original graph becomes a node in the line graph. Edges in the line graph exist if there was a common node between two edges in the original.

Original, I tried to perform PageRank on the line graph of the original, but the results from this were terrible, no better than nonsense. Then I tried again with the HITS algorithm. The results were better, but still not very good.

In [27]:
linegraph = nx.line_graph(graph)
edgerank = nx.hits(linegraph)
edgerank = pd.DataFrame([(k[0], k[1], v) for k, v in edgerank[0].items()], columns=['node1', 'node2', 'edgerank'])
edgerank.sort_values('edgerank', ascending=False).head()
Out[27]:
node1 node2 edgerank
141 King's Cross St. Pancras - Circle Line King's Cross St. Pancras - Hammersmith & City Line 0.028982
722 King's Cross St. Pancras - Circle Line King's Cross St. Pancras - Metropolitan Line 0.028982
224 King's Cross St. Pancras - Hammersmith & City Line King's Cross St. Pancras - Metropolitan Line 0.028982
395 King's Cross St. Pancras - Circle Line King's Cross St. Pancras - Northern Line 0.028845
118 King's Cross St. Pancras - Hammersmith & City Line King's Cross St. Pancras - Northern Line 0.028845

The connections between the nodes in the most important stations dominate the top of the list, generally clustering by which station they belong to. Not terribly interesting to look at.

Edge Rank V2

With the failure of the linegraph method, I wanted to find a new method to rank edges. Two obvious heuristics come to mind.

  1. Rank edges by the sum of the PageRank scores of the nodes they connect
  2. Rank edges by the number of times they appear in paths from one node to another

Method 1

In [28]:
def node_to_station_pagerank(node_name):
    station_name = node_name.split(' - ')[0]
    return stations[stations['name'] == station_name]['pagerank'].iloc[0]
In [29]:
edgerank = edgerank[edgerank['node1'] != edgerank['node2']]
for edge_id, edge in edgerank.iterrows():
    score = node_name_to_pagerank(edge['node1']) + node_name_to_pagerank(edge['node2'])
    edgerank.set_value(edge_id, 'edgerank', score)
In [30]:
edgerank.sort_values('edgerank', ascending=False).head()
Out[30]:
node1 node2 edgerank
1146 Stockwell - Northern Line Stockwell - Victoria Line 0.004502
418 West Finchley - Northern Line Woodside Park - Northern Line 0.004494
242 Ruislip Gardens - Central Line South Ruislip - Central Line 0.004475
880 Balham - Northern Line Tooting Bec - Northern Line 0.004467
998 Poplar - Docklands Light Railway Westferry - Docklands Light Railway 0.004464

Hmm, not very promising. Lets try another way

Method 2

NetworkX provides a very simple way of calculating all possible shortest paths around your graph. It uses the Floyd-Warshall algorithm to generate a list shortest paths from every possible origin to every possible destination in just a few minutes.

In [31]:
all_shortest_paths = nx.floyd_warshall_predecessor_and_distance(graph, weight='time')[0]

This gives us a dictionary of dictionaries. The first maps origins to the inner dictionaries. The inner dictionaries map a destination to the preceeding node in the shortest path to that destination. We can backtrack along this path and count up every time our path crossed an edge. The idea is that the more times we cross an edge, the more important it is.

In [32]:
counter = Counter()
for station in stations['name']:
    for destination, pred in all_shortest_paths[station].iteritems(): 
        if any(stations['name'] == destination):
            while not pred == station:
                counter[tuple(sorted([destination, pred]))] += 1
                destination, pred = pred, all_shortest_paths[station][pred]
In [33]:
edgerank = pd.DataFrame([(k[0], k[1], v) for k, v in counter.items()], columns=['node1', 'node2', 'edgerank'])
edgerank.sort_values('edgerank', ascending=False).head()
Out[33]:
node1 node2 edgerank
272 Baker Street - Jubilee Line Bond Street - Jubilee Line 19766
764 Bethnal Green - Central Line Liverpool Street - Central Line 18274
241 Bethnal Green - Central Line Mile End - Central Line 17896
797 Bond Street - Central Line Oxford Circus - Central Line 16914
446 Baker Street - Metropolitan Line Finchley Road - Metropolitan Line 15990

Its not what I was expecting, but it seems like a pretty sound list. Although the Bank to Monument passage is used much less than I would have thought. This could be just some bias from me, since I know I use it quite a bit.

In [34]:
edgerank[(edgerank['node1'] == 'Bank') & (edgerank['node2'] == 'Monument')]
Out[34]:
node1 node2 edgerank
108 Bank Monument 1442

What are the most used intra-station connections?

In [35]:
node_name_to_station = np.vectorize(node_name_to_station)
In [36]:
intra_station_edgeranks = edgerank[node_name_to_station(edgerank['node1']) == node_name_to_station(edgerank['node2'])]
intra_station_edgeranks.sort_values('edgerank', ascending=False).head(10)
Out[36]:
node1 node2 edgerank
138 Bond Street - Central Line Bond Street - Jubilee Line 14134
844 South Kensington - Circle Line South Kensington - Piccadilly Line 13400
383 Baker Street - Jubilee Line Baker Street - Metropolitan Line 12164
115 Westminster - Circle Line Westminster - Jubilee Line 8682
114 Mile End - Central Line Mile End - District Line 7852
728 Euston - Northern Line Euston - Victoria Line 6952
558 Victoria - Circle Line Victoria - Victoria Line 6652
660 Waterloo - Jubilee Line Waterloo - Waterloo & City Line 5976
745 Canary Wharf - Docklands Light Railway Canary Wharf - Jubilee Line 5738
293 Oxford Circus - Central Line Oxford Circus - Victoria Line 4980

This looks like a very sensible list. These pathways are usually pretty busy. Next time you're passing through one of those passage ways, see if its optimized!

Now lets put everything together and visualise both the important stations and connections that we were able to pick out. Similarly to stations, green connectons are less important than red ones.

In [37]:
edgerank_normed = edgerank['edgerank']
edgerank_normed = edgerank_normed - edgerank_normed.min()
edgerank_normed = edgerank_normed / edgerank_normed.max()
edgerank_normed = dict(zip(map(tuple, edgerank[['node1', 'node2']].values), edgerank_normed))

normed = stations[['longitude', 'latitude', 'pagerank']]
normed = normed - normed.min()
normed = normed / normed.max()
locations = dict(zip(stations['name'], normed[['longitude', 'latitude']].values))
pageranks = dict(zip(stations['name'], normed['pagerank'].values))

p = figure(
    x_range = (.4,.7),
    y_range = (.2,.5),
    height= 700,
    width= 900,
)
for index, (node1, node2, edgerank_score) in edgerank.iterrows():
    x1 = locations[str(node_name_to_station(node1))][0]
    x2 = locations[str(node_name_to_station(node2))][0]
    y1 = locations[str(node_name_to_station(node1))][1]
    y2 = locations[str(node_name_to_station(node2))][1]
    edgerank_score = edgerank_normed[(node1, node2)]
    
    if not (x1 == x2 and y1 == y2):   
        p.line( 
            x= [x1, x2],
            y= [y1, y2],
            line_color = pseudocolor(edgerank_score),
            line_width = max(0.5, 3 * edgerank_score)
        )

for node, location in locations.iteritems():
    x = [locations[node][0]]
    y = [locations[node][1]]
    p.circle(
        x, y, 
        radius = .01 * pageranks[node], 
        fill_color = pseudocolor(pageranks[node]), 
        line_alpha=0)
    p.text(
        x, y, 
        text = {'value':node}, 
        text_font_size = str(min(max(2, pageranks[node] * 12), 10)) + "pt", 
        text_alpha = pageranks[node],
        text_align='center',
        text_font_style='bold')
    
show(p)

Conclusion

Who would have thought that PageRank and some simple heuristics could produce such intuitive results about the London Tube? I know I was blown away when I quickly tested the idea with the simple graph, I wasn't expecting it to work that nicely at all.

At one point during this write up, I had the idea that this information could be useful to TFL, and the people designing the network. But then I realised this I've probably backwards, and that it's not quite true. The fact that PageRank picked out a lot of big, important stations is no mistake. The underground network has been designed this way. These algorithms and heuristics do not inform the network designers, it is probably the designers that inform the algorithms. They have structured their network in a logic way, and all we have done is verify that. Our algorithms worked only as well as the network was designed.

Download IPython Notebook

Download london.lines.csv

Download london.stations.csv

Download london.connections.csv