The London Tube as a Graph¶

Representing a rail network as a graph is nothing new, its the most obvious way to do it. Nodes are the stations and edges are the lines between them. But what happens when you apply algorithms like PageSort to the graph? Will it be able to pick out the stations a human would intuitively pick out as important? Lets find out

Imports first, I'll use NetworkX as the graph library. It seems to be the easiest and most full featured library around.

%matplotlib inline

import colorsys
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from collections import Counter
from bokeh.plotting import figure, show
from bokeh.resources import CDN
from bokeh.io import output_notebook
output_notebook( resources=CDN )

pd.set_option('max_colwidth', 200)

I found it suprisingly hard to find a nicely structured dataset of stations and connections between them. Luckily this library had some CSVs buried in it with just what I was looking for. So i yanked them out and opened them in pandas

lines       = pd.read_csv('london.lines.csv', index_col=0)
stations    = pd.read_csv('london.stations.csv', index_col=0)
connections = pd.read_csv('london.connections.csv')

Look at the nice data! Some of it is probably out of date, but not in any major way.

lines.head(3)

stations.head(3)

#drop station display_name for the future, its an ugly column
stations.drop('display_name', axis=1, inplace=True)

connections.head(3)

We can create a naive graph super easily from this.

A simplified graph¶

graph = nx.Graph()

for connection_id, connection in connections.iterrows():
    station1_name = stations.ix[connection['station1']]['name']
    station2_name = stations.ix[connection['station2']]['name']
    graph.add_edge(station1_name, station2_name, time = connection['time'])
    
#add the connection between Bank and Monument manually
graph.add_edge('Bank', 'Monument', time = 1)

Already we can do some kind of interesting stuff, like get a reasonable path between Oxford Circus and Canary Wharf

nx.shortest_path(graph, 'Oxford Circus', 'Canary Wharf', weight='time')

['Oxford Circus',
 'Tottenham Court Road',
 'Holborn',
 'Chancery Lane',
 "St. Paul's",
 'Bank',
 'Shadwell',
 'Limehouse',
 'Westferry',
 'West India Quay',
 'Canary Wharf']

And run PageRank on the network!

pagerank = nx.pagerank_numpy(graph)
pagerank = pd.DataFrame(pagerank.items(), columns=['name', 'pagerank'])
stations = pd.merge(stations, pagerank, on='name', right_index=True)

stations.sort_values('pagerank', ascending=False).head(10)

Those results look incredibly good considering how little work we've put in! With the exception of Turnham Green, this seems like a very reasonable list of the most important tube stations in London. And this is all without taking into account the Overground network, or looking at any traffic stats!

NetworkX also implements the HITS algorithm. It was originally designed to differenciate between web pages which acted as hubs of information and those which acted as authoritive sourves of information. It does this by looking at incoming and outgoing edges from each node. In an undirected graph (like we're using), incoming and outgoing edges are the same, but the results I found when I applied it to the Tube graph were quite interesting!

hits = nx.hits_scipy(graph, max_iter=1000)[0]
hits = pd.DataFrame(hits.items(), columns=['name', 'hits'])
stations = pd.merge(stations, hits, on='name', right_index=True)

stations.sort_values('hits', ascending=False).head(10)

Where PageRank finds the important stations, the HITS algorithm seems to be pretty good at finding the busy stations, still without any traffic data! Neat!

Lets visualise the importance of stations as defined by pagerank. Less important stations will be colored green, and more important stations will be colored red.

def pseudocolor(val):
    h = (1.0 - val) * 120 / 360
    r, g, b = colorsys.hsv_to_rgb(h, 1., 1.)
    return r * 255, g * 255, b * 255

normed = stations[['longitude', 'latitude', 'pagerank']]
normed = normed - normed.min()
normed = normed / normed.max()
locations = dict(zip(stations['name'], normed[['longitude', 'latitude']].values))
pageranks = dict(zip(stations['name'], normed['pagerank'].values))

p = figure(
    x_range = (.4,.7),
    y_range = (.2,.5),
    height= 700,
    width= 900,
)
for edge in graph.edges():
    p.line( 
        x= [locations[pt][0] for pt in edge],
        y= [locations[pt][1] for pt in edge],
    )

for node in graph.nodes():
    x = [locations[node][0]]
    y = [locations[node][1]]
    p.circle(
        x, y, 
        radius = .01 * pageranks[node], 
        fill_color = pseudocolor(pageranks[node]), 
        line_alpha=0)
    p.text(
        x, y, 
        text = {'value':node}, 
        text_font_size = str(min(pageranks[node] * 12, 10)) + "pt", 
        text_alpha = pageranks[node],
        text_align='center',
        text_font_style='bold')
    
show(p)

A realistic graph¶

The above data looks good, but the graph is terribly simple. It assumed that there is no inherent cost to switching trains at a station. This is definitely not true. A train change at a station takes at least as long as going one average stop.

To model this, we need to put edges between the lines at each station. For example, instead of having a single 'Oxford Circus' node in our graph, we will now have the nodes 'Oxford Circus' (to represent entrance and exit of the main station), 'Oxford Circus - Central Line', 'Oxford Circus - Victoria Line', etc. Think of it like each line has its own layer in a 3d tube map, and when 2 lines share a station there is an edge between the layers at that point.

This graph still isn't quite accurate to the real world. The nodes should really be platforms instead of stops on a line. For example, the District and Circle lines share a lot of platforms. But using this approach, we can easily model the time it takes to switch between lines.

stations.drop(['pagerank', 'hits'], axis=1, inplace=True)

graph = nx.Graph()

for connection_id, connection in connections.iterrows():
    line_name = lines.ix[connection.line]['name']
    station1_name = stations.ix[connection.station1]['name']
    station2_name = stations.ix[connection.station2]['name']

    #Oxford Circus - Central Line
    node1_name = "%s - %s" % (station1_name, line_name)
    node2_name = "%s - %s" % (station2_name, line_name)
    
    #"Bond Street - Central Line" to "Oxford Circus - Central Line"
    graph.add_edge(node1_name, node2_name,    time = connection['time'])
    
    #"Oxford Circus - Central Line" to "Oxford Circus"
    graph.add_edge(node1_name, station1_name, time = 1)
    graph.add_edge(node2_name, station2_name, time = 1)
    
    #"Oxford Circus - Central Line" to "Oxford Circus - Victoria Line"
    for neighbor in graph.neighbors(station1_name):
        graph.add_edge(node1_name, neighbor, time = 0.5)        
    for neighbor in graph.neighbors(station2_name):
        graph.add_edge(node2_name, neighbor, time = 0.5)
        
graph.add_edge('Bank', 'Monument', time = 1)

We attach a weight to each edge, which is just the inverse of the time

for node1, neighbors in graph.edge.iteritems():
    for node2, edge in neighbors.iteritems():
        graph[node1][node2]['weight'] = 1.0 / edge['time']

To get an idea of what the graph looks like, we can inspect it a little

graph.neighbors('Oxford Circus')

['Oxford Circus - Bakerloo Line',
 'Oxford Circus - Victoria Line',
 'Oxford Circus - Central Line']

graph.neighbors('Oxford Circus - Central Line')

['Oxford Circus',
 'Tottenham Court Road - Central Line',
 'Oxford Circus - Bakerloo Line',
 'Oxford Circus - Central Line',
 'Oxford Circus - Victoria Line',
 'Bond Street - Central Line']

nx.shortest_path(graph, 'Oxford Circus', 'Canary Wharf', weight='time')

['Oxford Circus',
 'Oxford Circus - Victoria Line',
 'Green Park - Victoria Line',
 'Green Park - Jubilee Line',
 'Westminster - Jubilee Line',
 'Waterloo - Jubilee Line',
 'Southwark - Jubilee Line',
 'London Bridge - Jubilee Line',
 'Bermondsey - Jubilee Line',
 'Canada Water - Jubilee Line',
 'Canary Wharf - Jubilee Line',
 'Canary Wharf']

Lets see what PageRank thinks of our more realistic graph

pagerank = nx.pagerank_numpy(graph, weight='weight')
pagerank = pd.DataFrame(pagerank.items(), columns=['name', 'pagerank'])
stations = pd.merge(stations, pagerank, on='name', right_index=True)

A stations pagerank score, should probably be the sum of its own score, plus that of its platforms

def node_name_to_pagerank(node_name):
    return pagerank[pagerank['name'] == node_name]['pagerank'].iloc[0]

for station_id, station in stations.iterrows():
    for neighbor in graph.neighbors(station['name']):
        stations.ix[station_id, 'pagerank'] += node_name_to_pagerank(neighbor)

stations.sort_values('pagerank', ascending=False).head(10)

The results seem just slightly more intuitive than our previous list. Notable differences with our earlier run of PageRank are that this list ranks Moorgate, Hammersmith, and Notting Hill Gate higher, but Earl's Court is no longer in the top 10. Its difficult to compare the lists quantifiably, and objectively say one is better than the other. I guess you would have to survey London commuters and ask them to rank their most important stations.

Lets plot these results again.

def node_name_to_station(node_name):
    return node_name.split(' - ')[0]

normed = stations[['longitude', 'latitude', 'pagerank']]
normed = normed - normed.min()
normed = normed / normed.max()
locations = dict(zip(stations['name'], normed[['longitude', 'latitude']].values))
pageranks = dict(zip(stations['name'], normed['pagerank'].values))

p = figure(
    x_range = (.4,.7),
    y_range = (.2,.5),
    height= 700,
    width= 900,
)
for edge in graph.edges():
    x1 = locations[node_name_to_station(edge[0])][0]
    x2 = locations[node_name_to_station(edge[1])][0]
    y1 = locations[node_name_to_station(edge[0])][1]
    y2 = locations[node_name_to_station(edge[1])][1]
    
    if not (x1 == x2 and y1 == y2):   
        p.line( 
            x= [x1, x2],
            y= [y1, y2],
        )

for node, location in locations.iteritems():
    x = [locations[node][0]]
    y = [locations[node][1]]
    p.circle(
        x, y, 
        radius = .01 * pageranks[node], 
        fill_color = pseudocolor(pageranks[node]), 
        line_alpha=0)
    p.text(
        x, y, 
        text = {'value':node}, 
        text_font_size = str(min(max(1, pageranks[node] * 12), 10)) + "pt", 
        text_alpha = pageranks[node],
        text_align='center',
        text_font_style='bold')
    
show(p)

Edge Rank¶

Now that we have a more complex representation of the tube graph, we can do some more interesting stuff. Wouldn't it be fun to see which connections, as opposed to end points, are the most important to the system? Intuitively, I reckon that the pathway between Bank and Monument should be important.

But of course, PageRank only works on nodes, not edges. So we transform our graph to its line graph representation. Each edge in the original graph becomes a node in the line graph. Edges in the line graph exist if there was a common node between two edges in the original.

Original, I tried to perform PageRank on the line graph of the original, but the results from this were terrible, no better than nonsense. Then I tried again with the HITS algorithm. The results were better, but still not very good.

linegraph = nx.line_graph(graph)
edgerank = nx.hits(linegraph)
edgerank = pd.DataFrame([(k[0], k[1], v) for k, v in edgerank[0].items()], columns=['node1', 'node2', 'edgerank'])
edgerank.sort_values('edgerank', ascending=False).head()

The connections between the nodes in the most important stations dominate the top of the list, generally clustering by which station they belong to. Not terribly interesting to look at.

Edge Rank V2¶

With the failure of the linegraph method, I wanted to find a new method to rank edges. Two obvious heuristics come to mind.

Rank edges by the sum of the PageRank scores of the nodes they connect
Rank edges by the number of times they appear in paths from one node to another

Method 1¶

def node_to_station_pagerank(node_name):
    station_name = node_name.split(' - ')[0]
    return stations[stations['name'] == station_name]['pagerank'].iloc[0]

edgerank = edgerank[edgerank['node1'] != edgerank['node2']]
for edge_id, edge in edgerank.iterrows():
    score = node_name_to_pagerank(edge['node1']) + node_name_to_pagerank(edge['node2'])
    edgerank.set_value(edge_id, 'edgerank', score)

edgerank.sort_values('edgerank', ascending=False).head()

Hmm, not very promising. Lets try another way

Method 2¶

NetworkX provides a very simple way of calculating all possible shortest paths around your graph. It uses the Floyd-Warshall algorithm to generate a list shortest paths from every possible origin to every possible destination in just a few minutes.

all_shortest_paths = nx.floyd_warshall_predecessor_and_distance(graph, weight='time')[0]

This gives us a dictionary of dictionaries. The first maps origins to the inner dictionaries. The inner dictionaries map a destination to the preceeding node in the shortest path to that destination. We can backtrack along this path and count up every time our path crossed an edge. The idea is that the more times we cross an edge, the more important it is.

counter = Counter()
for station in stations['name']:
    for destination, pred in all_shortest_paths[station].iteritems(): 
        if any(stations['name'] == destination):
            while not pred == station:
                counter[tuple(sorted([destination, pred]))] += 1
                destination, pred = pred, all_shortest_paths[station][pred]

edgerank = pd.DataFrame([(k[0], k[1], v) for k, v in counter.items()], columns=['node1', 'node2', 'edgerank'])
edgerank.sort_values('edgerank', ascending=False).head()

Its not what I was expecting, but it seems like a pretty sound list. Although the Bank to Monument passage is used much less than I would have thought. This could be just some bias from me, since I know I use it quite a bit.

edgerank[(edgerank['node1'] == 'Bank') & (edgerank['node2'] == 'Monument')]

What are the most used intra-station connections?

node_name_to_station = np.vectorize(node_name_to_station)

intra_station_edgeranks = edgerank[node_name_to_station(edgerank['node1']) == node_name_to_station(edgerank['node2'])]
intra_station_edgeranks.sort_values('edgerank', ascending=False).head(10)

This looks like a very sensible list. These pathways are usually pretty busy. Next time you're passing through one of those passage ways, see if its optimized!

Now lets put everything together and visualise both the important stations and connections that we were able to pick out. Similarly to stations, green connectons are less important than red ones.

edgerank_normed = edgerank['edgerank']
edgerank_normed = edgerank_normed - edgerank_normed.min()
edgerank_normed = edgerank_normed / edgerank_normed.max()
edgerank_normed = dict(zip(map(tuple, edgerank[['node1', 'node2']].values), edgerank_normed))

normed = stations[['longitude', 'latitude', 'pagerank']]
normed = normed - normed.min()
normed = normed / normed.max()
locations = dict(zip(stations['name'], normed[['longitude', 'latitude']].values))
pageranks = dict(zip(stations['name'], normed['pagerank'].values))

p = figure(
    x_range = (.4,.7),
    y_range = (.2,.5),
    height= 700,
    width= 900,
)
for index, (node1, node2, edgerank_score) in edgerank.iterrows():
    x1 = locations[str(node_name_to_station(node1))][0]
    x2 = locations[str(node_name_to_station(node2))][0]
    y1 = locations[str(node_name_to_station(node1))][1]
    y2 = locations[str(node_name_to_station(node2))][1]
    edgerank_score = edgerank_normed[(node1, node2)]
    
    if not (x1 == x2 and y1 == y2):   
        p.line( 
            x= [x1, x2],
            y= [y1, y2],
            line_color = pseudocolor(edgerank_score),
            line_width = max(0.5, 3 * edgerank_score)
        )

for node, location in locations.iteritems():
    x = [locations[node][0]]
    y = [locations[node][1]]
    p.circle(
        x, y, 
        radius = .01 * pageranks[node], 
        fill_color = pseudocolor(pageranks[node]), 
        line_alpha=0)
    p.text(
        x, y, 
        text = {'value':node}, 
        text_font_size = str(min(max(2, pageranks[node] * 12), 10)) + "pt", 
        text_alpha = pageranks[node],
        text_align='center',
        text_font_style='bold')
    
show(p)

Conclusion¶

Who would have thought that PageRank and some simple heuristics could produce such intuitive results about the London Tube? I know I was blown away when I quickly tested the idea with the simple graph, I wasn't expecting it to work that nicely at all.

At one point during this write up, I had the idea that this information could be useful to TFL, and the people designing the network. But then I realised this I've probably backwards, and that it's not quite true. The fact that PageRank picked out a lot of big, important stations is no mistake. The underground network has been designed this way. These algorithms and heuristics do not inform the network designers, it is probably the designers that inform the algorithms. They have structured their network in a logic way, and all we have done is verify that. Our algorithms worked only as well as the network was designed.

Download IPython Notebook

Download london.lines.csv

Download london.stations.csv

Download london.connections.csv

	name	colour	stripe
line
1	Bakerloo Line	AE6017	NaN
3	Circle Line	FFE02B	NaN
6	Hammersmith & City Line	F491A8	NaN

	latitude	longitude	name	display_name	zone	total_lines	rail
id
1	51.5028	-0.2801	Acton Town	Acton<br />Town	3	2	0
2	51.5143	-0.0755	Aldgate	NaN	1	2	0
3	51.5154	-0.0726	Aldgate East	Aldgate<br />East	1	2	0

	station1	station2	line	time
0	11	163	1	1
1	11	212	1	2
2	49	87	1	1

	latitude	longitude	name	zone	total_lines	rail	pagerank
id
145	51.5308	-0.1238	King's Cross St. Pancras	1.0	6	1	0.007915
11	51.5226	-0.1571	Baker Street	1.0	5	0	0.007613
13	51.5133	-0.0886	Bank	1.0	4	0	0.007140
74	51.4920	-0.1973	Earl's Court	1.5	2	0	0.007047
193	51.5154	-0.1755	Paddington	1.0	4	1	0.006178
265	51.4951	-0.2547	Turnham Green	2.5	2	0	0.006108
279	51.5036	-0.1143	Waterloo	1.0	4	1	0.006082
107	51.5067	-0.1428	Green Park	1.0	3	0	0.005852
225	51.5117	-0.0560	Shadwell	2.0	2	0	0.005845
192	51.5150	-0.1415	Oxford Circus	1.0	3	0	0.005757

	latitude	longitude	name	zone	total_lines	rail	pagerank	hits
id
192	51.5150	-0.1415	Oxford Circus	1	3	0	0.005757	0.059742
107	51.5067	-0.1428	Green Park	1	3	0	0.005852	0.059523
197	51.5098	-0.1342	Picadilly Circus	1	2	0	0.003862	0.047062
28	51.5142	-0.1494	Bond Street	1	2	0	0.004143	0.042528
285	51.5010	-0.1254	Westminster	1	3	0	0.004058	0.035861
279	51.5036	-0.1143	Waterloo	1	4	1	0.006082	0.033215
259	51.5165	-0.1310	Tottenham Court Road	1	2	0	0.004067	0.031668
151	51.5113	-0.1281	Leicester Square	1	2	0	0.004019	0.031330
11	51.5226	-0.1571	Baker Street	1	5	0	0.007613	0.029949
13	51.5133	-0.0886	Bank	1	4	0	0.007140	0.028016

	node1	node2	edgerank
141	King's Cross St. Pancras - Circle Line	King's Cross St. Pancras - Hammersmith & City Line	0.028982
722	King's Cross St. Pancras - Circle Line	King's Cross St. Pancras - Metropolitan Line	0.028982
224	King's Cross St. Pancras - Hammersmith & City Line	King's Cross St. Pancras - Metropolitan Line	0.028982
395	King's Cross St. Pancras - Circle Line	King's Cross St. Pancras - Northern Line	0.028845
118	King's Cross St. Pancras - Hammersmith & City Line	King's Cross St. Pancras - Northern Line	0.028845

	node1	node2	edgerank
1146	Stockwell - Northern Line	Stockwell - Victoria Line	0.004502
418	West Finchley - Northern Line	Woodside Park - Northern Line	0.004494
242	Ruislip Gardens - Central Line	South Ruislip - Central Line	0.004475
880	Balham - Northern Line	Tooting Bec - Northern Line	0.004467
998	Poplar - Docklands Light Railway	Westferry - Docklands Light Railway	0.004464

	node1	node2	edgerank
272	Baker Street - Jubilee Line	Bond Street - Jubilee Line	19766
764	Bethnal Green - Central Line	Liverpool Street - Central Line	18274
241	Bethnal Green - Central Line	Mile End - Central Line	17896
797	Bond Street - Central Line	Oxford Circus - Central Line	16914
446	Baker Street - Metropolitan Line	Finchley Road - Metropolitan Line	15990

	node1	node2	edgerank
138	Bond Street - Central Line	Bond Street - Jubilee Line	14134
844	South Kensington - Circle Line	South Kensington - Piccadilly Line	13400
383	Baker Street - Jubilee Line	Baker Street - Metropolitan Line	12164
115	Westminster - Circle Line	Westminster - Jubilee Line	8682
114	Mile End - Central Line	Mile End - District Line	7852
728	Euston - Northern Line	Euston - Victoria Line	6952
558	Victoria - Circle Line	Victoria - Victoria Line	6652
660	Waterloo - Jubilee Line	Waterloo - Waterloo & City Line	5976
745	Canary Wharf - Docklands Light Railway	Canary Wharf - Jubilee Line	5738
293	Oxford Circus - Central Line	Oxford Circus - Victoria Line	4980