Saturday, February 23, 2013

The Shape of Things to Come

Or perhaps another reference to classic science fiction media would be The Outer Limits.  For this posting, I direct your attention to this BioFabric network:

BioFabric Stanford Web Network
Click on image to enlarge
It's another example from the Stanford Large Network Dataset Collection.  This time, it's the Stanford Web Graph.  It's pretty much what you would expect: to quote the source, "nodes represent pages from Stanford University (stanford.edu) and directed edges represent hyperlinks between them."

But the interesting thing about the network above is that it contains 281,903 nodes and 2,312,497 edges.

And it kind of blows my mind that I feel that by looking at this, I can actually start to get some inklings about what is going on (YMMV).  At a minimum, I certainly know what I want to zoom in on and start to explore; there are all sorts of interesting structures in there to poke around and look at.  And the sequential nature of the BioFabric approach means I can easily do this in a methodical fashion. With hairballs containing 2.3 million edges, it seems kinda hopeless (for me, at least) without first hacking most of the network away.

Clicking on the image above to get a "larger version" is kind of a cruel joke.  Over at the Gallery, you can grab a 11,075 x 1,350 1.6 MB PNG with insanely inadequate resolution, or a much larger 31,173 x 3,800 13.0 MB PNG that is just ludicrously, stupendously inadequate.  They both look pretty bad... there is lots of room for improvement in rendering when links are this dense.  Of course, if you were trying to print this network out on paper, with one line per millimeter, the paper would be 2.3 kilometers long, and 282 meters high.

This is, of course, where the interactive BioFabric tool is supposed to come in! You can scroll around, zoom in and out, and explore a network in great detail.  After all, nobody thinks twice anymore about zooming in to look at cars and houses all over the entire world in Google Maps, right?  But, I am embarrassed to admit, this is why I referred to The Outer Limits at the start of the post... I can't do that yet at this scale. The nodes as lines technique scales wonderfully, but my implementation in software is not there yet. Version 1.0.0 is, after all, built as a proof-of-concept. Using the 4GB large memory version available on from the web site, I was able to get the network loaded, laid out, and exported to PNG, but the program was so sluggish as to be unusable as an interactive tool. Navigation was almost impossible. And when I tried to lay out the network using my similarity algorithm, the 4GB was not up to the task: I got an "out of memory" error after it ran for a few hours. So now I have a test case to use for working on the program's scalability.  But by using Jedi navigation tricks (hint: you can get a maximum zoom to the location under the mouse by pressing Ctrl-1 [that's one, not L], then zoom back out a bit), I did get a screen shot to show a detail:

BioFabric Stanford Web Network: Detail Screenshot
Click on image to enlarge

So these problems are the reason why I did not post a .bif file to the Gallery; it would be too painful.  Additionally, it would be really big: since that format is XML (more lack of scalability!), the file is about 750MB uncompressed, and 70MB in a ZIP.  If you insist, you can go get the original file from the source and create your own .sif import by deleting a few lines at the start and using awk to format it. You can get it to import if you let it run overnight, but it's just too embarrassing right now for me to make it too easy to do this.

One last point: I really found it frustrating that the data set is anonymized!  When you can actually see all sorts of interesting patterns in the network (why are there 21 pages with what looks to be essentially identical sets of links, as shown above?) you kinda want to try to figure out why that is.

So, truth be told, this network mostly serves to give an idea of the challenges ahead for BioFabric. But I'm looking forward to the day when I can easily explore this network (and larger) in depth: the picture at the top is The Shape of Things to Come!

No comments:

Post a Comment