Sunday, July 14, 2013

I Guess Caltech Students Do Have Social Connections

OK, just kidding. But this post involves Caltech dorms, and I feel I have to take part in some old-fashioned school rivalry. There is, in fact, only one college dorm worth talking about.

Anyway, a few years ago, a paper came out that was studying the structure of Facebook social networks on some college campuses:
Traud, A.Kelsic, E.Mucha, P., and Porter, M. Comparing Community Structure to Characteristics in Online Collegiate Social NetworksSIAM Review, 2011, Vol. 53, No. 3: pp. 526-543
In 2011, this Facebook dataset was used in a visualization competition that Conrad Lee  described in a post on his blog Sociograph. You can see some of the results in the post; perhaps not surprisingly, the visualizations were hitting the hairball ceiling.  

In another posting on Sociograph in late 2012, Conrad used the Caltech portion of that Facebook dataset to illustrate how to visualize adjacency matrices in Python. With that dataset, I've visualized the Caltech Facebook network using BioFabric. With 769 nodes and 33312 16656* edges the network is very high aspect ratio (43:1). I have 15,000 pixel-wide version of the network that you can scroll back and forth with here on my blog, and in my initial iteration of this post, I embedded the file directly on this page. But at about 3.8 MB, it's a little hefty to have to download it when you first visit the blog. So I've included just a detail snapshot below, and you should go to the special scroll page to view it now:

Caltech Facebook Network Detail
Click on this caption to view the 15,000 pixel-wide version

(Note the students' names in the data were anonymized to numbers.)

Since this network needed to be preprocessed a bit to get the final layout, and since it uses a feature I have not yet talked about (link groups), I'll spend this post talking a little bit about I built it.

In addition to providing the links, the data set also indicates the House affiliation (i.e. dorm) of each student (there are eight dorms), and this turns out to be an important aspect of this network. So let's use that data and have BioFabric show the clusters. As I have pointed out before, it's not yet a built-in BioFabric 1.0 feature to automatically do clustered layouts, (though I am working on it!), so some basic scripting is needed. I'm not going to get into the low-level detail of showing the scripts, but just give a high-level description of the steps involved. 

First, using the dorm assignments, we identify which edges are in-dorm (Facebook friends in the same dorm), and which edges are between-dorm (Facebook friends in different dorms). Then, using just the in-dorm links, we create eight separate SIF files of in-dorm links, one per dorm. Separately loading each into BioFabric, we can get eight per-dorm BioFabric default layouts (i.e. we are going to use BioFabric to handle the default layout step, instead of scripting it as well). The resulting node orders, which we will use to create a single global ordering file, can be simply exported, just choose Select File->Export->Export Node Order:

BioFabric Export Node Order
Click on image to enlarge

(As a side note, the Export Link Order option in that menu is the best route to seeing how to create the edge attribute files you need to explicitly layout edges).

Since we want to order the dorms from biggest to smallest, number the eight dorm node order files in that fashion, e.g. dorm1.noa (biggest) to dorm8.noa (smallest). You'll also need to chop off the first line of each of these files, using e.g.:

tail -n +2 < dorm1.noa > dormr1.noa

Then, to create the single global node ordering file, just do this on the Unix command line:

cat dormr*.noa | awk '{print $1 " = " NR-1}' | sed '1 i Node Row' > globalOrdering.noa

That takes care of specifying the node ordering we will need. At the same time, we want to create the single full-network SIF file where each link is tagged with a suffix indicating whether is it in-dorm (tagged -ic, for in-cluster), or between-dorm (tagged -bc, for between-cluster). We were figuring that out above when we created the eight separate dorm-only networks, so also use that information to tag the links to write out the final single SIF input file. 

Then, import the global SIF file, and after the network is loaded, re-layout the whole network by specifying node order. Just select Layout->Layout Using Node Attributes..., use the globalOrdering.noa file you generated, and the network now has the eight separate dorms broken out.

When a network gets long and thin like this, I'm quick to turn on shadow links to get a better idea of what's going on. Just select Edit->Set Display Options... and check Display Shadow Links box. At the same time, I like to shade the node zones, so also check Node Zone Shading before clicking OK. This allows you to see all the Facebook connections for a student by just looking at the node zone for that student.  

There is now one more step. As it currently sits, each student has a single edge wedge for all his/her Facebook friends.  The tiny subnetwork of three students shown below illustrates that. Although the links going to the node lines right above and below these students correspond to the in-dorm links, that distinction is completely hidden:

BioFabric Submodel of Caltech Network: No Link Groups
Click on image to enlarge

So we want to separate the links into the in-dorm (-icand between-dorm (-bcgroups so we can see separate edge wedges for these two sets. Since we tagged the links in the SIF input with suffixes, we can easily use that information to create the two distinct edge wedges. Just go to Layout->Specify Link Groups...:

BioFabric Specify Link Groups
Click on image to enlarge

In the dialog, click Add New Entry... twice and enter in the two groups, -ic and -bc:

BioFabric Specify Link Groups Dialog
Click on image to enlarge

Click OK, and the network is laid out.  Because of the link grouping, we can now easily visualize the two -ic and -bc classes of links for each student. Compare this version below with the one above. The first diagonal for each student are the in-dorm -ic links, as that was the first link group we specified. As expected, those links end at the node rows near these students, i.e. the other nodes in the same dorm. The following single edge wedge of -bc links over on the right side of each node zone tends to look like two separate wedges, since they are connecting to both the dorms above and below this dorm:

BioFabric Submodel of Caltech Network: With Link Groups
Click on image to enlarge
Note how the grouping lets us instantly see which students are mostly in-dorm focused with their Facebook connections (e.g. 590), and which have more connections outside the dorm (e.g. 20).

That's it for details on building the network. So go back and have a look at the whole network in the 15,000 pixel-wide version up at the top of this post. You can see the eight separate runs of dorms, compare the two different types of connections, and get an idea of how the students interact.

My next post or two will cover a couple of interesting aspects of this network, but it will be awhile, as I'll be on the road next week to the ISMB/ECCB 2013 conference. If you happen to be there, come say hi at my BioFabric Birds of a Feather session on Monday, July 26th! 

Correction: The original edge number of 33312 that I gave did not account for the equivalent reverse edges in the SIF file of the undirected graph getting thrown out on import to BioFabric. Though the view is actually showing 33312 edges in it since shadow links are turned on.

Monday, July 8, 2013

Big Data, Big Documents: The 100-Foot-Wide PDF

RBioFabric Version 0.3

I've just committed RBioFabric Version 0.3 on Github.  You can go take a look at it: 

You can install it directly from GitHub using the following command sequence and then start working with it:
# You need 'devtools':
# load it:
# install 'RBioFabric' from GitHub:
install_github('RBioFabric',  username='wjrl')
And here's a screen shot of RBioFabric in action inside RStudio:

RBioFabric running in RStudio
RBioFabric in action!

This new version has added a couple of necessary features to the very bare-bones first version:
  • You can specify a node order via an ordered list of node names, or a supplied reordering function.
  • You can display shadow links.

There's still lots to do, but RBioFabric actually provides some neat features that are not available in the Java version:
  • You can easily read in a variety of graph formats, since RBioFabric operates on the graphs provided by the igraph package. 
  • You can have BioFabric do a default layout that starts at a user-specified node, instead of the highest-degree node. This is shown in the example documentation for the defaultNodeOrder function. 
  • You can create PDF files of your BioFabric network.

RBioFabric and PDF

It's true that RBioFabric is, for the moment, the only way to create a BioFabric PDF output. This is because the current Java BioFabric version can only directly export to PNG. While it is possible to print a network to a PDF target with Java BioFabric, I have found that the results for large networks are unacceptable, apparently due to precision issues. For example, the one time I tried it, the endpoint glyphs did not coincide with their corresponding link ends! So if the shortcomings of the current RBioFabric are not an issue (e.g. you cannot mix directed and undirected link types, there is no explicit edge ordering, etc.), you can use it for getting a PDF of your network.

But there are some caveats to be aware of when doing PDF outputs. First, some PDF viewers are better than others. Specifically, PDF viewers (or PNG viewers, for that matter) that cannot do antialiasing of line art are a terrible choice for viewing BioFabric networks. The closely spaced parallel lines of a BioFabric plot MUST be antialiased to get acceptable results when you are zoomed out to view the whole network. It's also useful for the viewer to have a decent maximum zoom level and a "Hand Tool" to be able to navigate by dragging the cursor over the image. I've tested a few viewers, and here is what I found. Note that all my computers are pretty old, so newer versions of these tools may do a better job:
  • Evince on Linux (Document Viewer 2.30.3 tested): Antialiasing is always on, and the visuals are good. But there are a few problems. First, very tiny text below some size threshold explodes to a huge size.  Second, you cannot zoom above 400 percent, which is simply insufficient to explore your network. Finally, there is no hand tool to navigate by mouse dragging, which is essential. As a side note, for Postscript output, Evince does not antialias the image, giving very poor results.  
  • Preview on Mac (Version 4.2 tested): Be sure that Anti-alias text and line art is checked on the PDF tab in Preferences, which gives adequate visuals (I feel they are way too dark at the full-network level). It has a very good maximum zoom level, and the Move cursor provides convenient mouse-drag navigation. You can also Select a rectangle and then zoom to it using command-* (i.e. command-shift-8).
  • Adobe Reader on Mac (Version 9.5.5 tested): Be sure that Smooth line art is checked, and (VERY IMPORTANT) Enhance thin lines is NOT checked, on the Page Display Preferences. The Hand tool is available via Tools->Select & Zoom->Hand Tool, and the Marquee Zoom from the same Select & Zoom menu allows you to quickly zoom to a selected rectangle. The maximum 6400 percent zoom level is very good for exploration.

Sizing the PDF Document

It is important to make sure that your PDF document is large enough! If you don't set your PDF document height and width to a large enough value, the small text labels will not appear. My experiments show that both Adobe Reader and Mac Preview will no longer display the smallest node labels when the document gets smaller than about .0145 inches per link, which is about 69 links per inch. To get labels that are correctly proportioned, it appears to be best to actually stay above .0175 inches per link, i.e. 57 links per inch. So, for the yeastHighQuality.sif network displayed on the home page, which contains 6888 links, you need to make a PDF file about 100 inches wide (i.e. 8 feet, 4 inches) to be just able to view it, and 120 inches wide (10 feet) to really do a decent job.

You read that last dimension correctly. To be able to explore a 6,888 link BioFabric network right down to the smallest detail, you need to make your PDF 10 feet wide! The implication is that a network with 69,000 links would need to be 100 feet wide. And that starts to limit what can be handled in the PDF viewer tools I tested, using the network from the Cytoscape HumanInteractomeMay.sif file, which has 61,263 links. Using the resolution guidelines I gave above, I made the PDF document 1200 inches wide by 240 inches high; that's 100 feet wide by 20 feet tall. The Adobe Reader just would not load it; it seems that it hits a limit at documents that are 200 inches square. The Mac Preview tool was able to load it, and you could zoom all the way in to view the labels. Unfortunately, the full-network view looks really bad.

The bottom line is that if you need to make a poster-sized (e.g. 48 inch-wide) image of a network with maybe 2,750 links or less, PDF can (in theory) provide a scalable, completely readable image. I say in theory, since your plotter driver rendering the PDF may have its own issues with tiny text and hairlines. If you have more links than that, a PNG file of 300 to 600 dpi will provide a decent image, though it will not have per-link resolution. I've been successful with the PNG route for the posters I have made so far.

So RBioFabric now provides a route to creating huge PDF documents for your network. But by far the best way to interactively explore a large network is to still use the BioFabric Java application, since the tool is designed to view large networks using the built-in interactive search, magnifier, touring, mouseover, submodeling, and view-tiling features.

Be sure to keep watching this blog for more announcements of future RBioFabric improvements! 

Monday, July 1, 2013

R you Ready for BioFabRic?

OK, this is going to be a very short post. After a marathon weekend hacking binge, I've created an initial version of BioFabric that is implemented in R.  The code for Version 0.01 is hosted on GitHub at:

At the moment, it's nothing fancy. For example, it only does the default layout, cannot do shadow links,  and so on. Right now, it just takes a graph in the  R igraph format and creates a static plot. I have done no benchmarking on how big you can go before everything starts to fall apart, but it seems to handle node and link counts in the hundreds without too much sweat. Here's an example of a network generated using, m=6, directed=FALSE)

Click on image to enlarge
It's very rough at the moment, and is just a couple source files to start. But if you are the adventurous type, download bioFabric.R and then have a look at the TestHarness.R file to see how it can be used. I'll have more blog posts coming soon.

Happy R hacking!