Monday, December 9, 2013

Also Sprach Zarathustra

OK, before we get started with this post, you first have to view this video clip to set the appropriate mood:


And with that memorable introduction, I present the BioFabric version of Brendan Griffen's Graph of Influential Thinkers, with 7239 nodes and 14560 edges:

BioFabric Network Visualization of Brendan Griffen's Graph of Thinkers
Click on picture to enlarge
And what's that got to do with the opening credits of 2001: A Space Odyssey? Well, that memorable piece of music, Einleitung, oder Sonnenaufgang (Introduction, or Sunrise), is the famous opening section of Richard Strauss's tone poem Also Sprach Zarathustra. And who was the author of the book Also sprach Zarathustra: Ein Buch für Alle und Keinen that inspired Strauss? Friedrich Nietzsche, who happens to hold the premier, top-left, row #1 position in the BioFabric version of the network:

BioFabric Network Visualization of Brendan Griffen's Graph of Thinkers: Nietzsche
Click on picture to enlarge
The graph was built using the "influenced" and "influenced by" links that appear in the sidebar of many Wikipedia articles about historical and current figures. Go and visit Dr. Griffen's blog post to learn about the creation of his network, and to see his beautiful Gephi-based renderings!

I'll be spending the next couple blog posts discussing this network, which will give me a chance to discuss BioFabric's "similar connectivity" algorithm, since it was used to layout the network instead of the default method. But to get started in this post, I've just included some screen shots of BioFabric showing some of the same thinkers as were depicted in the original blog post. First some artists, with Pablo Picasso as the most visible node:

BioFabric Network Visualization of Brendan Griffen's Graph of Thinkers: Artists
Click on picture to enlarge
Some authors, where Stephen King and H.P. Lovecraft are prominent:

BioFabric Network Visualization of Brendan Griffen's Graph of Thinkers: Authors
Click on picture to enlarge

The comedians include George Carlin and Richard Pryor:

BioFabric Network Visualization of Brendan Griffen's Graph of Thinkers: Comedians
Click on picture to enlarge

More philosophers, who are placed a little further over than Nietzsche:

BioFabric Network Visualization of Brendan Griffen's Graph of Thinkers: Philosophers
Click on picture to enlarge

And some more writers, with Beat poets and other Beat Generation writers showing prominently on the left:

BioFabric Network Visualization of Brendan Griffen's Graph of Thinkers: Beat Writers
Click on picture to enlarge
Of course, the best way to explore the network is to view it in BioFabric. Head on over to the BioFabric Gallery to pick up the .bif file (in a compressed gzip archive file) and have fun! Thanks to Brendan Griffen for providing the data, and keep an eye out for my next blog posts on the network.

Thursday, December 5, 2013

I Was Lost, But Now I'm Found...

In my last posting, I showed a variety of node ordering schemes that could be applied to the combined glucose/oleate network. Of course, the only way to actually do these different layouts is to create a .noa node attribute file that specifies a node ordering and then install it using the Layout->Layout Using Node Attributes... feature. To make that whole process more understandable, I've posted the code for the little standalone Java program I used to create the files up at my BioFabric Github repository at:


It's a quick-and-dirty implementation that is totally hardwired to this specific example, but taking a look at that code can give you an idea of how to extend it to your particular situation.


Friday, November 29, 2013

I View Yeast to the Breadth and Height BioFabric Can Reach

It's time to pick up where I left off last month with the yeast glucose versus oleate network. In that first post, I introduced the network, and then I showed how the target node rows can be logically arranged in an order so that targets with the same combination of inputs are grouped together.

But I wrapped up that introductory post after showing the two different experimental conditions as two separate networks. But by using the link grouping feature, we can create a single combined network that allows us to directly compare the two conditions side-by-side, and that's the topic of this posting.

First, you should go and review how to set up link groups; I covered that in my Caltech dorm post, and so I won't cover that ground again in detail. In this case, I simply created a single .sif file by combining the results from the two different conditions. For the glucose condition links, I added a "-g" suffix to the link tag, while I added an "-o" suffix to the oleate condition links. I then used these two tags to create the link groups. By putting the glucose tag first in the list of groups, I ensured that the edge wedges for the glucose condition would always show up to the left of the oleate condition edge wedge.

I also created a node attribute file that I used to order the nodes, just like I did in my two original networks. Since there are four transcription factors, there are (4 x 4) - 1 = 15 possible non-zero combinations of the inputs. If you recollect from my first post, I showed how we could consider these 15 different input combinations simply as binary numbers, and then order the targets by just sorting those numbers. This put the target nodes with all four inputs in the topmost rows, and the nodes with only an Adr1p (A) input in the lowest rows. I did exactly the same thing in this case, though in this case I have (4 x 4) x (4 x 4) - 1 = 255 possible non-zero combinations across the four inputs for the two different experimental conditions. Here's the result:

BioFabric network visualization of combined glucose oleate network
Click on picture to enlarge
Let's get oriented here. The leftmost edge wedge contains the links for the targets of Oaf1p (O) under the glucose condition; the next wedge to its immediate right contains links for Oaf1p targets under the oleate condition. Most, but not all, of the O-glucose targets are O-oleate targets too, and there are a whole bunch more new O-oleate targets as well. The pattern of glucose wedge followed by the oleate wedge for each of the four transcription factors is the direct result of our using link grouping to organize the two different experimental conditions. Thus, the same pattern of glucose followed by oleate links repeats across the remaining three source nodes Oaf3p (Y),  Pip2p (P), and Adr1p (A).

The crucial point here is to note how we can now directly compare the networks for the two separate conditions. For example, as I just alluded to above, looking at the O targets for glucose versus oleate, we see that maybe 20% of the O targets under glucose are not O targets in the oleate condition. For any combination of inputs and conditions, we can quickly scan the network to find such patterns.

But is the arrangement of node rows that I chose above really the best one for doing these comparisons? I don't think so, and we have complete freedom to arrange the node rows in whatever way works best. The node row ordering I used above simply matched the one I introduced in my last posting. You can see that pattern, starting with the leftmost edge wedge (O-glucose), which has two bands of rows: the band of nodes with edges from O, sitting above the band of nodes without edges from O. That second band of no-edge nodes might require a little bit of imagination to spot, since it's just the empty space below the wedge, but there it is if you think about it a bit! For shorthand, I'll call this banding arrangement (1,0). Then, the next wedge to the right (O-oleate) has four bands (1,0,1,0), the third (Y-glucose) has eight bands (1,0,1,0,1,0,1,0), and and so forth. With this scheme, the rightmost (A-oleate) wedge is the most fragmented, with 256 possible bands (1,0,1,...,0) though there are fewer than that because not all possible combinations are present. Another way to view this arrangement is like a car odometer: the rightmost column is always changing, while the leftmost column almost never changes.

So let's try different node row orders. First, compare the following arrangement with the first. Here, we make the four glucose wedges the most coherent, with the fewest bands, and the oleate wedges are more fragmented:

BioFabric network visualization of combined glucose oleate network
Click on picture to enlarge 

Compare this to the glucose-only network I presented in my introductory post, which I have reproduced here:
BioFabric network visualization of glucose network
Click on picture to enlarge
See how the original pattern of the edge wedges is retained? The glucose-only version reappears with this arrangement, it's just interspersed with the oleate edge wedges.

So I like to view the above arrangement as being "glucose condition centric". You can think of it as perhaps the best organization to use if you want to view and think about the changes between the two conditions where the first, glucose condition, serves as the starting point, or baseline.

But perhaps you want to view the two conditions the other way around, where the oleate condition edge wedges are the most coherent:

BioFabric network visualization of combined glucose oleate network
Click on picture to enlarge
Comparing that version to the oleate-only network, that I am again showing here, you can see how the original oleate edge wedges are the ones to retain their shapes in the combined version shown above:

BioFabric network visualization of oleate network
Click on image to enlarge
So again, this version of the combined network is perhaps the best organization to use if you want to understand the changes across the conditions with the oleate condition serving as the baseline.

It's important to remember that the above visual changes are being made on exactly the same network file, just with different node attribute files used to lay out the network with different node row orders. Futhermore, those differences are created simply by changing the sorting order used, specifying which edge wedges vary the fastest versus the slowest.

To wrap things up, let's look at an example of how we can use the glucose baseline version to visualize the network changes going from glucose to oleate. Consider the set of nodes that only have inputs from Adr1p (A) in glucose; the thick circle in the following figure highlights that group of nodes. In the oleate condition, most of these nodes now become targets of O, P and/or Y as well, in various combinations. The other four thin circles highlight where to look to see these changes:

BioFabric network visualization of combined glucose oleate network
Click on image to enlarge
So, for example, about half of these glucose A-only targets become O targets as well in oleate; look at the leftmost red circle to see this. And though it is challenging with the limited resolution of these images, we can also spot two targets that go from A-only in glucose to having all four inputs in oleate. Given the node ordering, they are the two uppermost nodes in the band. You can pick them out at the very top of the P-target set (the third red highlight circle from the left). 

So if you have two (or more) networks you want to compare, combine them all into one while using unique link suffix tags to tell them apart. Then use the link group feature to represent each network as separate edge wedges. Finally, change the node row ordering as needed, using the node attribute layout feature, to visualize your data from different perspectives.

Sunday, November 24, 2013

Everybody Wants to be a Node!


My apologies for another long stretch of no postings this fall! First, I was helping to teach the Gene Regulatory Networks in Development course at the MBL in Woods Hole, MA during a good portion of October. Then I got very busy taking a class through Coursera for the last one and a half months, and that ate up my evenings. So the blog fell behind. But I'll now be back at it again, and anticipate that my next post will follow up with the second installment of my last post, which is talking about using link groups to visualize the differences in a network under different experimental conditions.

But before that, I have an example of a BioFabric network in action. Last month was Leroy Hood's 75th birthday celebration, held at the Institute for Systems Biology. As part of the celebration, we assembled some visualizations of Lee's "influence network". One of these networks was based on information from a questionnaire that was sent to Lee's colleagues, and was depicted as a 10 foot long BioFabric network posted on the wall. The pictures here, courtesy of ISB Senior Research Scientist Gustavo Glusman, were taken during the set-up for the party:

Photo by Gustavo Glusman

The network had 330 nodes and 1400 edges; there were nodes for people, places, and research interests. Since the node lines for the people Lee knew were organized in chronological order from when they met him, the viewer could easily spot Lee's professional development, his evolving research interests, and his Caltech to UW to ISB path over the last 40+ years. What was interesting is that people would walk up to the giant poster, find their own node line, and trace their finger along their node to see their associations:

Photo by Gustavo Glusman

Which is exactly what I had hoped they would do, and that's why I think that BioFabric not only enables, but actively invites exploration of very large networks. You can start by seeing the whole structure at once, and subsequently drilling down to the smallest detail does not require you to prune away anything before you can clearly see any relationship you want. Just trace across a node to see how it fits into the whole picture. Let your fingers do the walking! (Does anybody under the age of 30 even know what that means anymore?)


Friday, October 4, 2013

How do I View Yeast? Let me Count the Ways

In my previous posts on the Caltech dorm network, I used link groups to separate intra- and inter- dorm links into distinct sets. I now want to shift away from social networks for a bit and look at biological ones. My goal will be to show how link groups can also be used to compare multiple experimental results in a single network view. That will take a couple posts; this first one will set the stage, and the follow-on will show how link groups can be applied. To do this, I will use network data from this paper from the Institute for Systems Biology:
Smith, J.J., Ramsey, S.A., Marelli, M., Marzolf, B., Hwang, D., Saleem, R.A., Rachubinski R.A., and Aitchison, J.D. Transcriptional responses to fatty acid are coordinated by combinatorial control Molecular Systems Biology, 2007,  3:115
The experimental data in the paper consists of two networks, which detail the targets of four different transcriptional regulators: Oaf1p, Pip2p, Oaf3p and Adr1p. They are labeled as O, P, Y, and A respectively. One of the networks is obtained under a yeast growth condition with low (0.1%) glucose, and the other is from a time point five hours after the sole carbon source has been switched to oleate (a fatty acid).

The original networks (Figure 1) show how there are relatively few targets that are under combined control in the glucose condition, and more complex control in the oleate condition. Go take a look at those networks, and then have a look at the BioFabric versions of these two networks. First, the glucose condition:


BioFabric Network Visualization of The Targets of Four Yeast Regulators
Click on picture for larger version
Note how the node ordering of the BioFabric network has been set so that the regulators appear in the top four node rows. (In both these examples, the node ordering was specified in a file using the Layout Using Node Attributes function.) All the target nodes are then arranged in the rows below these four regulators, in a very specific order, so that target nodes with the same input combinations will appear together in distinct, contiguous horizontal bands. You can think of it this way: with four inputs, there are 16 distinct input combinations (2^4). But since we are not showing any targets that are not regulated at all by these four, the 0000 state is omitted, leaving 15.

The 15 different input combinations can be represented by binary numbers, with a 1 indicating a target is regulated by one of the four inputs, 0 if not. With this scheme, we can sort the nodes using this number. Nodes with all four inputs are assigned binary number 1111 (= 15), and these are assigned to the top node rows of the fabric. At the bottom of the fabric, we assign the targets that are only regulated by A with the binary number 0001 (= 1). (When nodes have the same inputs, we sort them alphabetically by name.) Symbolically, the stack of sorted binary numbers looks like this:

OYPA
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001


If you now compare this pattern produced by a decreasing sort of the binary numbers 15 down to 1 with the above BioFabric glucose network, you will see the same pattern. You can check and see that there are no targets with all four inputs (it would be in the top row just under the four regulators). It is also obvious that the vast majority of targets have only one input, with O (Oaf1p) being the clear winner. You can also spot pretty quickly that there are only four targets with three inputs, just like in the original network diagram in the paper.

The next picture is the BioFabric version of the network under the oleate condition:

BioFabric Network Visualization of The Targets of Four Yeast Regulators
Click on picture for larger version


This diagram uses the same ordering scheme as the first. You can clearly see from the top node rows that there are now many targets with all four inputs (you can count 28). In fact, now all 15 of the different input states are represented. Finally, not only are there many more targets compared to the glucose condition, but the fraction of targets under the control of more than one regulator has increased.

So that's an introduction to the data and to the basic approach I'm using for ordering the node rows. In the next post, I'll discuss how to use link groups to take these two separate networks and combine them into one.

Tuesday, September 24, 2013

With no Direction Home... Like a Complete Unknown

BioFabric does not like it when the direction home is a complete unknown! (Or something like that; apologies to Bob Dylan).

This posting will be a quick side trip into an undocumented BioFabric feature that can be useful. Whenever you import a SIF file, each link needs to be tagged with a relationship identifier, per the SIF format:

node1_ID [tab] linkTag [tab] node2_ID

BioFabric displays that link tag whenever you mouse over the link, as well as in the Network Magnifier and Network Tour displays. It also insists on knowing whether the relationship indicated by the tag is directed or not. So after a SIF file has been read in, you are confronted with a dialog box that insists that you identify whether each link tag identifies a directed or an undirected edge in the graph. For example, for this tiny little SIF file: 

foo UNDIR bar
foo DIR   baz

You are presented with this dialog box when you import it:

BioFabric Specify Directional Relationships DIalog
Click on picture to enlarge

In this case, since we wish the link tagged DIR to be directed, we would check the box on the right side of the row labeled  DIR and then hit the OK button to finish the import. When the number of link tags is small, it's not too onerous, and the benefit is that you can create a graph with an explicit mixture of directed and undirected edges.

However, things can start to get painful when the number of link tags starts to grow. The worst case is when you are tagging links with real numbers with a large number of significant digits, since the table in the above dialog will create a row for each one of those values. For that reason, it is best to truncate real-valued link tags to <= 2 digits to keep this from getting out of hand.

If you do have a lot of tags to deal with, you will note there are two buttons on the lower left that give you useful shortcuts. You can make every link either directed or undirected by using those buttons. But what do you do if there is a mixture?

On that count, there is good news and bad news. I'll give the good news first: the Load From File... button allows you to specify the whether link tags represent directed or undirected edges using an input file. Unfortunately, it has not actually been documented anywhere what the file format is... until now! It needs to be an attribute file that has a format similar to the node attribute file used to specify node layout order. The file suffix can be whatever you want, but the file chooser dialog will highlight files with an .rda suffix. (I guess I was thinking that it would stand for relation directed attribute?) The file contents has a required column header line, followed by one and only one row for each and every link tag. Provide true for directed links, and false for directed links.  An example for the given SIF file is:


Relation Directed
UNDIR = false
DIR = true

So, if you load in the above sample file, your SIF import will look like the picture below, with BioFabric using an arrowhead to show the direction of an edge.  But, at the moment, that's all it does to treat directed edges in a special fashion. Most notably, the layout algorithms completely ignore directionality in the current version.

A very tiny BioFabric network visualization
Click on picture to enlarge (but why would you?)
That was the good news. The bad news is that there is a bug in the implementation. If you note, I wrote the two link tags out in all uppercase in this example That's because the Load from File... option is stuck at only recognizing all uppercase tags. If your SIF file is using lower- or mixed-case tags, the program will complain and reject the .rda file. That bug has just been added to the GitHub BioFabric Issues Page! That's the beauty of open source: you got no secrets to conceal (OK, more apologies to Bob Dylan).


Friday, September 20, 2013

Using Heads and Tails to Make Heads or Tails of Caltech Houses



I have previously introduced and discussed my BioFabric version of the Caltech Houses (i.e. dorms) network that I based on the data from Traud et. al. 2011, and I am going to talk about it a little further here. If you want to view the whole network, and you don't mind the 3.8 MB download, take a look at the scrollable version.

In this post I will discuss the structure of the network within one of the dorms. Again, as with previous posts, I'm just going to describe what features I am seeing by eyeballing the BioFabric visualization. I'm not going to back up these claims by using network tools to analyze the structure; I'll leave that an an exercise to the reader. My goal here is to help you to build up your visual "network fabric intuition".

Recall that this Caltech dorm network was drawn by grouping the students using the provided information about which dorm each student lives in. So there are eight horizontal bands in the fabric that correspond to these eight dorms. Each dorm was separately laid out using the default BioFabric layout algorithm on just the intra-dorm links before they were combined into the full network. This means that the head (left end) of each dorm starts with the most popular student in that dorm (considering the dorm in isolation), followed immediately by that most popular student's Facebook friends. Meanwhile, the tail (right end) of the dorm is typically going to tend to show students with fewer friends in the dorm and/or more indirect connections to the most popular student. Keep in mind that this tendency towards low-degree students in the tail is broken by those students in the tail who do have many in-dorm friends, but who are several degrees of separation away from the "in crowd" at the head. That's the consequence of the breadth-first search used by the default layout.

Note, by the way, that the presentation I am using here is different from the one considered in the original paper, which showed how well the dorm assignment corresponded to the clustering they detected in the Caltech network via clustering algorithms. Instead, my approach here is to look at what we may be able to observe about the network given that it has been explicitly grouped using those dorm assignments; this presentation provides no insights into larger social groupings that cross dorm boundaries. 

Below are two figures showing portions of Dorm 4, which appears to be a pretty typical example of the dorms. Though you can pick out each dorm easily enough as you scan the network just by following the shape of the diagonal, I have added the red horizontal lines in these figures to clearly show the extent of Dorm 4 in these extracted segments. The first figure shows the head end, i.e. the popular students:


BioFabric Version of Caltech Social Network: Head End
Click on picture to enlarge

So, the first thing to notice is that 623, on the far left (and the most popular student in the dorm), is pretty well connected within the dorm. His/her in-dorm edge wedge covers about 75% of the students in the dorm. (Remember, due to link grouping, the in-dorm edge wedge appears to the left of the out-of-dorm edge wedge for each student. Furthermore, the out-of dorm wedge appears as two wedges here, since it is split into above-node and below-node pieces.) Then, 623's most popular friends do a pretty good job of matching 623's friends in the dorm, since we can see that their in-dorm wedges very roughly approximate 623's. Those friends also do a good job of bringing in some more students, such that by the time we get to the ninth student on the right side of the head, at appears that well over 80% of the students in the dorm have been linked to.

It's also interesting to note that 623, while the most popular student in the dorm, has a majority (maybe greater than 67%?) of his/her friends outside the dorm. It appears that 633, the next in line, is almost as popular as 623 in-dorm, but is also much more inwardly focused on Dorm 4! 

Now look at the tail end of Dorm 4, again with the red lines to show the extent of the dorm:

BioFabric Version of Caltech Social Network: Tail End
Click on picture to enlarge
Note first that this short tail stretch actually shows the links for just over 50% of the students in the dorm (241, on the left, is not quite below the half-way point between the red lines). We can also see here what is going on with the 10% or so of the students who are not directly connected to the popular core; they start around the prominently labeled student 734 near the far right. Even at this scale, you can spot sort of a "phase change" in the edge wedge pattern as we get to just to the left of 734: the in-dorm edge wedges stop connecting to the popular students at the top of the Dorm 4 band. I'll discuss this group a little more below.

But turning back to the "typical" tail-end students in Dorm 4, we see that they are all connected to that central core at the head of the dorm, since they have links going to the top of the dorm band. Perhaps not surprisingly, most of the tail students are connected to social groups centered on the top core 50%, but not so much amongst each other. We can see that because the in-dorm edge wedges here typically show few edges below the diagonal (226 and 606 are notable exceptions). Another feature worth noting is that even these students with relatively few in-dorm connections almost all have at least a few out-of-dorm connections as well.

Finally, the isolated tail group starting around 734 is shown below in detail as an extracted submodel, again with red lines to indicate the bounds of Dorm 4. Note how 734, 722, 728, 738, and (to a lesser extent) 744 form a somewhat cohesive social unit, with many common social connections focused outside of the dorm:
BioFabric Version of Caltech Social Network: Detail
Click on picture to enlarge
So that's my attempt to make "heads or tails" of one of the dorms in the network just by visual inspection of the fabric. I expect one more posting on this network: stay tuned!

Saturday, September 7, 2013

New Kids in Town?


My attempts to keep the blog fresh and current this summer? EPIC FAIL! But I have a backlog of post topics, so I hope that this post marks the end of the dry spell.

In my last post, so very long ago, I introduced my BioFabric version of the Caltech Dorms Facebook Network, where nodes are students, edges are Facebook friend relationships, and the students have been grouped by dorm. Additionally, the edges for each student are grouped into two separate edge wedges: the first (left) one is for friend connections within a dorm, and the second (right) one is for connections between dorms. For a better view of that network, go to the scrollable version, but be warned that it is kinda big: 3.8 MB. 

There are a few interesting things in that network, so I'll be spending a couple of blog posts covering them. The first one is pretty simple, and you can spot it easily while scrolling across the network. It's at the tail end of the Dorm 5 cluster, and it shows up in the following figure. The figure shows two separate pieces of the network, divided by the vertical blue line, that are aligned so the nodes match up:

BioFabric Caltech Dorm Network Example
Click on picture to enlarge
Take a look at circled students 144 and 85, who are in Dorm 5; they are in the right half of the figure. What's interesting about them is that they look more like members of Dorm 4 than Dorm 5. For comparison, the left half of the figure shows some Dorm 4 students, and the horizontal red lines show the extent of the Dorm 4 cluster. Clearly, 144 and 85 have most of their friends in Dorm 4. And as the following detail shows, they don't know too many people at all in Dorm 5 (and they do know each other):

BioFabric Caltech Dorm Network Detail
Click on picture to enlarge

So perhaps we can hazard a guess that 144 and 85 are recent arrivals to Dorm 5, both coming from Dorm 4?

Of course, this sort of visual analysis can also be done using adjacency matrices that have been ordered to show the dorm groups on the diagonal. However, I will argue that the visual cues provided by the two-dimensional edge wedges of BioFabric makes them stand out better than a one-dimensional column of the matrix. This is particularly true when the resolution of the adjacency matrix falls below the threshold of one pixel per student, as we would expect in larger networks. Furthermore, at such resolutions, I think it would be very difficult to spot that a single column has a set of pixels in one set of rows (Dorm 4) while simultaneously missing pixels in another set of rows (Dorm 5).

Now let's see if I can manage to get my rate of blog posts back up to speed...

Sunday, July 14, 2013

I Guess Caltech Students Do Have Social Connections

OK, just kidding. But this post involves Caltech dorms, and I feel I have to take part in some old-fashioned school rivalry. There is, in fact, only one college dorm worth talking about.

Anyway, a few years ago, a paper came out that was studying the structure of Facebook social networks on some college campuses:
Traud, A.Kelsic, E.Mucha, P., and Porter, M. Comparing Community Structure to Characteristics in Online Collegiate Social NetworksSIAM Review, 2011, Vol. 53, No. 3: pp. 526-543
In 2011, this Facebook dataset was used in a visualization competition that Conrad Lee  described in a post on his blog Sociograph. You can see some of the results in the post; perhaps not surprisingly, the visualizations were hitting the hairball ceiling.  

In another posting on Sociograph in late 2012, Conrad used the Caltech portion of that Facebook dataset to illustrate how to visualize adjacency matrices in Python. With that dataset, I've visualized the Caltech Facebook network using BioFabric. With 769 nodes and 33312 16656* edges the network is very high aspect ratio (43:1). I have 15,000 pixel-wide version of the network that you can scroll back and forth with here on my blog, and in my initial iteration of this post, I embedded the file directly on this page. But at about 3.8 MB, it's a little hefty to have to download it when you first visit the blog. So I've included just a detail snapshot below, and you should go to the special scroll page to view it now:


Caltech Facebook Network Detail
Click on this caption to view the 15,000 pixel-wide version


(Note the students' names in the data were anonymized to numbers.)

Since this network needed to be preprocessed a bit to get the final layout, and since it uses a feature I have not yet talked about (link groups), I'll spend this post talking a little bit about I built it.

In addition to providing the links, the data set also indicates the House affiliation (i.e. dorm) of each student (there are eight dorms), and this turns out to be an important aspect of this network. So let's use that data and have BioFabric show the clusters. As I have pointed out before, it's not yet a built-in BioFabric 1.0 feature to automatically do clustered layouts, (though I am working on it!), so some basic scripting is needed. I'm not going to get into the low-level detail of showing the scripts, but just give a high-level description of the steps involved. 

First, using the dorm assignments, we identify which edges are in-dorm (Facebook friends in the same dorm), and which edges are between-dorm (Facebook friends in different dorms). Then, using just the in-dorm links, we create eight separate SIF files of in-dorm links, one per dorm. Separately loading each into BioFabric, we can get eight per-dorm BioFabric default layouts (i.e. we are going to use BioFabric to handle the default layout step, instead of scripting it as well). The resulting node orders, which we will use to create a single global ordering file, can be simply exported, just choose Select File->Export->Export Node Order:


BioFabric Export Node Order
Click on image to enlarge



(As a side note, the Export Link Order option in that menu is the best route to seeing how to create the edge attribute files you need to explicitly layout edges).

Since we want to order the dorms from biggest to smallest, number the eight dorm node order files in that fashion, e.g. dorm1.noa (biggest) to dorm8.noa (smallest). You'll also need to chop off the first line of each of these files, using e.g.:

tail -n +2 < dorm1.noa > dormr1.noa

Then, to create the single global node ordering file, just do this on the Unix command line:

cat dormr*.noa | awk '{print $1 " = " NR-1}' | sed '1 i Node Row' > globalOrdering.noa


That takes care of specifying the node ordering we will need. At the same time, we want to create the single full-network SIF file where each link is tagged with a suffix indicating whether is it in-dorm (tagged -ic, for in-cluster), or between-dorm (tagged -bc, for between-cluster). We were figuring that out above when we created the eight separate dorm-only networks, so also use that information to tag the links to write out the final single SIF input file. 

Then, import the global SIF file, and after the network is loaded, re-layout the whole network by specifying node order. Just select Layout->Layout Using Node Attributes..., use the globalOrdering.noa file you generated, and the network now has the eight separate dorms broken out.

When a network gets long and thin like this, I'm quick to turn on shadow links to get a better idea of what's going on. Just select Edit->Set Display Options... and check Display Shadow Links box. At the same time, I like to shade the node zones, so also check Node Zone Shading before clicking OK. This allows you to see all the Facebook connections for a student by just looking at the node zone for that student.  

There is now one more step. As it currently sits, each student has a single edge wedge for all his/her Facebook friends.  The tiny subnetwork of three students shown below illustrates that. Although the links going to the node lines right above and below these students correspond to the in-dorm links, that distinction is completely hidden:


BioFabric Submodel of Caltech Network: No Link Groups
Click on image to enlarge

So we want to separate the links into the in-dorm (-icand between-dorm (-bcgroups so we can see separate edge wedges for these two sets. Since we tagged the links in the SIF input with suffixes, we can easily use that information to create the two distinct edge wedges. Just go to Layout->Specify Link Groups...:


BioFabric Specify Link Groups
Click on image to enlarge



In the dialog, click Add New Entry... twice and enter in the two groups, -ic and -bc:



BioFabric Specify Link Groups Dialog
Click on image to enlarge




Click OK, and the network is laid out.  Because of the link grouping, we can now easily visualize the two -ic and -bc classes of links for each student. Compare this version below with the one above. The first diagonal for each student are the in-dorm -ic links, as that was the first link group we specified. As expected, those links end at the node rows near these students, i.e. the other nodes in the same dorm. The following single edge wedge of -bc links over on the right side of each node zone tends to look like two separate wedges, since they are connecting to both the dorms above and below this dorm:

BioFabric Submodel of Caltech Network: With Link Groups
Click on image to enlarge
Note how the grouping lets us instantly see which students are mostly in-dorm focused with their Facebook connections (e.g. 590), and which have more connections outside the dorm (e.g. 20).

That's it for details on building the network. So go back and have a look at the whole network in the 15,000 pixel-wide version up at the top of this post. You can see the eight separate runs of dorms, compare the two different types of connections, and get an idea of how the students interact.

My next post or two will cover a couple of interesting aspects of this network, but it will be awhile, as I'll be on the road next week to the ISMB/ECCB 2013 conference. If you happen to be there, come say hi at my BioFabric Birds of a Feather session on Monday, July 26th! 

Correction: The original edge number of 33312 that I gave did not account for the equivalent reverse edges in the SIF file of the undirected graph getting thrown out on import to BioFabric. Though the view is actually showing 33312 edges in it since shadow links are turned on.