Combing the Hairball: September 2013

Tuesday, September 24, 2013

With no Direction Home... Like a Complete Unknown

BioFabric does not like it when the direction home is a complete unknown! (Or something like that; apologies to Bob Dylan).

This posting will be a quick side trip into an undocumented BioFabric feature that can be useful. Whenever you import a SIF file, each link needs to be tagged with a relationship identifier, per the SIF format:

node1_ID [tab] linkTag [tab] node2_ID

BioFabric displays that link tag whenever you mouse over the link, as well as in the Network Magnifier and Network Tour displays. It also insists on knowing whether the relationship indicated by the tag is directed or not. So after a SIF file has been read in, you are confronted with a dialog box that insists that you identify whether each link tag identifies a directed or an undirected edge in the graph. For example, for this tiny little SIF file:

foo UNDIR bar

foo DIR baz

You are presented with this dialog box when you import it:

BioFabric Specify Directional Relationships DIalog

Click on picture to enlarge

In this case, since we wish the link tagged DIR to be directed, we would check the box on the right side of the row labeled DIR and then hit the OK button to finish the import. When the number of link tags is small, it's not too onerous, and the benefit is that you can create a graph with an explicit mixture of directed and undirected edges.

However, things can start to get painful when the number of link tags starts to grow. The worst case is when you are tagging links with real numbers with a large number of significant digits, since the table in the above dialog will create a row for each one of those values. For that reason, it is best to truncate real-valued link tags to <= 2 digits to keep this from getting out of hand.

If you do have a lot of tags to deal with, you will note there are two buttons on the lower left that give you useful shortcuts. You can make every link either directed or undirected by using those buttons. But what do you do if there is a mixture?

On that count, there is good news and bad news. I'll give the good news first: the Load From File... button allows you to specify the whether link tags represent directed or undirected edges using an input file. Unfortunately, it has not actually been documented anywhere what the file format is... until now! It needs to be an attribute file that has a format similar to the node attribute file used to specify node layout order. The file suffix can be whatever you want, but the file chooser dialog will highlight files with an .rda suffix. (I guess I was thinking that it would stand for relation directed attribute?) The file contents has a required column header line, followed by one and only one row for each and every link tag. Provide true for directed links, and false for directed links. An example for the given SIF file is:

Relation Directed

UNDIR = false

DIR = true

So, if you load in the above sample file, your SIF import will look like the picture below, with BioFabric using an arrowhead to show the direction of an edge. But, at the moment, that's all it does to treat directed edges in a special fashion. Most notably, the layout algorithms completely ignore directionality in the current version.

A very tiny BioFabric network visualization

Click on picture to enlarge (but why would you?)

That was the good news. The bad news is that there is a bug in the implementation. If you note, I wrote the two link tags out in all uppercase in this example That's because the Load from File... option is stuck at only recognizing all uppercase tags. If your SIF file is using lower- or mixed-case tags, the program will complain and reject the .rda file. That bug has just been added to the GitHub BioFabric Issues Page! That's the beauty of open source: you got no secrets to conceal (OK, more apologies to Bob Dylan).

Friday, September 20, 2013

Using Heads and Tails to Make Heads or Tails of Caltech Houses

I have previously introduced and discussed my BioFabric version of the Caltech Houses (i.e. dorms) network that I based on the data from Traud et. al. 2011, and I am going to talk about it a little further here. If you want to view the whole network, and you don't mind the 3.8 MB download, take a look at the scrollable version.

In this post I will discuss the structure of the network within one of the dorms. Again, as with previous posts, I'm just going to describe what features I am seeing by eyeballing the BioFabric visualization. I'm not going to back up these claims by using network tools to analyze the structure; I'll leave that an an exercise to the reader. My goal here is to help you to build up your visual "network fabric intuition".

Recall that this Caltech dorm network was drawn by grouping the students using the provided information about which dorm each student lives in. So there are eight horizontal bands in the fabric that correspond to these eight dorms. Each dorm was separately laid out using the default BioFabric layout algorithm on just the intra-dorm links before they were combined into the full network. This means that the head (left end) of each dorm starts with the most popular student in that dorm (considering the dorm in isolation), followed immediately by that most popular student's Facebook friends. Meanwhile, the tail (right end) of the dorm is typically going to tend to show students with fewer friends in the dorm and/or more indirect connections to the most popular student. Keep in mind that this tendency towards low-degree students in the tail is broken by those students in the tail who do have many in-dorm friends, but who are several degrees of separation away from the "in crowd" at the head. That's the consequence of the breadth-first search used by the default layout.

Note, by the way, that the presentation I am using here is different from the one considered in the original paper, which showed how well the dorm assignment corresponded to the clustering they detected in the Caltech network via clustering algorithms. Instead, my approach here is to look at what we may be able to observe about the network given that it has been explicitly grouped using those dorm assignments; this presentation provides no insights into larger social groupings that cross dorm boundaries.

Below are two figures showing portions of Dorm 4, which appears to be a pretty typical example of the dorms. Though you can pick out each dorm easily enough as you scan the network just by following the shape of the diagonal, I have added the red horizontal lines in these figures to clearly show the extent of Dorm 4 in these extracted segments. The first figure shows the head end, i.e. the popular students:

BioFabric Version of Caltech Social Network: Head End

Click on picture to enlarge

So, the first thing to notice is that 623, on the far left (and the most popular student in the dorm), is pretty well connected within the dorm. His/her in-dorm edge wedge covers about 75% of the students in the dorm. (Remember, due to link grouping, the in-dorm edge wedge appears to the left of the out-of-dorm edge wedge for each student. Furthermore, the out-of dorm wedge appears as two wedges here, since it is split into above-node and below-node pieces.) Then, 623's most popular friends do a pretty good job of matching 623's friends in the dorm, since we can see that their in-dorm wedges very roughly approximate 623's. Those friends also do a good job of bringing in some more students, such that by the time we get to the ninth student on the right side of the head, at appears that well over 80% of the students in the dorm have been linked to.

It's also interesting to note that 623, while the most popular student in the dorm, has a majority (maybe greater than 67%?) of his/her friends outside the dorm. It appears that 633, the next in line, is almost as popular as 623 in-dorm, but is also much more inwardly focused on Dorm 4!

Now look at the tail end of Dorm 4, again with the red lines to show the extent of the dorm:

BioFabric Version of Caltech Social Network: Tail End

Click on picture to enlarge

Note first that this short tail stretch actually shows the links for just over 50% of the students in the dorm (241, on the left, is not quite below the half-way point between the red lines). We can also see here what is going on with the 10% or so of the students who are not directly connected to the popular core; they start around the prominently labeled student 734 near the far right. Even at this scale, you can spot sort of a "phase change" in the edge wedge pattern as we get to just to the left of 734: the in-dorm edge wedges stop connecting to the popular students at the top of the Dorm 4 band. I'll discuss this group a little more below.

But turning back to the "typical" tail-end students in Dorm 4, we see that they are all connected to that central core at the head of the dorm, since they have links going to the top of the dorm band. Perhaps not surprisingly, most of the tail students are connected to social groups centered on the top core 50%, but not so much amongst each other. We can see that because the in-dorm edge wedges here typically show few edges below the diagonal (226 and 606 are notable exceptions). Another feature worth noting is that even these students with relatively few in-dorm connections almost all have at least a few out-of-dorm connections as well.

Finally, the isolated tail group starting around 734 is shown below in detail as an extracted submodel, again with red lines to indicate the bounds of Dorm 4. Note how 734, 722, 728, 738, and (to a lesser extent) 744 form a somewhat cohesive social unit, with many common social connections focused outside of the dorm:

BioFabric Version of Caltech Social Network: Detail

Click on picture to enlarge

So that's my attempt to make "heads or tails" of one of the dorms in the network just by visual inspection of the fabric. I expect one more posting on this network: stay tuned!

Saturday, September 7, 2013

New Kids in Town?

My attempts to keep the blog fresh and current this summer? EPIC FAIL! But I have a backlog of post topics, so I hope that this post marks the end of the dry spell.

In my last post, so very long ago, I introduced my BioFabric version of the Caltech Dorms Facebook Network, where nodes are students, edges are Facebook friend relationships, and the students have been grouped by dorm. Additionally, the edges for each student are grouped into two separate edge wedges: the first (left) one is for friend connections within a dorm, and the second (right) one is for connections between dorms. For a better view of that network, go to the scrollable version, but be warned that it is kinda big: 3.8 MB.

There are a few interesting things in that network, so I'll be spending a couple of blog posts covering them. The first one is pretty simple, and you can spot it easily while scrolling across the network. It's at the tail end of the Dorm 5 cluster, and it shows up in the following figure. The figure shows two separate pieces of the network, divided by the vertical blue line, that are aligned so the nodes match up:

Click on picture to enlarge

Take a look at circled students 144 and 85, who are in Dorm 5; they are in the right half of the figure. What's interesting about them is that they look more like members of Dorm 4 than Dorm 5. For comparison, the left half of the figure shows some Dorm 4 students, and the horizontal red lines show the extent of the Dorm 4 cluster. Clearly, 144 and 85 have most of their friends in Dorm 4. And as the following detail shows, they don't know too many people at all in Dorm 5 (and they do know each other):

Click on picture to enlarge

So perhaps we can hazard a guess that 144 and 85 are recent arrivals to Dorm 5, both coming from Dorm 4?

Of course, this sort of visual analysis can also be done using adjacency matrices that have been ordered to show the dorm groups on the diagonal. However, I will argue that the visual cues provided by the two-dimensional edge wedges of BioFabric makes them stand out better than a one-dimensional column of the matrix. This is particularly true when the resolution of the adjacency matrix falls below the threshold of one pixel per student, as we would expect in larger networks. Furthermore, at such resolutions, I think it would be very difficult to spot that a single column has a set of pixels in one set of rows (Dorm 4) while simultaneously missing pixels in another set of rows (Dorm 5).

Now let's see if I can manage to get my rate of blog posts back up to speed...