This was originally posted 1/9/2009 on my old blog.
Due to popular demand (well 3 requests :) ), this is a commentary and additional information for my conference paper and presentation:
Pikas, C. K. (2008). Detecting Communities in Science Blogs. Paper presented at eScience '08. IEEE Fourth International Conference on eScience, 2008. Indianapolis. 95-102. doi:10.1109/eScience.2008.30 (available in IEEE Xplore to institutional subscribers) [also self-archived - free!- here]
The presentation is embedded in another blog post, and is available online at SlideShare. The video of me talking about it [was?] available on the conference site, but I haven't gotten it to load.
Context:
I'm interested in scholarly communication in science, engineering, and math. Specifically, informal scholarly communication and how information and communication technologies, in particular social computing technologies, can/do/might impact informal scholarly communication in science/math/engineering. I'm also interested in knowledge production and public communication of science, two sub-areas of STS (this acronym has several translations - the most common probably science and technology studies).
As a blogger, and a 2-time (soon to be 3) attendee of what was the NC Science Blogging Conference and a reader of science blogs, I became curious about how and why scientists use blogs and if their use is: a) similar to how non-scientists use blogs b) for informal scholarly communication (to other scientists about their work) c) for public communication of science d) for personal information management e) maybe for team collaboration(?)... The first way I looked at this was by doing a study with content analysis and interviews of chemists and physicists (this has not been published yet, but maybe someday, these things aren't as perishable as writings in other fields, I hope). The second study swings all the way to a structural analysis of the science blogosphere - and that's what was reported here.
In social network analysis (SNA), you look at the link structure, not the attributes of the actors or nodes. The idea is that links show evidence of potential information flows or influence. You can pick out prestigious or central actors, and groups which are more tightly connected to each other than to the rest of the network.
The first major problem was locating science blogs - and even drawing any sort of boundary as to what a science blog was or wasn't. Given that I'm interested in how these things contribute to science, I drew the line thusly:
- Blogs maintained by scientists that deal with any aspect of being a scientist
- Blogs about scientific topics by non-scientists
Omitted were:
- Primarily political speech
- Ones maintained by corporations
- Non-English language
(you could definitely draw the line somewhere else, but this is what I did!)
Also given that I'm a great searcher but almost not a coder at all, I did this by search, snowball, and any hook or crook to get as big a set as possible. I went to each of these, and copied off the URLS from the blogrolls (to answer a question from a Scibling - if you had a rotating list that showed up in javascript on the page source, I probably got it; if you have a second page with a list of 300 blogs (cough - Bora - cough); I probably got it, likewise if generated by like GoogleReader or something)... so this was incredibly tedious, and probably missed a few, but probably pretty accurate. So that was the first network.
The second network - and I originally had a much grander scheme - took the "most interesting" (most central by common measures) blogs from the first network, and then used Perl scripts (core script developed by Jen Golbeck, and then I customized to work for non-wordpress blogs, and blogs where people changed their templates a lot - you all really could have made this easier, lol) to pull all of the commenter links off of the last 10 posts (this was done in like April).
Blogs have links between them a) in the content b) in the blogroll c) in signed comments... other studies have used basically any link on the page, but the fact is that it's not really saying much to link within a post (a little link love, but not a real endorsement). Blogrolls are some sort of endorsement, typically, and signing a comment means *something*.
So then I ran all the typical SNA things across it to look at central actors and to find cohesive subgroups. As far as centrality - no real surprises. As far as cohesive subgroups - a bit more tricky. Basically one large component - and not terribly clumpy, with the exception of the astro bloggers - they're pretty tight. Most of the community detection techniques use a binary split - or start with binary splits - none of these were at all effective in dividing up the hairball. Spin glass, OTOH, worked beautifully to return 7 clusters. So then I went back and looked at the blog and figured out the commonality for each of the clusters (yes, I could have used some NLP to extract terms and automatically label the clusters, but there were 7 so...).
The single component isn't too surprising because we know from diffusion of innovations for ICTs that we would expect people to pick this up from other people and then probably link back. The power law degree distribution is also very typical when you're talking the activities of people (whether Lotka, Zipf, Pareto, Bradford.... whatever law). The clusters were related to subject areas - very broad subject areas. One question in my mind was how much people would be outside of their home discipline in their reading/commenting... based on this network, certainly outside of their particular specialty, but still in the neighborhood with the exception of a few "a-list" science bloggers who everyone reads.
What was interesting - and most definitely worthy of further investigation - is this cluster of blogs written mostly by women, discussing the scientific life, etc. The degree distribution was much closer to uniform within the cluster, and there were many comment links between all of the nodes. This, to me, indicates other uses for the blogs and perhaps a real community (or Blanchard's virtual settlement).
Also, picked out the troll very easily using the commenter network - so this method could be used to automate troll identification. (in the first study I talked about this guy with a physicist and the physicist basically only reins the troll in when he's so out of bounds as to be gross... so ID-ing a troll doesn't necessarily meaning banning).
I'm quickly running out of steam in this blog post - but this might end up being a pilot for my dissertation, so I'm definitely more than happy to talk about it either in the comments here, or on slideshare, or on friendfeed... or twitter or... just look for cpikas :)