I am collecting my reading and reference material in CiteULike. I like the service because it can capture details from multiple sources. It also allows to discover what was collected by other interesting people through tags, people and bookmarks graph navigation.
Nice as CiteULike is, it is fairly difficult to get an overall picture of one’s own collection. It is especially difficult to see quickly if there are people who serve as hubs by collaborating with multiple different groups. The information is there, but it requires a lot of clicks to find it out.
My usual solution is to export information out, massage it into Graphviz format and use graph segmentation and layout algorithms to get a better overview. I have talked about Graphviz a number of times on this blog before. This is yet another time it proved useful.
I started by exporting CiteULike’s content of my library. I found Endnote export format to be more structured and therefore easier to parse. I then run it through a custom Python program that basically spat out graph with titles pointing at authors. That produced a very large graph and was not particularly useful.
The next step was to discover disjointed clusters of titles/authors. I used ccomps with -v and -x flags (e.g. ccomps.exe -v -x -o comp.dot output.dot).
ccomps gave me partitioned graphs as well as statistics on number of nodes/edges in each graph. I could then choose a graph with large number of nodes/edges (eventually, all of them) and run it through neato with overlap=scale and splines=true (e.g. neato.exe -Tgif -o neato_1.gif -Goverlap=scale -Gsplines=true comp_1.dot).
The resulting graph was still not perfect, but it was a good start. I also tried fdp instead of neato, but that seemed to produce giraffe versions of the graph with graph edges being overly long.
I have run into some problems as well that would either cause partitions combine together or produce duplicate nodes and edges.
The first problem was that sometimes a person was an author and sometimes an editor. I was interested in both, so collapsed those fields together. That caused some non-people to then show up on the graph and connect clusters in unexpected ways. For my library the specific value was ‘European’, so I filtered it out in the code.
The second problem was to do with CiteULike’s parsing. Sometimes, it would split a first+last name into separate names, probably due to incorrect manual entry at some point. I had to fix those at the source by editing corresponding CiteULike entry. Probably a good thing to do anyway.
The other problem is right out of the co-reference resolution domain. Sometimes names would include full first names, sometimes only a first name initial. I have worked around that by normalizing all first names to the initials. Obviously, this could collapse entries belonging to multiple real people into one.
Further on name problems, in cases of non English names (e.g. Spanish names with multiple surnames), CiteULike would get confused which part is which and not display or export it correctly. Additionally, sometimes characters such as ñ would be entered as plain n. Those also needed to be corrected manually.
The project only took a couple of hours including writing code and cleanup. It is already useful to me, as I found a new person who was in unexpectedly large number of papers and also found a chain of connections that might be interesting to follow more closely.
There is of course a lot more that could be done. Automatic co-reference of misspelt names, layout hints based on number of times authors appeared together, color coding of tags - these are just some of the easy ideas.
There might even be a small project/paper in doing co-reference resolution and cleaning up CiteULike data? After all, similar projects were done for Wikipedia. I don’t think CiteULike currently makes a full export available, but they do have some so might be amendable to exporting a special set for research purposes.