DCW Volume 1 Issue 15 – eXtending the Library & Roadtrippin’ with HathiTrust

by Korey Jackson on September 14, 2012

Libraries as Platforms
Zach Coble

In a recent article on libraries as platforms, David Weinberger envisions that, similar to Facebook opening its API to developers in 2007, libraries would provide access to everything they have–books and online content, metadata, and conversations about that content–in an effort to develop knowledge and community. Such an idea is not radical and is essentially at the core of libraries: any citizen can check out Huck Finn to read for fun, for a research article on Mark Twain, or to use as a text for a discussion group. In fact, many are currently working on technical solutions to the ideas Weinberger outlines, such as linked open data and the University of Rochester’s eXtensible Catalog. Of course, part of the reason we’re not yet there is that libraries don’t own (in the sense of copyright) all of their content and thus can’t do as they please with it, and most libraries don’t have the resources for a for a research and development unit (a library lab or a skunkworks team) dedicated to finding innovative solutions.

The idea of a library platform with individual libraries as nodes within a larger network is similar to the DPLA’s move toward hubs. Each DPLA hub will contribute metadata under a CC0 license (i.e. no restrictions whatsoever) and preferably for “Green Light” content that resolves to already accessible digital content. After all, there’s no need to duplicate what the Web as a platform has already accomplished. Similarly, one idea recently put forth is for the DPLA to work with institutional repositories to aggregate and spotlight existing collections. As libraries move toward a more networked platform model and leverage the wealth of existing collections, using the DIKW hierarchy, it seems that the library becomes less preoccupied with collecting data and information and more interested in mechanisms that facilitate the creation of knowledge and wisdom.

A Day and a Half of “Data to Insight”: The HathiTrust Digital Research Center Uncamp.
Aaron McCollough

This week the world (or 80 or so academics, at least) witnessed the first HathiTrust Digital Research Center Uncamp. I went. I’m tempted to conduct the rest of this entry in the gonzo journalistic style of Hunter S. Thompson, but I won’t. Really, the only resemblance between this trip and Fear and Loathing in Las Vegas is that there’s a trip in that novel, too.

I cruised down to Bloomington on Sunday with Justin Joque, the University of Michigan Library’s resident spatial and numeric data librarian. We took his Saturn VUE. We had some iffy Thai food in Fort Wayne on the way. On Monday, we settled in to IU’s brand new CyberInfrastructure Building, which is terrifically fancy. By Tuesday evening, we were back on our way to Ann Arbor in the VUE. We got burritos in Fort Wayne.

So, what (you may be asking yourself) is the HathiTrust Digital Research Center? “I know HathiTrust is the world’s largest digital library, with over 10 million volumes digitized, Aaron,” you might be saying. “And,” you might also be saying, “I know the HathiTrust gathers volumes digitized by Google, the Internet Archive, and many of the partner libraries, but… Aaron… what is the Digital Research Center?”

Well, friend, I’m here to tell you about it, because I just saw the light. Er, I just saw the Digital Research Center anyway…

Here’s how the Center describes itself: “The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.” (http://www.hathitrust.org/htrc).

The HTRC Uncamp was an opportunity for scholars, librarians, programmers, and their ilk to learn more about the current state of the Research Center’s mission and to get a crack at some hands-on demos with the corpus. This was also an opportunity to offer feedback and to discuss possible new directions for working with the corpus.

To my mind, the two primary themes of the UnCamp were sounded in the first 15 minutes by Brad Wheeler and Beth Plale. Wheeler recalled the research center’s founding priority to make connections between the HathiTrust repository and the researchers who would benefit from the new potential such a large digital corpus affords. He stressed the importance of the “Technology Acceptance Model” (http://www.istheory.yorku.ca/Technologyacceptancemodel.htm) to enticing researchers. Plale stressed the importance of computation moving TO the data rather than the other way around. In other words, the HTRC is designed to be flexible and to accept algorithms and analysis routines users bring to the corpus (rather than imposing algorithms and routines on users). The implicit message was that the research center has learned a great deal from STM computing, but that it isn’t interested in shoehorning humanities research into STM routines.

We were treated to the philosophical view behind Hathi as articulated by John Wilkin (Executive Director of HathiTrust), an overview of the collection and data currently accessible with research center tools by Jeremy York and Stacy Kowalczyk, and an overview of the architecture designed to enable “non-consumptive reading.”

The notion of “non-consumptive reading” will be familiar to most DH-interested readers due, in part to its perversity as an English phrase. The point, as an HTRC poster succinctly put it, is this: “The ‘non-consumptive research’ model has been developed to provide secure analytic access to large corpora of copyrighted materials that otherwise would be off limits to researchers when rights holders do not want to expose their collections to the public.” Put another (somewhat tortured) way by Plale, “your eyes are not human-consuming the materials.” Instead, you are using a variety of workflows to query the repository’s index, to access volumes and build collections using the metadata retrieved from the index, and to process those collections against an array of analytic tools.

Concrete research examples were offered by Colin Allen and Ted Underwood. The former stressed the need for significant computing power to run, store, and make replicable all of the experiments his research into homologies between academic argumentation across humanistic and scientific discourses. The HathiTrust Research Center promises this kind of scale. Underwood focused his remarks on the potential for cleaning HathiTrust data within the Research Center’s architecture.

During the hands-on sessions with SEASR analytics, we also got an immediate glimpse of the potential Allen and Underwood had described. Although the system is still in its infancy, it’s easy to imagine it bringing DH to the doorstep of many scholars who presently regard the field as obscure. Those who are proficient in Java and/or PYTHON, or (more likely) those who have proficient students/assistants/librarians at their disposal will be able to customize routines, iterate those routines extensively, preserve the routines along with their results, and then leave the routines behind for others to use in turn.

Overall, the HTRC Uncamp was a success, and the prevailing attitude was one of powerful potential just around the corner. As the HathiTrust partnership continues to grow, it is encouraging to see institutions developing their own brands of expertise (i.e., Indiana and Illinois tackling the Research Center). It bodes well for further developments at other schools. I look forward to future HTRC events as they emerge, and I hope this one will inspire other Hathi communities to flourish.

Leave a Comment


Previous post:

Next post: