Cohen on Data Mining

Cohen argues that computational methods for analyzing, manipulating and retrieving data from large corpuses will provide new tools for academic research, including the humanities. He provides two examples, projects he worked on. Syllabus Finder, a document classification tool for aggregating and searching course syllabi, finds and collects documents that show similar patterns in their use of words. It also allows to differentiate documents that have similar keywords by analyzing the use of other words. Another example he provides is H-Bot, a question answering tool that takes in queries in natural language (instead of code), transforms the query using predetermined rules and conducts a web search before outputting the answer the tool decides is relevant.

Lessons that Cohen learned while building these tools:

  • APIs are good
    • they offer the possibilities for combining various resources (which facilitates the use of less rigorous but more accessible corpuses)
    • third-party development can lead to unexpected and positive results
  • open resources are better than restricted ones (access makes up for quality)
  • large quantity can make up for quality

Just in case: an API is a way of making easy the process of using our software get data (instead of doing it manually) from another software (usually on another computer, like a web server). The following is one of the more concise and less technical-details-oriented explanations I found online:

Also, I feel that The Lexicon of DH workshop slides provide a good overview of the coming week’s theme.

So indeed, the use of APIs has become more common outside of the IT field since 2006. New York Public LibraryCooper-Hewitt Museum and the New York Times, among many others, provide APIs that allow the access on their digital collection through software. MOMA provides their collection data on Github.

The technology used for document searching and question answering, the two examples that Cohen provides, have developed into something arguably more reliable, faster and easier to use. For example, we don’t even need to build a tool in order to be able to ask some questions in natural language:

We'll remember you, H-Bot.

Relating back to the discussion of previous weeks, what do you think is the impacts or implications that the increase of digital collections and APIs, along with developments in data collecting and analyzing technologies, have on teaching? (or on more broader aspects of life and research) How does this fit together with more traditional modes of teaching, like textbooks?

Another question I have relates to the fact that both examples mentioned in the article are no longer functioning. The latest update on Syllabus Finder that I could find explains that a system change in the Google search API effectively deprecated the tool; it also provides a download link to the database of syllabi—but only a small part of it. H-Bot is online, but sadly doesn’t seem able to answer me:

Oh, H-Bot.

I can easily imagine the difficulties of maintaining such a digital project. I am also under the impression that the eventual outdating is the fate of many digital projects. They require a different type of effort than, say, putting out journal articles. Maintenance requires manpower, manpower requires funds— I also have the ambivalent feeling that it may not be necessarily a bad thing that some projects finish their life cycle, while it would be great if those projects were archived somehow (in a functioning state). I guess I feel more personally involved since I will probably build something or another during my time here— I would love to hear your thoughts on this matter.

Some more or less related links:

4 thoughts on “Cohen on Data Mining

  1. Sara Vogel, PhD. (she/her)

    Achim — In reading the article you linked to about the last update on Syllabus Finder, there is something very troubling about the fact that Google deprecated its own API to protect their IP, and won’t provide support for the old version anymore. I agree with what both you and Jojo are writing here, that maybe some projects deserve to fade away into digital project heaven (or the Wayback Machine, or when funding dries up. At the same time, when third parties develop on top of Google’s API, it seems to me like Google actually has a great deal of power (linking to Alexis’ question). I probably interact with and depend on dozens of apps over the course of the week that use Google Maps API. It is a mutually beneficial relationship between Google and the third parties… until it’s not…

  2. Alexis Larsson

    Thanks, Achim. I think I gleaned the same new thoughts about API as you did from the article, but I wish Cohen had gone further into what kinds of questions data mining can or could answer. I’m thinking that if the structures, material, resources available shape human activity, then this would also be the case with resources like API. So, how would that change the shape of research trends over time? Also, Cohen gave light treatment to the commercial interests backing private API applications. This is worthy of further investigation. Lastly, about archiving digital projects. What kind of resources are involved in that? What’s the environmental impact? I imagine someone is doing research on this and I’m curious about their comparisons with digital vs. conventional box-and-paper methods (I find the term “brick and mortar” gross).

  3. Achim Koh Post author

    Thanks for the input, Jojo! I will need to take some time to consider your last question, I feel it connects to the question of the degree to which the choice of a certain medium determines the characteristics of an activity performed using it. Or, to be blunt, how to deprecate McLuhan.

    Meanwhile, some more links related to hypertext and preserving digital projects from

  4. Jojo Karlin (she/her/hers)

    Thanks for the provocation! As I write and concoct notions of what new digital scholarship should be, I keep coming back to this question of longevity. The question about sustaining digital projects is a constant consideration — and I agree that some need not live forever. Just as certain papers might fall into disuse, perhaps, digital works that do not demand reconsideration may be useful for only a short while. The push to have work accessible should not discount the natural fade to obscurity of even great ideas. I am struck by the, in my estimation laudable, projects aimed at making papers more visible and impactful. At the same time, I am interested in seeing what ways digital projects can live as scaffolding in the way that stages of writing do. Is there a way that we can make our digital output matter less in order to make the overall goals matter more?

Comments are closed.