Tag Archives: data mining

css.php

Cohen on Data Mining

http://www.dlib.org/dlib/march06/cohen/03cohen.html

Cohen argues that computational methods for analyzing, manipulating and retrieving data from large corpuses will provide new tools for academic research, including the humanities. He provides two examples, projects he worked on. Syllabus Finder, a document classification tool for aggregating and searching course syllabi, finds and collects documents that show similar patterns in their use of words. It also allows to differentiate documents that have similar keywords by analyzing the use of other words. Another example he provides is H-Bot, a question answering tool that takes in queries in natural language (instead of code), transforms the query using predetermined rules and conducts a web search before outputting the answer the tool decides is relevant.

Lessons that Cohen learned while building these tools:

  • APIs are good
    • they offer the possibilities for combining various resources (which facilitates the use of less rigorous but more accessible corpuses)
    • third-party development can lead to unexpected and positive results
  • open resources are better than restricted ones (access makes up for quality)
  • large quantity can make up for quality

Just in case: an API is a way of making easy the process of using our software get data (instead of doing it manually) from another software (usually on another computer, like a web server). The following is one of the more concise and less technical-details-oriented explanations I found online: https://www.quora.com/What-is-an-API

Also, I feel that The Lexicon of DH workshop slides provide a good overview of the coming week’s theme.

So indeed, the use of APIs has become more common outside of the IT field since 2006. New York Public LibraryCooper-Hewitt Museum and the New York Times, among many others, provide APIs that allow the access on their digital collection through software. MOMA provides their collection data on Github.

The technology used for document searching and question answering, the two examples that Cohen provides, have developed into something arguably more reliable, faster and easier to use. For example, we don’t even need to build a tool in order to be able to ask some questions in natural language:

We'll remember you, H-Bot.

Relating back to the discussion of previous weeks, what do you think is the impacts or implications that the increase of digital collections and APIs, along with developments in data collecting and analyzing technologies, have on teaching? (or on more broader aspects of life and research) How does this fit together with more traditional modes of teaching, like textbooks?

Another question I have relates to the fact that both examples mentioned in the article are no longer functioning. The latest update on Syllabus Finder that I could find explains that a system change in the Google search API effectively deprecated the tool; it also provides a download link to the database of syllabi—but only a small part of it. H-Bot is online, but sadly doesn’t seem able to answer me:

Oh, H-Bot.

I can easily imagine the difficulties of maintaining such a digital project. I am also under the impression that the eventual outdating is the fate of many digital projects. They require a different type of effort than, say, putting out journal articles. Maintenance requires manpower, manpower requires funds— I also have the ambivalent feeling that it may not be necessarily a bad thing that some projects finish their life cycle, while it would be great if those projects were archived somehow (in a functioning state). I guess I feel more personally involved since I will probably build something or another during my time here— I would love to hear your thoughts on this matter.

Some more or less related links: