Tag Archives: Wikipedia

css.php

Cohen on Data Mining

http://www.dlib.org/dlib/march06/cohen/03cohen.html

Cohen argues that computational methods for analyzing, manipulating and retrieving data from large corpuses will provide new tools for academic research, including the humanities. He provides two examples, projects he worked on. Syllabus Finder, a document classification tool for aggregating and searching course syllabi, finds and collects documents that show similar patterns in their use of words. It also allows to differentiate documents that have similar keywords by analyzing the use of other words. Another example he provides is H-Bot, a question answering tool that takes in queries in natural language (instead of code), transforms the query using predetermined rules and conducts a web search before outputting the answer the tool decides is relevant.

Lessons that Cohen learned while building these tools:

  • APIs are good
    • they offer the possibilities for combining various resources (which facilitates the use of less rigorous but more accessible corpuses)
    • third-party development can lead to unexpected and positive results
  • open resources are better than restricted ones (access makes up for quality)
  • large quantity can make up for quality

Just in case: an API is a way of making easy the process of using our software get data (instead of doing it manually) from another software (usually on another computer, like a web server). The following is one of the more concise and less technical-details-oriented explanations I found online: https://www.quora.com/What-is-an-API

Also, I feel that The Lexicon of DH workshop slides provide a good overview of the coming week’s theme.

So indeed, the use of APIs has become more common outside of the IT field since 2006. New York Public LibraryCooper-Hewitt Museum and the New York Times, among many others, provide APIs that allow the access on their digital collection through software. MOMA provides their collection data on Github.

The technology used for document searching and question answering, the two examples that Cohen provides, have developed into something arguably more reliable, faster and easier to use. For example, we don’t even need to build a tool in order to be able to ask some questions in natural language:

We'll remember you, H-Bot.

Relating back to the discussion of previous weeks, what do you think is the impacts or implications that the increase of digital collections and APIs, along with developments in data collecting and analyzing technologies, have on teaching? (or on more broader aspects of life and research) How does this fit together with more traditional modes of teaching, like textbooks?

Another question I have relates to the fact that both examples mentioned in the article are no longer functioning. The latest update on Syllabus Finder that I could find explains that a system change in the Google search API effectively deprecated the tool; it also provides a download link to the database of syllabi—but only a small part of it. H-Bot is online, but sadly doesn’t seem able to answer me:

Oh, H-Bot.

I can easily imagine the difficulties of maintaining such a digital project. I am also under the impression that the eventual outdating is the fate of many digital projects. They require a different type of effort than, say, putting out journal articles. Maintenance requires manpower, manpower requires funds— I also have the ambivalent feeling that it may not be necessarily a bad thing that some projects finish their life cycle, while it would be great if those projects were archived somehow (in a functioning state). I guess I feel more personally involved since I will probably build something or another during my time here— I would love to hear your thoughts on this matter.

Some more or less related links:

As We May Think

Quick note:
The tech terminology at first confused me, and if you are like me the following list might help you. Please feel free to correct me if I’m wrong:

  • Photocells are light sensors. Advanced versions of these are in your smartphone and digital cameras, behind the lens.
  • Thermionic tubes = vacuum tubes. Incandescent light bulbs are a type of these. Along with relays, these were among the essential components of an electric circuit until transistors became popular.
  • Cathode ray tubes = CRT (old fat screens)

The article was published in a time where the industrialization we discussed regarding last week’s readings is quite in its adult phase; “the humble typewriter, or the movie camera, or the automobile” are, rather than new innovations, things that “perform reliably.” In a war-winning United States, not without the help of mobilized scientists, Bush proposes a vision of using technology to deal with the problem of ever-increasing human knowledge that “extended far beyond our present ability to make real use of the record.” He is anticipating computers to be used in the information age, Thomas P. Hughes (2005) describes (p. 97). His picture of the “memex”, which is essentially a microfilm browser with editing and sharing functions, seems a bit different from what computers actually became- but it was 1945, and digital computing was not really a thing. What is impressive is his insights on how information should be dealt with.

Using the example of Mendel’s work not reaching potentially significant contemporary readers, Bush defines the problem faced by humans as the inability for the actual use of (scientific) knowledge to keep pace with the speed that its records expand. According to him, three aspects could use some improvement and will do so regarding scientific records: creation of new ones, storage and retrieval.

Instead of trying to guess what the next new technology will be, Bush describes in detail how the current technology could develop and be used for the above goals. Storage will be faster, easier, cheaper and smaller. Note that he emphasizes that “[c]ompression is important … when it comes to costs.” With a little stretch, his idea that smaller size will lead to massive reproduction is in a way analogous to the shrinking space of railway times leading to the access to a much larger geographical space.

Creating new records could also become easier, through such developments as speech recognition and automated input. The automation of repetitive processes that are currently limited to arithmetic equations would extend to higher-level symbolic logics and advanced data analysis. And the access to specific data, which Bush calls selection, also could be much faster if we applied the selection process of, say, the telephone switching system and improved it using electronics. So the storage, input and retrieval of knowledge would all become faster and allow for a much larger quantity.

Then Bush pictures a device, “memex”, that embodies the above improvements along with an additional crucial idea, association. Unlike the current indexing systems, which are mostly alphabetical or numerical categorization, a new system would enable the direct connection of two or more different pieces of information; allowing for the association between thoughts which is how the human mind works, hence the title of the article. His example of the Turkish bow researcher describes knowledge pieces that are interconnectible via a code space separate from the content and allow long-term storage, commenting/editing/creating from the user’s part, browsing, copying and sharing. This idea is viewed as the initial concept of hypertext– one of the main structures of internet (Landow, 2006, p.11)

His “new forms of encyclopedias” filled with “a mesh of associative trails” incredibly seem to be referring to Wikipedia. He expresses the hope that humankind would be able to stand on the shoulder of giants and go beyond its application of control over the environment and war against each other, in order to “grow in the wisdom of race experience.”

The article portrays several ideas that we can associate with current things: the hypertext and links, of course, but also here and there we find mentions of potential Google Glass, Siri and big data analysis. Just before the end of the article, we can also peep at Bush’s version of cyborg future, where information could be transmitted to and from the brain directly using electric signals rather than being translated to sensory phenomena; this sounds like his sci-fi imagination, which he has been suppressing throughout the article, finally going off… But in a sense this also has been realized: not exactly (well, not yet) by connecting wires to the nervous system, but by the vast network of computers and the digitization of all information.

  • It is worth noting his limited use of female words, only associated with certain jobs: stenotypist, typist, “simple key board punches” operators, and (not specified but probably) file clerk. This seems almost like a repeat of something that happened in Marx’s era: as machines enter the labor space, so do women- but not on equal terms. A further interesting point is that as Wendy Chun (2004) points out, computers in early 20C referred to human operators of the machines, mostly young women; “they were also considered to be better, more conscientious computers, presumably because they were better at repetitious, clerical tasks” (p. 33).
  • Bush was administrator for the wartime U.S. military R&D, which I can’t imagine had no influence on his ideas regarding the inefficiently increasing knowledge. The initial version of internet was funded by the U.S. department of defense. The steam engine came out of an industrial need, like a lot of innovations happening in the tech industry today. Although this might be a rough statement, I feel not too much off target by arguing that a lot of initiative regarding technological change comes from either the military or the industry. What are the implications here? How relevant are the sources of technological changes?
  • Are we better off with the internet? I mean, I love the internet. But more globally, what would be the implications? While there are claims that the hypertext as a system that allows for easier participation in creative activities and dissolves the boundaries between author and reader, it “has the potential … to be a democratic or multicentered system” (Landow, 2006, p. 343), last class we also talked about how some corporations are exploiting that type of collaborative and/or voluntary work that the medium enables us to do. Is the capital intercepting the ‘revolutionary potential’ and use it for profit?
  • Are we any wiser? To be more specific, has our ability to process information caught up with the ever-increasing rate of knowledge production? Or are we being disoriented by the influx of information? In the case of the latter, is it a transitional thing, just as the coach travelers were disoriented by the speed of the railway?

Also, this video featuring Douglas Engelbart, who was inspired by Bush, might be interesting to watch alongside the article. It is perhaps most famous for the use of a computer mouse, but it also introduces important features of computers that now seem so natural, including the hypertext.

https://en.wikipedia.org/wiki/The_Mother_of_All_Demos

Citations:

Continue reading

Wikipedia Assignment

Collaboration and Wikipedia: Collaboratively write one well cited paragraph of literature review that traces the reception of one of the texts from the first few weeks of class. Work only on wiki, communicating via talk pages. Draft your paragraph in the sandbox, and add it in only once completed.

  • Do Androids Dream of Electric Sheet
  • Blade Runner
  • Cyborg Manifesto
  • How We Became Posthuman:
    • https://en.wikipedia.org/wiki/Posthumanism
    • https://en.wikipedia.org/wiki/N._Katherine_Hayles
  • https://en.wikipedia.org/wiki/Lisa_Nakamura

Groups of two or threee will be assigned September 21st. The improved article is due October 13th (this is the day after a holiday). Turn your assignment in by posting a comment on both of our talk pages.