Hamerman
Pirate Libraries and the Fight for Open Information
2015

| | SEPTEMBER 11TH, 2015 | A BI-WEEKLY WEBPAPER | ISSUE 61

|
---|---|---|---|---
PIRATE LIBRARIES and the fight for open information
/ by _Sarah Hamerman_ |

In a digital era that destabilizes traditional notions of intellectual
property, cultural producers must rethink information access.

Over the last several years, a number of _pirate libraries_ have done just
that. Collaboratively run digital libraries such as
[_Aaaaaarg_](http://aaaaarg.fail/),
_[Monoskop](http://www.monoskop.org/Monoskop)_ , _[Public
Library](https://www.memoryoftheworld.org/)_ , and
_[UbuWeb_](http://www.ubuweb.tv/) have emerged, offering access to humanities
texts and audiovisual resources that are technically ‘pirated’ and often hard
to find elsewhere.

Though these sites differ somewhat in content, architecture, and ideological
bent, all of them disavow intellectual copyright law to varying degrees,
offering up pirated books and media with the aim of advancing information
access.

“Information wants to be free,” has served as a catchphrase in recent internet
activism, calling for information democracy, led by media, library and
information advocates.

As online information access is increasingly embedded within the networks of
capital, the digital text-sharing underground actualizes the Internet’s
potential to build a true information commons.

With such projects, the archive becomes a record of collective power, not
corporate or state power; the digital book becomes unlocked, linkable, and
shareable.

Still, these sites comprise but a small subset of the networks of peer-to-peer
file sharing. Many legal battles waged over the explosion of audiovisual file
sharing through p2p services such as Napster, BitTorrent and MediaFire. At its
peak, Napster boasted over 80 million users; the p2p music-sharing service was
shut down after a high-profile lawsuit by the RIAA in 2001.

The US Department of Justice brought charges against open access activist
_[Aaron Swartz](http://www.fvckthemedia.com/issue51/editorial)_ in 2011 for
his large-scale unauthorized downloading of files from the JStor Academic
database. Swartz, who sadly committed suicide before his trial, was an
organizer for Demand Progress, a campaign against the Stop Online Piracy Act,
which was defeated in 2012. Swartz’s actions and the fight around SOPA
represent a benchmark in the struggle for open-access and anti-copyright
practices surrounding the digital book.

Aaaaaarg, Monoskop, UbuWeb and Public Library are representative cases of the
pirate library because of their explicit engagement with archival form, their
embrace of ideas of the _[digital commons](https://en.wikipedia.org/wiki/Digital_Commons)_ within current left-leaning thought, and their like-minded focus on critical theory and the arts.

All of these projects lend themselves to be considered _as libraries_ ,
retooled for open digital networks.

_Aaaaaarg.org_ , started by Los Angeles based artist Sean Dockray, hosts
full-text pdfs of over 50,000 books and articles. The library is connected to a an
alternative education project called the Public School, which serves as a
platform for self-organizing lectures, workshops and projects in cities across
the globe. _Aaaaaarg_ ’s catalog is viewable by the public, but
upload/download privileges are restricted through an invite system, thus
circumventing copyright law.

![](http://i.imgur.com/rbdvPIG.png)

The site is divided into a “Library,” in which users can search for texts by
author; “Collections,” or user-generated grouping of texts designed for
reading groups or research interests; and “Discussions,” a message board where
participants can request texts and volunteer for working groups. Most
recently, _Aaaaaarg_ has introduced a “compiler” tool that allows readers to
select excerpts from longer texts and assemble them into new PDFs, and a
reading tool that allows readers to save reference points and insert comments
into texts. Though the library is easily searchable, it doesn’t maintain
high-quality _[metadata](https://en.wikipedia.org/wiki/Metadata)_. Dockray and
other organizers intend to preserve a certain subjective and informal quality,
focusing more on discussion and collaboration than correct preservation and
classification practice.

_Aaaaaarg_ has been threatened with takedowns a few times, but has survived by
creating mirrored sites and reconstituted itself by varying the number of A’s
in the URL. Its shifts in location, organization, and capabilities reflect
both the decentralized, ad-hoc nature of its maintenance and the organizers’
attempts to elude copyright regulations. Text-sharing sites such as _Aaaaaarg_
have also been referred to as _[shadow
libraries_](http://supercommunity.e-flux.com/texts/sharing-instinct/),
reflecting their quasi-covert status and their efforts to evade shutdown.

Monoskop.org, a project founded by media artist _[Dušan
Barok](http://monoskop.org/Du%C5%A1an_Barok)_ , is a wiki for collaborative
studies of art, media and the humanities that was born in 2004 out of Barok’s
study of media art and related cultural practices. Its significant holdings -
about 3,000 full-length texts and many more excerpts, links and citations -
include avant-garde and modernist magazines, writings on sound art, scanned
illustrations, and media theory texts.

As a wiki, any user can edit any article or upload content, and see their
changes reflected immediately. Monoskop is comprised of two sister sites: the
Monoskop wiki and Monoskop Log, the accompanying text repository. Monoskop Log
is structured as a Wordpress site with links hosted on third-party sites, much
like the rare-music download blogs that became popular in the mid-2000s.
Though this architecture is relatively unstable, links are fixed on-demand and
site mirroring and redundancy balance out some of the instability.

Monoskop makes clear that it is offering content under the fair-use doctrine
and that this content is for personal and scholarly use, not commercial use.
Barok notes that though there have been a small number of takedowns, people
generally appreciate unrestricted access to the types of materials in Monoskop
log, whether they are authors or publishers.

_Public Library_ , a somewhat newer pirate library founded by Croatian
Internet activist and researcher Marcell Mars and his collaborators, currently
offers a collection of about 6,300 texts. The project frames itself through a
utopian philosophy of building a truly universal library, radically extending
enlightenment-era conceptions of democracy. Through democratizing the _tools
of librarianship_ – book scanning, classification systems, cataloging,
information – it promises a broader, de-institutionalized public library.

In __[Public Library: An
Essay](https://www.memoryoftheworld.org/blog/2014/10/27/public-library-an-essay/#sdendnote19sym)__ , Public Library’s organizers frame p2p libraries as
“fragile knowledge infrastructures built and maintained by brave librarians
practicing civil disobedience which the world of researchers in the humanities
rely on.” This civil disobedience is a politically motivated refutation of
intellectual property law and the orientation of information networks toward
venture capital and advertising. While the pirate libraries fulfill this
dissident function as a kind of experimental provocation, their content is
audience-specific rather than universal.

_[UbuWeb](http://www.ubuweb.com/resources/index.html)_ , founded in 1996 by
conceptual artist/ writer Kenneth Goldsmith, is the largest online archive of
avant-garde art resources. Its holdings include sound, video and text-based
works dating from the historical avant-garde era to today. While many of the
sites in the “pirate library” continuum source their content through
community-based or peer-to-peer models, UbuWeb focuses on making available out
of print, obscure or difficult to access artistic media, stating that
uploading such historical artifacts doesn’t detract from the physical value of
the work; rather, it enhances it. The website’s philosophy blends the utopian
ideals of avant-garde concrete poetry with the ideals of the digital gift
economy, and it has specifically refused to accept corporate or foundation
funding or adopt a more market-oriented business model.

![](http://i.imgur.com/pHdiL9S.png)

**Pirate Libraries vs. “The Sharing Economy”**

In pirate libraries, information users become archive builders by uploading
often-copyrighted content to shared networks.

Within the so-called “ _[sharing
economy](https://en.wikipedia.org/wiki/Sharing_economy)_ ,” users essentially
lease e-book content from information corporations such as Amazon, which
markets both the Kindle as platform. This centralization of intellectual
property has dire impacts on the openness of the digital book as a
collaborative knowledge-sharing device.

In contrast, the pirate library actualizes a gift economy based on qualitative
and communal rather than monetized exchange. As Mackenzie Wark writes in _A
Hacker Manifesto_ (2004), “The gift is marginal, but nevertheless plays a
vital role in cementing reciprocal and communal relations among people who
otherwise can only confront each other as buyers and sellers of commodities.”

From theorizing new media art to building solidarity against repressive
regimes, such communal information networks can crucially articulate shared
bodies of political and aesthetic desire and meaning. According to author
Matthew Stadler, literature is by nature communal. “Literature is not owned,”
he writes. “It is, by definition, a space of mutually negotiated meanings that
never closes or concludes, a space that thrives on — indeed requires — open
access and sharing.”

In a roundtable discussion published in _New Formations_ , _Aaaaaarg_ founder
Sean Dockray remarks that the site “actively explored and exploited the
affordances of asynchronous, networked communication,” functioning upon the
logic of the hack. Dockray continues: “But all of this is rather commonplace
for what’s called ‘piracy,’ isn’t it?” Pirate librarianship can be thought of
as a practice of civil disobedience within the stringent information
environment of today.

These projects promise both the realization and destruction of the public
library. They promote information democracy while calling the _professional_
institution of the Library into question, allowing amateurs to upload,
catalog, lend and maintain collections. In _Public Library: An Essay_ , Public
Library’s organizers _[write](https://www.memoryoftheworld.org/blog/2014/10/27
/public-library-an-essay/)_ : “With the emergence of the internet…
librarianship has been given an opportunity… to include thousands of amateur
librarians who will, together with the experts, build a distributed peer-to-peer network to care for the catalog of available knowledge.”

Public Library frames amateur librarianship as a free, collaboratively
maintained and democratic activity, drawing upon the language of the French
Revolution and extending it for the 21st century. While these practices are
democratic in form, they are not necessarily democratic in the populist sense;
rather, they focus on bringing high theoretical discourses to people outside
the academy. Accordingly, they attract a modest but engaged audience of
critics, artists, designers, activists, and scholars.

The activities of Aaaaaarg and Public Library may fall closer to ‘ _[peer
preservation](http://computationalculture.net/article/book-piracy-as-peer-preservation)_ ’
than ‘peer production,’ as the desires to share information
widely and to preserve these collections against shutdown often come into
conflict. In a _[recent piece](http://supercommunity.e-flux.com/texts/sharing-instinct/)_ for e-flux coauthored with Lawrence Liang, Dockray accordingly
laments “the unfortunate fact that digital shadow libraries have to operate
somewhat below the radar: it introduces a precariousness that doesn’t allow
imagination to really expand, as it becomes stuck on techniques of evasion,
distribution, and redundancy.”

![](http://i.imgur.com/KFe3chu.png)

UbuWeb and Monoskop, which digitize rare, out-of-print art texts and media
rather than in-print titles, can be said to fulfill the aims of preservation
and access. UbuWeb and Monoskop are openly used and discussed as classroom
resources and in online arts journalism more frequently than the more
aggressively anti-copyright sources; more on-the-record and mainstream
visibility likely -- but doesn’t necessarily -- equate to wider usage.

**From Alternative Space to Alternative Media**

Aaaaaarg _[locates itself as a
‘scaffolding’](http://chtodelat.org/b9-texts-2/vilensky/materialities-of-independent-publishing-a-conversation-with-aaaaarg-chto-delat-i-cite-mute-and-neural/)_ between institutions, a platform that unfolds between institutional
gaps and fills them in, rather than directly opposing them. Over ten years
after it was founded, it continues to provide a community for “niche”
varieties of political critique.

Drawing upon different strains of ‘alternative networking,’ the digital
text-sharing underground gives a voice to those quieted by the mechanisms of
institutional archives, publishing, and galleries. On the one hand, pirate
libraries extend the logic of alternative art spaces/artist-run spaces that
challenge the “white cube” and the art market; instead, they showcase ways of
making that are often ephemeral, performative, and anti-commercial.

Lawrence Liang refers to projects such as Aaaaaarg as “ _[ludic
libraries](http://supercommunity.e-flux.com/texts/sharing-instinct/)_ ,” as
they encourage a sense of intellectual play that deviates from well-
established norms of utility, seriousness, purpose, and property.

Just as alternative, community-oriented art spaces promote “fringe” art forms,
the pirate libraries build upon open digital architectures to promote “fringe”
scholarship, art, technological and archival practices. Though the comparison
between physical architecture and virtual architecture is a metaphor here, the
impact upon creative communities runs parallel.

At the same time, the digital text-sharing underground builds upon Robert W.
McChesney’s calls in _Digital Disconnect_ for a democratic media system that
promotes the expansion of public, student and community journalism. A truly
heterogeneous media system, for McChesney, would promote a multiplicity of
opinions, supplementing for-profit mass media with a substantial and varied
portion of nonprofit and independent media.

In order to create a political system – and a media system – that reflects
multiple interests, rather than the supposedly neutral status quo, we must
support truly free, not-for-profit alternatives to corporate journalism and
“clickbait” media designed to lure traffic for advertisers. We must support
creative platforms that encourage blending high-academic language with pop-
culture; quantitative analysis with art-making; appropriation with
authenticity: the pirate libraries serve just these purposes.

Pirate libraries help bring about what Gary Hall calls the “unbound book” as
text-form; as he writes, we can perceive such a digital book “as liquid and
living, open to being continually updated and collaboratively written, edited,
annotated, critiqued, updated, shared, supplemented, revised, re-ordered,
reiterated and reimagined.” These projects allow us to re-imagine both
archival practices and the digital book for social networks based on the gift.

Aaaaaarg, Monoskop, UbuWeb, and Public Library build a record of critical and
artistic discourse that is held in common, user-responsive and networkable.
Amateur librarians sustain these projects through technological ‘hacks’ that
innovate upon present archival tools and push digital preservation practices
forward.

Pirate libraries critique the ivory tower’s monopoly over the digital book.
They posit a space where alternative communities can flourish.

Between the cracks of the new information capital, the digital text-sharing
underground fosters the coming-into-being of another kind of information
society, one in which the historical record is the democratically-shared basis
for new forms of knowledge.

From this we should take away the understanding that _piracy is normal_ and
the public domain it builds is abundant. While these practices will continue
just beneath the official surface of the information economy, it is high time
for us to demand that our legal structures catch up.

Murtaugh
A bag but is language nothing of words
2016

## A bag but is language nothing of words

### From Mondotheque

#####

(language is nothing but a bag of words)

[Michael Murtaugh](/wiki/index.php?title=Michael_Murtaugh "Michael Murtaugh")

In text indexing and other machine reading applications the term "bag of
words" is frequently used to underscore how processing algorithms often
represent text using a data structure (word histograms or weighted vectors)
where the original order of the words in sentence form is stripped away. While
"bag of words" might well serve as a cautionary reminder to programmers of the
essential violence perpetrated to a text and a call to critically question the
efficacy of methods based on subsequent transformations, the expression's use
seems in practice more like a badge of pride or a schoolyard taunt that would
go: Hey language: you're nothin' but a big BAG-OF-WORDS.

## Bag of words

In information retrieval and other so-called _machine-reading_ applications
(such as text indexing for web search engines) the term "bag of words" is used
to underscore how in the course of processing a text the original order of the
words in sentence form is stripped away. The resulting representation is then
a collection of each unique word used in the text, typically weighted by the
number of times the word occurs.

Bag of words, also known as word histograms or weighted term vectors, are a
standard part of the data engineer's toolkit. But why such a drastic
transformation? The utility of "bag of words" is in how it makes text amenable
to code, first in that it's very straightforward to implement the translation
from a text document to a bag of words representation. More significantly,
this transformation then opens up a wide collection of tools and techniques
for further transformation and analysis purposes. For instance, a number of
libraries available in the booming field of "data sciences" work with "high
dimension" vectors; bag of words is a way to transform a written document into
a mathematical vector where each "dimension" corresponds to the (relative)
quantity of each unique word. While physically unimaginable and abstract
(imagine each of Shakespeare's works as points in a 14 million dimensional
space), from a formal mathematical perspective, it's quite a comfortable idea,
and many complementary techniques (such as principle component analysis) exist
to reduce the resulting complexity.

What's striking about a bag of words representation, given is centrality in so
many text retrieval application is its irreversibility. Given a bag of words
representation of a text and faced with the task of producing the original
text would require in essence the "brain" of a writer to recompose sentences,
working with the patience of a devoted cryptogram puzzler to draw from the
precise stock of available words. While "bag of words" might well serve as a
cautionary reminder to programmers of the essential violence perpetrated to a
text and a call to critically question the efficacy of methods based on
subsequent transformations, the expressions use seems in practice more like a
badge of pride or a schoolyard taunt that would go: Hey language: you're
nothing but a big BAG-OF-WORDS. Following this spirit of the term, "bag of
words" celebrates a perfunctory step of "breaking" a text into a purer form
amenable to computation, to stripping language of its silly redundant
repetitions and foolishly contrived stylistic phrasings to reveal a purer
inner essence.

## Book of words

Lieber's Standard Telegraphic Code, first published in 1896 and republished in
various updated editions through the early 1900s, is an example of one of
several competing systems of telegraph code books. The idea was for both
senders and receivers of telegraph messages to use the books to translate
their messages into a sequence of code words which can then be sent for less
money as telegraph messages were paid by the word. In the front of the book, a
list of examples gives a sampling of how messages like: "Have bought for your
account 400 bales of cotton, March delivery, at 8.34" can be conveyed by a
telegram with the message "Ciotola, Delaboravi". In each case the reduction of
number of transmitted words is highlighted to underscore the efficacy of the
method. Like a dictionary or thesaurus, the book is primarily organized around
key words, such as _act_ , _advice_ , _affairs_ , _bags_ , _bail_ , and
_bales_ , under which exhaustive lists of useful phrases involving the
corresponding word are provided in the main pages of the volume. [1]

[![Liebers
P1016847.JPG](/wiki/images/4/41/Liebers_P1016847.JPG)](/wiki/index.php?title=File:Liebers_P1016847.JPG)

[![Liebers
P1016859.JPG](/wiki/images/3/35/Liebers_P1016859.JPG)](/wiki/index.php?title=File:Liebers_P1016859.JPG)

[![Liebers
P1016861.JPG](/wiki/images/3/34/Liebers_P1016861.JPG)](/wiki/index.php?title=File:Liebers_P1016861.JPG)

[![Liebers
P1016869.JPG](/wiki/images/f/fd/Liebers_P1016869.JPG)](/wiki/index.php?title=File:Liebers_P1016869.JPG)

> [...] my focus in this chapter is on the inscription technology that grew
parasitically alongside the monopolistic pricing strategies of telegraph
companies: telegraph code books. Constructed under the bywords “economy,”
“secrecy,” and “simplicity,” telegraph code books matched phrases and words
with code letters or numbers. The idea was to use a single code word instead
of an entire phrase, thus saving money by serving as an information
compression technology. Generally economy won out over secrecy, but in
specialized cases, secrecy was also important.[2]

In Katherine Hayles' chapter devoted to telegraph code books she observes how:

> The interaction between code and language shows a steady movement away from
a human-centric view of code toward a machine-centric view, thus anticipating
the development of full-fledged machine codes with the digital computer. [3]

[![Liebers
P1016851.JPG](/wiki/images/1/13/Liebers_P1016851.JPG)](/wiki/index.php?title=File:Liebers_P1016851.JPG)
Aspects of this transitional moment are apparent in a notice included
prominently inserted in the Lieber's code book:

> After July, 1904, all combinations of letters that do not exceed ten will
pass as one cipher word, provided that it is pronounceable, or that it is
taken from the following languages: English, French, German, Dutch, Spanish,
Portuguese or Latin -- International Telegraphic Conference, July 1903 [4]

Conforming to international conventions regulating telegraph communication at
that time, the stipulation that code words be actual words drawn from a
variety of European languages (many of Lieber's code words are indeed
arbitrary Dutch, German, and Spanish words) underscores this particular moment
of transition as reference to the human body in the form of "pronounceable"
speech from representative languages begins to yield to the inherent potential
for arbitrariness in digital representation.

What telegraph code books do is remind us of is the relation of language in
general to economy. Whether they may be economies of memory, attention, costs
paid to a telecommunicatons company, or in terms of computer processing time
or storage space, encoding language or knowledge in any form of writing is a
form of shorthand and always involves an interplay with what one expects to
perform or "get out" of the resulting encoding.

> Along with the invention of telegraphic codes comes a paradox that John
Guillory has noted: code can be used both to clarify and occlude. Among the
sedimented structures in the technological unconscious is the dream of a
universal language. Uniting the world in networks of communication that
flashed faster than ever before, telegraphy was particularly suited to the
idea that intercultural communication could become almost effortless. In this
utopian vision, the effects of continuous reciprocal causality expand to
global proportions capable of radically transforming the conditions of human
life. That these dreams were never realized seems, in retrospect, inevitable.
[5]

[![Liebers
P1016884.JPG](/wiki/images/9/9c/Liebers_P1016884.JPG)](/wiki/index.php?title=File:Liebers_P1016884.JPG)

[![Liebers
P1016852.JPG](/wiki/images/7/74/Liebers_P1016852.JPG)](/wiki/index.php?title=File:Liebers_P1016852.JPG)

[![Liebers
P1016880.JPG](/wiki/images/1/11/Liebers_P1016880.JPG)](/wiki/index.php?title=File:Liebers_P1016880.JPG)

Far from providing a universal system of encoding messages in the English
language, Lieber's code is quite clearly designed for the particular needs and
conditions of its use. In addition to the phrases ordered by keywords, the
book includes a number of tables of terms for specialized use. One table lists
a set of words used to describe all possible permutations of numeric grades of
coffee (Choliam = 3,4, Choliambos = 3,4,5, Choliba = 4,5, etc.); another table
lists pairs of code words to express the respective daily rise or fall of the
price of coffee at the port of Le Havre in increments of a quarter of a Franc
per 50 kilos ("Chirriado = prices have advanced 1 1/4 francs"). From an
archaeological perspective, the Lieber's code book reveals a cross section of
the needs and desires of early 20th century business communication between the
United States and its trading partners.

The advertisements lining the Liebers Code book further situate its use and
that of commercial telegraphy. Among the many advertisements for banking and
law services, office equipment, and alcohol are several ads for gun powder and
explosives, drilling equipment and metallurgic services all with specific
applications to mining. Extending telegraphy's formative role for ship-to-
shore and ship-to-ship communication for reasons of safety, commercial
telegraphy extended this network of communication to include those parties
coordinating the "raw materials" being mined, grown, or otherwise extracted
from overseas sources and shipped back for sale.

## "Raw data now!"

From [La ville intelligente - Ville de la connaissance](/wiki/index.php?title
=La_ville_intelligente_-_Ville_de_la_connaissance "La ville intelligente -
Ville de la connaissance"):

Étant donné que les nouvelles formes modernistes et l'utilisation de matériaux
propageaient l'abondance d'éléments décoratifs, Paul Otlet croyait en la
possibilité du langage comme modèle de « [données
brutes](/wiki/index.php?title=Bag_of_words "Bag of words") », le réduisant aux
informations essentielles et aux faits sans ambiguïté, tout en se débarrassant
de tous les éléments inefficaces et subjectifs.

From [The Smart City - City of Knowledge](/wiki/index.php?title
=The_Smart_City_-_City_of_Knowledge "The Smart City - City of Knowledge"):

As new modernist forms and use of materials propagated the abundance of
decorative elements, Otlet believed in the possibility of language as a model
of '[raw data](/wiki/index.php?title=Bag_of_words "Bag of words")', reducing
it to essential information and unambiguous facts, while removing all
inefficient assets of ambiguity or subjectivity.

> Tim Berners-Lee: [...] Make a beautiful website, but first give us the
unadulterated data, we want the data. We want unadulterated data. OK, we have
to ask for raw data now. And I'm going to ask you to practice that, OK? Can
you say "raw"?

>

> Audience: Raw.

>

> Tim Berners-Lee: Can you say "data"?

>

> Audience: Data.

>

> TBL: Can you say "now"?

>

> Audience: Now!

>

> TBL: Alright, "raw data now"!

>

> [...]

>

> So, we're at the stage now where we have to do this -- the people who think
it's a great idea. And all the people -- and I think there's a lot of people
at TED who do things because -- even though there's not an immediate return on
the investment because it will only really pay off when everybody else has
done it -- they'll do it because they're the sort of person who just does
things which would be good if everybody else did them. OK, so it's called
linked data. I want you to make it. I want you to demand it. [6]

## Un/Structured

As graduate students at Stanford, Sergey Brin and Lawrence (Larry) Page had an
early interest in producing "structured data" from the "unstructured" web. [7]

> The World Wide Web provides a vast source of information of almost all
types, ranging from DNA databases to resumes to lists of favorite restaurants.
However, this information is often scattered among many web servers and hosts,
using many different formats. If these chunks of information could be
extracted from the World Wide Web and integrated into a structured form, they
would form an unprecedented source of information. It would include the
largest international directory of people, the largest and most diverse
databases of products, the greatest bibliography of academic works, and many
other useful resources. [...]

>

> **2.1 The Problem**
> Here we define our problem more formally:
> Let D be a large database of unstructured information such as the World
Wide Web [...] [8]

In a paper titled _Dynamic Data Mining_ Brin and Page situate their research
looking for _rules_ (statistical correlations) between words used in web
pages. The "baskets" they mention stem from the origins of "market basket"
techniques developed to find correlations between the items recorded in the
purchase receipts of supermarket customers. In their case, they deal with web
pages rather than shopping baskets, and words instead of purchases. In
transitioning to the much larger scale of the web, they describe the
usefulness of their research in terms of its computational economy, that is
the ability to tackle the scale of the web and still perform using
contemporary computing power completing its task in a reasonably short amount
of time.

> A traditional algorithm could not compute the large itemsets in the lifetime
of the universe. [...] Yet many data sets are difficult to mine because they
have many frequently occurring items, complex relationships between the items,
and a large number of items per basket. In this paper we experiment with word
usage in documents on the World Wide Web (see Section 4.2 for details about
this data set). This data set is fundamentally different from a supermarket
data set. Each document has roughly 150 distinct words on average, as compared
to roughly 10 items for cash register transactions. We restrict ourselves to a
subset of about 24 million documents from the web. This set of documents
contains over 14 million distinct words, with tens of thousands of them
occurring above a reasonable support threshold. Very many sets of these words
are highly correlated and occur often. [9]

## Un/Ordered

In programming, I've encountered a recurring "problem" that's quite
symptomatic. It goes something like this: you (the programmer) have managed to
cobble out a lovely "content management system" (either from scratch, or using
any number of helpful frameworks) where your user can enter some "items" into
a database, for instance to store bookmarks. After this ordered items are
automatically presented in list form (say on a web page). The author: It's
great, except... could this bookmark come before that one? The problem stems
from the fact that the database ordering (a core functionality provided by any
database) somehow applies a sorting logic that's almost but not quite right. A
typical example is the sorting of names where details (where to place a name
that starts with a Norwegian "Ø" for instance), are language-specific, and
when a mixture of languages occurs, no single ordering is necessarily
"correct". The (often) exascerbated programmer might hastily add an additional
database field so that each item can also have an "order" (perhaps in the form
of a date or some other kind of (alpha)numerical "sorting" value) to be used
to correctly order the resulting list. Now the author has a means, awkward and
indirect but workable, to control the order of the presented data on the start
page. But one might well ask, why not just edit the resulting listing as a
document? Not possible! Contemporary content management systems are based on a
data flow from a "pure" source of a database, through controlling code and
templates to produce a document as a result. The document isn't the data, it's
the end result of an irreversible process. This problem, in this and many
variants, is widespread and reveals an essential backwardness that a
particular "computer scientist" mindset relating to what constitutes "data"
and in particular it's relationship to order that makes what might be a
straightforward question of editing a document into an over-engineered
database.

Recently working with Nikolaos Vogiatzis whose research explores playful and
radically subjective alternatives to the list, Vogiatzis was struck by how
from the earliest specifications of HTML (still valid today) have separate
elements (OL and UL) for "ordered" and "unordered" lists.

> The representation of the list is not defined here, but a bulleted list for
unordered lists, and a sequence of numbered paragraphs for an ordered list
would be quite appropriate. Other possibilities for interactive display
include embedded scrollable browse panels. [10]

Vogiatzis' surprise lay in the idea of a list ever being considered
"unordered" (or in opposition to the language used in the specification, for
order to ever be considered "insignificant"). Indeed in its suggested
representation, still followed by modern web browsers, the only difference
between the two visually is that UL items are preceded by a bullet symbol,
while OL items are numbered.

The idea of ordering runs deep in programming practice where essentially
different data structures are employed depending on whether order is to be
maintained. The indexes of a "hash" table, for instance (also known as an
associative array), are ordered in an unpredictable way governed by a
representation's particular implementation. This data structure, extremely
prevalent in contemporary programming practice sacrifices order to offer other
kinds of efficiency (fast text-based retrieval for instance).

## Data mining

In announcing Google's impending data center in Mons, Belgian prime minister
Di Rupo invoked the link between the history of the mining industry in the
region and the present and future interest in "data mining" as practiced by IT
companies such as Google.

Whether speaking of bales of cotton, barrels of oil, or bags of words, what
links these subjects is the way in which the notion of "raw material" obscures
the labor and power structures employed to secure them. "Raw" is always
relative: "purity" depends on processes of "refinement" that typically carry
social/ecological impact.

Stripping language of order is an act of "disembodiment", detaching it from
the acts of writing and reading. The shift from (human) reading to machine
reading involves a shift of responsibility from the individual human body to
the obscured responsibilities and seemingly inevitable forces of the
"machine", be it the machine of a market or the machine of an algorithm.

From [X = Y](/wiki/index.php?title=X_%3D_Y "X = Y"):

Still, it is reassuring to know that the products hold traces of the work,
that even with the progressive removal of human signs in automated processes,
the workers' presence never disappears completely. This presence is proof of
the materiality of information production, and becomes a sign of the economies
and paradigms of efficiency and profitability that are involved.

The computer scientists' view of textual content as "unstructured", be it in a
webpage or the OCR scanned pages of a book, reflect a negligence to the
processes and labor of writing, editing, design, layout, typesetting, and
eventually publishing, collecting and cataloging [11].

"Unstructured" to the computer scientist, means non-conformant to particular
forms of machine reading. "Structuring" then is a social process by which
particular (additional) conventions are agreed upon and employed. Computer
scientists often view text through the eyes of their particular reading
algorithm, and in the process (voluntarily) blind themselves to the work
practices which have produced and maintain these "resources".

Berners-Lee, in chastising his audience of web publishers to not only publish
online, but to release "unadulterated" data belies a lack of imagination in
considering how language is itself structured and a blindness to the need for
more than additional technical standards to connect to existing publishing
practices.

Last Revision: 2*08*2016

1. ↑ Benjamin Franklin Lieber, Lieber's Standard Telegraphic Code, 1896, New York;
2. ↑ Katherine Hayles, "Technogenesis in Action: Telegraph Code Books and the Place of the Human", How We Think: Digital Media and Contemporary Technogenesis, 2006
3. ↑ Hayles
4. ↑ Lieber's
5. ↑ Hayles
6. ↑ Tim Berners-Lee: The next web, TED Talk, February 2009
7. ↑ "Research on the Web seems to be fashionable these days and I guess I'm no exception." from Brin's [Stanford webpage](http://infolab.stanford.edu/~sergey/)
8. ↑ Extracting Patterns and Relations from the World Wide Web, Sergey Brin, Proceedings of the WebDB Workshop at EDBT 1998,
9. ↑ Dynamic Data Mining: Exploring Large Rule Spaces by Sampling; Sergey Brin and Lawrence Page, 1998; p. 2
10. ↑ Hypertext Markup Language (HTML): "Internet Draft", Tim Berners-Lee and Daniel Connolly, June 1993,
11. ↑

Retrieved from

[https://www.mondotheque.be/wiki/index.php?title=A_bag_but_is_language_nothing_of_words&oldid=8480](https://www.mondotheque.be/wiki/index.php?title=A_bag_but_is_language_nothing_of_words&oldid=8480)

Display 200 300 400 500 600 700 800 900 1000 ALL characters around the word.