Murtaugh
A bag but is language nothing of words
2016


## A bag but is language nothing of words

### From Mondotheque

#####

(language is nothing but a bag of words)

[Michael Murtaugh](/wiki/index.php?title=Michael_Murtaugh "Michael Murtaugh")

In text indexing and other machine reading applications the term "bag of
words" is frequently used to underscore how processing algorithms often
represent text using a data structure (word histograms or weighted vectors)
where the original order of the words in sentence form is stripped away. While
"bag of words" might well serve as a cautionary reminder to programmers of the
essential violence perpetrated to a text and a call to critically question the
efficacy of methods based on subsequent transformations, the expression's use
seems in practice more like a badge of pride or a schoolyard taunt that would
go: Hey language: you're nothin' but a big BAG-OF-WORDS.

## Bag of words

In information retrieval and other so-called _machine-reading_ applications
(such as text indexing for web search engines) the term "bag of words" is used
to underscore how in the course of processing a text the original order of the
words in sentence form is stripped away. The resulting representation is then
a collection of each unique word used in the text, typically weighted by the
number of times the word occurs.

Bag of words, also known as word histograms or weighted term vectors, are a
standard part of the data engineer's toolkit. But why such a drastic
transformation? The utility of "bag of words" is in how it makes text amenable
to code, first in that it's very straightforward to implement the translation
from a text document to a bag of words representation. More significantly,
this transformation then opens up a wide collection of tools and techniques
for further transformation and analysis purposes. For instance, a number of
libraries available in the booming field of "data sciences" work with "high
dimension" vectors; bag of words is a way to transform a written document into
a mathematical vector where each "dimension" corresponds to the (relative)
quantity of each unique word. While physically unimaginable and abstract
(imagine each of Shakespeare's works as points in a 14 million dimensional
space), from a formal mathematical perspective, it's quite a comfortable idea,
and many complementary techniques (such as principle component analysis) exist
to reduce the resulting complexity.

What's striking about a bag of words representation, given is centrality in so
many text retrieval application is its irreversibility. Given a bag of words
representation of a text and faced with the task of producing the original
text would require in essence the "brain" of a writer to recompose sentences,
working with the patience of a devoted cryptogram puzzler to draw from the
precise stock of available words. While "bag of words" might well serve as a
cautionary reminder to programmers of the essential violence perpetrated to a
text and a call to critically question the efficacy of methods based on
subsequent transformations, the expressions use seems in practice more like a
badge of pride or a schoolyard taunt that would go: Hey language: you're
nothing but a big BAG-OF-WORDS. Following this spirit of the term, "bag of
words" celebrates a perfunctory step of "breaking" a text into a purer form
amenable to computation, to stripping language of its silly redundant
repetitions and foolishly contrived stylistic phrasings to reveal a purer
inner essence.

## Book of words

Lieber's Standard Telegraphic Code, first published in 1896 and republished in
various updated editions through the early 1900s, is an example of one of
several competing systems of telegraph code books. The idea was for both
senders and receivers of telegraph messages to use the books to translate
their messages into a sequence of code words which can then be sent for less
money as telegraph messages were paid by the word. In the front of the book, a
list of examples gives a sampling of how messages like: "Have bought for your
account 400 bales of cotton, March delivery, at 8.34" can be conveyed by a
telegram with the message "Ciotola, Delaboravi". In each case the reduction of
number of transmitted words is highlighted to underscore the efficacy of the
method. Like a dictionary or thesaurus, the book is primarily organized around
key words, such as _act_ , _advice_ , _affairs_ , _bags_ , _bail_ , and
_bales_ , under which exhaustive lists of useful phrases involving the
corresponding word are provided in the main pages of the volume. [1]

[![Liebers
P1016847.JPG](/wiki/images/4/41/Liebers_P1016847.JPG)](/wiki/index.php?title=File:Liebers_P1016847.JPG)

[![Liebers
P1016859.JPG](/wiki/images/3/35/Liebers_P1016859.JPG)](/wiki/index.php?title=File:Liebers_P1016859.JPG)

[![Liebers
P1016861.JPG](/wiki/images/3/34/Liebers_P1016861.JPG)](/wiki/index.php?title=File:Liebers_P1016861.JPG)

[![Liebers
P1016869.JPG](/wiki/images/f/fd/Liebers_P1016869.JPG)](/wiki/index.php?title=File:Liebers_P1016869.JPG)

> [...] my focus in this chapter is on the inscription technology that grew
parasitically alongside the monopolistic pricing strategies of telegraph
companies: telegraph code books. Constructed under the bywords “economy,”
“secrecy,” and “simplicity,” telegraph code books matched phrases and words
with code letters or numbers. The idea was to use a single code word instead
of an entire phrase, thus saving money by serving as an information
compression technology. Generally economy won out over secrecy, but in
specialized cases, secrecy was also important.[2]

In Katherine Hayles' chapter devoted to telegraph code books she observes how:

> The interaction between code and language shows a steady movement away from
a human-centric view of code toward a machine-centric view, thus anticipating
the development of full-fledged machine codes with the digital computer. [3]

[![Liebers
P1016851.JPG](/wiki/images/1/13/Liebers_P1016851.JPG)](/wiki/index.php?title=File:Liebers_P1016851.JPG)
Aspects of this transitional moment are apparent in a notice included
prominently inserted in the Lieber's code book:

> After July, 1904, all combinations of letters that do not exceed ten will
pass as one cipher word, provided that it is pronounceable, or that it is
taken from the following languages: English, French, German, Dutch, Spanish,
Portuguese or Latin -- International Telegraphic Conference, July 1903 [4]

Conforming to international conventions regulating telegraph communication at
that time, the stipulation that code words be actual words drawn from a
variety of European languages (many of Lieber's code words are indeed
arbitrary Dutch, German, and Spanish words) underscores this particular moment
of transition as reference to the human body in the form of "pronounceable"
speech from representative languages begins to yield to the inherent potential
for arbitrariness in digital representation.

What telegraph code books do is remind us of is the relation of language in
general to economy. Whether they may be economies of memory, attention, costs
paid to a telecommunicatons company, or in terms of computer processing time
or storage space, encoding language or knowledge in any form of writing is a
form of shorthand and always involves an interplay with what one expects to
perform or "get out" of the resulting encoding.

> Along with the invention of telegraphic codes comes a paradox that John
Guillory has noted: code can be used both to clarify and occlude. Among the
sedimented structures in the technological unconscious is the dream of a
universal language. Uniting the world in networks of communication that
flashed faster than ever before, telegraphy was particularly suited to the
idea that intercultural communication could become almost effortless. In this
utopian vision, the effects of continuous reciprocal causality expand to
global proportions capable of radically transforming the conditions of human
life. That these dreams were never realized seems, in retrospect, inevitable.
[5]

[![Liebers
P1016884.JPG](/wiki/images/9/9c/Liebers_P1016884.JPG)](/wiki/index.php?title=File:Liebers_P1016884.JPG)

[![Liebers
P1016852.JPG](/wiki/images/7/74/Liebers_P1016852.JPG)](/wiki/index.php?title=File:Liebers_P1016852.JPG)

[![Liebers
P1016880.JPG](/wiki/images/1/11/Liebers_P1016880.JPG)](/wiki/index.php?title=File:Liebers_P1016880.JPG)

Far from providing a universal system of encoding messages in the English
language, Lieber's code is quite clearly designed for the particular needs and
conditions of its use. In addition to the phrases ordered by keywords, the
book includes a number of tables of terms for specialized use. One table lists
a set of words used to describe all possible permutations of numeric grades of
coffee (Choliam = 3,4, Choliambos = 3,4,5, Choliba = 4,5, etc.); another table
lists pairs of code words to express the respective daily rise or fall of the
price of coffee at the port of Le Havre in increments of a quarter of a Franc
per 50 kilos ("Chirriado = prices have advanced 1 1/4 francs"). From an
archaeological perspective, the Lieber's code book reveals a cross section of
the needs and desires of early 20th century business communication between the
United States and its trading partners.

The advertisements lining the Liebers Code book further situate its use and
that of commercial telegraphy. Among the many advertisements for banking and
law services, office equipment, and alcohol are several ads for gun powder and
explosives, drilling equipment and metallurgic services all with specific
applications to mining. Extending telegraphy's formative role for ship-to-
shore and ship-to-ship communication for reasons of safety, commercial
telegraphy extended this network of communication to include those parties
coordinating the "raw materials" being mined, grown, or otherwise extracted
from overseas sources and shipped back for sale.

## "Raw data now!"

From [La ville intelligente - Ville de la connaissance](/wiki/index.php?title
=La_ville_intelligente_-_Ville_de_la_connaissance "La ville intelligente -
Ville de la connaissance"):

Étant donné que les nouvelles formes modernistes et l'utilisation de matériaux
propageaient l'abondance d'éléments décoratifs, Paul Otlet croyait en la
possibilité du langage comme modèle de « [données
brutes](/wiki/index.php?title=Bag_of_words "Bag of words") », le réduisant aux
informations essentielles et aux faits sans ambiguïté, tout en se débarrassant
de tous les éléments inefficaces et subjectifs.


From [The Smart City - City of Knowledge](/wiki/index.php?title
=The_Smart_City_-_City_of_Knowledge "The Smart City - City of Knowledge"):

As new modernist forms and use of materials propagated the abundance of
decorative elements, Otlet believed in the possibility of language as a model
of '[raw data](/wiki/index.php?title=Bag_of_words "Bag of words")', reducing
it to essential information and unambiguous facts, while removing all
inefficient assets of ambiguity or subjectivity.


> Tim Berners-Lee: [...] Make a beautiful website, but first give us the
unadulterated data, we want the data. We want unadulterated data. OK, we have
to ask for raw data now. And I'm going to ask you to practice that, OK? Can
you say "raw"?

>

> Audience: Raw.

>

> Tim Berners-Lee: Can you say "data"?

>

> Audience: Data.

>

> TBL: Can you say "now"?

>

> Audience: Now!

>

> TBL: Alright, "raw data now"!

>

> [...]

>

> So, we're at the stage now where we have to do this -- the people who think
it's a great idea. And all the people -- and I think there's a lot of people
at TED who do things because -- even though there's not an immediate return on
the investment because it will only really pay off when everybody else has
done it -- they'll do it because they're the sort of person who just does
things which would be good if everybody else did them. OK, so it's called
linked data. I want you to make it. I want you to demand it. [6]

## Un/Structured

As graduate students at Stanford, Sergey Brin and Lawrence (Larry) Page had an
early interest in producing "structured data" from the "unstructured" web. [7]

> The World Wide Web provides a vast source of information of almost all
types, ranging from DNA databases to resumes to lists of favorite restaurants.
However, this information is often scattered among many web servers and hosts,
using many different formats. If these chunks of information could be
extracted from the World Wide Web and integrated into a structured form, they
would form an unprecedented source of information. It would include the
largest international directory of people, the largest and most diverse
databases of products, the greatest bibliography of academic works, and many
other useful resources. [...]

>

> **2.1 The Problem**
> Here we define our problem more formally:
> Let D be a large database of unstructured information such as the World
Wide Web [...] [8]

In a paper titled _Dynamic Data Mining_ Brin and Page situate their research
looking for _rules_ (statistical correlations) between words used in web
pages. The "baskets" they mention stem from the origins of "market basket"
techniques developed to find correlations between the items recorded in the
purchase receipts of supermarket customers. In their case, they deal with web
pages rather than shopping baskets, and words instead of purchases. In
transitioning to the much larger scale of the web, they describe the
usefulness of their research in terms of its computational economy, that is
the ability to tackle the scale of the web and still perform using
contemporary computing power completing its task in a reasonably short amount
of time.

> A traditional algorithm could not compute the large itemsets in the lifetime
of the universe. [...] Yet many data sets are difficult to mine because they
have many frequently occurring items, complex relationships between the items,
and a large number of items per basket. In this paper we experiment with word
usage in documents on the World Wide Web (see Section 4.2 for details about
this data set). This data set is fundamentally different from a supermarket
data set. Each document has roughly 150 distinct words on average, as compared
to roughly 10 items for cash register transactions. We restrict ourselves to a
subset of about 24 million documents from the web. This set of documents
contains over 14 million distinct words, with tens of thousands of them
occurring above a reasonable support threshold. Very many sets of these words
are highly correlated and occur often. [9]

## Un/Ordered

In programming, I've encountered a recurring "problem" that's quite
symptomatic. It goes something like this: you (the programmer) have managed to
cobble out a lovely "content management system" (either from scratch, or using
any number of helpful frameworks) where your user can enter some "items" into
a database, for instance to store bookmarks. After this ordered items are
automatically presented in list form (say on a web page). The author: It's
great, except... could this bookmark come before that one? The problem stems
from the fact that the database ordering (a core functionality provided by any
database) somehow applies a sorting logic that's almost but not quite right. A
typical example is the sorting of names where details (where to place a name
that starts with a Norwegian "Ø" for instance), are language-specific, and
when a mixture of languages occurs, no single ordering is necessarily
"correct". The (often) exascerbated programmer might hastily add an additional
database field so that each item can also have an "order" (perhaps in the form
of a date or some other kind of (alpha)numerical "sorting" value) to be used
to correctly order the resulting list. Now the author has a means, awkward and
indirect but workable, to control the order of the presented data on the start
page. But one might well ask, why not just edit the resulting listing as a
document? Not possible! Contemporary content management systems are based on a
data flow from a "pure" source of a database, through controlling code and
templates to produce a document as a result. The document isn't the data, it's
the end result of an irreversible process. This problem, in this and many
variants, is widespread and reveals an essential backwardness that a
particular "computer scientist" mindset relating to what constitutes "data"
and in particular it's relationship to order that makes what might be a
straightforward question of editing a document into an over-engineered
database.

Recently working with Nikolaos Vogiatzis whose research explores playful and
radically subjective alternatives to the list, Vogiatzis was struck by how
from the earliest specifications of HTML (still valid today) have separate
elements (OL and UL) for "ordered" and "unordered" lists.

> The representation of the list is not defined here, but a bulleted list for
unordered lists, and a sequence of numbered paragraphs for an ordered list
would be quite appropriate. Other possibilities for interactive display
include embedded scrollable browse panels. [10]

Vogiatzis' surprise lay in the idea of a list ever being considered
"unordered" (or in opposition to the language used in the specification, for
order to ever be considered "insignificant"). Indeed in its suggested
representation, still followed by modern web browsers, the only difference
between the two visually is that UL items are preceded by a bullet symbol,
while OL items are numbered.

The idea of ordering runs deep in programming practice where essentially
different data structures are employed depending on whether order is to be
maintained. The indexes of a "hash" table, for instance (also known as an
associative array), are ordered in an unpredictable way governed by a
representation's particular implementation. This data structure, extremely
prevalent in contemporary programming practice sacrifices order to offer other
kinds of efficiency (fast text-based retrieval for instance).

## Data mining

In announcing Google's impending data center in Mons, Belgian prime minister
Di Rupo invoked the link between the history of the mining industry in the
region and the present and future interest in "data mining" as practiced by IT
companies such as Google.

Whether speaking of bales of cotton, barrels of oil, or bags of words, what
links these subjects is the way in which the notion of "raw material" obscures
the labor and power structures employed to secure them. "Raw" is always
relative: "purity" depends on processes of "refinement" that typically carry
social/ecological impact.

Stripping language of order is an act of "disembodiment", detaching it from
the acts of writing and reading. The shift from (human) reading to machine
reading involves a shift of responsibility from the individual human body to
the obscured responsibilities and seemingly inevitable forces of the
"machine", be it the machine of a market or the machine of an algorithm.

From [X = Y](/wiki/index.php?title=X_%3D_Y "X = Y"):

Still, it is reassuring to know that the products hold traces of the work,
that even with the progressive removal of human signs in automated processes,
the workers' presence never disappears completely. This presence is proof of
the materiality of information production, and becomes a sign of the economies
and paradigms of efficiency and profitability that are involved.


The computer scientists' view of textual content as "unstructured", be it in a
webpage or the OCR scanned pages of a book, reflect a negligence to the
processes and labor of writing, editing, design, layout, typesetting, and
eventually publishing, collecting and cataloging [11].

"Unstructured" to the computer scientist, means non-conformant to particular
forms of machine reading. "Structuring" then is a social process by which
particular (additional) conventions are agreed upon and employed. Computer
scientists often view text through the eyes of their particular reading
algorithm, and in the process (voluntarily) blind themselves to the work
practices which have produced and maintain these "resources".

Berners-Lee, in chastising his audience of web publishers to not only publish
online, but to release "unadulterated" data belies a lack of imagination in
considering how language is itself structured and a blindness to the need for
more than additional technical standards to connect to existing publishing
practices.

Last Revision: 2*08*2016

1. ↑ Benjamin Franklin Lieber, Lieber's Standard Telegraphic Code, 1896, New York;
2. ↑ Katherine Hayles, "Technogenesis in Action: Telegraph Code Books and the Place of the Human", How We Think: Digital Media and Contemporary Technogenesis, 2006
3. ↑ Hayles
4. ↑ Lieber's
5. ↑ Hayles
6. ↑ Tim Berners-Lee: The next web, TED Talk, February 2009
7. ↑ "Research on the Web seems to be fashionable these days and I guess I'm no exception." from Brin's [Stanford webpage](http://infolab.stanford.edu/~sergey/)
8. ↑ Extracting Patterns and Relations from the World Wide Web, Sergey Brin, Proceedings of the WebDB Workshop at EDBT 1998,
9. ↑ Dynamic Data Mining: Exploring Large Rule Spaces by Sampling; Sergey Brin and Lawrence Page, 1998; p. 2
10. ↑ Hypertext Markup Language (HTML): "Internet Draft", Tim Berners-Lee and Daniel Connolly, June 1993,
11. ↑

Retrieved from
[https://www.mondotheque.be/wiki/index.php?title=A_bag_but_is_language_nothing_of_words&oldid=8480](https://www.mondotheque.be/wiki/index.php?title=A_bag_but_is_language_nothing_of_words&oldid=8480)

Dockray, Forster & Public Office
README.md
2018


## Introduction

How might we ensure the survival and availability of community libraries,
individual collections and other precarious archives? If these libraries,
archives and collections are unwanted by official institutions or, worse,
buried beneath good intentions and bureaucracy, then what tools and platforms
and institutions might we develop instead?

While trying to both formulate and respond to these questions, we began making
Dat Library and HyperReadings:

**Dat Library** distributes libraries across many computers so that many
people can provide disk space and bandwidth, sharing in the labour and
responsibility of the archival infrastructure.

**HyperReadings** implements ‘reading lists’ or a structured set of pointers
(a list, a syllabus, a bibliography, etc.) into one or more libraries,
_activating_ the archives.

## Installation

The easiest way to get started is to install [Dat Library as a desktop
app](http://dat-dat-dat-library.hashbase.io), but there is also a programme
called ‘[datcat](http://github.com/sdockray/dat-cardcat)’, which can be run on
the command line or included in other NodeJS projects.

## Accidents of the Archive

The 1996 UNESCO publication [Lost Memory: Libraries and Archives Destroyed in
the Twentieth Century](http://www.stephenmclaughlin.net/ph-
library/texts/UNESCO%201996%20-%20Lost%20Memory_%20Libraries%20and%20Archives%20Destroyed%20in%20the%20Twentieth%20Century.pdf)
makes the fragility of historical repositories startlingly clear. “[A]cidified
paper that crumbles to dust, leather, parchment, film and magnetic light
attacked by light, heat humidity or dust” all assault archives. “Floods,
fires, hurricanes, storms, earthquakes” and, of course, “acts of war,
bombardment and fire, whether deliberate or accidental” wiped out significant
portions of many hundreds of major research libraries worldwide. When
expanding the scope to consider public, private, and community libraries, that
number becomes uncountable.

Published during the early days of the World Wide Web, the report acknowledges
the emerging role of digitization (“online databases, CD-ROM etc.”), but today
we might reflect on the last twenty years, which has also introduced new forms
of loss.

Digital archives and libraries are subject to a number of potential hazards:
technical accidents like disk failures, accidental deletions, misplaced data
and imperfect data migrations, as well as political-economic accidents like
defunding of the hosting institution, deaccessioning parts of the collection
and sudden restrictions of access rights. Immediately after library.nu was
shut down on the grounds of copyright infringement in 2012, [Lawrence Liang
wrote](https://kafila.online/2012/02/19/library-nu-r-i-p/) of feeling “first
and foremost a visceral experience of loss.”

Whatever its legal status, the abrupt absence of a collection of 400,000 books
appears to follow a particularly contemporary pattern. In 2008, Aaron Swartz
moved millions of US federal court documents out from behind a paywall,
resulting in a trial and an FBI investigation. Three years later he was
arrested and indicted for a similar gesture, systematically downloading
academic journal articles from JSTOR. That year, Kazakhstani scientist
Alexandra Elbakyan began [Sci-Hub](https://en.wikipedia.org/wiki/Sci-Hub) in
response to scientific journal articles that were prohibitively expensive for
scholars based outside of Western academic institutions. (See
for further analysis and an alternative
approach to the same issues: “When everyone is librarian, library is
everywhere.”) The repository, growing to more than 60 millions papers, was
sued in 2015 by Elsevier for $15 million, resulting in a permanent injunction.
Library Genesis, another library of comparable scale, finds itself in a
similar legal predicament.

Arguably one of the largest digital archives of the “avant-garde” (loosely
defined), UbuWeb is transparent about this fragility. In 2011, its founder
[Kenneth Goldsmith wrote](http://www.ubu.com/resources/): “by the time you
read this, UbuWeb may be gone. […] Never meant to be a permanent archive, Ubu
could vanish for any number of reasons: our ISP pulls the plug, our university
support dries up, or we simply grow tired of it.” Even the banality of
exhaustion is a real risk to these libraries.

The simple fact is that some of these libraries are among the largest in the
world yet are subject to sudden disappearance. We can only begin to guess at
what the contours of “Lost Memory: Libraries and Archives Destroyed in the
Twenty-First Century” will be when it is written ninety years from now.

## Non-profit, non-state archives

Cultural and social movements have produced histories which are only partly
represented in state libraries and archives. Often they are deemed too small
or insignificant or, in some cases, dangerous. Most frequently, they are not
deemed to be anything at all — they are simply neglected. While the market,
eager for new resources to exploit, might occasionally fill in the gaps, it is
ultimately motivated by profit and not by responsibility to communities or
archives. (We should not forget the moment [Amazon silently erased legally
purchased copies of George Orwell’s
1984](http://www.nytimes.com/2009/07/18/technology/companies/18amazon.html)
from readers’ Kindle devices because of a change in the commercial agreement
with the publisher.)

So, what happens to these minor libraries? They are innumerable, but for the
sake of illustration let’s say that each could be represented by a single
book. Gathered together, these books would form a great library (in terms of
both importance and scale). But to extend the metaphor, the current reality
could be pictured as these books flying off their shelves to the furthest
reaches of the world, their covers flinging open and the pages themselves
scattering into bookshelves and basements, into the caring hands of relatives
or small institutions devoted to passing these words on to future generations.

While the massive digital archives listed above (library.nu, Library Genesis,
Sci-Hub, etc.) could play the role of the library of libraries, they tend to
be defined more as sites for [biblioleaks](https://www.jmir.org/2014/4/e112/).
Furthermore, given the vulnerability of these archives, we ought to look for
alternative approaches that do not rule out using their resources, but which
also do not _depend_ on them.

Dat Library takes the concept of “a library of libraries” not to manifest it
in a single, universal library, but to realise it progressively and partially
with different individuals, groups and institutions.

## Archival properties

So far, the emphasis of this README has been on _durability_ , and the
“accidents of the archive” have been instances of destruction and loss. The
persistence of an archive is, however, no guarantee of its _accessibility_ , a
common reality in digital libraries where access management is ubiquitous.
Official institutions police access to their archives vigilantly for the
ostensible purpose of preservation, but ultimately create a rarefied
relationship between the archives and their publics. Disregarding this
precious tendency toward preciousness, we also introduce _adaptability_ as a
fundamental consideration in the making of the projects Dat Library and
HyperReadings.

To adapt is to fit something for a new purpose. It emphasises that the archive
is not a dead object of research but a set of possible tools waiting to be
activated in new circumstances. This is always a possibility of an archive,
but we want to treat this possibility as desirable, as the horizon towards
which these projects move. We know how infrastructures can attenuate desire
and simply make things difficult. We want to actively encourage radical reuse.

In the following section, we don’t define these properties but rather discuss
how we implement (or fail to implement) them in software, while highlighting
some of the potential difficulties introduced.

### Durability

In 1964, in the midst of the “loss” of the twentieth-century, Paul Baran’s
RAND Corporation publication [On Distributed
Communications](https://www.rand.org/content/dam/rand/pubs/research_memoranda/2006/RM3420.pdf)
examined “redundancy as one means of building … highly survivable and reliable
communications systems”, thus midwifing the military foundations of the
digital networks that we operate within today. While the underlying framework
of the Internet generally follows distributed principles, the client–server/
request–response model of the HTTP protocol is highly centralised in practice
and is only as durable as the server.

Capitalism places a high value on originality and novelty, as exemplified in
art where the ultimate insult would to be the label “redundant”. Worse than
being derivative or merely unoriginal, being redundant means having no reason
to exist — a uselessness that art can’t tolerate. It means wasting a perfectly
good opportunity to be creative or innovative. In a relational network, on the
other hand, redundancy is a mode of support. It doesn’t stimulate competition
to capture its effects, but rather it is a product of cooperation. While this
attitude of redundancy arose within a Western military context, one can’t help
but notice that the shared resources, mutual support, and common
infrastructure seem fundamentally communist in nature. Computer networks are
not fundamentally exploitative or equitable, but they are used in specific
ways and they operate within particular economies. A redundant network of
interrelated, mutually supporting computers running mostly open-source
software can be the guts of an advanced capitalist engine, like Facebook. So,
could it be possible to organise our networked devices, embedded as they are
in a capitalist economy, in an anti-capitalist way?

Dat Library is built on the [Dat
Protocol](https://github.com/datproject/docs/blob/master/papers/dat-paper.md),
a peer-to-peer protocol for syncing folders of data. It is not the first
distributed protocol ([BitTorrent](https://en.wikipedia.org/wiki/BitTorrent)
is the best known and is noted as an inspiration for Dat), nor is it the only
new one being developed today ([IPFS](https://ipfs.io) or the Inter-Planetary
File System is often referenced in comparison), but it is unique in its
foundational goals of preserving scientific knowledge as a public good. Dat’s
provocation is that by creating custom infrastructure it will be possible to
overcome the accidents that restrict access to scientific knowledge. We would
specifically acknowledge here the role that the Dat community — or any
community around a protocol, for that matter — has in the formation of the
world that is built on top of that protocol. (For a sense of the Dat
community’s values — see its [code of conduct](https://github.com/datproject
/Code-of-Conduct/blob/master/CODE_OF_CONDUCT.md).)

When running Dat Library, a person sees their list of libraries. These can be
thought of as similar to a
[torrent](https://en.wikipedia.org/wiki/Torrent_file), where items are stored
across many computers. This means that many people will share in the provision
of disk space and bandwidth for a particular library, so that when someone
loses electricity or drops their computer, the library will not also break.
Although this is a technical claim — one that has been made in relation to
many projects, from Baran to BitTorrent — it is more importantly a social
claim: the users and lovers of a library will share the library. More than
that, they will share in the work of ensuring that it will continue to be
shared.

This is not dissimilar to the process of reading generally, where knowledge is
distributed and maintained through readers sharing and referencing the books
important to them. As [Peter Sloterdijk
describes](https://rekveld.home.xs4all.nl/tech/Sloterdijk_RulesForTheHumanZoo.pdf),
written philosophy is “reinscribed like a chain letter through the
generations, and despite all the errors of reproduction — indeed, perhaps
because of such errors — it has recruited its copyists and interpreters into
the ranks of brotherhood (sic)”. Or its sisterhood — but, the point remains
clear that the reading / writing / sharing of texts binds us together, even in
disagreement.

### Accessibility

In the world of the web, durability is synonymous with accessibility — if
something can’t be accessed, it doesn’t exist. Here, we disentangle the two in
order to consider _access_ independent from questions of resilience.

##### Technically Accessible

When you create a new library in Dat, a unique 64-digit “key” will
automatically be generated for it. An example key is
`6f963e59e9948d14f5d2eccd5b5ac8e157ca34d70d724b41cb0f565bc01162bf`, which
points to a library of texts. In order for someone else to see the library you
have created, you must provide to them your library’s unique key (by email,
chat, on paper or you could publish it on your website). In short, _you_
manage access to the library by copying that key, and then every key holder
also manages access _ad infinitum_.

At the moment this has its limitations. A Dat is only writable by a single
creator. If you want to collaboratively develop a library or reading list, you
need to have a single administrator managing its contents. This will change in
the near future with the integration of
[hyperdb](https://github.com/mafintosh/hyperdb) into Dat’s core. At that
point, the platform will enable multiple contributors and the management of
permissions, and our single key will become a key chain.

How is this key any different from knowing the domain name of a website? If a
site isn’t indexed by Google and has a suitably unguessable domain name, then
isn’t that effectively the same degree of privacy? Yes, and this is precisely
why the metaphor of the key is so apt (with whom do you share the key to your
apartment?) but also why it is limited. With the key, one not only has the
ability to _enter_ the library, but also to completely _reproduce_ the
library.

##### Consenting Accessibility

When we say “accessibility”, some hear “information wants to be free” — but
our idea of accessibility is not about indiscriminate open access to
everything. While we do support, in many instances, the desire to increase
access to knowledge where it has been restricted by monopoly property
ownership, or the urge to increase transparency in delegated decision-making
and representative government, we also recognise that Indigenous knowledge
traditions often depend on ownership, control, consent, and secrecy in the
hands of the traditions’ people. [see [“Managing Indigenous Knowledge and
Indigenous Cultural and Intellectual
Property”](https://epress.lib.uts.edu.au/system/files_force/Aus%20Indigenous%20Knowledge%20and%20Libraries.pdf?download=1),
pg 83] Accessibility understood in merely quantitative terms isn’t able to
reconcile these positions, which this is why we refuse to limit “access” to a
question of technology.

While “digital rights management” technologies have been developed almost
exclusively for protecting the commercial interests of capitalist property
owners within Western intellectual property regimes, many of the assumptions
and technological implementations are inadequate for the protection of
Indigenous knowledge. Rather than describing access in terms of commodities
and ownership of copyright, it might be defined by membership, status or role
within a community, and the rules of access would not be managed by a
generalised legal system but by the rules and traditions of the people and
their knowledge. [[“The Role of Information Technologies in Indigenous
Knowledge
Management”](https://epress.lib.uts.edu.au/system/files_force/Aus%20Indigenous%20Knowledge%20and%20Libraries.pdf?download=1),
101-102] These rights would not expire, nor would they be bought and sold,
because they are shared, i.e., held in common.

It is important, while imagining the possibilities of a technological
protocol, to also consider how different _cultural protocols_ might be
implemented and protected through the life of a project like Dat Library.
Certain aspects of this might be accomplished through library metadata, but
ultimately it is through people hosting their own archives and libraries
(rather than, for example, having them hosted by a state institution) that
cultural protocols can be translated and reproduced. Perhaps we should flip
the typical question of how might a culture exist within digital networks to
instead ask how should digital networks operate within cultural protocols?

### Adaptability (ability to use/modify as one’s own)

Durability and accessibility are the foundations of adoptability. Many would
say that this is a contradiction, that adoption is about use and
transformation and those qualities operate against the preservationist grain
of durability, that one must always be at the expense of the other. We say:
perhaps that is true, but it is a risk we’re willing to take because we don’t
want to be making monuments and cemeteries that people approach with reverence
or fear. We want tools and stories that we use and adapt and are always making
new again. But we also say: it is through use that something becomes
invaluable, which may change or distort but will not destroy — this is the
practical definition of durability. S.R. Ranganathan’s very first Law of
Library Science was [“BOOKS ARE FOR
USE”](https://babel.hathitrust.org/cgi/pt?id=uc1.$b99721;view=1up;seq=37),
which we would extend to the library itself, such that when he arrives at his
final law, [“THE LIBRARY IS A LIVING
ORGANISM”](https://babel.hathitrust.org/cgi/pt?id=uc1.$b99721;view=1up;seq=432),
we note that to live means not only to change, but also to live _in the
world_.

To borrow and gently distort another concept of Raganathan’s concepts, namely
that of ‘[Infinite
Hospitality](http://www.dextersinister.org/MEDIA/PDF/InfiniteHospitality.pdf)’,
it could be said that we are interested in ways to construct a form of
infrastructure that is infinitely hospitable. By this we mean, infrastructure
that accommodates the needs and desires of new users/audiences/communities and
allows them to enter and contort the technology to their own uses. We really
don’t see infrastructure as aimed at a single specific group, but rather that
it should generate spaces that people can inhabit as they wish. The poet Jean
Paul once wrote that books are thick letters to friends. Books as
infrastructure enable authors to find their friends. This is how we ideally
see Dat Library and HyperReadings working.

## Use cases

We began work on Dat Library and HyperReadings with a range of exemplary use
cases, real-world circumstances in which these projects might intervene. Not
only would the use cases make demands on the software we were and still are
beginning to write, but they would also give us demands to make on the Dat
protocol, which is itself still in the formative stages of development. And,
crucially, in an iterative feedback loop, this process of design produces
transformative effects on those situations described in the use cases
themselves, resulting in further new circumstances and new demands.

### Thorunka

Wendy Bacon and Chris Nash made us aware of Thorunka and Thor.

_Thorunka_ and _Thor_ were two underground papers in the early 1970’s that
spewed out from a censorship controversy surrounding the University of New
South Wales student newspaper _Tharunka_. Between 1971 and 1973, the student
magazine was under focused attack from the NSW state police, with several
arrests made on charges of obscenity and indecency. Rather than ceding to the
charges, this prompted a large and sustained political protest from Sydney
activists, writers, lawyers, students and others, to which _Thorunka_ and
_Thor_ were central.

> “The campaign contested the idea of obscenity and the legitimacy of the
legal system itself. The newspapers campaigned on the war in Vietnam,
Aboriginal land rights, women’s and gay liberation, and the violence of the
criminal justice system. By 1973 the censorship regime in Australia was
broken. Nearly all the charges were dropped.” – [Quotation from the 107
Projects Event](http://107.org.au/event/tharunka-thor-journalism-politics-
art-1970-1973/).

Although the collection of issues of _Tharunka_ is largely accessible [via
Trove](http://trove.nla.gov.au/newspaper/page/24773115), the subsequent issues
of _Thorunka_ , and later _Thor_ , are not. For us, this demonstrates clearly
how collections themselves can encourage modes of reading. If you focus on
_Tharunka_ as a singular and long-standing periodical, this significant
political moment is rendered almost invisible. On the other hand, if the
issues are presented together, with commentary and surrounding publications,
the political environment becomes palpable. Wendy and Chris have kindly
allowed us to make their personal collection available via Dat Library (the
key is: 73fd26846e009e1f7b7c5b580e15eb0b2423f9bea33fe2a5f41fac0ddb22cbdc), so
you can discover this for yourself.

### Academia.edu alternative

Academia.edu, started in 2008, has raised tens of millions of dollars as a
social network for academics to share their publications. As a for-profit
venture, it is rife with metrics and it attempts to capitalise on the innate
competition and self-promotion of precarious knowledge workers in the academy.
It is simultaneously popular and despised: popular because it fills an obvious
desire to share the fruits of ones intellectual work, but despised for the
neoliberal atmosphere that pervades every design decision and automated
correspondence. It is, however, just trying to provide a return on investment.

[Gary Hall has written](http://www.garyhall.info/journal/2015/10/18/does-
academiaedu-mean-open-access-is-becoming-irrelevant.html) that “its financial
rationale rests … on the ability of the angel-investor and venture-capital-
funded professional entrepreneurs who run Academia.edu to exploit the data
flows generated by the academics who use the platform as an intermediary for
sharing and discovering research”. Moreover, he emphasises that in the open-
access world (outside of the exploitative practice of for-profit publishers
like Elsevier, who charge a premium for subscriptions), the privileged
position is to be the one “ _who gate-keeps the data generated around the use
of that content_ ”. This lucrative position has been produced by recent
“[recentralising tendencies](http://commonstransition.org/the-revolution-will-
not-be-decentralised-blockchains/)” of the internet, which in Academia’s case
captures various, scattered open access repositories, personal web pages, and
other archives.

Is it possible to redecentralise? Can we break free of the subjectivities that
Academia.edu is crafting for us as we are interpellated by its infrastructure?
It is incredibly easy for any scholar running Dat Library to make a library of
their own publications and post the key to their faculty web page, Facebook
profile or business card. The tricky — and interesting — thing would be to
develop platforms that aggregate thousands of these libraries in direct
competition with Academia.edu. This way, individuals would maintain control
over their own work; their peer groups would assist in mirroring it; and no
one would be capitalising on the sale of data related to their performance and
popularity.

We note that Academia.edu is a typically centripetal platform: it provides no
tools for exporting one’s own content, so an alternative would necessarily be
a kind of centrifuge.

This alternative is becoming increasingly realistic. With open-access journals
already paving the way, there has more recently been a [call for free and open
access to citation data](https://www.insidehighered.com/news/2017/12/06
/scholars-push-free-access-online-citation-data-saying-they-need-and-deserve-
access). [The Initiative for Open Citations (I4OC)](https://i4oc.org) is
mobilising against the privatisation of data and working towards the
unrestricted availability of scholarly citation data. We see their new
database of citations as making this centrifugal force a possibility.

### Publication format

In writing this README, we have strung together several references. This
writing might be published in a book and the references will be listed as
words at the bottom of the page or at the end of the text. But the writing
might just as well be published as a HyperReadings object, providing the
reader with an archive of all the things we referred to and an editable
version of this text.

A new text editor could be created for this new publication format, not to
mention a new form of publication, which bundles together a set of
HyperReadings texts, producing a universe of texts and references. Each
HyperReadings text might reference others, of course, generating something
that begins to feel like a serverless World Wide Web.

It’s not even necessary to develop a new publication format, as any book might
be considered as a reading list (usually found in the footnotes and
bibliography) with a very detailed description of the relationship between the
consulted texts. What if the history of published works were considered in
this way, such that we might always be able to follow a reference from one
book directly into the pages of another, and so on?

### Syllabus

The syllabus is the manifesto of the twenty-first century. From [Your
Baltimore “Syllabus”](https://apis4blacklives.wordpress.com/2015/05/01/your-
baltimore-syllabus/), to
[#StandingRockSyllabus](https://nycstandswithstandingrock.wordpress.com/standingrocksyllabus/),
to [Women and gender non-conforming people writing about
tech](https://docs.google.com/document/d/1Qx8JDqfuXoHwk4_1PZYWrZu3mmCsV_05Fe09AtJ9ozw/edit),
syllabi are being produced as provocations, or as instructions for
reprogramming imaginaries. They do not announce a new world but they point out
a way to get there. As a programme, the syllabus shifts the burden of action
onto the readers, who will either execute the programme on their own fleshy
operating system — or not. A text that by its nature points to other texts,
the syllabus is already a relational document acknowledging its own position
within a living field of knowledge. It is decidedly not self-contained,
however it often circulates as if it were.

If a syllabus circulated as a HyperReadings document, then it could point
directly to the texts and other media that it aggregates. But just as easily
as it circulates, a HyperReadings syllabus could be forked into new versions:
the syllabus is changed because there is a new essay out, or because of a
political disagreement, or because following the syllabus produced new
suggestions. These forks become a family tree where one can follow branches
and trace epistemological mutations.

## Proposition (or Presuppositions)

While the software that we have started to write is a proposition in and of
itself, there is no guarantee as to _how_ it will be used. But when writing,
we _are_ imagining exactly that: we are making intuitive and hopeful
presuppositions about how it will be used, presuppositions that amount to a
set of social propositions.

### The role of individuals in the age of distribution

Different people have different technical resources and capabilities, but
everyone can contribute to an archive. By simply running the Dat Library
software and adding an archive to it, a person is sharing their disk space and
internet bandwidth in the service of that archive. At first, it is only the
archive’s index (a list of the contents) that is hosted, but if the person
downloads the contents (or even just a small portion of the contents) then
they are sharing in the hosting of the contents as well. Individuals, as
supporters of an archive or members of a community, can organise together to
guarantee the durability and accessibility of an archive, saving a future
UbuWeb from ever having to worry about if their ‘ISP pulling the plug’. As
supporters of many archives, as members of many communities, individuals can
use Dat Library to perform this function many times over.

On the Web, individuals are usually users or browsers — they use browsers. In
spite of the ostensible interactivity of the medium, users are kept at a
distance from the actual code, the infrastructure of a website, which is run
on a server. With a distributed protocol like Dat, applications such as
[Beaker Browser](https://beakerbrowser.com) or Dat Library eliminate the
central server, not by destroying it, but by distributing it across all of the
users. Individuals are then not _just_ users, but also hosts. What kind of
subject is this user-host, especially as compared to the user of the server?
Michel Serres writes in _The Parasite_ :

> “It is raining; a passer-by comes in. Here is the interrupted meal once
more. Stopped for only a moment, since the traveller is asked to join the
diners. His host does not have to ask him twice. He accepts the invitation and
sits down in front of his bowl. The host is the satyr, dining at home; he is
the donor. He calls to the passer-by, saying to him, be our guest. The guest
is the stranger, the interrupter, the one who receives the soup, agrees to the
meal. The host, the guest: the same word; he gives and receives, offers and
accepts, invites and is invited, master and passer-by… An invariable term
through the transfer of the gift. It might be dangerous not to decide who is
the host and who is the guest, who gives and who receives, who is the parasite
and who is the table d’hote, who has the gift and who has the loss, and where
hospitality begins with hospitality.” — Michel Serres, The Parasite (Baltimore
and London: The Johns Hopkins University Press), 15–16.

Serres notes that _guest_ and _host_ are the same word in French; we might say
the same for _client_ and _server_ in a distributed protocol. And we will
embrace this multiplying hospitality, giving and taking without measure.

### The role of institutions in the age of distribution

David Cameron launched a doomed initiative in 2010 called the Big Society,
which paired large-scale cuts in public programmes with a call for local
communities to voluntarily self-organise to provide these essential services
for themselves. This is not the political future that we should be working
toward: since 2010, austerity policies have resulted in [120,000 excess deaths
in England](http://bmjopen.bmj.com/content/7/11/e017722). In other words,
while it might seem as though _institutions_ might be comparable to _servers_
, inasmuch as both are centralised infrastructures, we should not give them up
or allow them to be dismantled under the assumption that those infrastructures
can simply be distributed and self-organised. On the contrary, institutions
should be defended and organised in order to support the distributed protocols
we are discussing.

One simple way for a larger, more established institution to help ensure the
durability and accessibility of diverse archives is through the provision of
hardware, network capability and some basic technical support. It can back up
the archives of smaller institutions and groups within its own community while
also giving access to its own archives so that those collections might be put
to new uses. A network of smaller institutions, separated by great distances,
might mirror each other’s archives, both as an expression of solidarity and
positive redundancy and also as a means of circulating their archives,
histories and struggles amongst each of the others.

It was the simultaneous recognition that some documents are too important to
be privatised or lost to the threats of neglect, fire, mould, insects, etc.,
that prompted the development of national and state archives (See page 39 in
[Beredo, B. C., Import of the archive: American colonial bureaucracy in the
Philippines, 1898-1916](http://hdl.handle.net/10125/101724)). As public
institutions they were, and still are, tasked with often competing efforts to
house and preserve while simultaneously also ensuring access to public
documents. Fire and unstable weather understandably have given rise to large
fire-proof and climate-controlled buildings as centralised repositories,
accompanied by highly regulated protocols for access. But in light of new
technologies and their new risks, as discussed above, it is compelling to
argue now that, in order to fulfil their public duty, public archives should
be distributing their collections where possible and providing their resources
to smaller institutions and community groups.

Through the provision of disk space, office space, grants, technical support
and employment, larger institutions can materially support smaller
organisations, individuals and their archival afterlives. They can provide
physical space and outreach for dispersed collectors, gathering and piecing
together a fragmented archive.

But what happens as more people and collections are brought in? As more
institutional archives are allowed to circulate outside of institutional
walls? As storage is cut loose from its dependency on the corporate cloud and
into forms of interdependency, such as mutual support networks? Could this
open up spaces for new forms of not-quite-organisations and queer-
institutions? These would be almost-organisations that uncomfortable exist
somewhere between the common categorical markings of the individual and the
institution. In our thinking, its not important what these future forms
exactly look like. Rather, as discussed above, what is important to us is that
in writing software we open up spaces for the unknown, and allow others agency
to build the forms that work for them. It is only in such an atmosphere of
infinite hospitality that we see the future of community libraries, individual
collections and other precarious archives.

## A note on this text

This README was, and still is being, collaboratively written in a
[Git](https://en.wikipedia.org/wiki/Git)
[repository](https://en.wikipedia.org/wiki/Repository_\(version_control\)).
Git is a free and open-source tool for version control used in software
development. All the code for Hyperreadings, Dat Library and their numerous
associated modules are managed openly using Git and hosted on GitHub under
open source licenses. In a real way, Git’s specification formally binds our
collaboration as well as the open invitation for others to participate. As
such, the form of this README reflects its content. Like this text, these
projects are, by design, works in progress that are malleable to circumstances
and open to contributions, for example by opening a pull request on this
document or raising an issue on our GitHub repositories.

 

Display 200 300 400 500 600 700 800 900 1000 ALL characters around the word.