Elbakyan
Why Science is Better with Communism The Case of Sci-Hub transcript and translation
2016


# Transcript and translation of Sci-Hub presentation

_The University of North Texas 's [Open Access Symposium
2016](/symposium/2016/) included [a presentation via Skype by Alexandra
Elbakyan](/symposium/2016/why-science-better-communism-case-sci-hub), the
founder of Sci-Hub. [Elbakyan's
slides](http://digital.library.unt.edu/ark:/67531/metadc850001/) (and those of
other presenters) have been archived in the UNT Digital Library, and [video of
this presentation](https://youtu.be/hr7v5FF5c8M) (and others) is now available
on YouTube and soon in the UNT Digital Library._

_The presentation was entitled "Why Science is Better with Communism? The Case
of Sci-Hub." Below is an edited transcript of the presentation produced by
Regina Anikina and Kevin Hawkins, with a translation by Kevin Hawkins and Anna
Pechenina._

**Martin Halbert** : We have a recent addition to our lineup of speakers that
we'll start off the day with: Alexandra Elbakyan. As many of you know,
Alexandra is a Kazakhstani graduate student, computer programmer, and the
creator of the controversial Sci-Hub site. The New York Times has compared her
to Edward Snowden for leaking information and because she avoids American law,
but Ars Technica has compared her to Aaron Swartz--so a controversial figure.
We thought it was very important to include her in the dialog about open
access because we want, in this symposium series, to include all the different
perspectives on copyright, intellectual property, open access, and access to
scholarly information. So I'm delighted that we're actually able to have her
here via Skype to present.

---

**Alexandra Elbakyan** : First of all, thank you for inviting me to share my
views. My name is Alexandra. As you might have guessed, I represent the site
Sci-Hub. It was founded in 2011 and immediately became popular among the local
community, almost immediately began providing access to about 40 articles an
hour and now providing more than 200,000.

It has to be said that over the course of the site's development it was
strongly supported by donations, and when for various reasons we had to
suspend the service, there were many displeased users who clamored for the
project to return so that the work in their laboratory could continue.

This is the case not just in poor countries; I can say that in rich countries
the public also doesn't have access to scholarly articles. And not all
universities have subscriptions to those resources that are required for
research.

A few of our users insisted that we start charging users, for example, by
allowing one or two articles to be downloaded for free but charging for more,
so that the service would be supported by those who really need it. But I
didn't end up doing that because the goal of the resource is knowledge for
all.

Certain open-access advocates criticize the site, saying that what we really
need is for articles to be in open access from the start, by changing the
business models of publishers. I can respond by saying that the goal of the
project is first and foremost the dissemination of scholarly knowledge in
society, and we have to work in the conditions we find ourselves in. Of
course, if scholarly publishers had a different business model, then perhaps
this project wouldn't be necessary. We can also imagine that if humans had
wings, we wouldn't need airplanes. But in any case we need to fly, so we make
airplanes.

Scholarly publishers quickly dubbed the work of Sci-Hub as piracy. Admittedly
Sci-Hub violates the laws of copyright, but copyright is related to the rights
of intellectual property. That is, scholarly articles are the property of
publishers, and reading them for free turns out to be something like theft
according to the current law.

The concept of intellectual property itself is not new, although it can seem
otherwise. The history of copyright goes back to around the 18th century,
although the first mentions of something similar can be found in the Talmud.
It's just that recently copyright has been found at the center of passionate
debate since some are trying to forbid the free distribution of information in
the internet.

However, the central focus of the debate is on censorship and privacy. The
defense of intellectual property in the internet requires censorship of
websites, and that is consequently a violation of freedom of speech. This also
raises a question of interference in private life - that is, when the
government in some way monitors users who violate copyright. In principle this
is also an intrusion in communication.

However, the very essence of copyright - that is, the concept of intellectual
property - is almost never questioned. That is, whether knowledge can be
someone's property is rarely discussed.

However, our ancestors were even more daring. They did not just question
intellectual property but property in general. That is, there are works in
which we can find the appearance of the idea of communism. There's Thomas
More's _Utopia_ from the 16th century, but actually such works arose much
earlier, even in Ancient Greece where these questions were already been
discussed in 391 BCE.

If we look at the slogans of communism, we see that one of the core concepts
is the struggle against inequality, the revolt of the suppressed classes,
whose members don't have any power against those who have concentrated basic
resources and power in their hands, with the goal of redistributing these
resources.

We can see that even today there is a certain informational inequality, when,
for example, only students and employees of the most wealthy universities have
full access to scholarly information, while access can be completely lacking
for institutions at the next lower tier and for the general public.

An idea arises: if there isn't private property, then there's no basis for
unequal distribution of wealth. In our case as well: if there's no private
intellectual property and all scholarly publications are nationalized, then
all people will have equal access to knowledge.

However, a question arises: if there is no private property, then what can
stimulate a person to work? One of the ideas is that under communism, rather
than greed or aspiration for wealth being a stimulus for work, a person would
aspire to self-development and learning for the betterment of the world.

Even if such values can't be applied to society as a whole, they at least work
in the world of scholarship. Therefore in the Soviet Union there was a true
cult of science - statues were even erected to the glory of science - and
perhaps thanks to this our country was one of the first to go into space.

However, it's one thing to have a revolution, when there's a mass
redistribution of property in society, but an act of theft is another thing.
This, of course, is not yet a revolution, but it's a small protest against the
property rights and the unequal distribution of wealth. Theft as protest has
always been welcomed and approved of in all eras of society. For example, we
all know about Robin Hood, but there have actually been quite a few noble
bandits in history. I've listed just a few of them.

I think that if the state works well, then accordingly it has a working tax
system and a certain system of redistribution of wealth, and then,
accordingly, there's no cause for revolution, for example. But if for some
reasons the state works poorly, then people begin to solve the problem for
themselves. In this way, Sci-Hub is an appropriate response to the inequality
that has arisen due to lack of access to information.

Pictured is Aldar Köse, a Kazakh folk hero who used his cunning to deceive
wealthy beys and take possession of their property. It's interesting to note
that beys are always depicted as greedy and stupid. And if you look at what's
written in the blogosphere today about scholarly publishers, you can find
these same characteristics.

There's also the interesting figure of the ancient Greek god Hermes, the
patron of thieves. That is, theft was a sufficiently respected activity that
it had its own god.

There's a researcher named Norman Brown who wrote an academic work called
_Hermes the Thief: The Evolution of a Myth_. It turns out that this myth is
related to a certain revolution in ancient Greek society, when the lower
classes, which lacked property, began to rise up.

For example, the poet Theognis of Megara wrote that "those who were nothing
became everything" and vice versa. This is essentially one of the most well-
known communist slogans.

For the ancient Greeks this was related, again as Brown says, to the
appearance of trade. Trade was identified with theft. There was no clear
distinction between the exchange of legal and illegal goods - that is, trade
was just as much considered theft as what we call piracy today.

Why did it turn out this way? Because Hermes was originally a god of
boundaries and transitions. Therefore, we can think that property is related
to keeping something within boundaries. At the same time, the things that
Hermes protected - theft, trade and communication - are related to boundary-
crossing.

If we think about scholarly journals, then any journal is first of all a means
of communication, and therefore it's apparent that keeping journals in closed
access contradicts the essence of what they were intended for.

This is, of course, not even the most interesting thing.

Hermes actually evolved - that is, while he was once an intellectual deity, he
later came to be interpreted as the same as Thoth, the Egyptian god of
knowledge, and further came to oversee such things as astrology, alchemy, and
magic - that is, the things from which, you might say, contemporary sciences
arose. So we can say that contemporary science arose from theft.

Of course, someone can object, saying that contemporary science is very
different from esoterica, such as astrology and alchemy, but if we look at the
history of science, we see that contemporary science differs from the ancient
arts in the former being more open.

That is, when the movement towards greater openness appeared, contemporary
science also appeared. Once again this is not an argument in support of
scholarly publishers.

Indeed, in the cultural consciousness science and the process of learning have
always been closely associated with theft, beginning with the legend of Adam
and Eve and the forbidden tree, which is called simply "the tree of
knowledge." And it's interesting that Elsevier's logo depicts some kind of
tree, which, accordingly, raises associations with this tree in the Garden of
Eden - the tree of knowledge - from which it was forbidden to eat the fruit.

Likewise we can recall the well-known legend of Prometheus, a part of our
cultural consciousness, who stole some knowledge and brought it to humans.
Once again we see the connection between science and theft.

Nowadays, many scholars have described science as the knowledge of secrets.
However, if we look closely, we have to ask: what is a secret? A secret is
something private, in essence private property. Accordingly, the disclosure of
the secret signifies that it ceases to be property. Once again we see the
contradiction between scholarship and property rights.

We can recall Robert Merton, who studied research institutes and revealed four
basic ethical norms that in his opinion are important for their successful
functioning. One of them is communism - that is, knowledge is shared.

Accordingly, if we look at certain traditional communities, then we find that
those communities that function within a caste system (dividing people by
occupation) usually turn out to have certain castes of people with
intellectual occupations, and if you look at the ethical norms of such castes,
you find that they are also communistic. You can find this, for example, in
Plato. Or even if you look at India, you find the accumulation of wealth is
usually the occupation of another caste.

To sum up, we have the following take-aways. Science, as a part of culture, is
in conflict with private property. Accordingly, scholarly communication is a
dual conflict. What open access is doing is returning science to its essential
roots.

**Audience question** : I'm a former university press director. I'd just like
to point out also that "property is theft" is the watchword of French
anarchism, a famous phrase from Pierre-Joseph Proudhon, so perhaps anarchism
and science are also inseparable. But my main question really has to do with a
challenge that a librarian named Rick Anderson posted on the Scholarly Kitchen
blog two days ago, and that has to do with the fact that evidently Sci-Hub
relies a lot on the access codes that faculty have given to Sci-Hub in one way
or another so that Sci-Hub can gain access to the electronic materials that it
then uses to post on its own site. What Anderson does is points out that if
that information falls into the wrong hands, there are all sorts of terrible
things that can be done because those access codes provide access to personal
information, to student data, to all sorts of other things that could be badly
misused, so my question to you is what assurances can you give us that that
kind of information will not fall into the wrong hands.

**Elbakyan** : Well, first of all I doubt that it's possible to gain access to
all the information that is listed in the post on the Scholarly Kitchen. As a
rule, these logins and passwords can only be used for access to the proxy
server through which you can download articles, whereas for access to other
things, such as email, the login and password won't work. [ _Audience reacts
with skepticism._ ]

**Audience question** : Earlier this week a number of us participated in a
panel presentation on scholarly publishing and social justice, and one of the
primary points that came out of that was that the people who create the
published product - not necessarily the scientist but the people who actually
do the work that results in the published product - deserve to be paid for
their labor, and there is definitely labor involved. So if you're replacing
the market for these publications and eliminating these people's opportunities
to make money, where is the appropriate distribution of wealth.

**Elbakyan** : First of all, we shouldn't confuse the compensation that a
person receives for their labor with the excessive profits that publishers
wring out by limiting access to information. For example, Sci-Hub also does a
fair amount of work and has high expenses, but these expenses are for some
reason covered by donations - that is, there's no need to close access to
information - that is, it's a red herring to say that if articles are
distributed for free, people won't have anything to eat. One does not follow
from the other. In my opinion, though, an optimal system for funding would
consist of grants, donations, and membership fees.

**Audience question** : You've spoken so far exclusively about Sci-Hub. I
wonder if you could comment just briefly on LibGen and whether you see the two
models as identical or whether there are any material differences between
LibGen and Sci-Hub.

**Elbakyan** : Well, LibGen is primarily a repository. It doesn't download
new articles but is more aimed at preserving that which has already been
downloaded.



Murtaugh
A bag but is language nothing of words
2016


## A bag but is language nothing of words

### From Mondotheque

#####

(language is nothing but a bag of words)

[Michael Murtaugh](/wiki/index.php?title=Michael_Murtaugh "Michael Murtaugh")

In text indexing and other machine reading applications the term "bag of
words" is frequently used to underscore how processing algorithms often
represent text using a data structure (word histograms or weighted vectors)
where the original order of the words in sentence form is stripped away. While
"bag of words" might well serve as a cautionary reminder to programmers of the
essential violence perpetrated to a text and a call to critically question the
efficacy of methods based on subsequent transformations, the expression's use
seems in practice more like a badge of pride or a schoolyard taunt that would
go: Hey language: you're nothin' but a big BAG-OF-WORDS.

## Bag of words

In information retrieval and other so-called _machine-reading_ applications
(such as text indexing for web search engines) the term "bag of words" is used
to underscore how in the course of processing a text the original order of the
words in sentence form is stripped away. The resulting representation is then
a collection of each unique word used in the text, typically weighted by the
number of times the word occurs.

Bag of words, also known as word histograms or weighted term vectors, are a
standard part of the data engineer's toolkit. But why such a drastic
transformation? The utility of "bag of words" is in how it makes text amenable
to code, first in that it's very straightforward to implement the translation
from a text document to a bag of words representation. More significantly,
this transformation then opens up a wide collection of tools and techniques
for further transformation and analysis purposes. For instance, a number of
libraries available in the booming field of "data sciences" work with "high
dimension" vectors; bag of words is a way to transform a written document into
a mathematical vector where each "dimension" corresponds to the (relative)
quantity of each unique word. While physically unimaginable and abstract
(imagine each of Shakespeare's works as points in a 14 million dimensional
space), from a formal mathematical perspective, it's quite a comfortable idea,
and many complementary techniques (such as principle component analysis) exist
to reduce the resulting complexity.

What's striking about a bag of words representation, given is centrality in so
many text retrieval application is its irreversibility. Given a bag of words
representation of a text and faced with the task of producing the original
text would require in essence the "brain" of a writer to recompose sentences,
working with the patience of a devoted cryptogram puzzler to draw from the
precise stock of available words. While "bag of words" might well serve as a
cautionary reminder to programmers of the essential violence perpetrated to a
text and a call to critically question the efficacy of methods based on
subsequent transformations, the expressions use seems in practice more like a
badge of pride or a schoolyard taunt that would go: Hey language: you're
nothing but a big BAG-OF-WORDS. Following this spirit of the term, "bag of
words" celebrates a perfunctory step of "breaking" a text into a purer form
amenable to computation, to stripping language of its silly redundant
repetitions and foolishly contrived stylistic phrasings to reveal a purer
inner essence.

## Book of words

Lieber's Standard Telegraphic Code, first published in 1896 and republished in
various updated editions through the early 1900s, is an example of one of
several competing systems of telegraph code books. The idea was for both
senders and receivers of telegraph messages to use the books to translate
their messages into a sequence of code words which can then be sent for less
money as telegraph messages were paid by the word. In the front of the book, a
list of examples gives a sampling of how messages like: "Have bought for your
account 400 bales of cotton, March delivery, at 8.34" can be conveyed by a
telegram with the message "Ciotola, Delaboravi". In each case the reduction of
number of transmitted words is highlighted to underscore the efficacy of the
method. Like a dictionary or thesaurus, the book is primarily organized around
key words, such as _act_ , _advice_ , _affairs_ , _bags_ , _bail_ , and
_bales_ , under which exhaustive lists of useful phrases involving the
corresponding word are provided in the main pages of the volume. [1]

[![Liebers
P1016847.JPG](/wiki/images/4/41/Liebers_P1016847.JPG)](/wiki/index.php?title=File:Liebers_P1016847.JPG)

[![Liebers
P1016859.JPG](/wiki/images/3/35/Liebers_P1016859.JPG)](/wiki/index.php?title=File:Liebers_P1016859.JPG)

[![Liebers
P1016861.JPG](/wiki/images/3/34/Liebers_P1016861.JPG)](/wiki/index.php?title=File:Liebers_P1016861.JPG)

[![Liebers
P1016869.JPG](/wiki/images/f/fd/Liebers_P1016869.JPG)](/wiki/index.php?title=File:Liebers_P1016869.JPG)

> [...] my focus in this chapter is on the inscription technology that grew
parasitically alongside the monopolistic pricing strategies of telegraph
companies: telegraph code books. Constructed under the bywords “economy,”
“secrecy,” and “simplicity,” telegraph code books matched phrases and words
with code letters or numbers. The idea was to use a single code word instead
of an entire phrase, thus saving money by serving as an information
compression technology. Generally economy won out over secrecy, but in
specialized cases, secrecy was also important.[2]

In Katherine Hayles' chapter devoted to telegraph code books she observes how:

> The interaction between code and language shows a steady movement away from
a human-centric view of code toward a machine-centric view, thus anticipating
the development of full-fledged machine codes with the digital computer. [3]

[![Liebers
P1016851.JPG](/wiki/images/1/13/Liebers_P1016851.JPG)](/wiki/index.php?title=File:Liebers_P1016851.JPG)
Aspects of this transitional moment are apparent in a notice included
prominently inserted in the Lieber's code book:

> After July, 1904, all combinations of letters that do not exceed ten will
pass as one cipher word, provided that it is pronounceable, or that it is
taken from the following languages: English, French, German, Dutch, Spanish,
Portuguese or Latin -- International Telegraphic Conference, July 1903 [4]

Conforming to international conventions regulating telegraph communication at
that time, the stipulation that code words be actual words drawn from a
variety of European languages (many of Lieber's code words are indeed
arbitrary Dutch, German, and Spanish words) underscores this particular moment
of transition as reference to the human body in the form of "pronounceable"
speech from representative languages begins to yield to the inherent potential
for arbitrariness in digital representation.

What telegraph code books do is remind us of is the relation of language in
general to economy. Whether they may be economies of memory, attention, costs
paid to a telecommunicatons company, or in terms of computer processing time
or storage space, encoding language or knowledge in any form of writing is a
form of shorthand and always involves an interplay with what one expects to
perform or "get out" of the resulting encoding.

> Along with the invention of telegraphic codes comes a paradox that John
Guillory has noted: code can be used both to clarify and occlude. Among the
sedimented structures in the technological unconscious is the dream of a
universal language. Uniting the world in networks of communication that
flashed faster than ever before, telegraphy was particularly suited to the
idea that intercultural communication could become almost effortless. In this
utopian vision, the effects of continuous reciprocal causality expand to
global proportions capable of radically transforming the conditions of human
life. That these dreams were never realized seems, in retrospect, inevitable.
[5]

[![Liebers
P1016884.JPG](/wiki/images/9/9c/Liebers_P1016884.JPG)](/wiki/index.php?title=File:Liebers_P1016884.JPG)

[![Liebers
P1016852.JPG](/wiki/images/7/74/Liebers_P1016852.JPG)](/wiki/index.php?title=File:Liebers_P1016852.JPG)

[![Liebers
P1016880.JPG](/wiki/images/1/11/Liebers_P1016880.JPG)](/wiki/index.php?title=File:Liebers_P1016880.JPG)

Far from providing a universal system of encoding messages in the English
language, Lieber's code is quite clearly designed for the particular needs and
conditions of its use. In addition to the phrases ordered by keywords, the
book includes a number of tables of terms for specialized use. One table lists
a set of words used to describe all possible permutations of numeric grades of
coffee (Choliam = 3,4, Choliambos = 3,4,5, Choliba = 4,5, etc.); another table
lists pairs of code words to express the respective daily rise or fall of the
price of coffee at the port of Le Havre in increments of a quarter of a Franc
per 50 kilos ("Chirriado = prices have advanced 1 1/4 francs"). From an
archaeological perspective, the Lieber's code book reveals a cross section of
the needs and desires of early 20th century business communication between the
United States and its trading partners.

The advertisements lining the Liebers Code book further situate its use and
that of commercial telegraphy. Among the many advertisements for banking and
law services, office equipment, and alcohol are several ads for gun powder and
explosives, drilling equipment and metallurgic services all with specific
applications to mining. Extending telegraphy's formative role for ship-to-
shore and ship-to-ship communication for reasons of safety, commercial
telegraphy extended this network of communication to include those parties
coordinating the "raw materials" being mined, grown, or otherwise extracted
from overseas sources and shipped back for sale.

## "Raw data now!"

From [La ville intelligente - Ville de la connaissance](/wiki/index.php?title
=La_ville_intelligente_-_Ville_de_la_connaissance "La ville intelligente -
Ville de la connaissance"):

Étant donné que les nouvelles formes modernistes et l'utilisation de matériaux
propageaient l'abondance d'éléments décoratifs, Paul Otlet croyait en la
possibilité du langage comme modèle de « [données
brutes](/wiki/index.php?title=Bag_of_words "Bag of words") », le réduisant aux
informations essentielles et aux faits sans ambiguïté, tout en se débarrassant
de tous les éléments inefficaces et subjectifs.


From [The Smart City - City of Knowledge](/wiki/index.php?title
=The_Smart_City_-_City_of_Knowledge "The Smart City - City of Knowledge"):

As new modernist forms and use of materials propagated the abundance of
decorative elements, Otlet believed in the possibility of language as a model
of '[raw data](/wiki/index.php?title=Bag_of_words "Bag of words")', reducing
it to essential information and unambiguous facts, while removing all
inefficient assets of ambiguity or subjectivity.


> Tim Berners-Lee: [...] Make a beautiful website, but first give us the
unadulterated data, we want the data. We want unadulterated data. OK, we have
to ask for raw data now. And I'm going to ask you to practice that, OK? Can
you say "raw"?

>

> Audience: Raw.

>

> Tim Berners-Lee: Can you say "data"?

>

> Audience: Data.

>

> TBL: Can you say "now"?

>

> Audience: Now!

>

> TBL: Alright, "raw data now"!

>

> [...]

>

> So, we're at the stage now where we have to do this -- the people who think
it's a great idea. And all the people -- and I think there's a lot of people
at TED who do things because -- even though there's not an immediate return on
the investment because it will only really pay off when everybody else has
done it -- they'll do it because they're the sort of person who just does
things which would be good if everybody else did them. OK, so it's called
linked data. I want you to make it. I want you to demand it. [6]

## Un/Structured

As graduate students at Stanford, Sergey Brin and Lawrence (Larry) Page had an
early interest in producing "structured data" from the "unstructured" web. [7]

> The World Wide Web provides a vast source of information of almost all
types, ranging from DNA databases to resumes to lists of favorite restaurants.
However, this information is often scattered among many web servers and hosts,
using many different formats. If these chunks of information could be
extracted from the World Wide Web and integrated into a structured form, they
would form an unprecedented source of information. It would include the
largest international directory of people, the largest and most diverse
databases of products, the greatest bibliography of academic works, and many
other useful resources. [...]

>

> **2.1 The Problem**
> Here we define our problem more formally:
> Let D be a large database of unstructured information such as the World
Wide Web [...] [8]

In a paper titled _Dynamic Data Mining_ Brin and Page situate their research
looking for _rules_ (statistical correlations) between words used in web
pages. The "baskets" they mention stem from the origins of "market basket"
techniques developed to find correlations between the items recorded in the
purchase receipts of supermarket customers. In their case, they deal with web
pages rather than shopping baskets, and words instead of purchases. In
transitioning to the much larger scale of the web, they describe the
usefulness of their research in terms of its computational economy, that is
the ability to tackle the scale of the web and still perform using
contemporary computing power completing its task in a reasonably short amount
of time.

> A traditional algorithm could not compute the large itemsets in the lifetime
of the universe. [...] Yet many data sets are difficult to mine because they
have many frequently occurring items, complex relationships between the items,
and a large number of items per basket. In this paper we experiment with word
usage in documents on the World Wide Web (see Section 4.2 for details about
this data set). This data set is fundamentally different from a supermarket
data set. Each document has roughly 150 distinct words on average, as compared
to roughly 10 items for cash register transactions. We restrict ourselves to a
subset of about 24 million documents from the web. This set of documents
contains over 14 million distinct words, with tens of thousands of them
occurring above a reasonable support threshold. Very many sets of these words
are highly correlated and occur often. [9]

## Un/Ordered

In programming, I've encountered a recurring "problem" that's quite
symptomatic. It goes something like this: you (the programmer) have managed to
cobble out a lovely "content management system" (either from scratch, or using
any number of helpful frameworks) where your user can enter some "items" into
a database, for instance to store bookmarks. After this ordered items are
automatically presented in list form (say on a web page). The author: It's
great, except... could this bookmark come before that one? The problem stems
from the fact that the database ordering (a core functionality provided by any
database) somehow applies a sorting logic that's almost but not quite right. A
typical example is the sorting of names where details (where to place a name
that starts with a Norwegian "Ø" for instance), are language-specific, and
when a mixture of languages occurs, no single ordering is necessarily
"correct". The (often) exascerbated programmer might hastily add an additional
database field so that each item can also have an "order" (perhaps in the form
of a date or some other kind of (alpha)numerical "sorting" value) to be used
to correctly order the resulting list. Now the author has a means, awkward and
indirect but workable, to control the order of the presented data on the start
page. But one might well ask, why not just edit the resulting listing as a
document? Not possible! Contemporary content management systems are based on a
data flow from a "pure" source of a database, through controlling code and
templates to produce a document as a result. The document isn't the data, it's
the end result of an irreversible process. This problem, in this and many
variants, is widespread and reveals an essential backwardness that a
particular "computer scientist" mindset relating to what constitutes "data"
and in particular it's relationship to order that makes what might be a
straightforward question of editing a document into an over-engineered
database.

Recently working with Nikolaos Vogiatzis whose research explores playful and
radically subjective alternatives to the list, Vogiatzis was struck by how
from the earliest specifications of HTML (still valid today) have separate
elements (OL and UL) for "ordered" and "unordered" lists.

> The representation of the list is not defined here, but a bulleted list for
unordered lists, and a sequence of numbered paragraphs for an ordered list
would be quite appropriate. Other possibilities for interactive display
include embedded scrollable browse panels. [10]

Vogiatzis' surprise lay in the idea of a list ever being considered
"unordered" (or in opposition to the language used in the specification, for
order to ever be considered "insignificant"). Indeed in its suggested
representation, still followed by modern web browsers, the only difference
between the two visually is that UL items are preceded by a bullet symbol,
while OL items are numbered.

The idea of ordering runs deep in programming practice where essentially
different data structures are employed depending on whether order is to be
maintained. The indexes of a "hash" table, for instance (also known as an
associative array), are ordered in an unpredictable way governed by a
representation's particular implementation. This data structure, extremely
prevalent in contemporary programming practice sacrifices order to offer other
kinds of efficiency (fast text-based retrieval for instance).

## Data mining

In announcing Google's impending data center in Mons, Belgian prime minister
Di Rupo invoked the link between the history of the mining industry in the
region and the present and future interest in "data mining" as practiced by IT
companies such as Google.

Whether speaking of bales of cotton, barrels of oil, or bags of words, what
links these subjects is the way in which the notion of "raw material" obscures
the labor and power structures employed to secure them. "Raw" is always
relative: "purity" depends on processes of "refinement" that typically carry
social/ecological impact.

Stripping language of order is an act of "disembodiment", detaching it from
the acts of writing and reading. The shift from (human) reading to machine
reading involves a shift of responsibility from the individual human body to
the obscured responsibilities and seemingly inevitable forces of the
"machine", be it the machine of a market or the machine of an algorithm.

From [X = Y](/wiki/index.php?title=X_%3D_Y "X = Y"):

Still, it is reassuring to know that the products hold traces of the work,
that even with the progressive removal of human signs in automated processes,
the workers' presence never disappears completely. This presence is proof of
the materiality of information production, and becomes a sign of the economies
and paradigms of efficiency and profitability that are involved.


The computer scientists' view of textual content as "unstructured", be it in a
webpage or the OCR scanned pages of a book, reflect a negligence to the
processes and labor of writing, editing, design, layout, typesetting, and
eventually publishing, collecting and cataloging [11].

"Unstructured" to the computer scientist, means non-conformant to particular
forms of machine reading. "Structuring" then is a social process by which
particular (additional) conventions are agreed upon and employed. Computer
scientists often view text through the eyes of their particular reading
algorithm, and in the process (voluntarily) blind themselves to the work
practices which have produced and maintain these "resources".

Berners-Lee, in chastising his audience of web publishers to not only publish
online, but to release "unadulterated" data belies a lack of imagination in
considering how language is itself structured and a blindness to the need for
more than additional technical standards to connect to existing publishing
practices.

Last Revision: 2*08*2016

1. ↑ Benjamin Franklin Lieber, Lieber's Standard Telegraphic Code, 1896, New York;
2. ↑ Katherine Hayles, "Technogenesis in Action: Telegraph Code Books and the Place of the Human", How We Think: Digital Media and Contemporary Technogenesis, 2006
3. ↑ Hayles
4. ↑ Lieber's
5. ↑ Hayles
6. ↑ Tim Berners-Lee: The next web, TED Talk, February 2009
7. ↑ "Research on the Web seems to be fashionable these days and I guess I'm no exception." from Brin's [Stanford webpage](http://infolab.stanford.edu/~sergey/)
8. ↑ Extracting Patterns and Relations from the World Wide Web, Sergey Brin, Proceedings of the WebDB Workshop at EDBT 1998,
9. ↑ Dynamic Data Mining: Exploring Large Rule Spaces by Sampling; Sergey Brin and Lawrence Page, 1998; p. 2
10. ↑ Hypertext Markup Language (HTML): "Internet Draft", Tim Berners-Lee and Daniel Connolly, June 1993,
11. ↑

Retrieved from
[https://www.mondotheque.be/wiki/index.php?title=A_bag_but_is_language_nothing_of_words&oldid=8480](https://www.mondotheque.be/wiki/index.php?title=A_bag_but_is_language_nothing_of_words&oldid=8480)

 

Display 200 300 400 500 600 700 800 900 1000 ALL characters around the word.