USDC
Complaint: Elsevier v. SciHub and LibGen
2015


Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 1 of 16

UNITED STATES DISTRICT COURT
SOUTHERN DISTRICT OF NEW YORK

Index No. 15-cv-4282 (RWS)
COMPLAINT

ELSEVIER INC., ELSEVIER B.V., ELSEVIER LTD.
Plaintiffs,

v.

SCI-HUB d/b/a WWW.SCI-HUB.ORG, THE LIBRARY GENESIS PROJECT d/b/a LIBGEN.ORG, ALEXANDRA ELBAKYAN, JOHN DOES 1-99,
Defendants.

Plaintiffs Elsevier Inc, Elsevier B.V., and Elsevier Ltd. (collectively “Elsevier”),
by their attorneys DeVore & DeMarco LLP, for their complaint against www.scihub.org,
www.libgen.org, Alexandra Elbakyan, and John Does 1-99 (collectively the “Defendants”),
allege as follows:

NATURE OF THE ACTION

1. This is a civil action seeking damages and injunctive relief for: (1) copyright infringement under the copyright laws of the United States (17 U.S.C. § 101 et seq.); and (2) violations of the Computer Fraud and Abuse Act, 18.U.S.C. § 1030, based upon Defendants’ unlawful access to, use, reproduction, and distribution of Elsevier’s copyrighted works. Defendants’ actions in this regard have caused and continue to cause irreparable injury to Elsevier and its publishing partners (including scholarly societies) for which it publishes certain journals.

1

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 2 of 16

PARTIES

2. Plaintiff Elsevier Inc. is a corporation organized under the laws of Delaware, with its principal place of business at 360 Park Avenue South, New York, New York 10010.

3. Plaintiff Elsevier B.V. is a corporation organized under the laws of the Netherlands, with its principal place of business at Radarweg 29, Amsterdam, 1043 NX, Netherlands.

4. Plaintiff Elsevier Ltd. is a corporation organized under the laws of the United Kingdom, with its principal place of business at 125 London Wall, EC2Y 5AS United Kingdom.

5. Upon information and belief, Defendant Sci-Hub is an individual or organization engaged in the operation of the website accessible at the URL “www.sci-hub.org,” and related subdomains, including but not limited to the subdomain “www.sciencedirect.com.sci-hub.org,”
www.elsevier.com.sci-hub.org,” “store.elsevier.com.sci-hub.org,” and various subdomains
incorporating the company and product names of other major global publishers (collectively with www.sci-hub.org the “Sci-Hub Website”). The sci-hub.org domain name is registered by
“Fundacion Private Whois,” located in Panama City, Panama, to an unknown registrant. As of
the date of this filing, the Sci-Hub Website is assigned the IP address 31.184.194.81. This IP address is part of a range of IP addresses assigned to Petersburg Internet Network Ltd., a webhosting company located in Saint Petersburg, Russia.

6. Upon information and belief, Defendant Library Genesis Project is an organization which operates an online repository of copyrighted materials accessible through the website located at the URL “libgen.org” as well as a number of other “mirror” websites
(collectively the “Libgen Domains”). The libgen.org domain is registered by “Whois Privacy
Corp.,” located at Ocean Centre, Montagu Foreshore, East Bay Street, Nassau, New Providence,

2

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 3 of 16

Bahamas, to an unknown registrant. As of the date of this filing, libgen.org is assigned the IP address 93.174.95.71. This IP address is part of a range of IP addresses assigned to Ecatel Ltd., a web-hosting company located in Amsterdam, the Netherlands.

7. The Libgen Domains include “elibgen.org,” “libgen.info,” “lib.estrorecollege.org,” and “bookfi.org.”

8. Upon information and belief, Defendant Alexandra Elbakyan is the principal owner and/or operator of Sci-Hub. Upon information and belief, Elbakyan is a resident of Almaty, Kazakhstan.

9. Elsevier is unaware of the true names and capacities of the individuals named as Does 1-99 in this Complaint (together with Alexandra Elbakyan, the “Individual Defendants”),
and their residence and citizenship is also unknown. Elsevier will amend its Complaint to allege the names, capacities, residence and citizenship of the Doe Defendants when their identities are learned.

10. Upon information and belief, the Individual Defendants are the owners and operators of numerous of websites, including Sci-Hub and the websites located at the various
Libgen Domains, and a number of e-mail addresses and accounts at issue in this case.

11. The Individual Defendants have participated, exercised control over, and benefited from the infringing conduct described herein, which has resulted in substantial harm to
the Plaintiffs.

JURISDICTION AND VENUE

12. This is a civil action arising from the Defendants’ violations of the copyright laws of the United States (17 U.S.C. § 101 et seq.) and the Computer Fraud and Abuse Act (“CFAA”),

3

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 4 of 16

18.U.S.C. § 1030. Therefore, the Court has subject matter jurisdiction over this action pursuant to 28 U.S.C. § 1331.

13. Upon information and belief, the Individual Defendants own and operate computers and Internet websites and engage in conduct that injures Plaintiff in this district, while
also utilizing instrumentalities located in the Southern District of New York to carry out the acts complained of herein.

14. Defendants have affirmatively directed actions at the Southern District of New York by utilizing computer servers located in the District without authorization and by
unlawfully obtaining access credentials belonging to individuals and entities located in the
District, in order to unlawfully access, copy, and distribute Elsevier's copyrighted materials
which are stored on Elsevier’s ScienceDirect platform.
15.

Defendants have committed the acts complained of herein through unauthorized

access to Plaintiffs’ copyrighted materials which are stored and maintained on computer servers
located in the Southern District of New York.
16.

Defendants have undertaken the acts complained of herein with knowledge that

such acts would cause harm to Plaintiffs and their customers in both the Southern District of
New York and elsewhere. Defendants have caused the Plaintiff injury while deriving revenue
from interstate or international commerce by committing the acts complained of herein.
Therefore, this Court has personal jurisdiction over Defendants.
17.

Venue in this District is proper under 28 U.S.C. § 1391(b) because a substantial

part of the events giving rise to Plaintiffs’ claims occurred in this District and because the
property that is the subject of Plaintiffs’ claims is situated in this District.

4

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 5 of 16

FACTUAL ALLEGATIONS
Elsevier’s Copyrights in Publications on ScienceDirect
18.

Elsevier is a world leading provider of professional information solutions in the

Science, Medical, and Health sectors. Elsevier publishes, markets, sells, and licenses academic
textbooks, journals, and examinations in the fields of science, medicine, and health. The
majority of Elsevier’s institutional customers are universities, governmental entities, educational
institutions, and hospitals that purchase physical and electronic copies of Elsevier’s products and
access to Elsevier’s digital libraries. Elsevier distributes its scientific journal articles and book
chapters electronically via its proprietary subscription database “ScienceDirect”
(www.sciencedirect.com). In most cases, Elsevier holds the copyright and/or exclusive
distribution rights to the works available through ScienceDirect. In addition, Elsevier holds
trademark rights in “Elsevier,” “ScienceDirect,” and several other related trade names.
19.

The ScienceDirect database is home to almost one-quarter of the world's peer-

reviewed, full-text scientific, technical and medical content. The ScienceDirect service features
sophisticated search and retrieval tools for students and professionals which facilitates access to
over 10 million copyrighted publications. More than 15 million researchers, health care
professionals, teachers, students, and information professionals around the globe rely on
ScienceDirect as a trusted source of nearly 2,500 journals and more than 26,000 book titles.
20.

Authorized users are provided access to the ScienceDirect platform by way of

non-exclusive, non-transferable subscriptions between Elsevier and its institutional customers.
According to the terms and conditions of these subscriptions, authorized users of ScienceDirect
must be users affiliated with the subscriber (e.g., full-time and part-time students, faculty, staff

5

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 6 of 16

and researchers of subscriber universities and individuals using computer terminals within the
library facilities at the subscriber for personal research, education or other non-corporate use.)
21.

A substantial portion of American research universities maintain active

subscriptions to ScienceDirect. These subscriptions, under license, allow the universities to
provide their faculty and students access to the copyrighted works within the ScienceDirect
database.
22.

Elsevier stores and maintains the copyrighted material available in ScienceDirect

on servers owned and operated by a third party whose servers are located in the Southern District
of New York and elsewhere. In order to optimize performance, these third-party servers
collectively operate as a distributed network which serves cached copies of Elsevier’s
copyrighted materials by way of particular servers that are geographically close to the user. For
example, a user that accesses ScienceDirect from a University located in the Southern District of
New York will likely be served that content from a server physically located in the District.

Authentication of Authorized University ScienceDirect Users
23.

Elsevier maintains the integrity and security of the copyrighted works accessible

on ScienceDirect by allowing only authenticated users access to the platform. Elsevier
authenticates educational users who access ScienceDirect through their affiliated university’s
subscription by verifying that they are able to access ScienceDirect from a computer system or
network previously identified as belonging to a subscribing university.
24.

Elsevier does not track individual educational users’ access to ScienceDirect.

Instead, Elsevier verifies only that the user has authenticated access to a subscribing university.
25.

Once an educational user authenticates his computer with ScienceDirect on a

university network, that computer is permitted access to ScienceDirect for a limited amount of
6

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 7 of 16

time without re-authenticating. For example, a student could access ScienceDirect from their
laptop while sitting in a university library, then continue to access ScienceDirect using that
laptop from their dorm room later that day. After a specified period of time has passed, however,
a user will have to re-authenticate his or her computer’s access to ScienceDirect by connecting to
the platform through a university network.
26.

As a matter of practice, educational users access university networks, and thereby

authenticate their computers with ScienceDirect, primarily through one of two methods. First,
the user may be physically connected to a university network, for example by taking their
computer to the university’s library. Second, the user may connect remotely to the university’s
network using a proxy connection. Universities offer proxy connections to their students and
faculty so that those users may access university computing resources – including access to
research databases such as ScienceDirect – from remote locations which are unaffiliated with the
university. This practice facilitates the use of ScienceDirect by students and faculty while they
are at home, travelling, or otherwise off-campus.
Defendants’ Unauthorized Access to University Proxy Networks to Facilitate Copyright
Infringement
27.

Upon information and belief, Defendants are reproducing and distributing

unauthorized copies of Elsevier’s copyrighted materials, unlawfully obtained from
ScienceDirect, through Sci-Hub and through various websites affiliated with the Library Genesis
Project. Specifically, Defendants utilize their websites located at sci-hub.org and at the Libgen
Domains to operate an international network of piracy and copyright infringement by
circumventing legal and authorized means of access to the ScienceDirect database. Defendants’
piracy is supported by the persistent intrusion and unauthorized access to the computer networks

7

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 8 of 16

of Elsevier and its institutional subscribers, including universities located in the Southern District
of New York.
28.

Upon information and belief, Defendants have unlawfully obtained and continue

to unlawfully obtain student or faculty access credentials which permit proxy connections to
universities which subscribe to ScienceDirect, and use these credentials to gain unauthorized
access to ScienceDirect.
29.

Upon information and belief, Defendants have used and continue to use such

access credentials to authenticate access to ScienceDirect and, subsequently, to obtain
copyrighted scientific journal articles therefrom without valid authorization.
30.

The Sci-Hub website requires user interaction in order to facilitate its illegal

copyright infringement scheme. Specifically, before a Sci-Hub user can obtain access to
copyrighted scholarly journals, articles, and books that are maintained by ScienceDirect, he must
first perform a search on the Sci-Hub page. A Sci-Hub user may search for content using either
(a) a general keyword-based search, or (b) a journal, article or book identifier (such as a Digital
Object Identifier, PubMed Identifier, or the source URL).
31.

When a user performs a keyword search on Sci-Hub, the website returns a proxied

version of search results from the Google Scholar search database. 1 When a user selects one of
the search results, if the requested content is not available from the Library Genesis Project, SciHub unlawfully retrieves the content from ScienceDirect using the access previously obtained.
Sci-Hub then provides a copy of that article to the requesting user, typically in PDF format. If,
however, the requested content can be found in the Library Genesis Project repository, upon

1

Google Scholar provides its users the capability to search for scholarly literature, but does not provide the
full text of copyrighted scientific journal articles accessible through paid subscription services such as
ScienceDirect. Instead, Google Scholar provides bibliographic information concerning such articles along with a
link to the platform through which the article may be purchased or accessed by a subscriber.

8

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 9 of 16

information and belief, Sci-Hub obtains the content from the Library Genesis Project repository
and provides that content to the user.
32.

When a user searches on Sci-Hub for an article available on ScienceDirect using a

journal or article identifier, the user is redirected to a proxied version of the ScienceDirect page
where the user can download the requested article at no cost. Upon information and belief, SciHub facilitates this infringing conduct by using unlawfully-obtained access credentials to
university proxy servers to establish remote access to ScienceDirect through those proxy servers.
If, however, the requested content can be found in the Library Genesis Project repository, upon
information and belief, Sci-Hub obtains the content from it and provides it to the user.
33.

Upon information and belief, Sci-Hub engages in no other activity other than the

illegal reproduction and distribution of digital copies of Elsevier’s copyrighted works and the
copyrighted works of other publishers, and the encouragement, inducement, and material
contribution to the infringement of the copyrights of those works by third parties – i.e., the users
of the Sci-Hub website.
34.

Upon information and belief, in addition to the blatant and rampant infringement

of Elsevier’s copyrights as described above, the Defendants have also used the Sci-Hub website
to earn revenue from the piracy of copyrighted materials from ScienceDirect. Sci-Hub has at
various times accepted funds through a variety of payment processors, including PayPal,
Yandex, WebMoney, QiQi, and Bitcoin.
Sci-Hub’s Use of the Library Genesis Project as a Repository for Unlawfully-Obtained
Scientific Journal Articles and Books
35.

Upon information and belief, when Sci-Hub pirates and downloads an article from

ScienceDirect in response to a user request, in addition to providing a copy of that article to that
user, Sci-Hub also provides a duplicate copy to the Library Genesis Project, which stores the
9

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 10 of 16

article in a database accessible through the Internet. Upon information and belief, the Library
Genesis Project is designed to be a permanent repository of this and other illegally obtained
content.
36.

Upon information and belief, in the event that a Sci-Hub user requests an article

which has already been provided to the Library Genesis Project, Sci-Hub may provide that user
access to a copy provided by the Library Genesis Project rather than re-download an additional
copy of the article from ScienceDirect. As a result, Defendants Sci-Hub and Library Genesis
Project act in concert to engage in a scheme designed to facilitate the unauthorized access to and
wholesale distribution of Elsevier’s copyrighted works legitimately available on the
ScienceDirect platform.
The Library Genesis Project’s Unlawful Distribution of Plaintiff’s Copyrighted Works
37.

Access to the Library Genesis Project’s repository is facilitated by the website

“libgen.org,” which provides its users the ability to search, download content from, and upload
content to, the repository. The main page of libgen.org allows its users to perform searches in
various categories, including “LibGen (Sci-Tech),” and “Scientific articles.” In addition to
searching by keyword, users may also search for specific content by various other fields,
including title, author, periodical, publisher, or ISBN or DOI number.
38.

The libgen.org website indicates that the Library Genesis Project repository

contains approximately 1 million “Sci-Tech” documents and 40 million scientific articles. Upon
information and belief, the large majority of these works is subject to copyright protection and is
being distributed through the Library Genesis Project without the permission of the applicable
rights-holder. Upon information and belief, the Library Genesis Project serves primarily, if not

10

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 11 of 16

exclusively, as a scheme to violate the intellectual property rights of the owners of millions of
copyrighted works.
39.

Upon information and belief, Elsevier owns the copyrights in a substantial

number of copyrighted materials made available for distribution through the Library Genesis
Project. Elsevier has not authorized the Library Genesis Project or any of the Defendants to
copy, display, or distribute through any of the complained of websites any of the content stored
on ScienceDirect to which it holds the copyright. Among the works infringed by the Library
Genesis Project are the “Guyton and Hall Textbook of Medical Physiology,” and the article “The
Varus Ankle and Instability” (published in Elsevier’s journal “Foot and Ankle Clinics of North
America”), each of which is protected by Elsevier’s federally-registered copyrights.
40.

In addition to the Library Genesis Project website accessible at libgen.org, users

may access the Library Genesis Project repository through a number of “mirror” sites accessible
through other URLs. These mirror sites are similar, if not identical, in functionality to
libgen.org. Specifically, the mirror sites allow their users to search and download materials from
the Library Genesis Project repository.
FIRST CLAIM FOR RELIEF
(Direct Infringement of Copyright)
41.

Elsevier incorporates by reference the allegations contained in paragraphs 1-40

42.

Elsevier’s copyright rights and exclusive distribution rights to the works available

above.

on ScienceDirect (the “Works”) are valid and enforceable.
43.

Defendants have infringed on Elsevier’s copyright rights to these Works by

knowingly and intentionally reproducing and distributing these Works without authorization.

11

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 12 of 16

44.

The acts of infringement described herein have been willful, intentional, and

purposeful, in disregard of and indifferent to Plaintiffs’ rights.
45.

Without authorization from Elsevier, or right under law, Defendants are directly

liable for infringing Elsevier’s copyrighted Works pursuant to 17 U.S.C. §§ 106(1) and/or (3).
46.

As a direct result of Defendants’ actions, Elsevier has suffered and continues to

suffer irreparable harm for which Elsevier has no adequate remedy at law, and which will
continue unless Defendants’ actions are enjoined.
47.

Elsevier seeks injunctive relief and costs and damages in an amount to be proven

at trial.
SECOND CLAIM FOR RELIEF
(Secondary Infringement of Copyright)
48.

Elsevier incorporates by reference the allegations contained in paragraphs 1-40

49.

Elsevier’s copyright rights and exclusive distribution rights to the works available

above.

on ScienceDirect (the “Works”) are valid and enforceable.
50.

Defendants have infringed on Elsevier’s copyright rights to these Works by

knowingly and intentionally reproducing and distributing these Works without license or other
authorization.
51.

Upon information and belief, Defendants intentionally induced, encouraged, and

materially contributed to the reproduction and distribution of these Works by third party users of
websites operated by Defendants.
52.

The acts of infringement described herein have been willful, intentional, and

purposeful, in disregard of and indifferent to Elsevier’s rights.

12

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 13 of 16

53.

Without authorization from Elsevier, or right under law, Defendants are directly

liable for third parties’ infringement of Elsevier’s copyrighted Works pursuant to 17 U.S.C. §§
106(1) and/or (3).
54.

Upon information and belief, Defendants profited from third parties’ direct

infringement of Elsevier’s Works.
55.

Defendants had the right and the ability to supervise and control their websites

and the third party infringing activities described herein.
56.

As a direct result of Defendants’ actions, Elsevier has suffered and continues to

suffer irreparable harm for which Elsevier has no adequate remedy at law, and which will
continue unless Defendants’ actions are enjoined.
57.

Elsevier seeks injunctive relief and costs and damages in an amount to be proven

at trial.
THIRD CLAIM FOR RELIEF
(Violation of the Computer Fraud & Abuse Act)
58.

Elsevier incorporates by reference the allegations contained in paragraphs 1-40

59.

Elsevier’s computers and servers, the third-party computers and servers which

above.

store and maintain Elsevier’s copyrighted works for ScienceDirect, and Elsevier’s customers’
computers and servers which facilitate access to Elsevier’s copyrighted works on ScienceDirect,
are all “protected computers” under the Computer Fraud and Abuse Act (“CFAA”).
60.

Defendants (a) knowingly and intentionally accessed such protected computers

without authorization and thereby obtained information from the protected computers in a
transaction involving an interstate or foreign communication (18 U.S.C. § 1030(a)(2)(C)); and
(b) knowingly and with an intent to defraud accessed such protected computers without
13

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 14 of 16

authorization and obtained information from such computers, which Defendants used to further
the fraud and obtain something of value (18 U.S.C. § 1030(a)(4)).
61.

Defendants’ conduct has caused, and continues to cause, significant and

irreparable damages and loss to Elsevier.
62.

Defendants’ conduct has caused a loss to Elsevier during a one-year period

aggregating at least $5,000.
63.

As a direct result of Defendants’ actions, Elsevier has suffered and continues to

suffer irreparable harm for which Elsevier has no adequate remedy at law, and which will
continue unless Defendants’ actions are enjoined.
64.

Elsevier seeks injunctive relief, as well as costs and damages in an amount to be

proven at trial.
PRAYER FOR RELIEF
WHEREFORE, Elsevier respectfully requests that the Court:
A. Enter preliminary and permanent injunctions, enjoining and prohibiting Defendants,
their officers, directors, principals, agents, servants, employees, successors and
assigns, and all persons and entities in active concert or participation with them, from
engaging in any of the activity complained of herein or from causing any of the injury
complained of herein and from assisting, aiding, or abetting any other person or
business entity in engaging in or performing any of the activity complained of herein
or from causing any of the injury complained of herein;
B. Enter an order that, upon Elsevier’s request, those in privity with Defendants and
those with notice of the injunction, including any Internet search engines, Web
Hosting and Internet Service Providers, domain-name registrars, and domain name

14

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 15 of 16

registries or their administrators that are provided with notice of the injunction, cease
facilitating access to any or all domain names and websites through which Defendants
engage in any of the activity complained of herein;
C. Enter an order that, upon Elsevier’s request, those organizations which have
registered Defendants’ domain names on behalf of Defendants shall disclose
immediately to Plaintiffs all information in their possession concerning the identity of
the operator or registrant of such domain names and of any bank accounts or financial
accounts owned or used by such operator or registrant;
D. Enter an order that, upon Elsevier’s request, the TLD Registries for the Defendants’
websites, or their administrators, shall place the domain names on
registryHold/serverHold as well as serverUpdate, ServerDelete, and serverTransfer
prohibited statuses, for the remainder of the registration period for any such website.
E. Enter an order canceling or deleting, or, at Elsevier’s election, transferring the domain
name registrations used by Defendants to engage in the activity complained of herein
to Elsevier’s control so that they may no longer be used for illegal purposes;
F. Enter an order awarding Elsevier its actual damages incurred as a result of
Defendants’ infringement of Elsevier’s copyright rights in the Works and all profits
Defendant realized as a result of its acts of infringement, in amounts to be determined
at trial; or in the alternative, awarding Elsevier, pursuant to 17 U.S.C. § 504, statutory
damages for the acts of infringement committed by Defendants, enhanced to reflect
the willful nature of the Defendants’ infringement;
G. Enter an order disgorging Defendants’ profits;

15

Case 1:15-cv-04282-RWS Document 1 Filed 06/03/15 Page 16 of 16

Liang
Shadow Libraries
2012


Journal #37 - September 2012

# Shadow Libraries

Over the last few monsoons I lived with the dread that the rain would
eventually find its ways through my leaky terrace roof and destroy my books.
Last August my fears came true when I woke up in the middle of the night to
see my room flooded and water leaking from the roof and through the walls.
Much of the night was spent rescuing the books and shifting them to a dry
room. While timing and speed were essential to the task at hand they were also
the key hazards navigating a slippery floor with books perched till one’s
neck. At the end of the rescue mission, I sat alone, exhausted amongst a
mountain of books assessing the damage that had been done, but also having
found books I had forgotten or had not seen in years; books which I had
thought had been permanently borrowed by others or misplaced found their way
back as I set many aside in a kind of ritual of renewed commitment.

[ ](//images.e-flux-systems.com/2012_09_book-library-small-WEB.jpg,2000)

Sorting the badly damaged from the mildly wet, I could not help but think
about the fragile histories of books from the library of Alexandria to the
great Florence flood of 1966. It may have seemed presumptuous to move from the
precarity of one’s small library and collection to these larger events, but is
there any other way in which one experiences earth-shattering events if not
via a microcosmic filtering through one’s own experiences? I sent a distressed
email to a friend Sandeep a committed bibliophile and book collector with a
fantastic personal library, who had also been responsible for many of my new
acquisitions. He wrote back on August 17, and I quote an extract of the email:

> Dear Lawrence

>

> I hope your books are fine. I feel for you very deeply, since my nightmares
about the future all contain as a key image my books rotting away under a
steady drip of grey water. Where was this leak, in the old house or in the
new? I spent some time looking at the books themselves: many of them I greeted
like old friends. I see you have Lewis Hyde’s _Trickster Makes the World_ and
Edward Rice’s _Captain Sir Richard Francis Burton_ in the pile: both top-class
books. (Burton is a bit of an obsession with me. The man did and saw
everything there was to do and see, and thought about it all, and wrote it all
down in a massive pile of notes and manuscripts. He squirrelled a fraction of
his scholarship into the tremendous footnotes to the Thousand and One Nights,
but most of it he could not publish without scandalising the Victorians, and
then he died, and his widow made a bonfire in the backyard, and burnt
everything because she disapproved of these products of a lifetime’s labors,
and of a lifetime such as few have ever had, and no one can ever have again. I
almost hope there is a special hell for Isabel Burton to burn in.)

Moving from one’s personal pile to the burning of the work of one of the
greatest autodidacts of the nineteenth century and back it was strangely
comforting to be reminded that libraries—the greatest of time machines
invented—were testimonies to both the grandeur and the fragility of
civilizations. Whenever I enter huge libraries it is with a tingling sense of
excitement normally reserved for horror movies, but at the same time this same
sense of awe is often accompanied by an almost debilitating sense of what it
means to encounter finitude as it is dwarfed by centuries of words and
scholarship. Yet strangely when I think of libraries it is rarely the New York
public library that comes to mind even as I wish that we could have similar
institutions in India. I think instead of much smaller collections—sometimes
of institutions but often just those of friends and acquaintances. I enjoy
browsing through people’s bookshelves, not just to discern their reading
preferences or to discover for myself unknown treasures, but also to take
delight in the local logic of their library, their spatial preferences and to
understand the order of things not as a global knowledge project but as a
personal, often quirky rationale.

[ ](//images.e-flux-systems.com/2012_09_library-of-congress.jpg,2000 "Machine
room for book transportation at the Library of Congress, early 20th century.")

Machine room for book transportation at the Library of Congress, early 20th
century.

Like romantic love, bibliophilia is perhaps shaped by one’s first love. The
first library that I knew intimately was a little six by eight foot shop
hidden in a by-lane off one of the busiest roads in Bangalore, Commercial
street. From its name to what it contained, Mecca stores could well have been
transported out of an Arabian nights tale. One side of the store was lined
with plastic ware and kitchen utensils of every shape and size while the other
wall was piled with books, comics, and magazines. From my eight-year-old
perspective it seemed large enough to contain all the knowledge of the world.
I earned a weekly stipend packing noodles for an hour every day after school
in the home shop that my parents ran, which I used to either borrow or buy
second hand books from the store. I was usually done with them by Sunday and
would have them reread by Wednesday. The real anguish came in waiting from
Wednesday to Friday for the next set. After finally acquiring a small
collection of books and comics myself I decided—spurred on by a fatal
combination of entrepreneurial enthusiasm and a pedantic desire to educate
others—to start a small library myself. Packing my books into a small aluminum
case and armed with a makeshift ledger, I went from house to house convincing
children in the neighborhood to forgo twenty-five paisa in exchange for a book
or comic with an additional caveat that they were not to share them with any
of their friends. While the enterprise got off to a reasonable start it soon
met its end when I realized that despite my instructions, my friends were
generously sharing the comics after they were done with them, which thereby
ended my biblioempire ambitions.

Over the past few years the explosion of ebook readers and consequent rise in
the availability of pirated books have opened new worlds to my booklust.
[Library.nu](library.nu), which began as gigapedia, suddenly made the idea of
the universal library seem like reality. By the time it shut down in February
2012 the library had close to a million books and over half a million active
users. Bibliophiles across the world were distraught when the site was shut
down and if it were ever possible to experience what the burning of the
library of Alexandria must have felt it was that collective ache of seeing the
closure of [library.nu.](library.nu)

What brings together something as monumental as the New York public library, a
collective enterprise like [library.nu](library.nu) and Mecca stores if not
the word library? As spaces they may have little in common but as virtual
spaces they speak as equals even if the scale of their imagination may differ.
All of them partake of their share in the world of logotopias. In an
exhibition designed to celebrate the place of the library in art, architecture
and imagination the curator Sascha Hastings coined the term logotopia to
designate “word places”—a happy coincidence of architecture and language.

There is however a risk of flattening the differences between these spaces by
classifying them all under a single utopian ideal of the library. Imagination
after all has a geography and physiology and requires our alertness to these
distinctions. Lets think instead of an entire pantheon (both of spaces as well
as practices) that we can designate as shadow libraries (or shadow logotopias
if you like) which exist in the shadows cast by the long history of monumental
libraries. While they are often dwarfed by the idea of the library, like the
shadows cast by our bodies, sometimes these shadows surge ahead of the body.

[ ](//images.e-flux-systems.com/2012_09_london-blitz-WEB.jpg,2000 "The London
Library after the Blitz, c. 1940.")

The London Library after the Blitz, c. 1940.

At the heart of all libraries lies a myth—that of the burning of the library
of Alexandria. No one knows what the library of Alexandria looked like or
possesses an accurate list of its contents. What we have long known though is
a sense of loss. But a loss of what? Of all the forms of knowledge in the
world in a particular time. Because that was precisely what the library of
Alexandria sought to collect under its roofs. It is believed that in order to
succeed in assembling a universal library, King Ptolemy I wrote “to all the
sovereigns and governors on earth” begging them to send to him every kind of
book by every kind of author, “poets and prose-writers, rhetoricians and
sophists, doctors and soothsayers, historians, and all others too.” The king’s
scholars had calculated that five hundred thousand scrolls would be required
if they were to collect in Alexandria “all the books of all the peoples of the
world.”1

What was special about the Library of Alexandria was the fact that until then
the libraries of the ancient world were either private collections of an
individual or government storehouses where legal and literary documents were
kept for official reference. By imagining a space where the public could have
access to all the knowledge of the world, the library also expressed a new
idea of the human itself. While the library of Alexandria is rightfully
celebrated, what is often forgotten in the mourning of its demise is another
library—one that existed in the shadows of the grand library but whose
whereabouts ensured that it survived Caesar’s papyrus destroying flames.

According to the Sicilian historian Diodorus Siculus, writing in the first
century BC, Alexandria boasted a second library, the so-called daughter
library, intended for the use of scholars not affiliated with the Museion. It
was situated in the south-western neighborhood of Alexandria, close to the
temple of Serapis, and was stocked with duplicate copies of the Museion
library’s holdings. This shadow library survived the fire that destroyed the
primary library of Alexandria but has since been eclipsed by the latter’s
myth.

Alberto Manguel says that if the library of Alexandria stood tall as an
expression of universal ambitions, there is another structure that haunts our
imagination: the tower of Babel. If the library attempted to conquer time, the
tower sought to vanquish space. He says “The Tower of Babel in space and the
Library of Alexandria in time are the twin symbols of these ambitions. In
their shadow, my small library is a reminder of both impossible yearnings—the
desire to contain all the tongues of Babel and the longing to possess all the
volumes of Alexandria.”2 Writing about the two failed projects Manguel adds
that when seen within the limiting frame of the real, the one exists only as
nebulous reality and the other as an unsuccessful if ambitious real estate
enterprise. But seen as myths, and in the imagination at night, the solidity
of both buildings for him is unimpeachable.3

The utopian ideal of the universal library was more than a question of built
up form or space or even the possibility of storing all of the knowledge of
the world; its real aspiration was in the illusion of order that it could
impose on a chaotic world where the lines drawn by a fine hairbrush
distinguished the world of animals from men, fairies from ghosts, science from
magic, and Europe from Japan. In some cases even after the physical structure
that housed the books had crumbled and the books had been reduced to dust the
ideal remained in the form of the order imagined for the library. One such
residual evidence comes to us by way of the _Pandectae_ —a comprehensive
bibliography created by Conrad Gesner in 1545 when he feared that the Ottoman
conquerors would destroy all the books in Europe. He created a bibliography
from which the library could be built again—an all embracing index which
contained a systematic organization of twenty principal groups with a matrix
like structure that contained 30,000 concepts.4

It is not surprising that Alberto Manguel would attempt write a literary,
historical and personal history of the library. As a seventeen-year-old man in
Buenos Aries, Manguel read for the blind seer Jorge Luis Borges who once
imagined in his appropriately named story—The Tower of Babel—paradise as a
kind of library. Modifying his mentor’s statement in what can be understood as
a gesture to the inevitable demands of the real and yet acknowledging the
possible pleasures of living in shadows, Manguel asserts that sometimes
paradise must adapt itself to suit circumstantial requirements. Similarly
Jacques Rancière writing about the libraries of the working class in the
eighteenth century tells us about Gauny a joiner and a boy in love with
vagrancy and botany who decides to build a library for himself. For the sons
of the poor proletarians living in Saint Marcel district, libraries were built
only a page at a time. He learnt to read by tracing the pages on which his
mother bought her lentils and would be disappointed whenever he came to the
end of a page and the next page was not available, even though he urged his
mother to buy her lentils from the same grocer. 5

[ ](//images.e-flux-systems.com/2012_09_DGF-D-Tropics-detail-hi-res-
WEB.jpg,2000 "Dominique Gonzalez-Foerster, Chronotopes & Dioramas , 2009.
Diorama installation at The Hispanic Society of America, New York.")

Dominique Gonzalez-Foerster, _Chronotopes & Dioramas_, 2009. Diorama
installation at The Hispanic Society of America, New York.

Is the utopian ideal of the universal library as exemplified by the library of
Alexandria or modernist pedagogic institutions of the twentieth century
adequate to the task of describing the space of the shadow library, or do we
need a different account of these other spaces? In an era of the ebook reader
where the line between a book and a library is blurred, the very idea of a
library is up for grabs. It has taken me well over two decades to build a
collection of a few thousand books while around two hundred thousand books
exist as bits and bytes on my computer. Admittedly hard drives crash and data
is lost, but is that the same threat as those of rain or fire? Which then is
my library and which its shadow? Or in the spirit of logotopias would it be
more appropriate to ask the spatial question: where is the library?

If the possibility of having 200,000 books on one’s computer feels staggering
here is an even more startling statistic. The Library of Congress which is the
largest library in the world with holdings of approximately thirty million
books, which would—if they were piled on the floor—cover 364 kilometers could
potentially fit into an SD card. It is estimated that by 2030 an ordinary SD
card will have the capacity of storing up to 64 TB and assuming each book were
digitized at an average size of 1MB it would technically be possible to fit
two Libraries of Congress in one’s pocket.

It sounds like science fiction, but isn’t it the case that much of the science
fiction of a decade ago finds itself comfortably within the weaves of everyday
life. How do we make sense of the future of the library? While it may be
tempting to throw our hands up in boggled perplexity about what it means to be
able to have thirty million books lets face it: the point of libraries have
never been that you will finish what’s there. Anyone with even a modest book
collection will testify to the impossibility of ever finishing their library
and if anything at all the library stands precisely at the cusp of our
finitude and our infinity. Perhaps that is what Borges—the consummate mixer of
time and space—meant when he described paradise as a library, not as a spatial
idea but a temporal one: that it was only within the confines of infinity that
one imagine finishing reading one’s library. It would therefore be more
interesting to think of the shadow library as a way of thinking about what it
means to dwell in knowledge. While all our aspirations for a habitat should
have a utopian element to them, lets face it, utopias have always been
difficult spaces to live in.

In contrast to the idea of utopia is heterotopia—a term with its origins in
medicine (referring to an organ of the body that had been dislodged from its
usual space) and popularized by Michel Foucault both in terms of language as
well as a spatial metaphor. If utopia exists as a nowhere or imaginary space
with no connection to any existing social spaces, then heterotopias in
contrast are realities that exist and are even foundational, but in which all
other spaces are potentially inverted and contested. A mirror for instance is
simultaneously a utopia (placeless place) even as it exists in reality. But
from the standpoint of the mirror you discover your absence as well. Foucault
remarks, “The mirror functions as a heterotopia in this respect: it makes this
place that I occupy at the moment when I look at myself in the glass at once
absolutely real, connected with all the space that surrounds it, and
absolutely unreal, since in order to be perceived it has to pass through this
virtual point which is over there.”6

In _The Order of Things_ Foucault sought to investigate the conceptual space
which makes the order of knowledge possible; in his famed reading of Borges’s
Chinese encyclopedia he argues that the impossibility involved in the
encyclopedia consists less in the fantastical status of the animals and their
coexistence with real animals such as (d) sucking pigs and (e) sirens, but in
where they coexist and what “transgresses the boundaries of all imagination,
of all possible thought, is simply that alphabetical series (a, b, c, d) which
links each of those categories to all the others.” 7 Heterotopias destabilize
the ground from which we build order and in doing so reframe the very
epistemic basis of how we know.

Foucault later developed a greater spatial understanding of heterotopias in
which he uses specific examples such as the cemetery (at once the space of the
familiar since everyone has someone in the cemetery and at the heart of the
city but also over a period of time the other city, where each family
possesses its dark resting place).8 Indeed, the paradox of heterotopias is
that they are both separate from yet connected to all other spaces. This
connectedness is precisely what builds contestation into heterotopias.
Imaginary spaces such as utopias exist completely outside of order.
Heteretopias by virtue of their connectedness become sites in which epistemes
collide and overlap. They bring together heterogeneous collections of unusual
things without allowing them a unity or order established through resemblance.
Instead, their ordering is derived from a process of similitude that produces,
in an almost magical, uncertain space, monstrous combinations that unsettle
the flow of discourse.

If the utopian ideal of the library was to bring together everything that we
know of the world then the length of its bookshelves was coterminous with the
breadth of the world. But like its predecessors in Alexandria and Babel the
project is destined to be incomplete haunted by what it necessarily leaves out
and misses. The library as heterotopia reveals itself only through the
interstices and lays bare the fiction of any possibility of a coherent ground
on which a knowledge project can be built. Finally there is the question of
where we stand once the grounds that we stand on itself has been dislodged.
The answer from my first foray into the tiny six by eight foot Mecca store to
the innumerable hours spent on [ library.nu]( library.nu) remains the same:
the heterotopic pleasure of our finite selves in infinity.

×

This essay is a part of a work I am doing for an exhibition curated by Raqs
Media Collective, Sarai Reader 09. The show began on August 19, 2012, with a
deceptively empty space containing only the proposal, with ideas for the
artworks to come over a period of nine months. See
.

**Lawrence Liang** is a researcher and writer based at the Alternative Law
Forum, Bangalore. His work lies at the intersection of law and cultural
politics, and has in recent years been looking at question of media piracy. He
is currently finish a book on law and justice in Hindi cinema.

© 2012 e-flux and the author

[ ![](//images.e-flux-systems.com/Banner-Eflux-760x1350px-Learoyd-ing-
ok.gif,300) ](/ads/redirect/271922)

Journal # 37

Related

Conversations

Notes

Share

[Download PDF](http://worker01.e-flux.com/pdf/article_8957468.pdf)

More

Julieta Aranda, Brian Kuan Wood, and Anton Vidokle

## [Editorial](/journal/37/61227/editorial/)

![](data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7)

It is hard to avoid the feeling these days that the future is behind us. It’s
not so much that time has stopped, but rather that the sense of promise and
purpose that once drove historical progress has become impossible to sustain.
On the one hand, the faith in modernist, nationalist, or universalist utopias
continues to retreat, while on the other, a more immediate crisis of faith has
accompanied the widespread sense of diminishing economic prospects felt in so
many places. Not to mention...

## [Shadow Libraries](/journal/37/61228/shadow-libraries/)

![](data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7)

Over the last few monsoons I lived with the dread that the rain would
eventually find its ways through my leaky terrace roof and destroy my books.
Last August my fears came true when I woke up in the middle of the night to
see my room flooded and water leaking from the roof and through the walls.
Much of the night was spent rescuing the books and shifting them to a dry
room. While timing and speed were essential to the task at hand they were also
the key hazards navigating a slippery floor...

Metahaven

## [Captives of the Cloud: Part I](/journal/37/61232/captives-of-the-cloud-
part-i/)

![](data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7)

We are the voluntary prisoners of the cloud; we are being watched over by
governments we did not elect. Wael Ghonim, Google's Egyptian executive, said:
“If you want to liberate a society just give them the internet.” 1 But how
does one liberate a society that already has the internet? In a society
permanently connected through pervasive broadband networks, the shared
internet is, bit by bit and piece by piece, overshadowed by the “cloud.” The
Coming of the Cloud The cloud,...

Amelia Groom

## [There’s Nothing to See Here: Erasing the
Monochrome](/journal/37/61233/there-s-nothing-to-see-here-erasing-the-
monochrome/)

![](data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7)

There was once a typist from Texas named Bette Nesmith Graham, who wasn’t very
good at her job. In 1951 she started erasing her typing mistakes with a white
tempera paint solution she mixed in her kitchen blender. She called her
invention Mistake Out and began distributing small green bottles of it to her
coworkers. In 1956 she founded the delectably named Mistake Out Company.
Shortly after, she was apparently fired from her typist job because she made a
“mistake” that she failed to cover...

Nato Thompson

## [The Last Pictures: Interview with Trevor Paglen](/journal/37/61238/the-
last-pictures-interview-with-trevor-paglen/)

![](data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7)

In 1963 NASA launched the first communications satellite, Syncom 2, into a
geosynchronous orbit over the Atlantic Ocean. Since then, humans have slowly
and methodically added to this space-based communications infrastructure.
Currently, more than 800 spacecraft in geosynchronous orbit form a man-made
ring of satellites around Earth at an altitude of 36,000 kilometers. Most of
these spacecraft powered down long ago, yet continue to float aimlessly around
the planet. Geostationary satellites...

Claire Tancons

## [Carnival to Commons: Pussy Riot, Punk Protest, and the Exercise of
Democratic Culture](/journal/37/61239/carnival-to-commons-pussy-riot-punk-
protest-and-the-exercise-of-democratic-culture/)

![](data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7)

Once again, the press has dismissed a popular movement as carnival—this time
not Occupy Wall Street, but the anti-Putin protests. On March 1, 2012, in a
Financial Times article titled “Carnival spirit is not enough to change
Russia,” Konstantin von Eggert wrote, “One cannot sustain [the movement] on
carnival spirit alone.” 1 A little over a week later, Reuters sought to close
the debate with an article by Alissa de Carbonnel, in which she announced,
“The carnival is over for Russia’s...

Anton Vidokle and Brian Kuan Wood

## [Breaking the Contract](/journal/37/61241/breaking-the-contract/)

![](data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7)

1\. The Contract The Duchampian revolution leads not to the liberation of the
artist from work, but to his or her proletarization via alienated construction
and transportation work. In fact, contemporary art institutions no longer need
an artist as a traditional producer. Rather, today the artist is more often
hired for a certain period of time as a worker to realize this or that
institutional project. — Boris Groys 1 When his readymades entered the space
of art, Duchamp...

Shadow Libraries

There is nothing related.

Conversations - Shadow Libraries

Conversations

[Join the Conversation](http://conversations.e-flux.com/t/5546)

e-flux conversations is a discussion platform for e-flux readers. Click to
start a discussion of the article above.

Start the Conversation

Notes - Shadow Libraries

1

Esther Shipman and Sascha Hastings eds., _Logotopia: The Library in
Architecture Art and the Imagination,_ (Cambridge Galleries: Abc Art Books
Canada, 2008).

Go to Text

2

Alberto Manguel, “My Library” in Hastings and Shipman eds. _Logotopia, The
Library in Art and Architecture and the Imagination, (Cambridge Galleries: ABC
Art Books Canada, 2008)._

Go to Text

3

Alberto Manguel, _The Library at Night_ , (Yale University Press 2009).

Go to Text

4

Ray Hastings and Esther Shipman, eds. _Logotopia: The Library in Architecture
Art and the Imagination_. Cambridge Galleries / ABC Art Books Canada, 2008.

Go to Text

5

Jacques Rancière, _The Nights of Labour: The Workers’ Dream in Nineteenth
Century France,_ (Philadelphia: Temple University Press, 1991).

Go to Text

6

Michel Foucault, “Different Spaces,” in _Aesthetics, Method, Epistemology_ ,
ed. James D. Faubion (New York: The New Press, 1998), 179; For Foucault on
language and heterotopias see _The Order of Things: An Archaeology of the
Human Sciences,_ (New York: Pantheon, 1970).

Go to Text

7

Ibid, xv.

Go to Text

8

In Foucault, “Different Spaces,” which was presented as a lecture to the
_Architecture Studies Circle_ in 1967, a few years after the writing of _The
Order of Things_.

Go to Text

Esther Shipman and Sascha Hastings eds., _Logotopia: The Library in
Architecture Art and the Imagination,_ (Cambridge Galleries: Abc Art Books
Canada, 2008).

Alberto Manguel, “My Library” in Hastings and Shipman eds. _Logotopia, The
Library in Art and Architecture and the Imagination, (Cambridge Galleries: ABC
Art Books Canada, 2008)._

Alberto Manguel, _The Library at Night_ , (Yale University Press 2009).

Ray Hastings and Esther Shipman, eds. _Logotopia: The Library in Architecture
Art and the Imagination_. Cambridge Galleries / ABC Art Books Canada, 2008.

Jacques Rancière, _The Nights of Labour: The Workers’ Dream in Nineteenth
Century France,_ (Philadelphia: Temple University Press, 1991).

Michel Foucault, “Different Spaces,” in _Aesthetics, Method, Epistemology_ ,
ed. James D. Faubion (New York: The New Press, 1998), 179; For Foucault on
language and heterotopias see _The Order of Things: An Archaeology of the
Human Sciences,_ (New York: Pantheon, 1970).

Ibid, xv.

In Foucault, “Different Spaces,” which was presented as a lecture to the
_Architecture Studies Circle_ in 1967, a few years after the writing of _The
Order of Things_.


Barok
Poetics of Research
2014


_An unedited version of a talk given at the conference[Public
Library](http://www.wkv-stuttgart.de/en/program/2014/events/public-library/)
held at Württembergischer Kunstverein Stuttgart, 1 November 2014._

_Bracketed sequences are to be reformulated._

Poetics of Research

In this talk I'm going to attempt to identify [particular] cultural
algorithms, ie. processes in which cultural practises and software meet. With
them a sphere is implied in which algorithms gather to form bodies of
practices and in which cultures gather around algorithms. I'm going to
approach them through the perspective of my practice as a cultural worker,
editor and artist, considering practice in the same rank as theory and
poetics, and where theorization of practice can also lead to the
identification of poetical devices.

The primary motivation for this talk is an attempt to figure out where do we
stand as operators, users [and communities] gathering around infrastructures
containing a massive body of text (among other things) and what sort of things
might be considered to make a difference [or to keep making difference].

The talk mainly [considers] the role of text and the word in research, by way
of several figures.

A

A reference, list, scheme, table, index; those things that intervene in the
flow of narrative, illustrating the point, perhaps in a more economic way than
the linear text would do. Yet they don't function as pictures, they are
primarily texts, arranged in figures. Their forms have been
standardised[normalised] over centuries, withstood the transition to the
digital without any significant change, being completely intuitive to the
modern reader. Compared to the body of text they are secondary, run parallel
to it. Their function is however different to that of the punctuation. They
are there neither to shape the narrative nor to aid structuring the argument
into logical blocks. Nor is their function spatial, like in visual poems.
Their positions within a document are determined according to the sequential
order of the text, [standing as attachments] and are there to clarify the
nature of relations among elements of the subject-matter, or to establish
relations with other documents. The [premise] of my talk is that these
_textual figures_ also came to serve as the abstract[relational] models
determining possible relations among documents as such, and in consequence [to
structure conditions [of research]].

B

It can be said that research, as inquiry into a subject-matter, consists of
discrete queries. A query, such as a question about what something is, what
kinds, parts and properties does it have, and so on, can be consulted in
existing documents or generate new documents based on collection of data [in]
the field and through experiment, before proceeding to reasoning [arguments
and deductions]. Formulation of a query is determined by protocols providing
access to documents, which means that there is a difference between collecting
data outside the archive (the undocumented, ie. in the field and through
experiment), consulting with a person--an archivist (expert, librarian,
documentalist), and consulting with a database storing documents. The
phenomena such as [deepening] of specialization and throughout digitization
[have given] privilege to the database as [a|the] [fundamental] means for
research. Obviously, this is a very recent [phenomenon]. Queries were once
formulated in natural language; now, given the fact that databases are queried
[using] SQL language, their interfaces are mere extensions of it and
researchers pose their questions by manipulating dropdowns, checkboxes and
input boxes mashed together on a flat screen being ran by software that in
turn translates them into a long line of conditioned _SELECTs_ and _JOINs_
performed on tables of data.

Specialization, digitization and networking have changed the language of
questioning. Inquiry, once attached to the flesh and paper has been
[entrusted] to the digital and networked. Researchers are querying the black
box.

C

Searching in a collection of [amassed/assembled] [tangible] documents (ie.
bookshelf) is different from searching in a systematically structured
repository (library) and even more so from searching in a digital repository
(digital library). Not that they are mutually exclusive. One can devise
structures and algorithms to search through a printed text, or read books in a
library one by one. They are rather [models] [embodying] various [processes]
associated with the query. These properties of the query might be called [the
sequence], the structure and the index. If they are present in the ways of
querying documents, and we will return to this issue, are they persistent
within the inquiry as such? [wait]

D

This question itself is a rupture in the sequence. It makes a demand to depart
from one narrative [a continuous flow of words] to another, to figure out,
while remaining bound to it [it would be even more as a so-called rhetorical
question]. So there has been one sequence, or line, of the inquiry--about the
kinds of the query and its properties. That sequence itself is a digression,
from within the sequence about what is research and describing its parts
(queries). We are thus returning to it and continue with a question whether
the properties of the inquiry are the same as the properties of the query.

E

But isn't it true that every single utterance occurring in a sequence yields a
query as well? Let's consider the word _utterance_. [wait] It can produce a
number of associations, for example with how Foucault employs the notion of
_énoncé_ in his _Archaeology of Knowledge_ , giving hard time to his English
translators wondering whether _utterance_ or _statement_ is more appropriate,
or whether they are interchangeable, and what impact would each choice have on
his reception in the Anglophone world. Limiting ourselves to textual forms for
now (and not translating his work but pursing a different inquiry), let us say
the utterance is a word [or a phrase or an idiom] in a sequence such as a
sentence, a paragraph, or a document.

## (F) The
structure[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=1
"Edit section: \(F\) The structure")]

This distinction is as old as recorded Western thought since both Plato and
Aristotle differentiate between a word on its own ("the said", a thing said)
and words in the company of other words. For example, Aristotle's _Categories_
[lay] on the [notion] of words on their own, and they are made the subject-
matter of that inquiry. [For him], the ambiguity of connotation words
[produce] lies in their synonymity, understood differently from the moderns--
not as more words denoting a similar thing but rather one word denoting
various things. Categories were outlined as a device to differentiate among
words according to kinds of these things. Every word as such belonged to not
less and not more than one of ten categories.

So it happens to the word _utterance_ , as to any other word uttered in a
sequence, that it poses a question, a query about what share of the spectrum
of possibly denoted things might yield as the most appropriate in a given
context. The more context the more precise share comes to the fore. When taken
out of the context ambiguity prevails as the spectrum unveils in its variety.

Thus single words [as any other utterances] are questions, queries,
themselves, and by occuring in statements, in context, their [means] are being
singled out.

This process is _conditioned_ by what has been formalized as the techniques of
_regulating_ definitions of words.

### (G) The structure: words as
words[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=2
"Edit section: \(G\) The structure: words as words")]

* [![](/images/thumb/c/c8/Philitas_in_P.Oxy.XX_2260_i.jpg/144px-Philitas_in_P.Oxy.XX_2260_i.jpg)](/File:Philitas_in_P.Oxy.XX_2260_i.jpg)

P.Oxy.XX 2260 i: Oxyrhynchus papyrus XX, 2260, column i, with quotation from
Philitas, early 2nd c. CE. 1(http://163.1.169.40/cgi-
bin/library?e=q-000-00---0POxy--00-0-0--0prompt-10---4------0-1l--1-en-50---
20-about-2260--
00031-001-0-0utfZz-8-00&a=d&c=POxy&cl=search&d=HASH13af60895d5e9b50907367)
2(http://en.wikipedia.org/wiki/File:POxy.XX.2260.i-Philitas-
highlight.jpeg)

* [![](/images/thumb/9/9e/Cyclopaedia_1728_page_210_Dictionary_entry.jpg/88px-Cyclopaedia_1728_page_210_Dictionary_entry.jpg)](/File:Cyclopaedia_1728_page_210_Dictionary_entry.jpg)

Ephraim Chambers, _Cyclopaedia, or an Universal Dictionary of Arts and
Sciences_ , 1728, p. 210. 3(http://digicoll.library.wisc.edu/cgi-
bin/HistSciTech/HistSciTech-
idx?type=turn&entity=HistSciTech.Cyclopaedia01.p0576&id=HistSciTech.Cyclopaedia01&isize=L)

* [![](/images/thumb/b/b8/Detail_from_the_Liddell-Scott_Greek-English_Lexicon_c1843.jpg/160px-Detail_from_the_Liddell-Scott_Greek-English_Lexicon_c1843.jpg)](/File:Detail_from_the_Liddell-Scott_Greek-English_Lexicon_c1843.jpg)

Detail from the Liddell-Scott Greek-English Lexicon, c1843.

Dictionaries have had a long life. The ancient Greek scholar and poet Philitas
of Cos living in the 4th c. BCE wrote a vocabulary explaining the meanings of
rare Homeric and other literary words, words from local dialects, and
technical terms. The vocabulary, called _Disorderly Words_ (Átaktoi glôssai),
has been lost, with a few fragments quoted by later authors. One example is
that the word πέλλα (pélla) meant "wine cup" in the ancient Greek region of
Boeotia; contrasted to the same word meaning "milk pail" in Homer's _Iliad_.

Not much has changed in the way how dictionaries constitute order. Selected
archives of statements are queried to yield occurrences of particular words,
various _criteria[indicators]_ are applied to filtering and sorting them and
in turn the spectrum of [denoted] things allocated in this way is structured
into groups and subgroups which are then given, according to other set of
rules, shorter or longer names. These constitute facets of [potential]
meanings of a word.

So there are at least _four_ sets of conditions [structuring] dictionaries.
One is required to delimit an archive[corpus of texts], one to select and give
preference[weights] to occurrences of a word, another to cluster them, and yet
another to abstract[generalize] the subject-matter of each of these clusters.
Needless to say, this is a craft of a few and these criteria are rarely being
disclosed, despite their impact on research, and more generally, their
influence as conditions for production[making] of a so called _common sense_.

It doesn't take that much to reimagine what a dictionary is and what it could
be, especially having large specialized corpora of texts at hand. These can
also serve as aids in production of new words and new meanings.

### (H) The structure: words as knowledge and the
world[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=3
"Edit section: \(H\) The structure: words as knowledge and the world")]

* [![](/images/thumb/0/02/Boethius_Porphyrys_Isagoge.jpg/120px-Boethius_Porphyrys_Isagoge.jpg)](/File:Boethius_Porphyrys_Isagoge.jpg)

Boethius's rendering of a classification tree described in Porphyry's Isagoge
(3th c.), [6th c.] 10th c.
4(http://www.e-codices.unifr.ch/en/sbe/0315/53/medium)

* [![](/images/thumb/d/d0/Cyclopaedia_1728_page_ii_Division_of_Knowledge.jpg/94px-Cyclopaedia_1728_page_ii_Division_of_Knowledge.jpg)](/File:Cyclopaedia_1728_page_ii_Division_of_Knowledge.jpg)

Ephraim Chambers, _Cyclopaedia, or an Universal Dictionary of Arts and
Sciences_ , London, 1728, p. II. 5(http://digicoll.library.wisc.edu/cgi-
bin/HistSciTech/HistSciTech-
idx?type=turn&entity=HistSciTech.Cyclopaedia01.p0015&id=HistSciTech.Cyclopaedia01&isize=L)

* [![](/images/thumb/d/d6/Encyclopedie_1751_Systeme_figure_des_connaissances_humaines.jpg/116px-Encyclopedie_1751_Systeme_figure_des_connaissances_humaines.jpg)](/File:Encyclopedie_1751_Systeme_figure_des_connaissances_humaines.jpg)

Système figuré des connaissances humaines, _Encyclopédie ou Dictionnaire
raisonné des sciences, des arts et des métiers_ , 1751.
6(http://encyclopedie.uchicago.edu/content/syst%C3%A8me-figur%C3%A9-des-
connaissances-humaines)

* [![](/images/thumb/9/96/Haeckel_Ernst_1874_Stammbaum_des_Menschen.jpg/96px-Haeckel_Ernst_1874_Stammbaum_des_Menschen.jpg)](/File:Haeckel_Ernst_1874_Stammbaum_des_Menschen.jpg)

Haeckel - Darwin's tree.

Another _formalized_ and [internalized] process being at play when figuring
out a word is its [containment]. Word is not only structured by way of things
it potentially denotes but also by words it is potentially part of and those
it contains.

The fuzz around categorization of knowledge _and_ the world in the Western
thought can be traced back to Porphyry, if not further. In his introduction to
Aristotle's _Categories_ this 3rd century AD Neoplatonist began expanding the
notions of genus and species into their hypothetic consequences. Aristotle's
brief work outlines ten categories of 'things that are said' (legomena,
λεγόμενα), namely substance (or substantive, {not the same as matter!},
οὐσία), quantity (ποσόν), qualification (ποιόν), a relation (πρός), where
(ποῦ), when (πότε), being-in-a-position (κεῖσθαι), having (or state,
condition, ἔχειν), doing (ποιεῖν), and being-affected (πάσχειν). In his
different work, _Topics_ , Aristotle outlines four kinds of subjects/materials
indicated in propositions/problems from which arguments/deductions start.
These are a definition (όρος), a genus (γένος), a property (ἴδιος), and an
accident (συμβεβηϰόϛ). Porphyry does not explicitly refer _Topics_ , and says
he omits speaking "about genera and species, as to whether they subsist (in
the nature of things) or in mere conceptions only"
8(http://www.ccel.org/ccel/pearse/morefathers/files/porphyry_isagogue_02_translation.htm#C1),
which means he avoids explicating whether he talks about kinds of concepts or
kinds of things in the sensible world. However, the work sparked confusion, as
the following passage [suggests]:

> "[I]n each category there are certain things most generic, and again, others
most special, and between the most generic and the most special, others which
are alike called both genera and species, but the most generic is that above
which there cannot be another superior genus, and the most special that below
which there cannot be another inferior species. Between the most generic and
the most special, there are others which are alike both genera and species,
referred, nevertheless, to different things, but what is stated may become
clear in one category. Substance indeed, is itself genus, under this is body,
under body animated body, under which is animal, under animal rational animal,
under which is man, under man Socrates, Plato, and men particularly." (Owen
1853,
9(http://www.ccel.org/ccel/pearse/morefathers/files/porphyry_isagogue_02_translation.htm#C2))

Porphyry took one of Aristotle's ten categories of the word, substance, and
dissected it using one of his four rhetorical devices, genus. Employing
Aristotle's categories, genera and species as means for logical operations,
for dialectic, Porphyry's interpretation resulted in having more resemblance
to the perceived _structures_ of the world. So they began to bloom.

There were earlier examples, but Porphyry was the most influential in
injecting the _universalist_ version of classification [implying] the figure
of a tree into the [locus] of Aristotle's thought. Knowledge became
monotheistic.

Classification schemes [growing from one point] play a major role in
untangling the format of modern encyclopedia from that of the dictionary
governed by alphabet. Two of the most influential encyclopedias of the 18th
century are cases in the point. Although still keeping 'dictionary' in their
titles, they are conceived not to represent words but knowledge. The [upper-
most] genus of the body was set as the body of knowledge. The English
_Cyclopaedia, or an Universal Dictionary of Arts and Sciences_ (1728) splits
into two main branches: "natural and scientifical" and "artificial and
technical"; these further split down to 47 classes in total, each carrying a
structured list (on the following pages) of thematic articles, serving as
table of contents. The French _Encyclopedia: or a Systematic Dictionary of the
Sciences, Arts, and Crafts_ (1751) [unwinds] from judgement ( _entendement_ ),
branches into memory as history, reason as philosophy, and imagination as
poetry. The logic of containers was employed as an aid not only to deal with
the enormous task of naming and not omiting anything from what is known, but
also for the management of labour of hundreds of writers and researchers, to
create a mechanism for delegating work and the distribution of
responsibilities. Flesh was also more present, in the field research, with
researchers attending workshops and sites of everyday life to annotate it.

The world came forward to unshine the word in other schemes. Darwin's tree of
evolution and some of the modern document classification systems such as
Charles A. Cutter's _Expansive Classification_ (1882) set to classify the
world itself and set the field for what has came to be known as authority
lists structuring metadata in today's computing.

### The structure
(summary)[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=4
"Edit section: The structure \(summary\)")]

Facetization of meaning and branching of knowledge are both the domain of the
unit of utterance.

While lexicographers[dictionarists] structure thought through multi-layered
processes of abstraction of the written record, knowledge growers dissect it
into hierarchies of [mutually] contained notions.

One seek to describe the word as a faceted list of small worlds, another to
describe the world as a structured lists of words. One play prime in the
domain of epistemology, in what is known, controlling the vocabulary, another
in the domain of ontology, in what is, controlling reality.

Every [word] has its given things, every thing has its place, closer or
further from a single word.

The schism between classifying words and classifying the world implies it is
not possible to construct a universal classification scheme[system]. On top of
that, any classification system of words is bound to a corpus of texts it is
operating upon and any classification system of the world again operates with
words which are bound to a vocabulary[lexicon] which is again bound to a
corpus [of texts]. It doesn't mean it would prevent people from trying.
Classifications function as descriptors of and 'inscriptors' upon the world,
imprinting their authority. They operate from [a locus of] their
corpus[context]-specificity. The larger the corpus, the more power it has on
shaping the world, as far as the word shapes it (yes, I do imply Google here,
for which it is a domain to be potentially exploited).

## (J) The
sequence[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=5
"Edit section: \(J\) The sequence")]

The structure-yielding query [of] the single word [shrinks][zuzuje
sa,spresnuje] with preceding and following words. Inquiry proceeds in the flow
that establishes another kind[mode] of relationality, chaining words into the
sequence. While the structuring property of the query brings words apart from
each other, its sequential property establishes continuity and brings these
units into an ordered set.

This is what is responsible for attaching textual figures mentioned earlier
(lists, schemes, tables) to the body of the text. Associations can be also
stated explicitly, by indexing tables and then referring them from a
particular point in the text. The same goes for explicit associations made
between blocks of the text by means of indexed paragraphs, chapters or pages.

From this follows that all utterances point to the following utterance by the
nature of sequential order, and indexing provides means for pointing elsewhere
in the document as well.

A lot can be said about references to other texts. Here, to spare time, I
would refer you to a talk I gave a few months ago and which is online
10(http://monoskop.org/Talks/Communing_Texts).

This is still the realm of print. What happens with document when it is
digitized?

Digitization breaks a document into units of which each is assigned a numbered
position in the sequence of the document. From this perspective digitization
can be viewed as a total indexation of the document. It is converted into
units rendered for machine operations. This sequentiality is made explicit, by
means of an underlying index.

Sequences and chains are orders of one dimension. Their one-dimensional
ordering allows addressability of each element and [random] access. [Jumps]
between [random] addresses are still sequential, processing elements one at a
time.

## (K) The
index[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=6
"Edit section: \(K\) The index")]

* [![](/images/thumb/2/27/Summa_confessorum.1310.jpg/103px-Summa_confessorum.1310.jpg)](/File:Summa_confessorum.1310.jpg)

Summa confessorum [1297-98], 1310.
7(http://www.bl.uk/onlinegallery/onlineex/illmanus/roymanucoll/j/011roy000008g11u00002000.html)

[The] sequencing not only weaves words into statements but activates other
temporalities, and _presents occurrences of words from past statements_. As
now when I am saying the word _utterance_ , each time there surface contexts
in which I have used it earlier.

A long quote from Frederick G. Kilgour, _The Evolution of the Book_ , 1998, pp
76-77:

> "A century of invention of various types of indexes and reference tools
preceded the advent of the first subject index to a specific book, which
occurred in the last years of the thirteenth century. The first subject
indexes were "distinctions," collections of "various figurative or symbolic
meanings of a noun found in the scriptures" that "are the earliest of all
alphabetical tools aside from dictionaries." (Richard and Mary Rouse supply an
example: "Horse = Preacher. Job 39: 'Hast thou given the horse strength, or
encircled his neck with whinning?')

>

> [Concordance] By the end of the third decade of the thirteenth century Hugh
de Saint-Cher had produced the first word concordance. It was a simple word
index of the Bible, with every location of each word listed by [its position
in the Bible specified by book, chapter, and letter indicating part of the
chapter]. Hugh organized several dozen men, assigning to each man an initial
letter to search; for example, the man assigned M was to go through the entire
Bible, list each word beginning with M and give its location. As it was soon
perceived that this original reference work would be even more useful if words
were cited in context, a second concordance was produced, with each word in
lengthy context, but it proved to be unwieldy. [Soon] a third version was
produced, with words in contexts of four to seven words, the model for
biblical concordances ever since.

>

> [Subject index] The subject index, also an innovation of the thirteenth
century, evolved over the same period as did the concordance. Most of the
early topical indexes were designed for writing sermons; some were organized,
while others were apparently sequential without any arrangement. By midcentury
the entries were in alphabetical order, except for a few in some classified
arrangement. Until the end of the century these alphabetical reference works
indexed a small group of books. Finally John of Freiburg added an alphabetical
subject index to his own book, _Summa Confessorum_ (1297—1298). As the Rouses
have put it, 'By the end of the [13]th century the practical utility of the
subject index is taken for granted by the literate West, no longer solely as
an aid for preachers, but also in the disciplines of theology, philosophy, and
both kinds of law.'"

In one sense neither subject-index nor concordane are indexes, they are words
or group of words selected according to given criteria from the body of the
text, each accompanied with a list of identifiers. These identifiers are
elements of an index, whether they represent a page, chapter, column, or other
[kind of] block of text. Every identifier is an unique _address_.

The index is thus an ordering of a sequence by means of associating its
elements with a set of symbols, when each element is given unique combination
of symbols. Different sizes of sets yield different number of variations.
Symbol sets such as an alphabet, arabic numerals, roman numerals, and binary
digits have different proportions between the length of a string of symbols
and the number of possible variations it can contain. Thus two symbols of
English alphabet can store 26^2 various values, of arabic numerals 10^2, of
roman numberals 8^2 and of binary digits 2^2.

Indexation is segmentation, a breaking into segments. From as early as the
13th century the index such as that of sections has served as enabler of
search. The more [detailed] indexation the more precise search results it
enables.

The subject-index and concordance are tables of search results. There is a
direct lineage from the 13th-century biblical concordances and the birth of
computational linguistic analysis, they were both initiated and realised by
priests.

During the World War II, Jesuit Father Roberto Busa began to look for machines
for the automation of the linguistic analysis of the 11 million-word Latin
corpus of Thomas Aquinas and related authors.

Working on his Ph.D. thesis on the concept of _praesens_ in Aquinas he
realised two things:

> "I realized first that a philological and lexicographical inquiry into the
verbal system of an author has t o precede and prepare for a doctrinal
interpretation of his works. Each writer expresses his conceptual system in
and through his verbal system, with the consequence that the reader who
masters this verbal system, using his own conceptual system, has to get an
insight into the writer's conceptual system. The reader should not simply
attach t o the words he reads the significance they have in his mind, but
should try t o find out what significance they had in the writer's mind.
Second, I realized that all functional or grammatical words (which in my mind
are not 'empty' at all but philosophically rich) manifest the deepest logic of
being which generates the basic structures of human discourse. It is .this
basic logic that allows the transfer from what the words mean today t o what
they meant to the writer.

>

> In the works of every philosopher there are two philosophies: the one which
he consciously intends to express and the one he actually uses to express it.
The structure of each sentence implies in itself some philosophical
assumptions and truths. In this light, one can legitimately criticize a
philosopher only when these two philosophies are in contradiction."
11(http://www.alice.id.tue.nl/references/busa-1980.pdf)

Collaborating with the IBM in New York from 1949, the work, a concordance of
all the words of Thomas Aquinas, was finally published in the 1970s in 56
printed volumes (a version is online since 2005
12(http://www.corpusthomisticum.org/it/index.age)). Besides that, an
electronic lexicon for automatic lemmatization of Latin words was created by a
team of ten priests in the scope of two years (in two phases: grouping all the
forms of an inflected word under their lemma, and coding the morphological
categories of each form and lemma), containing 150,000 forms
13(http://www.alice.id.tue.nl/references/busa-1980.pdf#page=4). Father
Busa has been dubbed the father of humanities computing and recently also of
digital humanities.

The subject-index has a crucial role in the printed book. It is the only means
for search the book offers. Subjects composing an index can be selected
according to a classification scheme (specific to a field of an inquiry), for
example as elements of a certain degree (with a given minimum number of
subclasses).

Its role seemingly vanishes in the digital text. But it can be easily
transformed. Besides serving as a table of pre-searched results the subject-
index also gives a distinct idea about content of the book. Two patterns give
us a clue: numbers of occurrences of selected words give subjects weights,
while words that seem specific to the book outweights other even if they don't
occur very often. A selection of these words then serves as a descriptor of
the whole text, and can be thought of as a specific kind of 'tags'.

This process was formalized in a mathematical function in the 1970s, thanks to
a formula by Karen Spärck Jones which she entitled 'inverse document
frequency' (IDF), or in other words, "term specificity". It is measured as a
proportion of texts in the corpus where the word appears at least once to the
total number of texts. When multiplied by the frequency of the word _in_ the
text (divided by the maximum frequency of any word in the text), we get _term
frequency-inverse document frequency_ (tf-idf). In this way we can get an
automated list of subjects which are particular in the text when compared to a
group of texts.

We came to learn it by practice of searching the web. It is a mechanism not
dissimilar to thought process involved in retrieving particular information
online. And search engines have it built in their indexing algorithms as well.

There is a paper proposing attaching words generated by tf-idf to the
hyperlinks when referring websites 14(http://bscit.berkeley.edu/cgi-
bin/pl_dochome?query_src=&format=html&collection=Wilensky_papers&id=3&show_doc=yes).
This would enable finding the referred content even after the link is dead.
Hyperlinks in references in the paper use this feature and it can be easily
tested: 15(http://www.cs.berkeley.edu/~phelps/papers/dissertation-
abstract.html?lexical-
signature=notemarks+multivalent+semantically+franca+stylized).

There is another measure, cosine similarity, which takes tf-idf further and
can be applied for clustering texts according to similarities in their
specificity. This might be interesting as a feature for digital libraries, or
even a way of organising library bottom-up into novel categories, new
discourses could emerge. Or as an aid for researchers to sort through texts,
or even for editors as an aid in producing interesting anthologies.

## Final
remarks[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=7
"Edit section: Final remarks")]

1

New disciplines emerge all the time - most recently, for example, cultural
techniques, software studies, or media archaeology. It takes years, even
decades, before they gain dedicated shelves in libraries or a category in
interlibrary digital repositories. Not that it matters that much. They are not
only sites of academic opportunities but, firstly, frameworks of new
perspectives of looking at the world, new domains of knowledge. From the
perspective of researcher the partaking in a discipline involves negotiating
its vocabulary, classifications, corpus, reference field, and specific
terms[subjects]. Creating new fields involves all that, and more. Even when
one goes against all disciplines.

2

Google can still surprise us.

3

Knowledge has been in the making for millenia. There have been (abstract)
mechanisms established that govern its conditions. We now possess specialized
corpora of texts which are interesting enough to serve as a ground to discuss
and experiment with dictionaries, classifications, indexes, and tools for
references retrieval. These all belong to the poetic devices of knowledge-
making.

4

Command-line example of tf-idf and concordance in 3 steps.

* 1\. Process the files text.1-5.txt and produce freq.1-5.txt with lists of (nonlemmatized) words (in respective texts), ordered by frequency:

> for i in {1..5}; do tr '[A-Z]' '[a-z]' < text.$i.txt | tr -c '[a-z]'
'[\012*]' | tr -d '[:punct:]' | sort | uniq -c | sort -k 1nr | sed '1,1d' >
temp.txt; max=$(awk -vvar=1 -F" " 'NR

1 {print $var}' temp.txt); awk
-vmaxx=$max -F' ' '{printf "%-7.7f %s\n", $1=0.5+($1/(maxx*2)), $2}' > freq.$i.txt; done && rm temp.txt

* 2\. Process the files freq.1-5.txt and produce tfidf.1-5.txt containing a list of words (out of 500 most frequent in respective lists), ordered by weight (specificity for each text):

> for j in {1..5}; do rm freq.$j.txt.temp; lines=$(wc -l freq.$j.txt) && for i
in {1..500}; do word=$(awk -vline="$i" -vfield=2 -F" " 'NR

line {print
$field}' freq.$j.txt); tf=$(awk -vline="$i" -vfield=1 -F" " 'NR

line {print
$field}' freq.$j.txt); count=$(egrep -lw $word freq.?.txt | wc -l); idf=$(echo
"1+l(5/$count)" | bc -l); tfidf=$(echo $tf*$idf | bc); echo $word $tfidf >>
freq.$j.txt.temp; done; sort -k 2nr < freq.$j.txt.temp > tfidf.$j.txt; done

* 3\. Process the files tfidf.1-5.txt and their source text, text.txt, and produce occ.txt with concordance of top 3 words from each of them:

> rm occ.txt && for j in {1..5}; do echo "$j" >> occ.txt; ptx -f -w 150
text.txt.$j > occ.$j.txt; for i in {1..3}; do word=$(awk -vline="$i" -vfield=1
-F" " 'NR

line {print $field}' tfidf.$j.txt); egrep -i
"[alpha:](/index.php?title=Alpha:&action=edit&redlink=1 "Alpha: \(page does
not exist\)") $word" occ.$j.txt >> occ.txt; done; done

Dušan Barok

_Written 23 October - 1 November 2014 in Bratislava and Stuttgart._


 

Display 200 300 400 500 600 700 800 900 1000 ALL characters around the word.