Fuller & Dockray
In the Paradise of Too Many Books An Interview with Sean Dockray
2011

# In the Paradise of Too Many Books: An Interview with Sean Dockray

By Matthew Fuller, 4 May 2011

[0 Comments](/editorial/articles/paradise-too-many-books-interview-sean-
dockray#comments_none) [9191 Reads](/editorial/articles/paradise-too-many-
books-interview-sean-dockray) Print

If the appetite to read comes with reading, then open text archive Aaaaarg.org
is a great place to stimulate and sate your hunger. Here, Matthew Fuller talks
to long-term observer Sean Dockray about the behaviour of text and
bibliophiles in a text-circulation network

Sean Dockray is an artist and a member of the organising group for the LA
branch of The Public School, a geographically distributed and online platform
for the self-organisation of learning.1 Since its initiation by Telic Arts, an
organisation which Sean directs, The Public School has also been taken up as a
model in a number of cities in the USA and Europe.2

We met to discuss the growing phenomenon of text-sharing. Aaaaarg.org has
developed over the last few years as a crucial site for the sharing and
discussion of texts drawn from cultural theory, politics, philosophy, art and
related areas. Part of this discussion is about the circulation of texts,
scanned and uploaded to other sites that it provides links to. Since
participants in The Public School often draw from the uploads to form readers
or anthologies for specific classes or events series, this project provides a
useful perspective from which to talk about the nature of text in the present
era.

**Sean Dockray** **:** People usually talk about three key actors in
discussions about publishing, which all play fairly understandable roles:
readers; publishers; and authors.

**Matthew Fuller:** Perhaps it could be said that Aaaaarg.org suggests some
other actors that are necessary for a real culture of text; firstly that books
also have some specific kind of activity to themselves, even if in many cases
it is only a latent quality, of storage, of lying in wait and, secondly, that
within the site, there is also this other kind of work done, that of the
public reception and digestion, the response to the texts, their milieu, which
involves other texts, but also systems and organisations, and platforms, such
as Aaaaarg.

![](/sites/www.metamute.org/files/u73/Roland_Barthes_web.jpg)

Image: A young Roland Barthes, with space on his bookshelf

**SD:** Where even the three actors aren't stable! The people that are using
the site are fulfilling some role that usually the publisher has been doing or
ought to be doing, like marketing or circulation.

**MF:** Well it needn't be seen as promotion necessarily. There's also this
kind of secondary work with critics, reviewers and so on - which we can say is
also taken on by universities, for instance, and reading groups, magazines,
reviews - that gives an additional life to the text or brings it particular
kinds of attention, certain kind of readerliness.

**SD:** Situates it within certain discourses, makes it intelligible in a way,
in a different way.

**MF:** Yes, exactly, there's this other category of life to the book, which
is that of the kind of milieu or the organisational structure in which it
circulates and the different kind of networks of reference that it implies and
generates. Then there's also the book itself, which has some kind of agency,
or at least resilience and salience, when you think about how certain books
have different life cycles of appearance and disappearance.

**SD:** Well, in a contemporary sense, you have something like _Nights of
Labour_ , by Ranci _è_ re - which is probably going to be republished or
reprinted imminently - but has been sort of invisible, out of print, until, by
surprise, it becomes much more visible within the art world or something.

**MF:** And it's also been interesting to see how the art world plays a role
in the reverberations of text which isn't the same as that in cultural theory
or philosophy. Certainly _Nights of Labour_ , something that is very close to
the role that cultural studies plays in the UK, but which (cultural studies)
has no real equivalent in France, so then, geographically and linguistically,
and therefore also in a certain sense conceptually, the life of a book
exhibits these weird delays and lags and accelerations, so that's a good
example. I'm interested in what role Aaaaarg plays in that kind of
proliferation, the kind of things that books do, where they go and how they
become manifest. So I think one of the things Aaaaarg does is to make books
active in different ways, to bring out a different kind of potential in
publishing.

**SD:** Yes, the debate has tended so far to get stuck in those three actors
because people tend to end up picking a pair and placing them in opposition to
one another, especially around intellectual property. The discussion is very
simplistic and ends up in that way, where it's the authors against readers, or
authors against their publishers, with the publishers often introducing
scarcity, where the authors don't want it to be - that's a common argument.
There's this situation where the record industry is suing its own audience.
That's typically the field now.

**MF:** So within that kind of discourse of these three figures, have there
been cases where you think it's valid that there needs to be some form of
scarcity in order for a publishing project to exist?

**SD:** It's obviously not for me to say that there does or doesn't need to be
scarcity but the scarcity that I think we're talking about functions in a
really specific way: it's usually within academic publishing, the book or
journal is being distributed to a few libraries and maybe 500 copies of it are
being printed, and then the price is something anywhere from $60 to $500, and
there's just sort of an assumption that the audience is very well defined and
stable and able to cope with that.

**MF:** Yeah, which recognises that the audiences may be stable as an
institutional form, but not that over time the individual parts of say that
library user population change in their relationship to the institution. If
you're a student for a few years and then you no longer have access, you lose
contact with that intellectual community...

**SD:** Then people just kind of have to cling to that intellectual community.
So when scarcity functions like that, I can't think of any reason why that
_needs_ to happen. Obviously it needs to happen in the sense that there's a
relatively stable balance that wants to perpetuate itself, but what you're
asking is something else.

**MF:** Well there are contexts where the publisher isn't within that academic
system of very high costs, sustained by volunteer labour by academics, the
classic peer review system, but if you think of more of a trade publisher like
a left or a movement or underground publisher, whose books are being
circulated on Aaaaarg...

**SD:** They're in a much more precarious position obviously than a university
press whose economics are quite different, and with the volunteer labour or
the authors are being subsidised by salary - you have to look at the entire
system rather than just the publication. But in a situation where the
publisher is much more precarious and relying on sales and a swing in one
direction or another makes them unable to pay the rent on a storage facility,
one can definitely see why some sort of predictability is helpful and
necessary.

**MF:** So that leads me to wonder whether there are models of publishing that
are emerging that work with online distribution, or with the kind of thing
that Aaaaarg does specifically. Are there particular kinds of publishing
initiatives that really work well in this kind of context where free digital
circulation is understood as an a priori, or is it always in this kind of
parasitic or cyclical relationship?

**SD:** I have no idea how well they work actually; I don't know how well,
say, Australian publisher re.press, works for example. 3 I like a lot of what
they publish, it's given visibility when re.press distributes it and that's a
lot of what a publisher's role seems to be (and what Aaaaarg does as well).
But are you asking how well it works in terms of economics?

**MF:** Well, just whether there's new forms of publishing emerging that work
well in this context that cut out some of the problems ?

**SD:** Well, there's also the blog. Certain academic discourses, philosophy
being one, that are carried out on blogs really work to a certain extent, in
that there is an immediacy to ideas, their reception and response. But there's
other problems, such as the way in which, over time, the posts quickly get
forgotten. In this sense, a publication, a book, is kind of nice. It
crystallises and stays around.

**MF:** That's what I'm thinking, that the book is a particular kind of thing
which has it's own quality as a form of media. I also wonder whether there
might be intermediate texts, unfinished texts, draft texts that might
circulate via Aaaaarg for instance or other systems. That, at least to me,
would be kind of unsatisfactory but might have some other kind of life and
readership to it. You know, as you say, the blog is a collection of relatively
occasional texts, or texts that are a work in progress, but something like
Aaaaarg perhaps depends upon texts that are finished, that are absolutely the
crystallisation of a particular thought.

![](/sites/www.metamute.org/files/u73/tree_of_knowledge_web.jpg)

Image: The Tree of Knowledge as imagined by Hans Sebald Beham in his 1543
engraving _Adam and Eve_

**SD:** Aaaaarg is definitely not a futuristic model. I mean, it occurs at a
specific time, which is while we're living in a situation where books exist
effectively as a limited edition. They can travel the world and reach certain
places, and yet the readership is greatly outpacing the spread and
availability of the books themselves. So there's a disjunction there, and
that's obviously why Aaaaarg is so popular. Because often there are maybe no
copies of a certain book within 400 miles of a person that's looking for it,
but then they can find it on that website, so while we're in that situation it
works.

**MF:** So it's partly based on a kind of asymmetry, that's spatial, that's
about the territories of publishers and distributors, and also a kind of
asymmetry of economics?

**SD:** Yeah, yeah. But others too. I remember when I was affiliated with a
university and I had JSTOR access and all these things and then I left my job
and then at some point not too long after that my proxy access expired and I
no longer had access to those articles which now would cost $30 a pop just to
even preview. That's obviously another asymmetry, even though, geographically
speaking, I'm in an identical position, just that my subject position has
shifted from affiliated to unaffiliated.

**MF:** There's also this interesting way in which Aaaaarg has gained
different constituencies globally, you can see the kind of shift in the texts
being put up. It seems to me anyway there are more texts coming from non-
western authors. This kind of asymmetry generates a flux. We're getting new
alliances between texts and you can see new bibliographies emerge.

**SD:** Yeah, the original community was very American and European and
gradually people were signing up at other places in order to have access to a
lot of these texts that didn't reach their libraries or their book stores or
whatever. But then there is a danger of US and European thought becoming
central. A globalisation where a certain mode of thought ends up just erasing
what's going on already in the cities where people are signing up, that's a
horrible possible future.

**MF:** But that's already something that's _not_ happening in some ways?

**SD:** Exactly, that's what seems to be happening now. It goes on to
translations that are being put up and then texts that are coming from outside
of the set of US and western authors and so, in a way, it flows back in the
other direction. This hasn't always been so visible, maybe it will begin to
happen some more. But think of the way people can list different texts
together as ‘issues' - a way that you can make arbitrary groupings - and
they're very subjective, you can make an issue named anything and just lump a
bunch of texts in there. But because, with each text, you can see what other
issues people have also put it in, it creates a trace of its use. You can see
that sometimes the issues are named after the reading groups, people are using
the issues format as a collecting tool, they might gather all Portuguese
translations, or The Public School uses them for classes. At other times it's
just one person organising their dissertation research but you see the wildly
different ways that one individual text can be used.

**MF:** So the issue creates a new form of paratext to the text, acting as a
kind of meta-index, they're a new form of publication themselves. To publish a
bibliography that actively links to the text itself is pretty cool. That also
makes me think within the structures of Aaaaarg it seems that certain parts of
the library are almost at breaking point - for instance the alphabetical
structure.

**SD:** Which is funny because it hasn't always been that alphabetical
structure either, it used to just be everything on one page, and then at some
point it was just taking too long for the page to load up A-Z. And today A is
as long as the entire index used to be, so yeah these questions of density and
scale are there but they've always been dealt with in a very ad hoc kind of
way, dealing with problems as they come. I'm sure that will happen. There
hasn't always been a search and, in a way, the issues, along with
alphabetising, became ways of creating more manageable lists, but even now the
list of issues is gigantic. These are problems of scale.

**MF:** So I guess there's also this kind of question that emerges in the
debate on reading habits and reading practices, this question of the breadth
of reading that people are engaging in. Do you see anything emerging in
Aaaaarg that suggests a new consistency of handling reading material? Is there
a specific quality, say, of the issues? For instance, some of them seem quite
focused, and others are very broad. They may provide insights into how new
forms of relationships to intellectual material may be emerging that we don't
quite yet know how to handle or recognise. This may be related to the lament
for the classic disciplinary road of deep reading of specific materials with a
relatively focused footprint whereas, it is argued, the net is encouraging a
much wider kind of sampling of materials with not necessarily so much depth.

**SD:** It's partially driven by people simply being in the system, in the
same way that the library structures our relationship to text, the net does it
in another way. One comment I've heard is that there's too much stuff on
Aaaaarg, which wasn't always the case. It used to be that I read every single
thing that was posted because it was slow enough and the things were short
enough that my response was, ‘Oh something new, great!' and I would read it.
But now, obviously that is totally impossible, there's too much; but in a way
that's just the state of things. It does seem like certain tactics of making
sense of things, of keeping things away and letting things in and queuing
things for reading later become just a necessary part of even navigating. It's
just the terrain at the moment, but this is only one instance. Even when I was
at the university and going to libraries, I ended up with huge stacks of books
and I'd just buy books that I was never going to read just to have them
available in my library, so I don't think feeling overwhelmed by books is
particularly new, just maybe the scale of it is. In terms of how people
actually conduct themselves and deal with that reality, it's difficult to say.
I think the issues are one of the few places where you would see any sort of
visible answers on Aaaaarg, otherwise it's totally anecdotal. At The Public
School we have organised classes in relationship to some of the issues, and
then we use the classes to also figure out what texts we are going to be
reading in the future, to make new issues and new classes. So it becomes an
organising group, reading and working its way through subject matter and
material, then revisiting that library and seeing what needs to be there.

**MF:** I want to follow that kind of strand of habits of accumulation,
sorting, deferring and so on. I wonder, what is a kind of characteristic or
unusual reading behavior? For instance are there people who download the
entire list? Or do you see people being relatively selective? How does the
mania of the net, with this constant churning of data, map over to forms of
bibliomania?

**SD:** Well, in Aaaaarg it's again very specific. Anecdotally again, I have
heard from people how much they download and sometimes they're very selective,
they just see something that's interesting and download it, other times they
download everything and occasionally I hear about this mania of mirroring the
whole site. What I mean about being specific to Aaaaarg is that a lot of the
mania isn't driven by just the need to have everything; it's driven by the
acknowledgement that the source is going to disappear at some point. That
sense of impending disappearance is always there, so I think that drives a lot
of people to download everything because, you know, it's happened a couple
times where it's just gone down or moved or something like that.

**MF:** It's true, it feels like something that is there even for a few weeks
or a few months. By a sheer fluke it could last another year, who knows.

**SD:** It's a different kind of mania, and usually we get lost in this
thinking that people need to possess everything but there is this weird
preservation instinct that people have, which is slightly different. The
dominant sensibility of Aaaaarg at the beginning was the highly partial and
subjective nature to the contents and that is something I would want to
preserve, which is why I never thought it to be particularly exciting to have
lots of high quality metadata - it doesn't have the publication date, it
doesn't have all the great metadata that say Amazon might provide. The system
is pretty dismal in that way, but I don't mind that so much. I read something
on the Internet which said it was like being in the porn section of a video
store with all black text on white labels, it was an absolutely beautiful way
of describing it. Originally Aaaaarg was about trading just those particular
moments in a text that really struck you as important, that you wanted other
people to read so it would be very short, definitely partial, it wasn't a
completist project, although some people maybe treat it in that way now. They
treat it as a thing that wants to devour everything. That's definitely not the
way that I have seen it.

**MF:** And it's so idiosyncratic I mean, you know it's certainly possible
that it could be read in a canonical mode, you can see that there's that
tendency there, of the core of Adorno or Agamben, to take the a's for
instance. But of the more contemporary stuff it's very varied, that's what's
nice about it as well. Alongside all the stuff that has a very long-term
existence, like historical books that may be over a hundred years old, what
turns up there is often unexpected, but certainly not random or
uninterpretable.

![](/sites/www.metamute.org/files/u1/malraux_web3_0.jpg)

Image: French art historian André Malraux lays out his _Musée Imaginaire_ ,
1947

**SD:** It's interesting to think a little bit about what people choose to
upload, because it's not easy to upload something. It takes a good deal of
time to scan a book. I mean obviously some things are uploaded which are, have
always been, digital. (I wrote something about this recently about the scan
and the export - the scan being something that comes out of a labour in
relationship to an object, to the book, and the export is something where the
whole life of the text has sort of been digital from production to circulation
and reception). I happen to think of Aaaaarg in the realm of the scan and the
bootleg. When someone actually scans something they're potentially spending
hours because they're doing the work on the book they're doing something with
software, they're uploading.

**MF:** Aaaarg hasn't introduced file quality thresholds either.

**SD:** No, definitely not. Where would that go?

**MF:** You could say with PDFs they have to be searchable texts?

**SD:** I'm sure a lot of people would prefer that. Even I would prefer it a
lot of the time. But again there is the idiosyncratic nature of what appears,
and there is also the idiosyncratic nature of the technical quality and
sometimes it's clear that the person that uploads something just has no real
experience of scanning anything. It's kind of an inevitable outcome. There are
movie sharing sites that are really good about quality control both in the
metadata and what gets up; but I think that if you follow that to the end,
then basically you arrive at the exported version being the Platonic text, the
impossible, perfect, clear, searchable, small - totally eliminating any trace
of what is interesting, the hand of reading and scanning, and this is what you
see with a lot of the texts on Aaaaarg. You see the hand of the person who's
read that book in the past, you see the hand of the person who scanned it.
Literally, their hand is in the scan. This attention to the labour of both
reading and redistributing, it's important to still have that.

**MF:** You could also find that in different ways for instance with a pdf, a
pdf that was bought directly as an ebook that's digitally watermarked will
have traces of the purchaser coded in there. So then there's also this work of
stripping out that data which will become a new kind of labour. So it doesn't
have this kind of humanistic refrain, the actual hand, the touch of the
labour. This is perhaps more interesting, the work of the code that strips it
out, so it's also kind of recognising that code as part of the milieu.

**SD:** Yeah, that is a good point, although I don't know that it's more
interesting labour.

**MF:** On a related note, The Public School as a model is interesting in that
it's kind of a convention, it has a set of rules, an infrastructure, a
website, it has a very modular being. Participants operate with a simple
organisational grammar which allows them to say ‘I want to learn this' or ‘I
want to teach this' and to draw in others on that basis. There's lots of
proposals for classes, some of them don't get taken up, but it's a process and
a set of resources which allow this aggregation of interest to occur. I just
wonder how you saw that kind of ethos of modularity in a way, as a set of
minimum rules or set of minimum capacities that allow a particular set of
things occur?

**SD:** This may not respond directly to what you were just talking about, but
there's various points of entry to the school and also having something that
people feel they can take on as their own and I think the minimal structure
invites quite a lot of projection as to what that means and what's possible
with it. If it's not doing what you want it to do or you think, ‘I'm not sure
what it is', there's the sense that you can somehow redirect it.

**MF:** It's also interesting that projection itself can become a technical
feature so in a way the work of the imagination is done also through this kind
of tuning of the software structure. The governance that was handled by the
technical infrastructure actually elicits this kind of projection, elicits the
imagination in an interesting way.

**SD:** Yeah, yeah, I totally agree and, not to put too much emphasis on the
software, although I think that there's good reason to look at both the
software and the conceptual diagram of the school itself, but really in a way
it would grind to a halt if it weren't for the very traditional labour of
people - like an organising committee. In LA there's usually around eight of
us (now Jordan Biren, Solomon Bothwell, Vladada Gallegos, Liz Glynn, Naoko
Miyano, Caleb Waldorf, and me) who are deeply involved in making that
translation of these wishes - thrown onto the website that somehow attract the
other people - into actual classes.

**MF:** What does the committee do?

**SD:** Even that's hard to describe and that's what makes it hard to set up.
It's always very particular to even a single idea, to a single class proposal.
In general it'd be things like scheduling, finding an instructor if an
instructor is what's required for that class. Sometimes it's more about
finding someone who will facilitate, other times it's rounding up materials.
But it could be helping an open proposal take some specific form. Sometimes
it's scanning things and putting them on Aaaaarg. Sometimes, there will be a
proposal - I proposed a class in the very, very beginning on messianic time, I
wanted to take a class on it - and it didn't happen until more than a year and
a half later.

**MF:** Well that's messianic time for you.

**SD:** That and the internet. But other times it will be only a week later.
You know we did one on the Egyptian revolution and its historical context,
something which demanded a very quick turnaround. Sometimes the committee is
going to classes and there will be a new conflict that arises within a class,
that they then redirect into the website for a future proposal, which becomes
another class: a point of friction where it's not just like next, and next,
and next, but rather it's a knot that people can't quite untie, something that
you want to spend more time with, but you may want to move on to other things
immediately, so instead you postpone that to the next class. A lot of The
Public School works like that: it's finding momentum then following it. A lot
of our classes are quite short, but we try and string them together. The
committee are the ones that orchestrate that. In terms of governance, it is
run collectively, although with the committee, every few months people drop
off and new people come on. There are some people who've been on for years.
Other people who stay on just for that point of time that feels right for
them. Usually, people come on to the committee because they come to a lot of
classes, they start to take an interest in the project and before they know it
they're administering it.

**Matthew Fuller's <[m.fuller@gold.ac.uk](mailto:m.fuller@gold.ac.uk)> most
recent book, _Elephant and Castle_ , is forthcoming from Autonomedia. **

**He is collated at**

**Footnotes**

1

2 [http://telic.info/ ](http://telic.info/)

3

1 {print $var}' temp.txt); awk
-vmaxx=$max -F' ' '{printf "%-7.7f %s\n", $1=0.5+($1/(maxx*2)), $2}' > freq.$i.txt; done && rm temp.txt

* 2\. Process the files freq.1-5.txt and produce tfidf.1-5.txt containing a list of words (out of 500 most frequent in respective lists), ordered by weight (specificity for each text):

> for j in {1..5}; do rm freq.$j.txt.temp; lines=$(wc -l freq.$j.txt) && for i
in {1..500}; do word=$(awk -vline="$i" -vfield=2 -F" " 'NR
line {print
$field}' freq.$j.txt); tf=$(awk -vline="$i" -vfield=1 -F" " 'NR
line {print
$field}' freq.$j.txt); count=$(egrep -lw $word freq.?.txt | wc -l); idf=$(echo
"1+l(5/$count)" | bc -l); tfidf=$(echo $tf*$idf | bc); echo $word $tfidf >>
freq.$j.txt.temp; done; sort -k 2nr < freq.$j.txt.temp > tfidf.$j.txt; done

* 3\. Process the files tfidf.1-5.txt and their source text, text.txt, and produce occ.txt with concordance of top 3 words from each of them:

> rm occ.txt && for j in {1..5}; do echo "$j" >> occ.txt; ptx -f -w 150
text.txt.$j > occ.$j.txt; for i in {1..3}; do word=$(awk -vline="$i" -vfield=1
-F" " 'NR

Barok
Poetics of Research
2014

_An unedited version of a talk given at the conference[Public
Library](http://www.wkv-stuttgart.de/en/program/2014/events/public-library/)
held at Württembergischer Kunstverein Stuttgart, 1 November 2014._

_Bracketed sequences are to be reformulated._

Poetics of Research

In this talk I'm going to attempt to identify [particular] cultural
algorithms, ie. processes in which cultural practises and software meet. With
them a sphere is implied in which algorithms gather to form bodies of
practices and in which cultures gather around algorithms. I'm going to
approach them through the perspective of my practice as a cultural worker,
editor and artist, considering practice in the same rank as theory and
poetics, and where theorization of practice can also lead to the
identification of poetical devices.

The primary motivation for this talk is an attempt to figure out where do we
stand as operators, users [and communities] gathering around infrastructures
containing a massive body of text (among other things) and what sort of things
might be considered to make a difference [or to keep making difference].

The talk mainly [considers] the role of text and the word in research, by way
of several figures.

A

A reference, list, scheme, table, index; those things that intervene in the
flow of narrative, illustrating the point, perhaps in a more economic way than
the linear text would do. Yet they don't function as pictures, they are
primarily texts, arranged in figures. Their forms have been
standardised[normalised] over centuries, withstood the transition to the
digital without any significant change, being completely intuitive to the
modern reader. Compared to the body of text they are secondary, run parallel
to it. Their function is however different to that of the punctuation. They
are there neither to shape the narrative nor to aid structuring the argument
into logical blocks. Nor is their function spatial, like in visual poems.
Their positions within a document are determined according to the sequential
order of the text, [standing as attachments] and are there to clarify the
nature of relations among elements of the subject-matter, or to establish
relations with other documents. The [premise] of my talk is that these
_textual figures_ also came to serve as the abstract[relational] models
determining possible relations among documents as such, and in consequence [to
structure conditions [of research]].

B

It can be said that research, as inquiry into a subject-matter, consists of
discrete queries. A query, such as a question about what something is, what
kinds, parts and properties does it have, and so on, can be consulted in
existing documents or generate new documents based on collection of data [in]
the field and through experiment, before proceeding to reasoning [arguments
and deductions]. Formulation of a query is determined by protocols providing
access to documents, which means that there is a difference between collecting
data outside the archive (the undocumented, ie. in the field and through
experiment), consulting with a person--an archivist (expert, librarian,
documentalist), and consulting with a database storing documents. The
phenomena such as [deepening] of specialization and throughout digitization
[have given] privilege to the database as [a|the] [fundamental] means for
research. Obviously, this is a very recent [phenomenon]. Queries were once
formulated in natural language; now, given the fact that databases are queried
[using] SQL language, their interfaces are mere extensions of it and
researchers pose their questions by manipulating dropdowns, checkboxes and
input boxes mashed together on a flat screen being ran by software that in
turn translates them into a long line of conditioned _SELECTs_ and _JOINs_
performed on tables of data.

Specialization, digitization and networking have changed the language of
questioning. Inquiry, once attached to the flesh and paper has been
[entrusted] to the digital and networked. Researchers are querying the black
box.

C

Searching in a collection of [amassed/assembled] [tangible] documents (ie.
bookshelf) is different from searching in a systematically structured
repository (library) and even more so from searching in a digital repository
(digital library). Not that they are mutually exclusive. One can devise
structures and algorithms to search through a printed text, or read books in a
library one by one. They are rather [models] [embodying] various [processes]
associated with the query. These properties of the query might be called [the
sequence], the structure and the index. If they are present in the ways of
querying documents, and we will return to this issue, are they persistent
within the inquiry as such? [wait]

D

This question itself is a rupture in the sequence. It makes a demand to depart
from one narrative [a continuous flow of words] to another, to figure out,
while remaining bound to it [it would be even more as a so-called rhetorical
question]. So there has been one sequence, or line, of the inquiry--about the
kinds of the query and its properties. That sequence itself is a digression,
from within the sequence about what is research and describing its parts
(queries). We are thus returning to it and continue with a question whether
the properties of the inquiry are the same as the properties of the query.

E

But isn't it true that every single utterance occurring in a sequence yields a
query as well? Let's consider the word _utterance_. [wait] It can produce a
number of associations, for example with how Foucault employs the notion of
_énoncé_ in his _Archaeology of Knowledge_ , giving hard time to his English
translators wondering whether _utterance_ or _statement_ is more appropriate,
or whether they are interchangeable, and what impact would each choice have on
his reception in the Anglophone world. Limiting ourselves to textual forms for
now (and not translating his work but pursing a different inquiry), let us say
the utterance is a word [or a phrase or an idiom] in a sequence such as a
sentence, a paragraph, or a document.

## (F) The
structure[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=1
"Edit section: $F$ The structure")]

This distinction is as old as recorded Western thought since both Plato and
Aristotle differentiate between a word on its own ("the said", a thing said)
and words in the company of other words. For example, Aristotle's _Categories_
[lay] on the [notion] of words on their own, and they are made the subject-
matter of that inquiry. [For him], the ambiguity of connotation words
[produce] lies in their synonymity, understood differently from the moderns--
not as more words denoting a similar thing but rather one word denoting
various things. Categories were outlined as a device to differentiate among
words according to kinds of these things. Every word as such belonged to not
less and not more than one of ten categories.

So it happens to the word _utterance_ , as to any other word uttered in a
sequence, that it poses a question, a query about what share of the spectrum
of possibly denoted things might yield as the most appropriate in a given
context. The more context the more precise share comes to the fore. When taken
out of the context ambiguity prevails as the spectrum unveils in its variety.

Thus single words [as any other utterances] are questions, queries,
themselves, and by occuring in statements, in context, their [means] are being
singled out.

This process is _conditioned_ by what has been formalized as the techniques of
_regulating_ definitions of words.

### (G) The structure: words as
words[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=2
"Edit section: $G$ The structure: words as words")]

* [![](/images/thumb/c/c8/Philitas_in_P.Oxy.XX_2260_i.jpg/144px-Philitas_in_P.Oxy.XX_2260_i.jpg)](/File:Philitas_in_P.Oxy.XX_2260_i.jpg)

P.Oxy.XX 2260 i: Oxyrhynchus papyrus XX, 2260, column i, with quotation from
Philitas, early 2nd c. CE. ¹(http://163.1.169.40/cgi-
bin/library?e=q-000-00---0POxy--00-0-0--0prompt-10---4------0-1l--1-en-50---
20-about-2260--
00031-001-0-0utfZz-8-00&a=d&c=POxy&cl=search&d=HASH13af60895d5e9b50907367)
²(http://en.wikipedia.org/wiki/File:POxy.XX.2260.i-Philitas-
highlight.jpeg)

* [![](/images/thumb/9/9e/Cyclopaedia_1728_page_210_Dictionary_entry.jpg/88px-Cyclopaedia_1728_page_210_Dictionary_entry.jpg)](/File:Cyclopaedia_1728_page_210_Dictionary_entry.jpg)

Ephraim Chambers, _Cyclopaedia, or an Universal Dictionary of Arts and
Sciences_ , 1728, p. 210. ³(http://digicoll.library.wisc.edu/cgi-
bin/HistSciTech/HistSciTech-
idx?type=turn&entity=HistSciTech.Cyclopaedia01.p0576&id=HistSciTech.Cyclopaedia01&isize=L)

* [![](/images/thumb/b/b8/Detail_from_the_Liddell-Scott_Greek-English_Lexicon_c1843.jpg/160px-Detail_from_the_Liddell-Scott_Greek-English_Lexicon_c1843.jpg)](/File:Detail_from_the_Liddell-Scott_Greek-English_Lexicon_c1843.jpg)

Detail from the Liddell-Scott Greek-English Lexicon, c1843.

Dictionaries have had a long life. The ancient Greek scholar and poet Philitas
of Cos living in the 4th c. BCE wrote a vocabulary explaining the meanings of
rare Homeric and other literary words, words from local dialects, and
technical terms. The vocabulary, called _Disorderly Words_ (Átaktoi glôssai),
has been lost, with a few fragments quoted by later authors. One example is
that the word πέλλα (pélla) meant "wine cup" in the ancient Greek region of
Boeotia; contrasted to the same word meaning "milk pail" in Homer's _Iliad_.

Not much has changed in the way how dictionaries constitute order. Selected
archives of statements are queried to yield occurrences of particular words,
various _criteria[indicators]_ are applied to filtering and sorting them and
in turn the spectrum of [denoted] things allocated in this way is structured
into groups and subgroups which are then given, according to other set of
rules, shorter or longer names. These constitute facets of [potential]
meanings of a word.

So there are at least _four_ sets of conditions [structuring] dictionaries.
One is required to delimit an archive[corpus of texts], one to select and give
preference[weights] to occurrences of a word, another to cluster them, and yet
another to abstract[generalize] the subject-matter of each of these clusters.
Needless to say, this is a craft of a few and these criteria are rarely being
disclosed, despite their impact on research, and more generally, their
influence as conditions for production[making] of a so called _common sense_.

It doesn't take that much to reimagine what a dictionary is and what it could
be, especially having large specialized corpora of texts at hand. These can
also serve as aids in production of new words and new meanings.

### (H) The structure: words as knowledge and the
world[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=3
"Edit section: $H$ The structure: words as knowledge and the world")]

* [![](/images/thumb/0/02/Boethius_Porphyrys_Isagoge.jpg/120px-Boethius_Porphyrys_Isagoge.jpg)](/File:Boethius_Porphyrys_Isagoge.jpg)

Boethius's rendering of a classification tree described in Porphyry's Isagoge
(3th c.), [6th c.] 10th c.
⁴(http://www.e-codices.unifr.ch/en/sbe/0315/53/medium)

* [![](/images/thumb/d/d0/Cyclopaedia_1728_page_ii_Division_of_Knowledge.jpg/94px-Cyclopaedia_1728_page_ii_Division_of_Knowledge.jpg)](/File:Cyclopaedia_1728_page_ii_Division_of_Knowledge.jpg)

Ephraim Chambers, _Cyclopaedia, or an Universal Dictionary of Arts and
Sciences_ , London, 1728, p. II. ⁵(http://digicoll.library.wisc.edu/cgi-
bin/HistSciTech/HistSciTech-
idx?type=turn&entity=HistSciTech.Cyclopaedia01.p0015&id=HistSciTech.Cyclopaedia01&isize=L)

* [![](/images/thumb/d/d6/Encyclopedie_1751_Systeme_figure_des_connaissances_humaines.jpg/116px-Encyclopedie_1751_Systeme_figure_des_connaissances_humaines.jpg)](/File:Encyclopedie_1751_Systeme_figure_des_connaissances_humaines.jpg)

Système figuré des connaissances humaines, _Encyclopédie ou Dictionnaire
raisonné des sciences, des arts et des métiers_ , 1751.
⁶(http://encyclopedie.uchicago.edu/content/syst%C3%A8me-figur%C3%A9-des-
connaissances-humaines)

* [![](/images/thumb/9/96/Haeckel_Ernst_1874_Stammbaum_des_Menschen.jpg/96px-Haeckel_Ernst_1874_Stammbaum_des_Menschen.jpg)](/File:Haeckel_Ernst_1874_Stammbaum_des_Menschen.jpg)

Haeckel - Darwin's tree.

Another _formalized_ and [internalized] process being at play when figuring
out a word is its [containment]. Word is not only structured by way of things
it potentially denotes but also by words it is potentially part of and those
it contains.

The fuzz around categorization of knowledge _and_ the world in the Western
thought can be traced back to Porphyry, if not further. In his introduction to
Aristotle's _Categories_ this 3rd century AD Neoplatonist began expanding the
notions of genus and species into their hypothetic consequences. Aristotle's
brief work outlines ten categories of 'things that are said' (legomena,
λεγόμενα), namely substance (or substantive, {not the same as matter!},
οὐσία), quantity (ποσόν), qualification (ποιόν), a relation (πρός), where
(ποῦ), when (πότε), being-in-a-position (κεῖσθαι), having (or state,
condition, ἔχειν), doing (ποιεῖν), and being-affected (πάσχειν). In his
different work, _Topics_ , Aristotle outlines four kinds of subjects/materials
indicated in propositions/problems from which arguments/deductions start.
These are a definition (όρος), a genus (γένος), a property (ἴδιος), and an
accident (συμβεβηϰόϛ). Porphyry does not explicitly refer _Topics_ , and says
he omits speaking "about genera and species, as to whether they subsist (in
the nature of things) or in mere conceptions only"
⁸(http://www.ccel.org/ccel/pearse/morefathers/files/porphyry_isagogue_02_translation.htm#C1),
which means he avoids explicating whether he talks about kinds of concepts or
kinds of things in the sensible world. However, the work sparked confusion, as
the following passage [suggests]:

> "[I]n each category there are certain things most generic, and again, others
most special, and between the most generic and the most special, others which
are alike called both genera and species, but the most generic is that above
which there cannot be another superior genus, and the most special that below
which there cannot be another inferior species. Between the most generic and
the most special, there are others which are alike both genera and species,
referred, nevertheless, to different things, but what is stated may become
clear in one category. Substance indeed, is itself genus, under this is body,
under body animated body, under which is animal, under animal rational animal,
under which is man, under man Socrates, Plato, and men particularly." (Owen
1853,
⁹(http://www.ccel.org/ccel/pearse/morefathers/files/porphyry_isagogue_02_translation.htm#C2))

Porphyry took one of Aristotle's ten categories of the word, substance, and
dissected it using one of his four rhetorical devices, genus. Employing
Aristotle's categories, genera and species as means for logical operations,
for dialectic, Porphyry's interpretation resulted in having more resemblance
to the perceived _structures_ of the world. So they began to bloom.

There were earlier examples, but Porphyry was the most influential in
injecting the _universalist_ version of classification [implying] the figure
of a tree into the [locus] of Aristotle's thought. Knowledge became
monotheistic.

Classification schemes [growing from one point] play a major role in
untangling the format of modern encyclopedia from that of the dictionary
governed by alphabet. Two of the most influential encyclopedias of the 18th
century are cases in the point. Although still keeping 'dictionary' in their
titles, they are conceived not to represent words but knowledge. The [upper-
most] genus of the body was set as the body of knowledge. The English
_Cyclopaedia, or an Universal Dictionary of Arts and Sciences_ (1728) splits
into two main branches: "natural and scientifical" and "artificial and
technical"; these further split down to 47 classes in total, each carrying a
structured list (on the following pages) of thematic articles, serving as
table of contents. The French _Encyclopedia: or a Systematic Dictionary of the
Sciences, Arts, and Crafts_ (1751) [unwinds] from judgement ( _entendement_ ),
branches into memory as history, reason as philosophy, and imagination as
poetry. The logic of containers was employed as an aid not only to deal with
the enormous task of naming and not omiting anything from what is known, but
also for the management of labour of hundreds of writers and researchers, to
create a mechanism for delegating work and the distribution of
responsibilities. Flesh was also more present, in the field research, with
researchers attending workshops and sites of everyday life to annotate it.

The world came forward to unshine the word in other schemes. Darwin's tree of
evolution and some of the modern document classification systems such as
Charles A. Cutter's _Expansive Classification_ (1882) set to classify the
world itself and set the field for what has came to be known as authority
lists structuring metadata in today's computing.

### The structure
(summary)[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=4
"Edit section: The structure $summary$")]

Facetization of meaning and branching of knowledge are both the domain of the
unit of utterance.

While lexicographers[dictionarists] structure thought through multi-layered
processes of abstraction of the written record, knowledge growers dissect it
into hierarchies of [mutually] contained notions.

One seek to describe the word as a faceted list of small worlds, another to
describe the world as a structured lists of words. One play prime in the
domain of epistemology, in what is known, controlling the vocabulary, another
in the domain of ontology, in what is, controlling reality.

Every [word] has its given things, every thing has its place, closer or
further from a single word.

The schism between classifying words and classifying the world implies it is
not possible to construct a universal classification scheme[system]. On top of
that, any classification system of words is bound to a corpus of texts it is
operating upon and any classification system of the world again operates with
words which are bound to a vocabulary[lexicon] which is again bound to a
corpus [of texts]. It doesn't mean it would prevent people from trying.
Classifications function as descriptors of and 'inscriptors' upon the world,
imprinting their authority. They operate from [a locus of] their
corpus[context]-specificity. The larger the corpus, the more power it has on
shaping the world, as far as the word shapes it (yes, I do imply Google here,
for which it is a domain to be potentially exploited).

## (J) The
sequence[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=5
"Edit section: $J$ The sequence")]

The structure-yielding query [of] the single word [shrinks][zuzuje
sa,spresnuje] with preceding and following words. Inquiry proceeds in the flow
that establishes another kind[mode] of relationality, chaining words into the
sequence. While the structuring property of the query brings words apart from
each other, its sequential property establishes continuity and brings these
units into an ordered set.

This is what is responsible for attaching textual figures mentioned earlier
(lists, schemes, tables) to the body of the text. Associations can be also
stated explicitly, by indexing tables and then referring them from a
particular point in the text. The same goes for explicit associations made
between blocks of the text by means of indexed paragraphs, chapters or pages.

From this follows that all utterances point to the following utterance by the
nature of sequential order, and indexing provides means for pointing elsewhere
in the document as well.

A lot can be said about references to other texts. Here, to spare time, I
would refer you to a talk I gave a few months ago and which is online
¹⁰(http://monoskop.org/Talks/Communing_Texts).

This is still the realm of print. What happens with document when it is
digitized?

Digitization breaks a document into units of which each is assigned a numbered
position in the sequence of the document. From this perspective digitization
can be viewed as a total indexation of the document. It is converted into
units rendered for machine operations. This sequentiality is made explicit, by
means of an underlying index.

Sequences and chains are orders of one dimension. Their one-dimensional
ordering allows addressability of each element and [random] access. [Jumps]
between [random] addresses are still sequential, processing elements one at a
time.

## (K) The
index[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=6
"Edit section: $K$ The index")]

* [![](/images/thumb/2/27/Summa_confessorum.1310.jpg/103px-Summa_confessorum.1310.jpg)](/File:Summa_confessorum.1310.jpg)

Summa confessorum [1297-98], 1310.
⁷(http://www.bl.uk/onlinegallery/onlineex/illmanus/roymanucoll/j/011roy000008g11u00002000.html)

[The] sequencing not only weaves words into statements but activates other
temporalities, and _presents occurrences of words from past statements_. As
now when I am saying the word _utterance_ , each time there surface contexts
in which I have used it earlier.

A long quote from Frederick G. Kilgour, _The Evolution of the Book_ , 1998, pp
76-77:

> "A century of invention of various types of indexes and reference tools
preceded the advent of the first subject index to a specific book, which
occurred in the last years of the thirteenth century. The first subject
indexes were "distinctions," collections of "various figurative or symbolic
meanings of a noun found in the scriptures" that "are the earliest of all
alphabetical tools aside from dictionaries." (Richard and Mary Rouse supply an
example: "Horse = Preacher. Job 39: 'Hast thou given the horse strength, or
encircled his neck with whinning?')

>

> [Concordance] By the end of the third decade of the thirteenth century Hugh
de Saint-Cher had produced the first word concordance. It was a simple word
index of the Bible, with every location of each word listed by [its position
in the Bible specified by book, chapter, and letter indicating part of the
chapter]. Hugh organized several dozen men, assigning to each man an initial
letter to search; for example, the man assigned M was to go through the entire
Bible, list each word beginning with M and give its location. As it was soon
perceived that this original reference work would be even more useful if words
were cited in context, a second concordance was produced, with each word in
lengthy context, but it proved to be unwieldy. [Soon] a third version was
produced, with words in contexts of four to seven words, the model for
biblical concordances ever since.

>

> [Subject index] The subject index, also an innovation of the thirteenth
century, evolved over the same period as did the concordance. Most of the
early topical indexes were designed for writing sermons; some were organized,
while others were apparently sequential without any arrangement. By midcentury
the entries were in alphabetical order, except for a few in some classified
arrangement. Until the end of the century these alphabetical reference works
indexed a small group of books. Finally John of Freiburg added an alphabetical
subject index to his own book, _Summa Confessorum_ (1297—1298). As the Rouses
have put it, 'By the end of the [13]th century the practical utility of the
subject index is taken for granted by the literate West, no longer solely as
an aid for preachers, but also in the disciplines of theology, philosophy, and
both kinds of law.'"

In one sense neither subject-index nor concordane are indexes, they are words
or group of words selected according to given criteria from the body of the
text, each accompanied with a list of identifiers. These identifiers are
elements of an index, whether they represent a page, chapter, column, or other
[kind of] block of text. Every identifier is an unique _address_.

The index is thus an ordering of a sequence by means of associating its
elements with a set of symbols, when each element is given unique combination
of symbols. Different sizes of sets yield different number of variations.
Symbol sets such as an alphabet, arabic numerals, roman numerals, and binary
digits have different proportions between the length of a string of symbols
and the number of possible variations it can contain. Thus two symbols of
English alphabet can store 26^2 various values, of arabic numerals 10^2, of
roman numberals 8^2 and of binary digits 2^2.

Indexation is segmentation, a breaking into segments. From as early as the
13th century the index such as that of sections has served as enabler of
search. The more [detailed] indexation the more precise search results it
enables.

The subject-index and concordance are tables of search results. There is a
direct lineage from the 13th-century biblical concordances and the birth of
computational linguistic analysis, they were both initiated and realised by
priests.

During the World War II, Jesuit Father Roberto Busa began to look for machines
for the automation of the linguistic analysis of the 11 million-word Latin
corpus of Thomas Aquinas and related authors.

Working on his Ph.D. thesis on the concept of _praesens_ in Aquinas he
realised two things:

> "I realized first that a philological and lexicographical inquiry into the
verbal system of an author has t o precede and prepare for a doctrinal
interpretation of his works. Each writer expresses his conceptual system in
and through his verbal system, with the consequence that the reader who
masters this verbal system, using his own conceptual system, has to get an
insight into the writer's conceptual system. The reader should not simply
attach t o the words he reads the significance they have in his mind, but
should try t o find out what significance they had in the writer's mind.
Second, I realized that all functional or grammatical words (which in my mind
are not 'empty' at all but philosophically rich) manifest the deepest logic of
being which generates the basic structures of human discourse. It is .this
basic logic that allows the transfer from what the words mean today t o what
they meant to the writer.

>

> In the works of every philosopher there are two philosophies: the one which
he consciously intends to express and the one he actually uses to express it.
The structure of each sentence implies in itself some philosophical
assumptions and truths. In this light, one can legitimately criticize a
philosopher only when these two philosophies are in contradiction."
¹¹(http://www.alice.id.tue.nl/references/busa-1980.pdf)

Collaborating with the IBM in New York from 1949, the work, a concordance of
all the words of Thomas Aquinas, was finally published in the 1970s in 56
printed volumes (a version is online since 2005
¹²(http://www.corpusthomisticum.org/it/index.age)). Besides that, an
electronic lexicon for automatic lemmatization of Latin words was created by a
team of ten priests in the scope of two years (in two phases: grouping all the
forms of an inflected word under their lemma, and coding the morphological
categories of each form and lemma), containing 150,000 forms
¹³(http://www.alice.id.tue.nl/references/busa-1980.pdf#page=4). Father
Busa has been dubbed the father of humanities computing and recently also of
digital humanities.

The subject-index has a crucial role in the printed book. It is the only means
for search the book offers. Subjects composing an index can be selected
according to a classification scheme (specific to a field of an inquiry), for
example as elements of a certain degree (with a given minimum number of
subclasses).

Its role seemingly vanishes in the digital text. But it can be easily
transformed. Besides serving as a table of pre-searched results the subject-
index also gives a distinct idea about content of the book. Two patterns give
us a clue: numbers of occurrences of selected words give subjects weights,
while words that seem specific to the book outweights other even if they don't
occur very often. A selection of these words then serves as a descriptor of
the whole text, and can be thought of as a specific kind of 'tags'.

This process was formalized in a mathematical function in the 1970s, thanks to
a formula by Karen Spärck Jones which she entitled 'inverse document
frequency' (IDF), or in other words, "term specificity". It is measured as a
proportion of texts in the corpus where the word appears at least once to the
total number of texts. When multiplied by the frequency of the word _in_ the
text (divided by the maximum frequency of any word in the text), we get _term
frequency-inverse document frequency_ (tf-idf). In this way we can get an
automated list of subjects which are particular in the text when compared to a
group of texts.

We came to learn it by practice of searching the web. It is a mechanism not
dissimilar to thought process involved in retrieving particular information
online. And search engines have it built in their indexing algorithms as well.

There is a paper proposing attaching words generated by tf-idf to the
hyperlinks when referring websites ¹⁴(http://bscit.berkeley.edu/cgi-
bin/pl_dochome?query_src=&format=html&collection=Wilensky_papers&id=3&show_doc=yes).
This would enable finding the referred content even after the link is dead.
Hyperlinks in references in the paper use this feature and it can be easily
tested: ¹⁵(http://www.cs.berkeley.edu/~phelps/papers/dissertation-
abstract.html?lexical-
signature=notemarks+multivalent+semantically+franca+stylized).

There is another measure, cosine similarity, which takes tf-idf further and
can be applied for clustering texts according to similarities in their
specificity. This might be interesting as a feature for digital libraries, or
even a way of organising library bottom-up into novel categories, new
discourses could emerge. Or as an aid for researchers to sort through texts,
or even for editors as an aid in producing interesting anthologies.

## Final
remarks[[edit](/index.php?title=Talks/Poetics_of_Research&action=edit§ion=7
"Edit section: Final remarks")]

1

New disciplines emerge all the time - most recently, for example, cultural
techniques, software studies, or media archaeology. It takes years, even
decades, before they gain dedicated shelves in libraries or a category in
interlibrary digital repositories. Not that it matters that much. They are not
only sites of academic opportunities but, firstly, frameworks of new
perspectives of looking at the world, new domains of knowledge. From the
perspective of researcher the partaking in a discipline involves negotiating
its vocabulary, classifications, corpus, reference field, and specific
terms[subjects]. Creating new fields involves all that, and more. Even when
one goes against all disciplines.

2

Google can still surprise us.

3

Knowledge has been in the making for millenia. There have been (abstract)
mechanisms established that govern its conditions. We now possess specialized
corpora of texts which are interesting enough to serve as a ground to discuss
and experiment with dictionaries, classifications, indexes, and tools for
references retrieval. These all belong to the poetic devices of knowledge-
making.

4

Command-line example of tf-idf and concordance in 3 steps.

* 1\. Process the files text.1-5.txt and produce freq.1-5.txt with lists of (nonlemmatized) words (in respective texts), ordered by frequency:

> for i in {1..5}; do tr '[A-Z]' '[a-z]' < text.$i.txt | tr -c '[a-z]'
'[\012*]' | tr -d '[:punct:]' | sort | uniq -c | sort -k 1nr | sed '1,1d' >
temp.txt; max=$(awk -vvar=1 -F" " 'NR

1 {print $var}' temp.txt); awk
-vmaxx=$max -F' ' '{printf "%-7.7f %s\n", $1=0.5+($1/(maxx2)), $2}' > freq.$i.txt; done && rm temp.txt

2\. Process the files freq.1-5.txt and produce tfidf.1-5.txt containing a list of words (out of 500 most frequent in respective lists), ordered by weight (specificity for each text):

> for j in {1..5}; do rm freq.$j.txt.temp; lines=$(wc -l freq.$j.txt) && for i
in {1..500}; do word=$(awk -vline="$i" -vfield=2 -F" " 'NR
line {print
$field}' freq.$j.txt); tf=$(awk -vline="$i" -vfield=1 -F" " 'NR
line {print
$field}' freq.$j.txt); count=$(egrep -lw $word freq.?.txt | wc -l); idf=$(echo
"1+l(5/$count)" | bc -l); tfidf=$(echo $tf$idf | bc); echo $word $tfidf >>
freq.$j.txt.temp; done; sort -k 2nr < freq.$j.txt.temp > tfidf.$j.txt; done

3\. Process the files tfidf.1-5.txt and their source text, text.txt, and produce occ.txt with concordance of top 3 words from each of them:

> rm occ.txt && for j in {1..5}; do echo "$j" >> occ.txt; ptx -f -w 150
text.txt.$j > occ.$j.txt; for i in {1..3}; do word=$(awk -vline="$i" -vfield=1
-F" " 'NR
line {print $field}' tfidf.$j.txt); egrep -i
"[alpha:](/index.php?title=Alpha:&action=edit&redlink=1 "Alpha: $page does
not exist$") $word" occ.$j.txt >> occ.txt; done; done

Dušan Barok

_Written 23 October - 1 November 2014 in Bratislava and Stuttgart._

Display 200 300 400 500 600 700 800 900 1000 ALL characters around the word.

line {print $field}' freq.$j.txt); tf=$(awk -vline="$i" -vfield=1 -F" " 'NR

line {print
$field}' freq.$j.txt); tf=$(awk -vline="$i" -vfield=1 -F" " 'NR