Medak, Sekulic & Mertens
Book Scanning and Post-Processing Manual Based on Public Library Overhead Scanner v1.2
2014


PUBLIC LIBRARY
&
MULTIMEDIA INSTITUTE

BOOK SCANNING & POST-PROCESSING MANUAL
BASED ON PUBLIC LIBRARY OVERHEAD SCANNER

Written by:
Tomislav Medak
Dubravka Sekulić
With help of:
An Mertens

Creative Commons Attribution - Share-Alike 3.0 Germany

TABLE OF CONTENTS

Introduction
3
I. Photographing a printed book
7
I. Getting the image files ready for post-processing
11
III. Transformation of source images into .tiffs
13
IV. Optical character recognition
16
V. Creating a finalized e-book file
16
VI. Cataloging and sharing the e-book
16
Quick workflow reference for scanning and post-processing
18
References
22

INTRODUCTION:
BOOK SCANNING - FROM PAPER BOOK TO E-BOOK
Initial considerations when deciding on a scanning setup
Book scanning tends to be a fragile and demanding process. Many factors can go wrong or produce
results of varying quality from book to book or page to page, requiring experience or technical skill
to resolve issues that occur. Cameras can fail to trigger, components to communicate, files can get
corrupted in the transfer, storage card doesn't get purged, focus fails to lock, lighting conditions
change. There are trade-offs between the automation that is prone to instability and the robustness
that is prone to become time consuming.
Your initial choice of book scanning setup will have to take these trade-offs into consideration. If
your scanning community is confined to your hacklab, you won't be risking much if technological
sophistication and integration fails to function smoothly. But if you're aiming at a broad community
of users, with varying levels of technological skill and patience, you want to create as much timesaving automation as possible on the condition of keeping maximum stability. Furthermore, if the
time of individual members of your scanning community can contribute is limited, you might also
want to divide some of the tasks between users and their different skill levels.
This manual breaks down the process of digitization into a general description of steps in the
workflow leading from the printed book to a digital e-book, each of which can be in a concrete
situation addressed in various manners depending on the scanning equipment, software, hacking
skills and user skill level that are available to your book scanning project. Several of those steps can
be handled by a single piece of equipment or software, or you might need to use a number of them your mileage will vary. Therefore, the manual will try to indicate the design choices you have in the
process of planning your workflow and should help you make decisions on what design is best for
you situation.
Introducing book scanner designs
The book scanning starts with the capturing of digital image files on the scanning equipment. There
are three principle types of book scanner designs:
 flatbed scanner
 single camera overhead scanner
 dual camera overhead scanner
Conventional flatbed scanners are widely available. However, given that they require the book to be
spread wide open and pressed down with the platen in order to break the resistance of the book
binding and expose sufficiently the inner margin of the text, it is the most destructive approach for
the book, imprecise and slow.
Therefore, book scanning projects across the globe have taken to custom designing improvised
setups or scanner rigs that are less destructive and better suited for fast turning and capturing of
pages. Designs abound. Most include:




one or two digital photo cameras of lesser or higher quality to capture the pages,
transparent V-shaped glass or Plexiglas platen to press the open book against a V-shape
cradle, and
a light source.

The go-to web resource to help you make an informed decision is the DIY book scanning
community at http://diybookscanner.org. A good place to start is their intro
(http://wiki.diybookscanner.org/ ) and scanner build list (http://wiki.diybookscanner.org/scannerbuild-list ).
The book scanners with a single camera are substantially cheaper, but come with an added difficulty
of de-warping the distorted page images due to the angle that pages are photographed at, which can
sometimes be difficult to correct in the post-processing. Hence, in this introductory chapter we'll
focus on two camera designs where the camera lens stands relatively parallel to the page. However,
with a bit of adaptation these instructions can be used to work with any other setup.
The Public Library scanner
In the focus of this manual is the scanner built for the Public Library project, designed by Voja
Antonić (see Illustration 1). The Public Library scanner was built with the immediate use by a wide
community of users in mind. Hence, the principle consideration in designing the Public Library
scanner was less sophistication and more robustness, facility of use and distributed process of
editing.
The board designs can be found here: http://www.memoryoftheworld.org/blog/2012/10/28/ourbeloved-bookscanner. The current iterations are using two Canon 1100 D cameras with the kit lens
Canon EF-S 18-55mm 1:3.5-5.6 IS. Cameras are auto-charging.

Illustration 1: Public Library Scanner
The scanner operates by automatically lowering the Plexiglas platen, illuminating the page and then
triggering camera shutters. The turning of pages and the adjustments of the V-shaped cradle holding

the book are manual.
The scanner is operated by a two-button controller (see Illustration 2). The upper, smaller button
breaks the capture process in two steps: the first click lowers the platen, increases the light level and
allows you to adjust the book or the cradle, the second click triggers the cameras and lifts the platen.
The lower button has
two modes. A quick
click will execute the
whole capture process in
one go. But if you hold
it pressed longer, it will
lower the platen,
allowing you to adjust
the book and the cradle,
and lift it without
triggering cameras when
you press again.

Illustration 2: A two-button controller

More on this manual: steps in the book scanning process
The book scanning process in general can be broken down in six steps, each of which will be dealt
in a separate chapter in this manual:
I. Photographing a printed book
I. Getting the image files ready for post-processing
III. Transformation of source images into .tiffs
IV. Optical character recognition
V. Creating a finalized e-book file
VI. Cataloging and sharing the e-book
A step by step manual for Public Library scanner
This manual is primarily meant to provide a detailed description and step-by-step instructions for an
actual book scanning setup -- based on the Voja Antonić's scanner design described above. This is a
two-camera overhead scanner, currently equipped with two Canon 1100 D cameras with EF-S 1855mm 1:3.5-5.6 IS kit lens. It can scan books of up to A4 page size.
The post-processing in this setup is based on a semi-automated transfer of files to a GNU/Linux
personal computer and on the use of free software for image editing, optical character recognition
and finalization of an e-book file. It was initially developed for the HAIP festival in Ljubljana in
2011 and perfected later at MaMa in Zagreb and Leuphana University in Lüneburg.
Public Library scanner is characterized by a somewhat less automated yet distributed scanning
process than highly automated and sophisticated scanner hacks developed at various hacklabs. A
brief overview of one such scanner, developed at the Hacker Space Bruxelles, is also included in
this manual.
The Public Library scanning process proceeds thus in following discrete steps:

1. creating digital images of pages of a book,
2. manual transfer of image files to the computer for post-processing,
3. automated renaming of files, ordering of even and odd pages, rotation of images and upload to a
cloud storage,
4. manual transformation of source images into .tiff files in ScanTailor
5. manual optical character recognition and creation of PDF files in gscan2pdf
The detailed description of the Public Library scanning process follows below.
The Bruxelles hacklab scanning process
For purposes of comparison, here we'll briefly reference the scanner built by the Bruxelles hacklab
(http://hackerspace.be/ScanBot). It is a dual camera design too. With some differences in hardware functionality
(Bruxelles scanner has automatic turning of pages, whereas Public Library scanner has manual turning of pages), the
fundamental difference between the two is in the post-processing - the level of automation in the transfer of images
from the cameras and their transformation into PDF or DjVu e-book format.
The Bruxelles scanning process is different in so far as the cameras are operated by a computer and the images are
automatically transferred, ordered and made ready for further post-processing. The scanner is home-brew, but the
process is for advanced DIY'ers. If you want to know more on the design of the scanner, contact Michael Korntheuer at
contact@hackerspace.be.
The scanning and post-processing is automated by a single Python script that does all the work
http://git.constantvzw.org/?
p=algolit.git;a=tree;f=scanbot_brussel;h=81facf5cb106a8e4c2a76c048694a3043b158d62;hb=HEAD
The scanner uses two Canon point and shoot cameras. Both cameras are connected to the PC with USB. They both run
PTP/CHDK (Canon Hack Development Kit). The scanning sequence is the following:
1. Script sends CHDK command line instructions to the cameras
2. Script sorts out the incoming files. This part is tricky. There is no reliable way to make a distinction between the left
and right camera, only between which camera was recognized by USB first. So the protocol is to always power up the
left camera first. See the instructions with the source code.
3. Collect images in a PDF file
4. Run script to OCR a .PDF file to plain .TXT file: http://git.constantvzw.org/?
p=algolit.git;a=blob;f=scanbot_brussel/ocr_pdf.sh;h=2c1f24f9afcce03520304215951c65f58c0b880c;hb=HEAD

I. PHOTOGRAPHING A PRINTED BOOK
Technologically the most demanding part of the scanning process is creating digital images of the
pages of a printed book. It's a process that is very different form scanner design to scanner design,
from camera to camera. Therefore, here we will focus strictly on the process with the Public Library
scanner.
Operating the Public Library scanner
0. Before you start:
Better and more consistent photographs lead to a more optimized and faster post-processing and a
higher quality of the resulting digital e-book. In order to guarantee the quality of images, before you
start it is necessary to set up the cameras properly and prepare the printed book for scanning.
a) Loosening the book
Depending on the type and quality of binding, some books tend to be too resistant to opening fully
to reveal the inner margin under the pressure of the scanner platen. It is thus necessary to “break in”
the book before starting in order to loosen the binding. The best way is to open it as wide as
possible in multiple places in the book. This can be done against the table edge if the book is more
rigid than usual. (Warning – “breaking in” might create irreversible creasing of the spine or lead to
some pages breaking loose.)
b) Switch on the scanner
You start the scanner by pressing the main switch or plugging the power cable into the the scanner.
This will also turn on the overhead LED lights.

c) Setting up the cameras
Place the cameras onto tripods. You need to move the lever on the tripod's head to allow the tripod
plate screwed to the bottom of the camera to slide into its place. Secure the lock by turning the lever
all the way back.
If the automatic chargers for the camera are provided, open the battery lid on the bottom of the
camera and plug the automatic charger. Close the lid.
Switch on the cameras using the lever on the top right side of the camera's body and place it into the
aperture priority (Av) mode on the mode dial above the lever (see Illustration 3). Use the main dial
just above the shutter button on the front side of the camera to set the aperture value to F8.0.

Illustration 3: Mode and main dial, focus mode switch, zoom
and focus ring
On the lens, turn the focus mode switch to manual (MF), turn the large zoom ring to set the value
exactly midway between 24 and 35 mm (see Illustration 3). Try to set both cameras the same.
To focus each camera, open a book on the cradle, lower the platen by holding the big button on the
controller, and turn on the live view on camera LCD by pressing the live view switch (see
Illustration 4). Now press the magnification button twice and use the focus ring on the front of the
lens to get a clear image view.

Illustration 4: Live view switch and magnification button

d) Connecting the cameras
Now connect the cameras to the remote shutter trigger cables that can be found lying on each side
of the scanner. They need to be plugged into a small round port hidden behind a protective rubber
cover on the left side of the cameras.
e) Placing the book into the cradle and double-checking the cameras
Open the book in the middle and place it on the cradle. Hold pressed the large button on the
controller to lower the Plexiglas platen without triggering the cameras. Move the cradle so that the
the platen fits into with the middle of the book.
Turn on the live view on the cameras' LED to see if the the pages fit into the image and if the
cameras are positioned parallel to the page.
f) Double-check storage cards and batteries
It is important that both storage cards on cameras are empty before starting the scanning in order
not to mess up the page sequence when merging photos from the left and the right camera in the
post-processing. To double-check, press play button on cameras and erase if there are some photos
left from the previous scan -- this you do by pressing the menu button, selecting the fifth menu from
the left and then select 'Erase Images' -> 'All images on card' -> 'OK'.
If no automatic chargers are provided, double-check on the information screen that batteries are
charged. They should be fully charged before starting with the scanning of a new book.

g) Turn off the light in the room
Lighting conditions during scanning should be as constant as possible, to reduce glare and achieve
maximum quality remove any source of light that might reflect off the Plexiglas platen. Preferably
turn off the light in the room or isolate the scanner with the black cloth provided.

1. Photographing a book
Now you are ready to start scanning. Place the book closed in the cradle and lower the platen by
holding the large button on the controller pressed (see Illustration 2). Adjust the position of the
cradle and lift the platen by pressing the large button again.
To scan you can now either use the small button on the controller to lower the platen, adjust and
then press it again to trigger the cameras and lift the platen. Or, you can just make a short press on
the large button to do it in one go.
ATTENTION: When the cameras are triggered, the shutter sound has to be heard coming
from both cameras. If one camera is not working, it's best to reconnect both cameras (see
Section 0), make sure the batteries are charged or adapters are connected, erase all images
and restart.
A mistake made in the photographing requires a lot of work in the post-processing, so it's
much quicker to repeat the photographing process.
If you make a mistake while flipping pages, or any other mistake, go back and scan from the page
you missed or incorrectly scanned. Note down the page where the error occurred and in the postprocessing the redundant images will be removed.
ADVICE: The scanner has a digital counter. By turning the dial forward and backward, you
can set it to tell you what page you should be scanning next. This should help you avoid
missing a page due to a distraction.
While scanning, move the cradle a bit to the left from time to time, making sure that the tip of Vshaped platen is aligned with the center of the book and the inner margin is exposed enough.

II. GETTING THE IMAGE FILES READY FOR POST-PROCESSING
Once the book pages have been photographed, they have to be transfered to the computer and
prepared for post-processing. With two-camera scanners, the capturing process will result in two
separate sets of images -- odd and even pages -- coming from the left and right cameras respectively
-- and you will need to rename and reorder them accordingly, rotate them into a vertical position
and collate them into a single sequence of files.
a) Transferring image files
For the transfer of files your principle process design choices are either to copy the files by
removing the memory cards from the cameras and copying them to the computer via a card reader
or to transfer them via a USB cable. The latter process can be automated by remote operating your
cameras from a computer, however this can be done only with a certain number of Canon cameras
(http://bit.ly/16xhJ6b) that can be hacked to run the open Canon Hack Development Kit firmware
(http://chdk.wikia.com).
After transferring the files, you want to erase all the image files on the camera memory card, so that
they would not end up messing up the scan of the next book.
b) Renaming image files
As the left and right camera are typically operated in sync, the photographing process results in two
separate sets of images, with even and odd pages respectively, that have completely different file
names and potentially same time stamps. So before you collate the page images in the order how
they appear in the book, you want to rename the files so that the first image comes from the right
camera, the second from the left camera, the third comes again from the right camera and so on.
You probably want to do a batch renaming, where your right camera files start with n and are offset
by an increment of 2 (e.g. page_0000.jpg, page_0002.jpg,...) and your left camera files start with
n+1 and are also offset by an increment of 2 (e.g. page_0001.jpg, page_0003.jpg,...).
Batch renaming can be completed either from your file manager, in command line or with a number
of GUI applications (e.g. GPrename, rename, cuteRenamer on GNU/Linux).
c) Rotating image files
Before you collate the renamed files, you might want to rotate them. This is a step that can be done
also later in the post-processing (see below), but if you are automating or scripting your steps this is
a practical place to do it. The images leaving your cameras will be positioned horizontally. In order
to position them vertically, the images from the camera on the right will have to be rotated by 90
degrees counter-clockwise, the images from the camera on the left will have to be rotated by 90
degrees clockwise.
Batch rotating can be completed in a number of photo-processing tools, in command line or
dedicated applications (e.g. Fstop, ImageMagick, Nautilust Image Converter on GNU/Linux).
d) Collating images into a single batch
Once you're done with the renaming and rotating of the files, you want to collate them into the same
folder for easier manipulation later.

Getting the image files ready for post-processing on the Public Library scanner
In the case of Public Library scanner, a custom C++ script was written by Mislav Stublić to
facilitate the transfer, renaming, rotating and collating of the images from the two cameras.
The script prompts the user to place into the card reader the memory card from the right camera
first, gives a preview of the first and last four images and provides an entry field to create a subfolder in a local cloud storage folder (path: /home/user/Copy).
It transfers, renames, rotates the files, deletes them from the card and prompts the user to replace the
card with the one from the left camera in order to the transfer the files from there and place them in
the same folder. The script was created for GNU/Linux system and it can be downloaded, together
with its source code, from: https://copy.com/nLSzflBnjoEB
If you have other cameras than Canon, you can edit the line 387 of the source file to change to the
naming convention of your cameras, and recompile by running the following command in your
terminal: "gcc scanflow.c -o scanflow -ludev `pkg-config --cflags --libs gtk+-2.0`"
In the case of Hacker Space Bruxelles scanner, this is handled by the same script that operates the cameras that can be
downloaded from: http://git.constantvzw.org/?
p=algolit.git;a=tree;f=scanbot_brussel;h=81facf5cb106a8e4c2a76c048694a3043b158d62;hb=HEAD

III. TRANSFORMATION OF SOURCE IMAGES INTO .TIFFS
Images transferred from the cameras are high definition full color images. You want your cameras
to shoot at the largest possible .jpg resolution in order for resulting files to have at least 300 dpi (A4
at 300 dpi requires a 9.5 megapixel image). In the post-processing the size of the image files needs
to be reduced down radically, so that several hundred images can be merged into an e-book file of a
tolerable size.
Hence, the first step in the post-processing is to crop the images from cameras only to the content of
the pages. The surroundings around the book that were captured in the photograph and the white
margins of the page will be cropped away, while the printed text will be transformed into black
letters on white background. The illustrations, however, will need to be preserved in their color or
grayscale form, and mixed with the black and white text. What were initially large .jpg files will
now become relatively small .tiff files that are ready for optical character recognition process
(OCR).
These tasks can be completed by a number of software applications. Our manual will focus on one
that can be used across all major operating systems -- ScanTailor. ScanTailor can be downloaded
from: http://scantailor.sourceforge.net/. A more detailed video tutorial of ScanTailor can be found
here: http://vimeo.com/12524529.
ScanTailor: from a photograph of a page to a graphic file ready for OCR
Once you have transferred all the photos from cameras to the computer, renamed and rotated them,
they are ready to be processed in the ScanTailor.
1) Importing photographs to ScanTailor
- start ScanTailor and open ‘new project’
- for ‘input directory’ chose the folder where you stored the transferred and renamed photo images
- you can leave ‘output directory’ as it is, it will place your resulting .tiffs in an 'out' folder inside
the folder where your .jpg images are
- select all files (if you followed the naming convention above, they will be named
‘page_xxxx.jpg’) in the folder where you stored the transferred photo images, and click 'OK'
- in the dialog box ‘Fix DPI’ click on All Pages, and for DPI choose preferably '600x600', click
'Apply', and then 'OK'
2) Editing pages
2.1 Rotating photos/pages
If you've rotated the photo images in the previous step using the scanflow script, skip this step.
- Rotate the first photo counter-clockwise, click Apply and for scope select ‘Every other page’
followed by 'OK'
- Rotate the following photo clockwise, applying the same procedure like in the previous step
2.2 Deleting redundant photographs/pages
- Remove redundant pages (photographs of the empty cradle at the beginning and the end of the
book scanning sequence; book cover pages if you don’t want them in the final scan; duplicate pages
etc.) by right-clicking on a thumbnail of that page in the preview column on the right side, selecting
‘Remove from project’ and confirming by clicking on ‘Remove’.

# If you by accident remove a wrong page, you can re-insert it by right-clicking on a page
before/after the missing page in the sequence, selecting 'insert after/before' (depending on which
page you selected) and choosing the file from the list. Before you finish adding, it is necessary to
again go through the procedure of fixing DPI and Rotating.
2.3 Adding missing pages
- If you notice that some pages are missing, you can recapture them with the camera and insert them
manually at this point using the procedure described above under 2.2.
3) Split pages and deskew
Steps ‘Split pages’ and ‘Deskew’ should work automatically. Run them by clicking the ‘Play’ button
under the 'Select content' function. This will do the three steps automatically: splitting of pages,
deskewing and selection of content. After this you can manually re-adjust splitting of pages and deskewing.
4) Selecting content
Step ‘Select content’ works automatically as well, but it is important to revise the resulting selection
manually page by page to make sure the entire content is selected on each page (including the
header and page number). Where necessary, use your pointer device to adjust the content selection.
If the inner margin is cut, go back to 'Split pages' view and manually adjust the selected split area. If
the page is skewed, go back to 'Deskew' and adjust the skew of the page. After this go back to
'Select content' and readjust the selection if necessary.
This is the step where you do visual control of each page. Make sure all pages are there and
selections are as equal in size as possible.
At the bottom of thumbnail column there is a sort option that can automatically arrange pages by
the height and width of the selected content, making the process of manual selection easier. The
extreme differences in height should be avoided, try to make selected areas as much as possible
equal, particularly in height, across all pages. The exception should be cover and back pages where
we advise to select the full page.
5) Adjusting margins
For best results select in the previous step content of the full cover and back page. Now go to the
'Margins' step and set under Margins section both Top, Bottom, Left and Right to 0.0 and do 'Apply
to...' → 'All pages'.
In Alignment section leave 'Match size with other pages' ticked, choose the central positioning of
the page and do 'Apply to...' → 'All pages'.
6) Outputting the .tiffs
Now go to the 'Output' step. Ignore the 'Output Resolution' section.
Next review two consecutive pages from the middle of the book to see if the scanned text is too
faint or too dark. If the text seems too faint or too dark, use slider Thinner – Thicker to adjust. Do
'Apply to' → 'All pages'.
Next go to the cover page and select under Mode 'Color / Grayscale' and tick on 'White Margins'.
Do the same for the back page.
If there are any pages with illustrations, you can choose the 'Mixed' mode for those pages and then

under the thumb 'Picture Zones' adjust the zones of the illustrations.
Now you are ready to output the files. Just press 'Play' button under 'Output'. Once the computer is
finished processing the images, just do 'File' → 'Save as' and save the project.

IV. OPTICAL CHARACTER RECOGNITION
Before the edited-down graphic files are finalized as an e-book, we want to transform the image of
the text into an actual text that can be searched, highlighted, copied and transformed. That
functionality is provided by Optical Character Recognition. This a technically difficult task dependent on language, script, typeface and quality of print - and there aren't that many OCR tools
that are good at it. There is, however, a relatively good free software solution - Tesseract
(http://code.google.com/p/tesseract-ocr/) - that has solid performance, good language data and can
be trained for an even better performance, although it has its problems. Proprietary solutions (e.g.
Abby FineReader) sometimes provide superior results.
Tesseract supports as input format primarily .tiff files. It produces a plain text file that can be, with
the help of other tools, embedded as a separate layer under the original graphic image of the text in
a PDF file.
With the help of other tools, OCR can be performed also against other input files, such as graphiconly PDF files. This produces inferior results, depending again on the quality of graphic files and
the reproduction of text in them. One such tool is a bashscript to OCR a ODF file that can be found
here: https://github.com/andrecastro0o/ocr/blob/master/ocr.sh
As mentioned in the 'before scanning' section, the quality of the original book will influence the
quality of the scan and thus the quality of the OCR. For a comparison, have a look here:
http://www.paramoulipist.be/?p=1303
Once you have your .txt file, there is still some work to be done. Because OCR has difficulties to
interpret particular elements in the lay-out and fonts, the TXT file comes with a lot of errors.
Recurrent problems are:
- combinations of specific letters in some fonts (it can mistake 'm' for 'n' or 'I' for 'i' etc.);
- headers become part of body text;
- footnotes are placed inside the body text;
- page numbers are not recognized as such.

V. CREATING A FINALIZED E-BOOK FILE
After the optical character recognition has been completed, the resulting text can be merged with
the images of pages and output into an e-book format. While increasingly the proper e-book file
formats such as ePub have been gaining ground, PDFs still remain popular because many people
tend to read on their computers, and they retain the original layout of the book on paper including
the absolute pagination needed for referencing in citations. DjVu is also an option, as an alternative
to PDF, used because of its purported superiority, but it is far less popular.
The export to PDF can be done again with a number of tools. In our case we'll complete the optical
character recognition and PDF export in gscan2pdf. Again, the proprietary Abbyy FineReader will
produce a bit smaller PDFs.
If you prefer to use an e-book format that works better with e-book readers, obviously you will have
to remove some of the elements that appear in the book - headers, footers, footnotes and pagination.

This can be done earlier in the process of cropping down the original .jpg image files (see under III)
or later by transforming the PDF files. This can be done in Calibre (http://calibre-ebook.com) by
converting the PDF into an ePub, where it can be further tweaked to better accommodate or remove
the headers, footers, footnotes and pagination.
Optical character recognition and PDF export in Public Library workflow
Optical character recognition with the Tesseract engine can be performed on GNU/Linux by a
number of command line and GUI tools. Much of those tools exist also for other operating systems.
For the users of the Public Library workflow, we recommend using gscan2pdf application both for
the optical character recognition and the PDF or DjVu export.
To do so, start gscan2pdf and open your .tiff files. To OCR them, go to 'Tools' and select 'OCR'. In
the dialog box select the Tesseract engine and your language. 'Start OCR'. Once the OCR is
finished, export the graphic files and the OCR text to PDF by selecting 'Save as'.
However, given that sometimes the proprietary solutions produce better results, these tasks can also
be done, for instance, on the Abbyy FineReader running on a Windows operating system running
inside the Virtual Box. The prerequisites are that you have both Windows and Abbyy FineReader
you can install in the Virtual Box. If using Virtual Box, once you've got both installed, you need to
designate a shared folder in your Virtual Box and place the .tiff files there. You can now open them
from the Abbyy FineReader running in the Virtual Box, OCR them and export them into a PDF.
To use Abbyy FineReader transfer the output files in your 'out' out folder to the shared folder of the
VirtualBox. Then start the VirtualBox, start Windows image and in Windows start Abbyy
FineReader. Open the files and let the Abbyy FineReader read the files. Once it's done, output the
result into PDF.

VI. CATALOGING AND SHARING THE E-BOOK
Your road from a book on paper to an e-book is complete. If you want to maintain your library you
can use Calibre, a free software tool for e-book library management. You can add the metadata to
your book using the existing catalogues or you can enter metadata manually.
Now you may want to distribute your book. If the work you've digitized is in the public domain
(https://en.wikipedia.org/wiki/Public_domain), you might consider contributing it to the Gutenberg
project
(http://www.gutenberg.org/wiki/Gutenberg:Volunteers'_FAQ#V.1._How_do_I_get_started_as_a_Pr
oject_Gutenberg_volunteer.3F ), Wikibooks (https://en.wikibooks.org/wiki/Help:Contributing ) or
Arhive.org.
If the work is still under copyright, you might explore a number of different options for sharing.

QUICK WORKFLOW REFERENCE FOR SCANNING AND
POST-PROCESSING ON PUBLIC LIBRARY SCANNER
I. PHOTOGRAPHING A PRINTED BOOK
0. Before you start:
- loosen the book binding by opening it wide on several places
- switch on the scanner
- set up the cameras:
- place cameras on tripods and fit them tigthly
- plug in the automatic chargers into the battery slot and close the battery lid
- switch on the cameras
- switch the lens to Manual Focus mode
- switch the cameras to Av mode and set the aperture to 8.0
- turn the zoom ring to set the focal length exactly midway between 24mm and 35mm
- focus by turning on the live view, pressing magnification button twice and adjusting the
focus to get a clear view of the text
- connect the cameras to the scanner by plugging the remote trigger cable to a port behind a
protective rubber cover on the left side of the cameras
- place the book into the crade
- double-check storage cards and batteries
- press the play button on the back of the camera to double-check if there are images on the
camera - if there are, delete all the images from the camera menu
- if using batteries, double-check that batteries are fully charged
- switch off the light in the room that could reflect off the platen and cover the scanner with the
black cloth
1. Photographing
- now you can start scanning either by pressing the smaller button on the controller once to
lower the platen and adjust the book, and then press again to increase the light intensity, trigger the
cameras and lift the platen; or by pressing the large button completing the entire sequence in one
go;
- ATTENTION: Shutter sound should be coming from both cameras - if one camera is not
working, it's best to reconnect both cameras, make sure the batteries are charged or adapters
are connected, erase all images and restart.
- ADVICE: The scanner has a digital counter. By turning the dial forward and backward,
you can set it to tell you what page you should be scanning next. This should help you to
avoid missing a page due to a distraction.

II. Getting the image files ready for post-processing
- after finishing with scanning a book, transfer the files to the post-processing computer
and purge the memory cards
- if transferring the files manually:
- create two separate folders,
- transfer the files from the folders with image files on cards, using a batch
renaming software rename the files from the right camera following the convention
page_0001.jpg, page_0003.jpg, page_0005.jpg... -- and the files from the left camera
following the convention page_0002.jpg, page_0004.jpg, page_0006.jpg...
- collate image files into a single folder
- before ejecting each card, delete all the photo files on the card
- if using the scanflow script:
- start the script on the computer
- place the card from the right camera into the card reader
- enter the name of the destination folder following the convention
"Name_Surname_Title_of_the_Book" and transfer the files
- repeat with the other card
- script will automatically transfer the files, rename, rotate, collate them in proper
order and delete them from the card
III. Transformation of source images into .tiffs
ScanTailor: from a photograph of page to a graphic file ready for OCR
1) Importing photographs to ScanTailor
- start ScanTailor and open ‘new project’
- for ‘input directory’ chose the folder where you stored the transferred photo images
- you can leave ‘output directory’ as it is, it will place your resulting .tiffs in an 'out' folder
inside the folder where your .jpg images are
- select all files (if you followed the naming convention above, they will be named
‘page_xxxx.jpg’) in the folder where you stored the transferred photo images, and click
'OK'
- in the dialog box ‘Fix DPI’ click on All Pages, and for DPI choose preferably '600x600',
click 'Apply', and then 'OK'
2) Editing pages
2.1 Rotating photos/pages
If you've rotated the photo images in the previous step using the scanflow script, skip this step.
- rotate the first photo counter-clockwise, click Apply and for scope select ‘Every other
page’ followed by 'OK'
- rotate the following photo clockwise, applying the same procedure like in the previous
step

2.2 Deleting redundant photographs/pages
- remove redundant pages (photographs of the empty cradle at the beginning and the end;
book cover pages if you don’t want them in the final scan; duplicate pages etc.) by rightclicking on a thumbnail of that page in the preview column on the right, selecting ‘Remove
from project’ and confirming by clicking on ‘Remove’.
# If you by accident remove a wrong page, you can re-insert it by right-clicking on a page
before/after the missing page in the sequence, selecting 'insert after/before' and choosing the file
from the list. Before you finish adding, it is necessary to again go the procedure of fixing DPI and
rotating.
2.3 Adding missing pages
- If you notice that some pages are missing, you can recapture them with the camera and
insert them manually at this point using the procedure described above under 2.2.
3)

Split pages and deskew
- Functions ‘Split Pages’ and ‘Deskew’ should work automatically. Run them by
clicking the ‘Play’ button under the 'Select content' step. This will do the three steps
automatically: splitting of pages, deskewing and selection of content. After this you can
manually re-adjust splitting of pages and de-skewing.

4)

Selecting content and adjusting margins
- Step ‘Select content’ works automatically as well, but it is important to revise the
resulting selection manually page by page to make sure the entire content is selected on
each page (including the header and page number). Where necessary use your pointer device
to adjust the content selection.
- If the inner margin is cut, go back to 'Split pages' view and manually adjust the selected
split area. If the page is skewed, go back to 'Deskew' and adjust the skew of the page. After
this go back to 'Select content' and readjust the selection if necessary.
- This is the step where you do visual control of each page. Make sure all pages are there
and selections are as equal in size as possible.
- At the bottom of thumbnail column there is a sort option that can automatically arrange
pages by the height and width of the selected content, making the process of manual
selection easier. The extreme differences in height should be avoided, try to make
selected areas as much as possible equal, particularly in height, across all pages. The
exception should be cover and back pages where we advise to select the full page.

5) Adjusting margins
- Now go to the 'Margins' step and set under Margins section both Top, Bottom, Left and
Right to 0.0 and do 'Apply to...' → 'All pages'.
- In Alignment section leave 'Match size with other pages' ticked, choose the central

positioning of the page and do 'Apply to...' → 'All pages'.
6) Outputting the .tiffs
- Now go to the 'Output' step.
- Review two consecutive pages from the middle of the book to see if the scanned text is
too faint or too dark. If the text seems too faint or too dark, use slider Thinner – Thicker to
adjust. Do 'Apply to' → 'All pages'.
- Next go to the cover page and select under Mode 'Color / Grayscale' and tick on 'White
Margins'. Do the same for the back page.
- If there are any pages with illustrations, you can choose the 'Mixed' mode for those
pages and then under the thumb 'Picture Zones' adjust the zones of the illustrations.
- To output the files press 'Play' button under 'Output'. Save the project.
IV. Optical character recognition & V. Creating a finalized e-book file
If using all free software:
1) open gscan2pdf (if not already installed on your machine, install gscan2pdf from the
repositories, Tesseract and data for your language from https://code.google.com/p/tesseract-ocr/)
- point gscan2pdf to open your .tiff files
- for Optical Character Recognition, select 'OCR' under the drop down menu 'Tools',
select the Tesseract engine and your language, start the process
- once OCR is finished and to output to a PDF, go under 'File' and select 'Save', edit the
metadata and select the format, save
If using non-free software:
2) open Abbyy FineReader in VirtualBox (note: only Abby FineReader 10 installs and works with some limitations - under GNU/Linux)
- transfer files in the 'out' folder to the folder shared with the VirtualBox
- point it to the readied .tiff files and it will complete the OCR
- save the file

REFERENCES
For more information on the book scanning process in general and making your own book scanner
please visit:
DIY Book Scanner: http://diybookscannnner.org
Hacker Space Bruxelles scanner: http://hackerspace.be/ScanBot
Public Library scanner: http://www.memoryoftheworld.org/blog/2012/10/28/our-belovedbookscanner/
Other scanner builds: http://wiki.diybookscanner.org/scanner-build-list
For more information on automation:
Konrad Voeckel's post-processing script (From Scan to PDF/A):
http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/
Johannes Baiter's automation of scanning to PDF process: http://spreads.readthedocs.org
For more information on applications and tools:
Calibre e-book library management application: http://calibre-ebook.com/
ScanTailor: http://scantailor.sourceforge.net/
gscan2pdf: http://sourceforge.net/projects/gscan2pdf/
Canon Hack Development Kit firmware: http://chdk.wikia.com
Tesseract: http://code.google.com/p/tesseract-ocr/
Python script of Hacker Space Bruxelles scanner: http://git.constantvzw.org/?
p=algolit.git;a=tree;f=scanbot_brussel;h=81facf5cb106a8e4c2a76c048694a3043b158d62;hb=HEA
D


Bodo
A Short History of the Russian Digital Shadow Libraries
2014


Draft Manuscript, 11/4/2014, DO NOT CITE!

A short history of the Russian digital shadow libraries
Balazs Bodo, Institute for Information Law, University of Amsterdam

“What I see as a consequence of the free educational book distribution: in decades generations of people
everywhere in the World will grow with the access to the best explained scientific texts of all times.
[…]The quality and accessibility of education to poors will drastically grow too. Frankly, I'm seeing this as
the only way to naturally improve mankind: by breeding people with all the information given to them at
any time.” – Anonymous admin of Aleph, explaining the reason d’étre of the site

Abstract
RuNet, the Russian segment of the internet is now the home of the most comprehensive scientific pirate
libraries on the net. These sites offer free access to hundreds of thousands of books and millions of
journal articles. In this contribution we try to understand the factors that led to the development of
these sites, and the sociocultural and legal conditions that enable them to operate under hostile legal
and political conditions. Through the reconstruction of the micro-histories of peer produced online text
collections that played a central role in the history of RuNet, we are able to link the formal and informal
support for these sites to the specific conditions developed under the Soviet and post Soviet times.

(pirate) libraries on the net
The digitization and collection of texts was one of the very first activities enabled by computers. Project
Gutenberg, the first in line of digital libraries was established as early as 1971. By the early nineties, a
number of online electronic text archives emerged, all hoping to finally realize the dream that was
chased by humans every since the first library: the collection of everything (Battles, 2004), the Memex
(Bush, 1945), the Mundaneum (Rieusset-Lemarié, 1997), the Library of Babel (Borges, 1998). It did not
take long to realize that the dream was still beyond reach: the information storage and retrieval
technology might have been ready, but copyright law, for the foreseeable future was not. Copyright
protection and enforcement slowly became one of the most crucial issues around digital technologies.

1
Electronic copy available at: http://ssrn.com/abstract=2616631

Draft Manuscript, 11/4/2014, DO NOT CITE!
And as that happened, the texts, which were archived without authorization were purged from the
budding digital collections. Those that survived complete deletion were moved into the dark, locked
down sections of digital libraries that sometimes still lurk behind the law-abiding public façades. Hopes
for a universal digital library can be built was lost in just a few short years as those who tried it (such as
Google or Hathitrust) got bogged down in endless court battles.
There are unauthorized texts collections circulating on channels less susceptible to enforcement, such as
DVDs, torrents, or IRC channels. But the technical conditions of these distribution channels do not enable
the development of a library. Two of the most essential attributes of any proper library: the catalogue
and the community are hard to provide on such channels. The catalog doesn’t just organize the
knowledge stored in the collection; it is not just a tool of searching and browsing. It is a critical
component in the organization of the community of “librarians” who preserve and nourish the
collection. The catalog is what distinguishes an unstructured heap of computer files from a wellmaintained library, but it is the same catalog, which makes shadow libraries, unauthorized texts
collections an easy target of law enforcement. Those few digital online libraries that dare to provide
unauthorized access to texts in an organized manner, such as textz.org, a*.org, monoskop or Gigapedia/
library.nu, all had their bad experiences with law enforcement and rights holder dismay.
Of these pirate libraries, Gigapedia—later called Library.nu—was the largest at the turn of the 2010’s. At
its peak, it was several orders of magnitudes bigger than its peers, offering access to nearly a million
English language documents. It was not just size that made Gigapedia unique. Unlike most sites, it
moved beyond its initial specialization in scientific texts to incorporate a wide range of academic
disciplines. Compared to its peers, it also had a highly developed central metadata database, which
contained bibliographic details on the collection and also, significantly, on gaps in the collection, which
underpinned a process of actively solicited contributions from users. With the ubiquitous
scanner/copiers, the production of book scans was as easy as copying them, thus the collection grew
rapidly.
Gigapedia’s massive catalog made the site popular, which in turn made it a target. In early 2012, a group
of 17 publishers was granted an injunction against the site (now called Library.nu; and against iFile.it—
the hosting site that stored most of Library.nu’s content). Unlike the record and movie companies,
which had collaborated on dozens of lawsuits over the past decade, the Library.nu injunction and lawsuit
were the first coordinated publisher actions against a major file-sharing site, and the first to involve
major university publishers in particular. Under the injunction, the Library.nu adminstrators closed the
site. The collection disappeared and the community around it dispersed. (Liang, 2012)
Gigapedia’s collection was integrated into Aleph’s predominantly Russian language collection before the
shutdown, making Aleph the natural successor of Gigapedia/library.nu.

Libraries in the RuNet

2
Electronic copy available at: http://ssrn.com/abstract=2616631

Draft Manuscript, 11/4/2014, DO NOT CITE!
The search soon zeroed in on a number of sites with strong hints to their Russian origins. Sites like Aleph,
[sc], [fi], [os] are open, completely free to use, and each offers access to a catalog comparable to the late
Gigapedia’s.
The similarity of these seemingly distinct services is no coincidence. These sites constitute a tightly knit
network, in which Aleph occupies the central position. Aleph, as its name suggests, is the source library,
it aims to seed of all scientific digital libraries on the net. Its mission is simple and straightforward. It
collects free-floating scientific texts and other collections from the Internet and consolidates them (both
content and metadata) into a single, open database. Though ordinary users can search the catalog and
retrieve the texts, its main focus is the distribution of the catalog and the collection to anyone who
wants to build services upon them. Aleph has regularly updated links that point to its own, neatly packed
source code, its database dump, and to the terabytes worth of collection. It is a knowledge infrastructure
that can be freely accessed, used and built upon by anyone. This radical openness enables a number of
other pirate libraries to offer Aleph’s catalogue along with books coming from other sources. By
mirroring Aleph they take over tasks that the administrators of Aleph are unprepared or unwilling to do.
Handling much of the actual download traffic they relieve Aleph from the unavoidable investment in
servers and bandwidth, which, in turn puts less pressure on Aleph to engage in commercial activities to
finance its operation. While Aleph stays in the background, the network of mirrors compete for
attention, users and advertising revenue as their design, business model, technical sophistication is finetuned to the profile of their intended target audience.
This strategy of creating an open infrastructure serves Aleph well. It ensures the widespread distribution
of books while it minimizes (legal) exposure. By relinquishing control, Aleph also ensures its own longterm survival, as it is copied again and again. In fact, openness is the core element in the philosophy of
Aleph, which was summed up by one of its administrators as to:
“- collect valuable science/technology/math/medical/humanities academic literature. That is,
collect humanity's valuable knowledge in digital form. Avoid junky books. Ignore "bestsellers".
- build a community of people who share knowledge, improve quality of books, find good and
valuable books, and correct errors.
- share the files freely, spreading the knowledge altruistically, not trying to make money, not
charging money for knowledge. Here people paid money for many books that they considered
valuable and then shared here on [Aleph], for free. […]
This is the true spirit of the [Aleph] project.”

3

Draft Manuscript, 11/4/2014, DO NOT CITE!
Reading, publishing, censorship and libraries in Soviet-Russia
“[T]he library of the Big Lubyanka was unique. In all probability it had been assembled out of confiscated
private libraries. The bibliophiles who had collected those books had already rendered up their souls to
God. But the main thing was that while State Security had been busy censoring and emasculating all the
libraries of the nation for decades, it forgot to dig in its own bosom. Here, in its very den, one could read
Zamyatin, Pilnyak, Panteleimon Romanov, and any volume at all of the complete works of Merezhkovsky.
(Some people wisecracked that they allowed us to read forbidden books because they already regarded
us as dead. But I myself think that the Lubyanka librarians hadn't the faintest concept of what they were
giving us—they were simply lazy and ignorant.)”
(Solzhenitsyn, 1974)
In order to properly understand the factors that shaped Russian pirate librarians’ and their wider
environments’ attitudes towards bottom-up, collaborative, copyright infringing open source digital
librarianship, we need to go back nearly a century and take a close look at the specific social and political
conditions of the Soviet times that shaped the contemporary Russian intelligentsia’s attitudes towards
knowledge.

The communist ideal of a reading nation
Russian culture always had a reverence for the printed word, and the Soviet state, with its Leninist
program of mass education further stressed the idea of the educated, reading public. As Stelmach (1993)
put it:
Reading almost transplanted religion as a sacred activity: in the secularized socialist state, where the
churches were closed, the free press stifled and schools and universities politicized, literature became the
unique source of moral truth for the population. Writers were considered teachers and prophets.
The Soviet Union was a reading culture: in the last days of the USSR, a quarter of the adult population
were considered active readers, and almost everyone else categorized as an occasional reader. Book
prices were low, alternative forms of entertainment were scarce, and people were poor, making reading
one of the most attractive leisure activities.
The communist approach towards intellectual property protection reflected the idea of the reading
nation. The Soviet Union inherited a lax and isolationist copyright system from the tsarist Russia. Neither
the tsarist Russian state nor the Soviet state adhered to international copyright treaties, nor did they
enter into bilateral treaties. Tsarist Russia’s refusal to grant protection to foreign authors and
translations had primarily an economic rationale. The Soviet regime added a strong ideological claim:
granting exclusive ownership to authors was against the interests of the reading public, and “the cultural
development of the masses,” and only served the private interests of authors and heirs.
“If copyright had an economic function, that was only as a right of remuneration for his contribution to
the extension of the socialist art heritage. If copyright had a social role, this was not to protect the author

4

Draft Manuscript, 11/4/2014, DO NOT CITE!
from the economically stronger exploiter, but was one of the instruments to get the author involved in
the great communist educational project.” (Elst, 2005, p 658)
The Soviet copyright system, even in its post-revolutionary phase, maintained two persistent features
that served as important instruments of knowledge dissemination. First, the statutorily granted
“freedom of translation” meant that translation was treated as an exception to copyright, which did not
require rights holder authorization. This measure dismantled a significant barrier to access in a
multicultural and multilingual empire. By the same token, the denial of protection to foreign authors and
rights holders eased the imports of foreign texts (after, of course the appropriate censorship review).
Due to these instruments:
“[s]oon after its founding, the Soviet Union became as well the world's leading literary pirate, not only
publishing in translation the creations of its own citizens but also publishing large numbers of copies of
the works of Western authors both in translation and in the original language.” (Newcity, 1980, p 6.)
Looking simply at the aggregate numbers of published books, the USSR had an impressive publishing
industry on a scale appropriate to a reading nation. Between 1946 and 1970 more than 1 billion copies of
over 26 thousand different work were published, all by foreign authors (Newcity, 1978). In 1976 alone,
more than 1.7 billion copies of 84,304 books were printed. (Friedberg, Watanabe, & Nakamoto, 1984, fn
4.)
Of course these impressive numbers reflected neither a healthy public sphere, nor a well-functioning
print ecology. The book-based public sphere was both heavily censored and plagued by the peculiar
economic conditions of the Soviet, and later the post-Soviet era.

Censorship
The totalitarian Soviet state had many instruments to control the circulation of literary and scientific
works. 1 Some texts never entered official circulation in the first hand: “A particularly harsh
prepublication censorship [affected] foreign literature, primarily in the humanities and socioeconomic
disciplines. Books on politics, international relations, sociology, philosophy, cybernetics, semiotics,
linguistics, and so on were hardly ever published.” (Stelmakh, 2001, p 145.)
Many ‘problematic’ texts were only put into severely limited circulation. Books were released in small
print runs; as in-house publications, or they were only circulated among the trustworthy few. As the
resolution of the Central Committee of the Communist Party of June 4, 1959, stated: “Writings by
bourgeois authors in the fields of philosophy, history, economics, diplomacy, and law […] are to be
published in limited quantities after the excision from them of passages of no scholarly or practical

1

We share Helen Freshwater’s (2003) approach that censorship is a more complex phenomenon than the state just
blocking the circulation of certain texts. Censorship manifested itself in more than one ways and its dominant
modus operandi, institutions, extent, focus, reach, effectiveness showed extreme variations over time. This short
chapter however cannot go into the intricate details of the incredibly rich history of censorship in the Soviet Union.
Instead, through much simplification we try to demonstrate that censorship did not only affect literary works, but
extended deep into scholarly publishing, including natural science disciplines.

5

Draft Manuscript, 11/4/2014, DO NOT CITE!
interest. They are to be supplied with extensive introductions and detailed annotations." (quoted in
Friedberg et al., 1984)
Truncation and mutilation of texts was also frequent. Literary works and texts from humanities and
social sciences were obvious subjects of censorship, but natural sciences and technical fields did not
escape:
“In our film studios we received an American technical journal, something like Cinema, Radio and
Television. I saw it on the chief engineer's desk and noticed that it had been reprinted in Moscow.
Everything undesirable, including advertisements, had been removed, and only those technical articles
with which the engineer could be trusted were retained. Everything else, even whole pages, was missing.
This was done by a photo copying process, but the finished product appeared to be printed.” (Dewhirst &
Farrell, 1973, p. 127)
Mass cultural genres were also subject to censorship and control. Women's fiction, melodrama, comics,
detective stories, and science fiction were completely missing or heavily underrepresented in the mass
market. Instead, “a small group of officially approved authors […] were published in massive editions
every year, [and] blocked readers' access to other literature. […]Soviet literature did not fit the formula
of mass culture and was simply bad literature, but it was issued in huge print-runs.” (Stelmakh, 2001, p.
150)
Libraries were also important instruments of censorship. When not destroyed altogether, censored
works ended up in the spetskhrans, limited access special collections established in libraries to contain
censored works. Besides obvious candidates such as anti-Soviet works and western ‘bourgeois’
publications, many scientific works from the fields of biology, nuclear physics, psychology, sociology,
cybernetics, and genetics ended up in these closed collections (Ryzhak, 2005). Access to the spetskhrans
was limited to those with special permits issued by their employers. “Only university educated readers
were enrolled and only those holding positions of at least junior scientific workers were allowed to read
the publications kept by the spetskhran” (Ryzhak, 2005). In the last years of the USSR, the spetskhran of
the Russian State Library—the largest of them with more than 1 million items in the collection—had 43
seats for its roughly 4500 authorized readers. Yearly circulation was around 200,000 items, a figure that
included “the history and literature of other countries, international relations, science of law, technical
sciences and others.” (Ryzhak, 2005)
Librarians thus played a central role in the censorship machinery. They did more than guard the contents
of limited-access collections and purge the freely accessible stocks according to the latest Party
directives. As the intermediaries between the readers and the closed stacks, their task was to carefully
guide readers’ interests:
“In the 1970s, among the staff members of the service department of the Lenin State Library of the
U.S.S.R., there were specially appointed persons-"politcontrollers"-who, apart from their regular
professional functions, had to perform additional control over the literature lent from the general stocks
(not from the restricted access collections), thus exercising censorship over the percolation of avant-garde

6

Draft Manuscript, 11/4/2014, DO NOT CITE!
aesthetics to the reader, the aesthetics that introduced new ways of thinking and a new outlook on life
and social behavior.” (Stelmakh, 2001)
Librarians also used library cards and lending histories to collect and report information on readers and
suspicious reading habits.
Soviet economic dysfunction also severely limited access to printed works. Acute and chronic shortages
of even censor-approved texts were common, both on the market and in libraries. When the USSR
joined its first first international copyright treaty in its history in 1973 (the UNESCO-backed Universal
Copyright Convention), which granted protection to foreign authors and denied “freedom of
translation,” the access problems only got worse. Soviet concern that granting protection to foreign
authors would result in significant royalty payments to western rightsholders proved valid. By 1976, the
yearly USSR trade deficit in publishing reached a million rubles (~5.5 million current USD) (Levin, 1983, p.
157). This imbalance not only affected the number of publications that were imported into the cashpoor country, but also raised the price of translated works to the double that of Russian-authored books
(Levin, 1983, p. 158).

The literary and scientific underground in Soviet times
Various practices and informal institutions evolved to address the problems of access. Book black
markets flourished: “In the 1970s and 1980s the black market was an active part of society. Buying books
directly from other people was how 35 percent of Soviet adults acquired books for their own homes, and
68 percent of families living in major cities bought books only on the black market.” (Stelmakh, 2001, p
146). Book copying and hoarding was practiced to supplement the shortages:
“People hoarded books: complete works of Pushkin, Tolstoy or Chekhov. You could not buy such things.
So you had the idea that it is very important to hoard books. High-quality literary fiction, high quality
science textbooks and monographs, even biographies of famous people (writers, scientists, composers,
etc.) were difficult to buy. You could not, as far as I remember, just go to a bookstore and buy complete
works of Chekhov. It was published once and sold out and that's it. Dostoyevsky used to be prohibited in
the USSR, so that was even rarer. Lots of writers were prohibited, like Nabokov. Eventually Dostoyevsky
was printed in the USSR, but in very small numbers.
And also there were scientists who wanted scientific books and also could not get them. Mathematics
books, physics - only very few books were published every year, you can't compare this with the market in
the U.S. Russian translations of classical monographs in mathematics were difficult to find.
So, in the USSR, everyone who had a good education shared the idea that hoarding books is very, very
important, and did just that. If someone had free access to a Xerox machine, they were Xeroxing
everything in sight. A friend of mine had entire room full of Xeroxed books.”2
From the 1960s onwards, the ever-growing Samizdat networks tried to counterbalance the effects of
censorship and provide access to both censored classics and information on the current state of Soviet

2

Anonymous source #1

7

Draft Manuscript, 11/4/2014, DO NOT CITE!
society. Reaching a readership of around 200,000, these networks operated in a networked, bottom-up
manner. Each node in the chain of distribution copied the texts it received, and distributed the copies.
The nodes also carried information backwards, towards the authors of the samizdat publications.
In the immediate post-Soviet political turmoil and economic calamity, access to print culture did not get
any easier. Censorship officially ended, but so too did much of the funding for the state-funded
publishing sector. Mass unemployment, falling wages, and the resulting loss of discretionary income did
not facilitate the shift toward market-based publishing models. The funding of libraries also dwindled,
limiting new acquisitions (Elst, 2005, p. 299-300). Economic constraints took the place of political ones.
But in the absence of political repression, self-organizing efforts to address these constraints acquired
greater scope of action. Slowly, the informal sphere began to deliver alternative modes of access to
otherwise hard-to-get literary and scientific works.
Russian pirate libraries emerged from these enmeshed contexts: communist ideologies of the reading
nation and mass education; the censorship of texts; the abused library system; economic hardships and
dysfunctional markets, and, most importantly, the informal practices that ensured the survival of
scholarship and literary traditions under hostile political and economic conditions. The prominent place
of Russian pirate libraries in the larger informal media economy—and of Russian piracy of music, film,
and other copyrighted work more generally—cannot be understood outside this history.

The emergence of DIY digital libraries in RuNet
The copying of censored and uncensored works (by hand, by typewriters, by photocopying or by
computers), the hoarding of copied texts, the buying and selling of books on the black market, and the
informal, peer-to-peer distribution of samizdat material were integral parts of the everyday experience
of much of educated Soviet and post-Soviet readers. The building and maintenance of individual
collections and the participation in the informal networks of exchange offered a sense of political,
economic and cultural agency—especially as the public institutions that supported the core professions
of the intelligentsia fell into sustained economic crisis.
Digital technologies were embraced by these practices as soon as they appeared:
"From late 1970s, when first computers became used in the USSR and printers became available,
people started to print forbidden books, or just books that were difficult to find, not necessarily
forbidden. I have seen myself a print-out on a mainframe computer of a science fiction novel,
printed in all caps! Samizdat was printed on typewriters, xeroxed, printed abroad and xeroxed, or
printed on computers. Only paper circulated, files could not circulate until people started to have
PCs at home. As late as 1992 most people did not have a PC at home. So the only reason to type
a big text into a computer was to print it on paper many times.”3
People who worked in academic and research institutions were well positioned in this process: they had
access to computers, and many had access to the materials locked up in the spetskhrans. Many also had
3

Anonymous source #1

8

Draft Manuscript, 11/4/2014, DO NOT CITE!
the time and professional motivations to collect and share otherwise inaccessible texts. The core of
current digital collections was created in this late-Soviet/early post-Soviet period by such professionals.
Their home academic and scientific institutions continued to play an important role in the development
of digital text collections well into the era of home computing and the internet.
Digitized texts first circulated in printouts and later on optical/magnetic storage media. With the
emergence of digital networking these texts quickly found their way to the early Internet as well. The
first platform for digital text sharing was the Russian Fidonet, a network of BBS systems similar to
Usenet, which enabled the mass distribution of plain text files. The BBS boards, such as the Holy Spirit
BBS’ “SU.SF & F.FANDOM” group whose main focus was Soviet-Russian science fiction and fantasy
literature, connected fans around emerging collections of shared texts. As an anyonmous interviewee
described his experience in the early 1990s…
“Fidonet collected a large number of plaintext files in literature / fiction, mostly in Russian, of course.
Fidonet was almost all typed in by hand. […] Maybe several thousand of the most important books,
novels that "everyone must read" and such stuff. People typed in poetry, smaller prose pieces. I have
myself read a sci-fi novel printed on a mainframe, which was obviously typed in. This novel was by
Strugatski brothers. It was not prohibited or dissident, but just impossible to buy in the stores. These
were culturally important, cult novels, so people typed them in. […] At this point it became clear that
there was a lot of value in having a plaintext file with some novels, and the most popular novels were first
digitized in this way.”
The next stage in the text digitization started around 1994. By that time growing numbers of people had
computers, scanning peripherals, OCR software. Russian internet and PC penetration while extremely
low overall in the 1990s (0.1% of the population having internet access in 1994, growing to 8.3% by
2003), began to make inroads in educational and scientific institutions and among Moscow and
St.Petersburg elites, who were often the critical players in these networks. As access to technologies
increased a much wider array of people began to digitize their favorite texts, and these collections began
to circulate, first via CD-ROMs, later via the internet.
One of such collection belonged to Maxim Moshkov, who published his library under the name lib.ru in
1994. Moshkov was a graduate of the Moscow State University Department of Mechanics and
Mathematics, which played a large role in the digitization of scientific works. After graduation, he started
to work for the Scientific Research Institute of System Development, a computer science institute
associated with the Russian Academy of Sciences. He describes the early days of his collection as follows:
“ I began to collect electronic texts in 1990, on a desktop computer. When I got on the Internet in 1994, I
found lots of sites with texts. It was like a dream came true: there they were, all the desired books. But
these collections were in a dreadful state! Incompatible formats, different encodings, missing content. I
had to spend hours scouring the different sites and directories to find something.
As a result, I decided to convert all the different file-formats into a single one, index the titles of the books
and put them in thematic directories. I organized the files on my work computer. I was the main user of
my collection. I perfected its structure, made a simple, fast and convenient search interface and

9

Draft Manuscript, 11/4/2014, DO NOT CITE!
developed many other useful functions and put it all on the Internet. Soon, people got into the habit of
visiting the site. […]
For about 2 years I have scoured the internet: I sought out and pulled texts from the network, which were
lying there freely accessible. Slowly the library grew, and the audience increased with it. People started
to send books to me, because they were easier to read in my collection. And the time came when I
stopped surfing the internet for books: regular readers are now sending me the books. Day after day I get
about 100 emails, and 10-30 of them contain books. So many books were sent in, that I did not have time
to process them. Authors, translators and publishers also started to send texts. They all needed the
library.”(Мошков, 1999)

In the second half of the 1990’s, the Russian Internet—RuNet—was awash in book digitization projects.
With the advent of scanners, OCR technology, and the Internet, the work of digitization eased
considerably. Texts migrated from print to digital and sometimes back to print again. They circulated
through different collections, which, in turn, merged, fell apart, and re-formed. Digital libraries with the
mission to collect and consolidate these free-floating texts sprung up by the dozens.
Such digital librarianship was the antithesis of official Soviet book culture: it was free, bottom-up,
democratic, and uncensored. It also offered a partial remedy to problems created by the post-Soviet
collapse of the economy: the impoverishment of libraries, readers, and publishers. In this context, book
digitization and collecting also offered a sense of political, economic and cultural agency, with parallels
to the copying and distribution of texts in Soviet times. The capacity to scale up these practices coincided
with the moment when anti-totalitarian social sentiments were the strongest, and economic needs the
direst.
The unprecedented bloom of digital librarianship is the result of the superimposition of multiple waves
of distinct transformations: technological, political, economical and social. “Maksim Moshkov's Library”
was ground zero for this convergence and soon became a central point of exchange for the community
engaged in text digitization and collection:
[At the outset] there were just a couple of people who started scanning books in large quantities. Literally
hundreds of books. Others started proofreading, etc. There was a huge hole in the market for books.
Science fiction, adventure, crime fiction, all of this was hugely in demand by the public. So lib.ru was to a
large part the response, and was filled by those books that people most desired and most valued.
For years, lib.ru integrated as much as it could of the different digital libraries flourishing in the RuNet. By
doing so, it preserved the collections of the many short-lived libraries.
This process of collection slowed in the early 2000’s. By that time, lib.ru had all of the classics, resulting
in a decrease in the flow of new digitized material. By the same token, the Russian book market was
finally starting to offer works aimed at the popular mainstream, and was flooded by cheap romances,
astrology, crime fiction, and other genres. Such texts started to appear in, and would soon flood lib.ru.
Many contributors, including Moshkov, were concerned that such ephemera would dilute the original
10

Draft Manuscript, 11/4/2014, DO NOT CITE!
library. And so they began to disaggregate the collection. Self-published literature, “user generated
content,” and fan fiction was separated into the aptly named samizdat.lib.ru, which housed original texts
submitted by readers. Popular fiction--“low-brow literature”—was copied from the relevant subsections
of lib.ru and split off. Sites specializing in those genres quickly formed their own ecosystem. [L], the first
of its kind, now charges a monthly fee to provide access to the collection. The [f] community split off
from [L] the same way that [L] split off from lib.ru, to provide free and unrestricted access to a
fundamentally similar collection. Finally, some in the community felt the need to focus their efforts on a
separate collection of scientific works. This became Kolhoz collection.

The genesis of a million book scientific library
A Kolhoz (Russian: колхо́ з) was one of the types of collective farm that emerged in the early Soviet
period. In the early days, it was a self-governing, community-owned collaborative enterprise, with many
of the features of a commons. For the Russian digital librarians, these historical resonances were
intentional.
The kolhoz group was initially a community that scanned and processed scientific materials: books and,
occasionally, articles. The ethos was free sharing. Academic institutes in Russia were in dire need of
scientific texts; they xeroxed and scanned whatever they could. Usually, the files were then stored on the
institute's ftp site and could be downloaded freely. There were at least three major research institutes
that did this, back in early 2000s, unconnected to each other in any way, located in various faraway parts
of Russia. Most of these scans were appropriated by the kolhoz group and processed into DJVU4.
The sources of files for kolhoz were, initially, several collections from academic institutes (downloaded
whenever the ftp servers were open for anonymous access; in one case, from one of the institutes of the
Chinese academy of sciences, but mostly from Russian academic institutes). At that time (around 2002),
there were also several commercialized collections of scanned books on sale in Russia (mostly, these were
college-level textbooks on math and physics); these files were also all copied to kolhoz and processed into
DJVU. The focus was on collecting the most important science textbooks and monographs of all time, in
all fields of natural science.
There was never any commercial support. The kolhoz group never had a web site with a database, like
most projects today. They had an ftp server with files, and the access to ftp was given by PM in a forum.
This ftp server was privately supported by one of the members (who was an academic researcher, like
most kolhoz members). The files were distributed directly by burning files on writable DVDs and giving the

4

DJVU is a file format that revolutionized online book distribution the way mp3 revolutionized the online music
distribution. For books that contain graphs, images and mathematical formulae scanning is the only digitization
option. However, the large number of resulting image files is difficult to handle. The DJVU file format allows for the
images of scanned book pages to be stored in the smallest possible file size, which makes it the perfect medium for
the distribution of scanned e-books.

11

Draft Manuscript, 11/4/2014, DO NOT CITE!
DVDs away. Later, the ftp access was closed to the public, and only a temporary file-swapping ftp server
remained. Today the kolhoz DVD releases are mostly spread via torrents.” 5
Kolhoz amassed around fifty thousand documents, the mexmat collection of the Moscow State
University Department of Mechanics and Mathematics (Moshkov’s alma mater) was around the same
size, the “world of books” collection (mirknig) had around thirty thousand files, and there were around a
dozen other smaller archives, each with approximately 10 thousand files in their respective collections.
The Kolhoz group dominated the science-minded ebook community in Russia well into the late 2000’s.
Kolhoz, however, suffered from the same problems as the early Fidonet-based text collections. Since it
was distributed in DVDs, via ftp servers and on torrents, it was hard to search, it lacked a proper catalog
and it was prone to fragmentation. Parallel solutions soon emerged: around 2006-7, an existing book site
called Gigapedia copied the English books from Kolhoz, set up a catalog, and soon became the most
influential pirate library in the English speaking internet.
Similar cataloguing efforts soon emerged elsewhere. In 2007, someone on rutracker.ru, a Russian BBS
focusing on file sharing, posted torrent links to 91 DVDs containing science and technology titles
aggregated from various other Russian sources, including Kolhoz. This massive collection had no
categorization or particular order. But it soon attracted an archivist: a user of the forum started the
laborious task of organizing the texts into a usable, searchable format—first filtering duplicates and
organizing existing metadata first into an excel spreadsheet, and later moving to a more open, webbased database operating under the name Aleph.
Aleph inherited more than just books from Kolhoz and Moshkov’s lib.ru. It inherited their elitism with
regard to canonical texts, and their understanding of librarianship as a community effort. Like the earlier
sites, Aleph’s collections are complemented by a stream of user submissions. Like the other sites, the
number of submissions grew rapidly as the site’s visibility, reputation and trustworthiness was
established, and like the others it later fell, as more and more of what was perceived as canonical
literature was uploaded:
“The number of mankind’s useful books is about what we already have. So growth is defined by newly
scanned or issued books. Also, the quality of the collection is represented not by the number of books but
by the amount of knowledge it contains. [ALEPH] does not need to grow more and I am not the only one
among us who thinks so. […]
We have absolutely no idea who sends books in. It is practically impossible to know, because there are a
million books. We gather huge collections which eliminate any traces of the original uploaders.
My expectation is that new arrivals will dry up. Not completely, as I described above, some books will
always be scanned or rescanned (it nowadays happens quite surprisingly often) and the overall process of
digitization cannot and should not be stopped. It is also hard to say when the slowdown will occur: I
expected it about a year ago, but then library.nu got shut down and things changed dramatically in many
respects. Now we are "in charge" (we had been the largest anyways, just now everyone thinks we are in
5

Anonymous source #1

12

Draft Manuscript, 11/4/2014, DO NOT CITE!
charge) and there has been a temporary rise in the book inflow. At the moment, relatively small or
previously unseen collections are being integrated into [ALEPH]. Perhaps in a year it will saturate.
However, intuition is not a good guide. There are dynamic processes responsible for eBook availability. If
publishers massively digitize old books, they'll obviously be harvested and that will change the whole
picture.” 6
Aleph’s ambitions to create a universal library are limited , at least in terms of scope. It does not want to
have everything, or anything. What it wants is what is thought to be relevant by the community,
measured by the act of actively digitizing and sharing books. But it has created a very interesting strategy
to establish a library which is universal in terms of its reach. The administrators of Aleph understand that
Gigapedia’s downfall was due to its visibility and they wish to avoid that trap:
“Well, our policy, which I control as strictly as I can, is to avoid fame. Gigapedia's policy was to gain as
much fame as possible. Books should be available to you, if you need them. But let the rest of the world
stay in its equilibrium. We are taking great care to hide ourselves and it pays off.”7
They have solved the dilemma of providing access without jeopardizing their mission by open sourcing
the collection and thus allowing others to create widely publicized services that interface with the
public.They let others run the risk of getting famous.

Mirrors and communities
Aleph serves as a source archive for around a half-dozen freely accessible pirate libraries on the net. The
catalog database is downloadable, the content is downloadable, even the server code is downloadable.
No passwords are required to download and there are no gatekeepers. There are no obstacle to setting
up a similar library with a wider catalog, with improved user interface and better services, with a
different audience or, in fact, a different business model.
This arrangement creates a two-layered community. The core group of the Aleph admins maintains the
current service, while a loose and ever changing network of ‘mirror sites’ build on the Aleph
infrastructure.
“The unspoken agreement is that the mirrors support our ideas. Otherwise we simply do not interact with
them. If the mirrors do support this, they appear in the discussions, on the Web etc. in a positive context.
This is again about building a reputation: if they are reliable, we help with what we can, otherwise they
should prove the World they are good on their own. We do not request anything from them. They are free
to do anything they like. But if they do what we do not agree with, it'll be taken into account in future
relations. If you think for a while, there is no other democratic way of regulation: everyone expresses his
own views and if they conform with ours, we support them. If the ideology does not match, it breaks
down.”8

6

Anonymous source #1
Anonymous source #2
8
Anonymous source #1
7

13

Draft Manuscript, 11/4/2014, DO NOT CITE!
The core Aleph team claims to exclusively control only two critical resources: the BBS that is the home of
the community, and the book-uploading interface. That claim is, however, not entirely accurate. For the
time being, the academic minded e-book community indeed gathers on the BBS managed by Aleph, and
though there is little incentive to move on, technically nothing stands in the way of alternatives to spring
up. As for the centralization of the book collection: many of the mirrors have their own upload pages
where one can contribute to a mirror’s collection, and it is not clear how or whether books that land at
one of the mirrors find their way back to the central database. Aleph also offers a desktop library
management tool, which enables dedicated librarians to see the latest Aleph database on their desktop
and integrate their local collections with the central database via this application. Nevertheless, it seems
that nothing really stands in the way of the fragmentation of the collection, apart from the willingness of
uploaders to contribute directly to Aleph rather than to one of its mirrors (or other sites).
Funding for Aleph comes from the administrators’ personal resources as well as occasional donations
when there is a need to buy or rent equipment or services:
“[W]e've been asking and getting support for this purpose for years. […] All our mirrors are supported
primarily from private pockets and inefficient donation schemes: they bring nothing unless a whole
campaign is arranged. I asked the community for donations 3 or 4 times, for a specific purpose only and
with all the budget spoken for. And after getting the requested amount of money we shut down the
donations.”9
Mirrors, however, do not need to be non-commercial to enjoy the support of the core Aleph community,
they just have to provide free access. Ad-supported business models that do not charge for individual
access are still acceptable to the community, but there has been serious fallout with another site, which
used the Aleph stock to seed its own library, but decided to follow a “collaborative piracy” business
approach.
“To make it utmost clear: we collaborate with anyone who shares the ideology of free knowledge
distribution. No conditions. [But] we can't suddenly start supporting projects that earn money. […]
Moreover, we've been tricked by commercial projects in the past when they used the support of our
community for their own benefit.”10
The site in question, [e], is based on a simple idea: If a user cannot find a book in its collection, the
administrators offer to purchase a digital or print copy, rip it, and sell it to the user for a fraction of the
original price—typically under $1. Payments are to be made in Amazon gift cards which make the
purchases easy but the de-anonymization of users difficult. [e] recoups its investment, in principle,
through resale. While clearly illegal, the logic is not that different from that of private subscription
libraries, which purchase a resource and distribute the costs and benefits among club members.

9

BBS comment posted on Jan 15, 2013
BBS comment posted on Jan 15, 2013

10

14

Draft Manuscript, 11/4/2014, DO NOT CITE!
Although from the rights holders’ perspective there is little difference between the two approaches,
many participants in the free access community draw a sharp line between the two, viewing the sales
model as a violation of community norms.
“[e] is a scam. They were banned in our forum. Yes, most of the books in [e] came from [ALEPH], because
[ALEPH] is open, but we have nothing to do with them... If you wish to buy a book, do it from legal
sources. Otherwise it must be free.[…]
What [e] wants:
- make money on ebook downloads, no matter what kind of ebooks.
- get books from all the easy sources - spend as little effort as possible on books - maximize profit.
- no need to build a community, no need to improve quality, no need to correct any errors - just put all
files in a big pile - maximize profit.
- files are kept in secret, never given away, there is no listing of files, there is no information about what
books are really there or what is being done.
There are very few similarities in common between [e]and [ALEPH], and these similarities are too
superficial to serve as a common ground for communication. […]
They run an illegal business, making a profit.”11
Aleph administrators describe a set of values that differentiates possible site models. They prioritize the
curatorial mission and the provision of long term free access to the collection with all the costs such a
position implies, such as open sourcing the collection, ignoring takedown requests, keeping a low profile,
refraining from commercial activities, and as a result, operating on a reduced budget . [e] prioritizes the
expansion of its catalogue on demand but that implies a commercial operation, a larger budget and the
associated high legal risk. Sites carrying Aleph’s catalogue prioritize public visibility, carry ads to cover
costs but respond to takedown requests to avoid as much trouble as they can. From the perspective of
expanding access, these are not easy or straightforward tradeoffs. In Aleph’s case, the strong
commitment to the mission of providing free access comes with significant sacrifices, the most important
of which is relinquishing control over its most valuable asset: its collection of 1.2 million scientific books.
But they believe that these costs are justified by the promise, that this way the fate of free access is not
tied to the fate of Aleph.
The fact that piratical file sharing communities are willing to make substantial sacrifices (in terms of selfrestraint) to ensure their long term survival has been documented in a number of different cases. (Bodó,
2013) Aleph is unique, however in its radical open source approach. No other piratical community has
given up all the control over itself entirely. This approach is rooted in the way how it regards the legal
status of its subject matter, i.e. scholarly publications in the first place. While norms of openness in the
field of scientific knowledge production were first formed in the Enlightenment period, Aleph’s
11

BBS comments posted on Jul 02, 2013, and Aug 25, 2013

15

Draft Manuscript, 11/4/2014, DO NOT CITE!
copynorms are as much shaped by the specificities of post-Soviet era as by the age old realization that in
science we can see further if we are allowed “standing on the shoulders of giants”.

Copyright and copynorms around Russian pirate libraries
The struggle to re-establish rightsholders’ control over digitized copyrighted works has defined the
copyright policy arena since Napster emerged in 1999. Russia brought a unique history to this conflict. In
Russia, digital libraries and their emerged in a period a double transformation: the post-Soviet copyright
system had to adopt global norms, while the global norms struggled to adapt to the emergence of digital
copying.
The first post-Soviet decade produced new copyright laws that conformed with some of the international
norms advocated by Western rightsholders, but little legal clarity or enforceability (Sezneva & Karaganis,
2011). Under such conditions, informally negotiated copynorms set in to fill the void of non-existent,
unreasonable, or unenforceable laws. The pirate libraries in the RuNet are as much regulated by such
norms as by the actual laws themselves.
During most of the 1990’s user-driven digitization and archiving was legal, or to be more exact, wasn’t
illegal. The first Russian copyright law, enacted in 1993, did not cover “internet rights” until a 2006
amendment (Budylin & Osipova, 2007; Elst, 2005, p. 425). As a result, many argued (including the
Moscow prosecutor’s office), that the distribution of copyrighted works via the internet was not
copyright infringement. Authors and publishers, who saw their works appear in digital form, and
circulated via CD-ROMs and the internet, had to rely on informal norms, still in development, to establish
control over their texts vis-à-vis enthusiastic collectors and for-profit entrepreneurs.
The HARRYFAN CD was one of the early examples of a digital text collection in circulation before internet
access was widespread. The CD contained around ten thousand texts, mostly Russian science fiction. It
was compiled in 1997 by Igor Zagumenov, a book enthusiast, from the texts that circulated on the Holy
Spirit BBS. The CD was a non-profit project, planned to be printed and sold in around 1000 copies.
Zagumenov did get in touch with some of the authors and publishers, and got permission to release
some of their texts, but the CD also included many other works that were uploaded to the BBS without
authorization. The CD included the following copyright notice, alongside the name and contact of
Zagumenov and those who granted permission:
Texts on this CD are distributed in electronic format with the consent of the copyright holders or their
literary agent. The disk is aimed at authors, editors, translators and fans SF & F as a compact reference
and information library. Copying or reproduction of this disc is not allowed. For the commercial use of
texts please refer directly to the copyright owners at the following addresses.
The authors whose texts and unpublished manuscripts appeared in the collection without authorization
started to complain to those whose contact details were in the copyright notice. Some complained
about the material damage the collection may have caused to them, but most complaints focused on
moral rights: unauthorized publication of a manuscript, the mutilation of published works, lack of
attribution, or the removal of original copyright and contact notices. Some authors had no problem
16

Draft Manuscript, 11/4/2014, DO NOT CITE!
appearing in non-commercially distributed collections but objected to the fact that the CDs were sold
(and later overproduced in spite of Zagumenov’s intentions).
The debate, which took place in the book-related fora of Fidonet, had some important points.
Participants again drew a significant distinction between free access provided first by Fidonet (and later
by lib.ru, which integrated some parts of the collection) and what was perceived as Zagumenov’s forprofit enterprise—despite the fact that the price of the CD only covered printing costs. The debate also
drew authors’ and publishers’ attention to the digital book communities’ actions, which many saw as
beneficial as long as it respected the wishes of the authors. Some authors did not want to appear online
at all, others wanted only their published works to be circulated.
Lib.ru of course integrated the parts of the HARRYFAN CD into its collection. Moshkov’s policy towards
authors’ rights was to ask for permission, if he could contact the author or publisher. He also honored
takedown requests sent to him. In 1999 he wrote on copyright issues as follows:
The author’s interests must be protected on the Internet: the opportunity to find the original copy, the
right of attribution, protection from distorting the work. Anyone who wants to protect his/her rights,
should be ready to address these problems, ranging from the ability to identify the offending party, to the
possibility of proving infringement.[…]
Meanwhile, it has become a stressing question how to protect authors-netizens' rights regarding their
work published on the Internet. It is known that there are a number of periodicals that reprint material
from the Internet without the permission of the author, without payment of a fee, without prior
arrangement. Such offenders need to be shamed via public outreach. The "Wall of shame" website is one
of the positive examples of effective instruments established by the networked public to protect their
rights. It manages to do the job without bringing legal action - polite warnings, an indication of potential
trouble and shaming of the infringer.
Do we need any laws for digital libraries? Probably we do, but until then we have to do without. Yes, of
course, it would be nice to have their status established as “cultural objects” and have the same rights as
a "real library" to collect information, but that might be in the distant future. It would also be nice to
have the e-library "legal deposits" of publications in electronic form, but when even Leninka [the Russian
State Library] cannot always afford that, what we really need are enthusiastic networkers. […]
The policy of the library is to take everything they give, otherwise they cease to send books. It is also to
listen to the authors and strictly comply with their requirements. And it is to grow and prosper. […] I
simply want the books to find their readers because I am afraid to live in a world where no one reads
books. This is already the case in America, and it is speeding up with us. I don’t just want to derail this
process, I would like to turn it around.”

17

Draft Manuscript, 11/4/2014, DO NOT CITE!
Moshkov played a crucial role in consolidating copynorms in the Russian digital publishing domain. His
reputation and place in the Russian literary domain is marked by a number of prizes12, and the library’s
continued existence. This place was secured by a number of closely intertwined factors:







Framing and anchoring the digitization and distribution practice in the library tradition.
The non-profit status of the enterprise.
Respecting the wishes of the rights holders even if he was not legally obliged to do so.
Maintaining active communication with the different stakeholders in the community,
including authors and readers.
Responding to a clear gap in affordable, legal access.
Conservatism with regard to the book, anchored in the argument that digital texts are not
substitutes for printed matter.

Many other digital libraries tried to follow Moshkov’s formula, but the times were changing. Internet and
computer access left the sub-cultural niches and became mainstream; commercialization became a
viable option and thus an issue for both the community and rightsholders; and the legal environment
was about to change.

Formalization of the IP regime in the 2000s
As soon as the 1993 copyright law passed, the US resumed pressure on the Russian government for
further reform. Throughout the period—and indeed to the present day—US Trade Representative
Special 301 reports cited inadequate protections and lack of enforcement of copyright. Russia’s plans to
join the WTO, over which the US had effective veto power, also became leverage to bring the Russian
copyright regime into compliance with US norms.
Book piracy was regularly mentioned in Special 301 reports in the 2000s, but the details, alleged losses,
and analysis changed little from year to year. The estimated $40M USD losses per year throughout this
period were dwarfed by claims from the studios and software vendors, and clearly were not among the
top priorities of the USTR. For most of the decade, the electronic availability of bestsellers and academic
textbooks was seen in the context of print substitution, rather than damage to the non-existent
electronic market. And though there is little direct indication, the Special 301 reports name sites which
(unlike lib.ru) were serving audiences beyond the RuNet, indicating that the focus of enforcement was
not to protect US interests in the Russian market, but to prevent sites based in Russia to cater for
demand in the high value Western-European and US markets.
A 1998 amendment to the 1993 copyright law extended the legal framework to encompass digital rights,
though in a fashion that continued to produce controversy. After 1998, digital services had to license
content from collecting societies, but those societies needed no permission from rightsholders provided
they paid royalites. The result was a proliferation of collective management organizations, competing to
license the material to digital services (Sezneva and Karaganis, 2011), which under this arrangement
12

ROTOR, the International Union of Internet Professionals in Russia voted lib.ru as the “literary site of the year” in
1999,2001 and 2003, “electronic library of the year” in 2004,2006,2008,2009, and 2010, “programmer of the year”
in 1999, and “man of the year” in 2004 and 2005.

18

Draft Manuscript, 11/4/2014, DO NOT CITE!
were compliant with Russian law, but were regarded as illegal by Western rights holders who claimed
that the Russian collecting societies were not representing them.
The best known of dispute from this time was the one around the legality of Allofmp3.com, a site that
sold music from western record labels at prices far below those iTunes or other officially licensed
vendors. AllofMP3.com claimed that it was licensed by ROMS, the Russian Society for Multimedia and
Internet (Российское общество по мультимедиа и цифровым сетям (НП РОМС)), but despite of that
became the focal point of US (and behind them, major label) pressure, leading to an unsuccessful
criminal prosecution of the site owner and eventual closure of the site in 2007. Although Lib.ru had
some direct agreements with authors, it also licensed much of its collection from ROMS, and thus was in
the same legal situation as AllofMP3.com. .
Lib.ru avoided the attention of foreign rightholders and Russian state pressure and even benefited from
state support during the period, the receiving a $30,000 grant from the Federal Agency for Press and
Mass Communications to digitize the most important works from the 1930’s. But the chaotic licensing
environment that governed their legal status also came back to haunt them. In 2005, a lawsuit was
brought against Moshkov by KM Online (KMO), an online vendor that sold digital texts for a small fee.
Although the KMO collection—like every other collection—had been assembled from a wide range of
sources on the Internet, KMO claimed to pay a 20% royalty on its income to authors. In 2004 KMO
requested that lib.ru take down works by several authors with whom (or with whose heirs) KMO claimed
to be in exclusive contract to distribute their texts online. KMO’s claims turned out to be only partly true.
KMO had arranged contracts with a number of the heirs to classics of the Soviet period, who hoped to
benefit from an obscure provision in the 1993 Russian copyright law that granted copyrights to the heirs
of politically prosecuted and later rehabilitated Soviet-era authors. Moshkov, in turn, claimed that he
had written or oral agreements with many of the same authors and heirs, in addition to his agreement
with ROMS.
The lawsuit was a true public event. It generated thousands of news items both online and in the
mainstream press. Authors, members of the publishing industry, legal professionals, librarians, internet
professionals publicly supported Moshkov, while KMO was seen as a rogue operator that would lie to
make easy money on freely-available digital resources.
Eventually, the court ruled that KMO indeed had one exclusive contract with Eduard Gevorgyan, and that
the publication of his texts by Moshkov infringed the moral (but not the economic) rights of the author.
Moshkov was ordered to pay 3000 Rubles (approximately $100) in compensation.
The lawsuit was a sign of a slow but significant transformation in the Russian print ecosystem. The idea
of a viable market for electronic books began to find a foothold. Electronic versions of texts began to be
regarded as potential substitutes for the printed versions, not advertisements for them or supplements
to them. More and more commercial services emerged, which regard the well-entrenched free digital
libraries as competitors. As Russia continued to bring its laws into closer conformance with WTO
requirements, ahead of Russia’s admission in 2012, western rightsholders gained enough power to
demand enforcement against RuNet pirate sites. The kinds of selective enforcement for political or

19

Draft Manuscript, 11/4/2014, DO NOT CITE!
business purposes, which had marked the Russian IP regime throughout the decade (Sezneva &
Karaganis, 2011), slowly gave way to more uniform enforcement.

Closure of the Legal Regime
The legal, economic, and cultural conditions under which Aleph and its mirrors operate today are very
different from those of two decades earlier. The major legal loopholes are now closed, though Russian
authorities have shown little inclination to pursue Aleph so far:
I can't say whether it's the Russian copyright enforcement or the Western one that's most dangerous for
Aleph; I'd say that Russian enforcement is still likely to tolerate most of the things that Western
publishers won't allow. For example, lib.ru and [L] and other unofficial Russian e-libraries are tolerated
even though far from compliant with the law. These kinds of e-libraries could not survive at all in western
countries.13
Western publishers have been slow to join record, film, and software companies in their aggressive
online enforcement campaigns, and academic publishers even more so. But such efforts are slowly
increasing, as the market for digital texts grows and as publishers benefit from the enforcement
precedents set or won by the more aggressive rightsholder groups. The domain name of [os], one of the
sites mirroring the Aleph collection was seized, apparently due to the legal action taken by a US
rightholder, and it also started to respond to DMCA notices, removing links to books reported to be
infringing. Aleph responds to this with a number of tactical moves:
We want books to be available, but only for those who need them. We do not want [ALEPH] to be visible.
If one knows where to get books, there are here for him or her. In this way we stay relatively invisible (in
search engines, e.g.), but all the relevant communities in the academy know about us. Actually, if you
question people at universities, the percentage of them is quite low. But what's important is that the
news about [ALEPH] is spread mostly by face-to-face communication, where most of the unnecessary
people do not know about it. (Unnecessary are those who aim profit)14
The policy of invisibility is radically different from Moshkov’s policy of maximum visibility. Aleph hopes
that it can recede into the shadows where it will be protected by the omerta of academics sharing the
sharing ethos:
In Russian academia, [Aleph] is tacitly or actively supported. There are people that do not want to be
included, but it is hard to say who they are in most cases. Since there are DMCA complaints, of course
there are people who do not want stuff to appear here. But in our experience the complainers are only
from the non-scientific fellows. […] I haven't seen a single complaint from the authors who should
constitute our major problem: professors etc. No, they don't complain. Who complains are either of such
type I have mentioned or the ever-hungry publishers.15

13

Anonymous source #1
Anonymous source #1
15
Anonymous source #1
14

20

Draft Manuscript, 11/4/2014, DO NOT CITE!
The protection the academic community has to offer may not be enough to fend off the publishers’
enforcement actions. The option to recede further into the darknets and hide behind the veil of privacy
technologies is one option the Aleph site has: the first mirror on I2P, an anonymizing network designed
to hide the whereabouts and identity of web services is already operational. But
[i]f people are physically served court invitations, they will have to close the site. The idea is, however,
that the entire collection is copied throughout the world many times over, the database is open, the code
for the site is open, so other people can continue.16

On methodology
We tried to reconstruct the story behind Aleph by conducting interviews and browsing through the BBS
of the community. Access to the site and community members was given under a strict condition of
anonymity. We thus removed any reference to the names and URLs of the services in question.
At one point we shared an early draft of this paper with interested members and asked for their
feedback. Beyond access and feedback, community members were helping the writing of this article by
providing translations of some Russian originals, as well as reviewing the translations made by the
author. In return, we provided financial contributions to the community, in the value of 100 USD.
We reproduced forum entries without any edits to the language, we, however, edited interviews
conducted via IM services to reflect basic writing standards.

16

Anonymous source #1

21

Draft Manuscript, 11/4/2014, DO NOT CITE!
References

Abelson, H., Diamond, P. A., Grosso, A., & Pfeiffer, D. W. (2013). Report to the President MIT and the
Prosecution of Aaron Swartz. Cambridge, MA. Retrieved from http://swartzreport.mit.edu/docs/report-to-the-president.pdf
Alekseeva, L., Pearce, C., & Glad, J. (1985). Soviet dissent: Contemporary movements for national,
religious, and human rights. Wesleyan University Press.
Bodó, B. (2013). Set the fox to watch the geese: voluntary IP regimes in piratical file-sharing
communities. In M. Fredriksson & J. Arvanitakis (Eds.), Piracy: Leakages from Modernity.
Sacramento, CA: Litwin Books.
Borges, J. L. (1998). The library of Babel. In Collected fictions. New York: Penguin.
Bowers, S. L. (2006). Privacy and Library Records. The Journal of Academic Librarianship, 32(4), 377–383.
doi:http://dx.doi.org/10.1016/j.acalib.2006.03.005
Budylin, S., & Osipova, Y. (2007). Is AllOfMP3 Legal? Non-Contractual Licensing Under Russian Copyright
Law. Journal Of High Technology Law, 7(1).
Bush, V. (1945). As We May Think. Atlantic Monthly.
Dewhirst, M., & Farrell, R. (Eds.). (1973). The Soviet Censorship. Metuchen, NJ: The Scarecrow Press.
Elst, M. (2005). Copyright, freedom of speech, and cultural policy in the Russian Federation.
Leiden/Boston: Martinus Nijhoff.
Ermolaev, H. (1997). Censorship in Soviet Literature: 1917-1991. Rowman & Littlefield.
Foerstel, H. N. (1991). Surveillance in the stacks: The FBI’s library awareness program. New York:
Greenwood Press.
Friedberg, M., Watanabe, M., & Nakamoto, N. (1984). The Soviet Book Market: Supply and Demand.
Acta Slavica Iaponica, 2, 177–192. Retrieved from
http://eprints.lib.hokudai.ac.jp/dspace/bitstream/2115/7941/1/KJ00000034083.pdf
Interview with Dusan Barok. (2013). Neural, 10–11.
Interview with Marcell Mars. (2013). Neural, 6–8.
Komaromi, A. (2004). The Material Existence of Soviet Samizdat. Slavic Review, 63(3), 597–618.
doi:10.2307/1520346

22

Draft Manuscript, 11/4/2014, DO NOT CITE!
Lessig, L. (2013). Aaron’s Laws - Law and Justice in a Digital Age. Cambridge,MA: Harward Law School.
Retrieved from http://www.youtube.com/watch?v=9HAw1i4gOU4
Levin, M. B. (1983). Soviet International Copyright: Dream or Nightmare. Journal of the Copyright Society
of the U.S.A., 31, 127.
Liang, L. (2012). Shadow Libraries. e-flux. Retrieved from http://www.e-flux.com/journal/shadowlibraries/
Newcity, M. A. (1978). Copyright law in the Soviet Union. Praeger.
Newcity, M. A. (1980). Universal Copyright Convention as an Instrument of Repression: The Soviet
Experiment, The. In Copyright L. Symp. (Vol. 24, p. 1). HeinOnline.
Patry, W. F. (2009). Moral panics and the copyright wars. New York: Oxford University Press.
Post, R. (1998). Censorship and Silencing: Practices of Cultural Regulation. Getty Research Institute for
the History of Art and the Humanities.
Rieusset-Lemarié, I. (1997). P. Otlet’s mundaneum and the international perspective in the history of
documentation and information science. Journal of the American Society for Information Science,
48(4), 301–309.
Ryzhak, N. (2005). Censorship in the USSR and the Russian State Library. IFLA/FAIFE Satellite meeting:
Documenting censorship – libraries linking past and present, and preparing for the future.
Sezneva, O., & Karaganis, J. (2011). Chapter 4: Russia. In J. Karaganis (Ed.), Media Piracy in Emerging
Economies. New York: Social Science Research Council.
Skilling, H. G. (1989). Samizdat and an Independent Society in Central and Eastern Europe. Pa[Aleph]rave
Macmillan.
Solzhenitsyn, A. I. (1974). The Gulag Archipelago 1918-1956: An Experiment in Literary Investigation,
Parts I-II. Harper & Row.
Stelmach, V. D. (1993). Reading in Russia: findings of the sociology of reading and librarianship section of
the Russian state library. The International Information & Library Review, 25(4), 273–279.
Stelmakh, V. D. (2001). Reading in the Context of Censorship in the Soviet Union. Libraries & Culture,
36(1), 143–151. doi:10.2307/25548897
Suber, P. (2013). Open Access (Vol. 1). Cambridge, MA: The MIT Press.
doi:10.1109/ACCESS.2012.2226094
UHF. (2005). Где-где - на борде! Хакер, 86–90.

23

Draft Manuscript, 11/4/2014, DO NOT CITE!
Гроер, И. (1926). Авторское право. In Большая Советская Энциклопедия. Retrieved from
http://ru.gse1.wikia.com/wiki/Авторское_право

24
 

Display 200 300 400 500 600 700 800 900 1000 ALL characters around the word.