historical newspaper Conversions
During the last few years, as digitization has gradually moved from an experimental and temporal activity toward one that is structural and continuous, mass digitization projects have been increasing in number. Newspapers appeal to a large audience, but in many cases are inaccessible. Therefore, it is no surprise that many institutions are now deciding to digitize their newspaper collections. Digitization and Web delivery make these collections available to a worldwide audience.
Contentra Technologies is one of the leading service providers in the digitization of archived newspapers. Contentra also specializes in digitizing books, journals, manuscripts, and the like. We partner with several newspaper publishers, national libraries, and university and state libraries, creating customized solutions to suit each of their specific needs. Contentra plays a significant active role in the archival digitization movement.
In order to make a newspaper available for searching on the Internet, the following processes take place:
- The microfilm copy or paper original is scanned. Master and Web image files are generated.
- De-speckling, de-skewing, and cropping of images. Metadata is assigned for each issue, page, and article to improve the search ability of the newspaper.
- OCR software is run over high-resolution images to create searchable full text. OCR text, images, and metadata are imported into a digital library software program.
METS & ALTO Conversions & Feeds
The Metadata Encoding and Transmission Standard (METS) is a data encoding and transmission specification in XML format that provides the means to convey the metadata necessary for both the management of digital objects within a repository and the exchange of such objects among repositories or among repositories and their users. This common object format was designed to allow the sharing of efforts to develop information management tools/services and to facilitate the interoperable exchange of digital materials among institutions including vendors. The METS XML schema was created in 2001 under the sponsorship of the Digital Library Federation (DLF), is supported by the Library of Congress as its maintenance agency, and is governed by the METS Editorial Board.
Purpose of METS
- Maintaining the metadata of the digital objects for the long term
- Recording the names and locations of the files that comprise those objects
- Creating XML documents that express the hierarchical structure
- When a repository of digital objects intends to share metadata about a digital object, or the object itself, with another repository or with a tool meant to render the object, the use of a common data transfer syntax among repositories and tools greatly improves the facility and efficiency with which the transactions can occur. METS was created and designed to provide a relatively easy format for these kinds of activities during the life cycle of the digital object.
ALTO (Analyzed Layout and Text Object) is an XML schema that details technical metadata for describing the layout information and OCR recognized text of resources, such as pages of a book or a newspaper. It is used as an extension schema to METS (Metadata Encoding and Transmission Standard), where METS provides metadata and structural information, while ALTO contains content and physical information.
- ALTO file contains a style section where different styles are listed. The layout section contains what is on the page.
- A page is divided into several regions (print space, left margin, right margin, top margin, bottom margin). For each region, all objects are listed that have been detected inside.
- Measurements in ALTO XML files can be given in 1/10 mm or in 1/1200 inch. To use the coordinates within the ALTO file with any resolution, they need to be transformed into pixels.
Why METS/ALTO Conversion?
The METS and ALTO have now been utilized for a number of years. Libraries, universities, newspaper publishers, and newspaper aggregators are familiar with these standards.
METS is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, using XML. Though METS is excellent at describing the structure of a digital object, it is missing the ability to describe the content and layout of each piece of the digital object. So an extension to METS, called ALTO (Analyzed Layout and Text Object), is required for this purpose. The combination of METS and ALTO was originally developed by the METAe project, and later was adopted by the Library of Congress for its large-scale National Digital Newspaper Program (NDNP). Since then, METS/ALTO has been used in many newspaper digitization projects—both large and small—as well as a number of projects digitizing books and journals.
A typical METS/ALTO object encodes the complete logical and physical structure of a document (i.e., chapters, sections, articles, pages, etc., and their associated metadata), as well as the full-text content of each section of the document, and even the physical coordinates of every word in the document.
Contemporary Newspaper Conversions & Feeds
Newspaper content offers a rich and valuable resource to any organization, whether monitoring press coverage or simply for gathering intelligence.
Online newspapers in a digital format become accessible to a much wider global audience. Users are then empowered to search any specific content across titles via a simple search tool.
Over the years, Contentra Technologies has specialized in providing time-driven services to media monitoring agencies, newspaper publishers, content aggregators, and licensing agencies across the UK, Europe, US, and Asia Pacific regions. Contentra utilizes state-of-the-art third-party and in-house software to execute and meet the specific requirements of each of our clients.
Contentra receives some 188 tabloid newspapers—6,000 pages—and 240 contemporary newspaper titles 24/7 in the form of PDF-Normal files via FTP. All these files are converted to NITF-compliant XML using customized XSLT and a user interface. Contentra also provides Kindle and iPad compatible outputs. The following processes are in effect:
- Downloading of PDF files
- Allocating of files
- Extracting of the data from PDF files
- Tagging, formatting, and on-screen proofreading
- Validating and parsing as per the DTD
- Quality checking
- Uploading of the output