Chymistry of Isaac Newton

From digitalHPS

Jump to: navigation, search

Contents

DATA FORMATS OF THE CHYMISTRY OF ISAAC NEWTON PROJECT

Organizational Background

Chymistry of Newton
The Chymistry of Isaac Newton project is based at Indiana University, Bloomington (IUB) and led by Bill Newman of the Department of the History and Philosophy of Science (HPSC). At IUB, the project team also includes individuals from the School of Library and Information Science, the Digital Library Program (DLP), and the Institute for Digital Arts and Humanities. Other team members come from the Chemical Heritage Foundation and Johns Hopkins. The DLP serves as our repository and provides hosting and long-term storage for our project site.

Our goal is to create a digital edition of Newton’s alchemical manuscripts with full transcriptions and manuscript images in a fully searchable web interface. We have transcribed 118 manuscripts and are in the process of reviewing and releasing those documents to our public site at www.chymistry.org. Nine documents, including three of the largest in the collection, have already been published on the site. We are planning an official release of at least a dozen more documents in April, 2010.

Brief Description of Data Formats

The manuscript transcriptions are recorded in XML documents according to Text Encoding Initiative (TEI) Guidelines, version P4, and the Unicode Consortium standards. The TEI P4 Guidelines are well documented online at http://www.tei-c.org/index.xml and the Unicode standard is well documented at http://unicode.org/. We have been careful in applying and adhering to both standards but there have been complications in encoding Newton’s work and, to handle those, we have contributed to the modification and expansion of both the TEI and Unicode guidelines.

Any TEI P4 XML document is contained in the document-level <TEI.2>. Our document <TEI.2> element has an @id attribute that identifies document according to its Sussex catalogue number for the manuscripts of Isaac Newton (which is documented at http://www.newtonproject.sussex.ac.uk/prism.php?id=82). This @id always looks like “ALCH00012”, consisting of the ‘ALCH’ prefix for alchemy and a five digit catalogue number. The document filename is based on the same catalogue number, like ALCH00012.xml.

TEI P4 XML documents are divided into two major elements, the <teiHeader>, containing metadata, and <text>, containing the edited MS. transcription. We will provide detailed listings and descriptions of the entire data structure as this investigation of interoperability progresses but it is probably worthwhile to provide a bit of initial detail about some of the elements at this stage.


Metadata

Document metadata information is stored in the <teiHeader> in four elements, <fileDesc>, <encodingDesc>, <profileDesc>, and <revisionDesc>. The <profileDesc> identifies the languages used in the document. The <encodingDesc> and <revisionDesc> record information about the construction of the transcription and may not be interesting beyond our own group, but who knows?

Our <fileDesc> element, on the other hand, includes a title and responsibility statement identifying the transcribers, and it includes a publication statement that is uniform across the collection. More importantly, perhaps, it has a descendant element, <msDecription>, that lets us include information about the provenance, current repository, and physical condition of each manuscript. TEI P4 did not officially include this element but P5, which has recently superseded P4, does, and Chymistry project team members participated in the development of the new guidelines. We have incorporated the P5 <msDescription> element into our P4 documents.

As we evolve a shared API, we would expect to be asked to provide services that list all available metadata elements (i.e., all that we have used), and provide listings of the metadata by element or element set, and by document or document set. Because the project already routinely uses XML and XSL we shouldn’t have any major back-end problems in creating web services.


Text

In theory, the <text> element can contain <front>, <body>, <group>, and <back> elements but all of our documents have only the <body> element. Inside that <body> element, however, every document is different.

The principal top-level elements are most commonly <div> and <list>, usually containing structural elements <p> and <bibl> (to identify bibliographical information supplied by Newton himself) and <item>, in almost any order. Our separate computational project already extracts full text and partial texts and listings of document contents and we can probably devise an interesting menu of web service options for the API.

We have succeeded in preserving Newton’s own line breaks with <lb/> and </p> elements and his own page break apparatus including catchwords in <fw @type= “catch”> elements. We use <milestone> and <pb> (page break) elements to mark every folio face so it is possible to include page numbering in the stream and to report the page location of passages or elements that interest a user.

We have been using <name> and <persName> elements to mark proper names and those could be delivered to users in an index-like form.

Newton wrote in English, Latin, French, Hebrew, and Greek. Each document identifies the preponderant language and marks passages and fragments in <foreign> elements, so it’s possible to note code-switching in the stream or to extract data according to language.

Our documents are constructed to provide both a diplomatic version and a normalized version of the text from a single XML file through the use of XSLT style sheets. Three elements are central to this organization: the <orig @reg> element which records the original form used by Newton and a regularized version that we use to normalize spelling and fix words that were broken over line and page breaks; the <abbr @expan> element that handles abbreviations; and, the <sic @corr> element that allows to exercise the editorial sic with its correction.

We have also marked Newton’s own deletions with <del> elements and additions with <add>. Deleted text appears in the diplomatic version with overstrikes but is removed from the normalized version. Newton usually added text between lines or in the margin. They appear in our diplomatic version as superscripted text with carets as in the originals but are simply rendered inline in the normalized version. Web-service streams could pass or suppress <del> text, and mark <add> text on request.

We will probably need or want to provide the user/requestor with some control over which stream, diplomatic or normalized, they want to receive and provide other options.


Alchemical Symbols

Newton used a large number of alchemical symbols, most of which were known and used by a larger community of alchemical authors and practitioners. He also used contemporary abbreviature in English and Latin. We have used the machinery of XML entities and entity files to encode those symbols. The encoders use the entities in working version of the XML documents and our XSLT style sheets transform those into TEI <c> elements in the finalized XML versions that are used on the website. The symbols and abbreviature could be marked in a web-service stream either by entity or by <c> element. (We’re already doing something similar in our separate computational project.)

On our websites, we serve the symbols as GIFs or as font characters if the user has a copy of our Newton OTF font. We create our own open source fonts for the symbols, and our development site is capable of detecting the presence of the main Newton font and serving characters instead of GIFs which makes for better kerning and placement. We definitely plan to share those printer-ready OTF and TTF fonts with interested users but for the immediate future, an API could use the store of GIF images without much difficulty. We will probably need to discuss the details of the symbol-rendering process as we move forward.

Unicode doesn’t include code points for the majority of the alchemical symbols and we have been working with the Unicode consortium for more than two years to develop a Unicode block devoted to those symbols. We are hoping that they will be included in Unicode 6.0 but at this time we aren’t certain that they will.

Personal tools