Technologies

From digitalHPS

Jump to: navigation, search

Contents

The Embryo Project Setup

Project Management

Text Markup

  • Textmate text editor
  • SVN versioning, with Versions app GUI on mac os x
  • vim CLI editor

Server Components

Embryo Architecture Diagram.png

Apache

Embryo Project Webapp

  • SVN
  • wiki
  • website


Tomcat

Fedora Repository

XML Catalog Setup

To speed up Fedora's validation of XML documents (and reduce HTTP requests, avoid 503 errors from w3.org, etc.) it helps to set up a local DTD catalog. Fedora doesn't have an easy way to configure this, but we can sneak in the functionality we need by creating a custom DocumentBuilderFactory based on the Xerces DocumentBuilderFactoryImpl:

package edu.asu.lib.fedora.xml;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.ParserConfigurationException;

import org.apache.xerces.jaxp.DocumentBuilderFactoryImpl;
import org.apache.xml.resolver.tools.CatalogResolver;

public class CatalogResolverDocumentBuilderFactory extends
		DocumentBuilderFactoryImpl {
	
	public DocumentBuilder newDocumentBuilder() throws ParserConfigurationException {
		DocumentBuilder db = super.newDocumentBuilder();
		db.setEntityResolver(new CatalogResolver());
		return db;
	}
} 

Compile and JAR this class, stick it in Fedora's classpath, along with resolver.jar from the Apache XML Commons project. In the Tomcat startup script, add the following to JAVA_OPTS:

JAVA_OPTS="$JAVA_OPTS -Djavax.xml.parsers.DocumentBuilderFactory=edu.asu.lib.fedora.xml.CatalogResolverDocumentBuilderFactory"

Finally, create a CatalogManager.properties file (the CatalogResolver class will use this), and put it in Fedora's classpath:

# Add semicolon-delimited catalog files here (see http://en.wikipedia.org/wiki/XML_Catalog)
catalogs=/etc/xml/catalog
relative-catalogs=false
static-catalog=yes
catalog-class=org.apache.xml.resolver.Resolver
verbosity=1


Webapp Components

Libraries

  • nusoap:
    • Used for communicating with Fedora SOAP services
  • fedora-client:
    • Built on the nusoap library, it contains objects that facilitate access to Fedora's SOAP APIs, risearch (via SPARQL), and the Solr index. Config.php specifies the configuration of these endpoints.

Query Results Page

(public_html/results.php)

Builds a Solr query based on the keywords entered (adding some boost terms and additional feature requests, like snippet highlighting) and retrieves the results. It then renders a page of results (including pagination) using the results.tpl/resultitem.tpl/results-grid.tpl template files. Each rendered result can be used to access the view page of the result object.

View Page

(public_html/view)

Renders a full-page view of an object in the repository, including metadata fields and content. The view page takes an object PID (specified in the URL as /view/<PID> and queries Fedora to determine the object's content type. It then delegates the rendering of the page's main
to a view class (see the includes directory). Below is a basic description of each view class:


Content Model View Class Description
info:fedora/embryo:Video-CModel includes/VideoView.php
  • Looks up the appropriate streaming video endpoint in the object's VideoStreams datastream
  • Writes a script and target div to the page to hold a JWPlayer SWFObject to play the video
  • Performs a DC to HTML transform and writes the result to the page
info:fedora/embryo:Image-CModel includes/ImageView.php
  • Looks up a thumbnail version of the image and adds it to the page, with a link to the Djatoka view of the full image.
  • Adds Creative Commons license display elements to page
info:fedora/embryo:Article-CModel includes/ArticleView.php
  • Performs an NLM to HTML transform on the object's Article datastream, writing the results to the page
  • Retrieves and passes a list of active PIDs in to the transform (links to Inactive 'stub' objects in the article are not rendered as links)
info:fedora/embryo:Citation-CModel includes/CitationView.php
  • Reads title and type from the object's DC datastream and renders a header
  • Performs a DC to HTML transform and writes the result to the page


Ingest scripts

Both ingest scripts have a similar command-line interface and internal operation. I fixed a couple dependency issues, and the ingest scripts are ready to go. Run them from anywhere on the new machine, as 'ingest_articles' and 'ingest_images' respectively. Add the '-h' flag to view usage information. I added the default Fedora connection info to the script, so you shouldn't need to specify that each time you run it.


Basically, each script can either take a single fileset or a directory of filesets (conforming to the expectations that we've set out). Here are some examples:

Ingest a single article: > ingest_articles /my/articles/article-01.xhtml

Ingest all articles in a directory: > ingest_articles /my/articles

Ingest a single image (will pick up image-01.jpf, image-01_dc.xml, and image-01_mods.xml, for example): > ingest_images /my/images/image-01

Ingest all images in a directory: > ingest_images /my/images



Each script takes a list of files and/or directories as arguments and processes them in three phases:

Gather files:
  • Creates a list of the files or filesets that will be used as source data for creating an object and its datastreams, and interactively reports any errors afterwards.
Generate datastreams:
  • Using a temp directory, generates any files that will be uploaded directly as datastreams, and interactively reports any errors afterwards.
Ingest:
  • Takes the datastreams generated in the previous phase and performs inserts or updates on the appropriate (new or existing) Fedora object.

Articles

Dependent upon a libxml2 system catalog (in /etc/xml/catalog, for example) for resolving XHTML and NLM DTDs
Gather files:
  • Gets a list of XHTML files (*.xhtml) from the specified directory/file arguments
Generate datastreams:
  • Parses dc.* properties from meta tags in XHTML header and generates DC
  • Creates RELS-EXT from links in XHTML document body.
  • Transforms XHTML to NLM using an XSL transform
Ingest:
  • Expects an existing stub object, corresponding to the PID in the XHTML 'fedora.PID' meta tag
  • Uploads the generated datastreams to that object

Images

Dependent upon the Kakadu image manipulation libraries and the appropriate paths to those libraries being set in the ingest script.
Dependent upon ImageMagick for thumbnail generation.
Gather files:
  • Expects a MODS metadata file (XXX_mods.xml), a DC metadata file (XXX_dc.xml) and an image file (XXX.(tif|tiff|jpg|jpeg|jp2|jpf|jpx)). If an incomplete file set is provided, or if an unrecognized suffix is encountered, the script will display a warning.
  • If a directory argument is provided, the script expects all files contained in that directory to be part of a fileset (as described above)
  • If a non-directory argument is provided, the script expects it to be the prefix of the files in a fileset (as described above)
Example:
ingest_images /tmp/foo -> (foo_dc.xml, foo_mods.xml, foo.jp2) in directory /tmp
Generate datastreams:
  • Generates a 150x150 thumbnail version of the source image (using Kakadu and/or ImageMagick)
Ingest:
  • Creates a new object each time (don't ingest multiple times without cleanup!)
  • Uploads the generated datastreams to a new object

Vogon

Vogon (http://gobtan.sourceforge.net/) is a desktop application for annotating texts with triplets. The triplets contain what the researcher annotating the text interprets as relevant information. For example, the sentence "Biology is the study of life" could be annotated with the triplet < Biology - is - study of life >. Each part of such a triplet is connected to an ontology so that a system can draw conclusions from the triplets. Triplets are uploaded to a common repository so that they can be searched and analyzed. Up to now, Vogon reads PDF files and texts in MEDLINE format (tagged field format). It is implemented in Java and is based on the Eclipse framework. Vogon is released under the Eclipse Public License.

Vogon also includes a component to support the Embryo Project. This component lets users load Word files in DOCX-format that are marked up according to the Embryo Project Workflow (e.g. a text consists of title, text, author, and references, links are highlighted, etc.) into Vogon. Vogon then enables a user to do the following:

  • Highlighted terms are recognized as links. For these links urls, relationships and PIDs (Embryo Project specific) can be specified.
  • Lists of relationships can be loaded into Vogon from an Excel-file. These lists can then be used to automatically find relationships that are specified for the links in a text.
  • Lists of PIDs with corresponding keywords can be loaded into Vogon from an Excel-file. Vogon uses these lists to automatically find PIDs for links.
  • Texts can be exported as XHTML. In the future the export format is planned to be modifiable.

A tutorial about how to use Vogon for Embryo Project article editing can be found here.

Max Planck Institute for the History of Science

Web Services and Applications

  • Language Technologies Donatus and Pollux are currently web appliation services that provide morphology and dictionary support for many foreign languages.
    • Donatus Donatus consists of a morphological database with over 3 millions word forms with its lemmas and grammatical information in a high quality and in different languages such as arabic, german, english, french, greek, italian, latin and dutch. Donatus is used by eXist during morphological indexing of documents and morphological querying so that morphological information could be presented in morphological query results of the eXist web interface.
    • Pollux Pollux consists of a dictionary database with over 460.000 entries in 9 word dictionaries and in different languages such as arabic, english, greek, italian, latin. On the ECHO site, for example, one can view a text in four modes, (1) Text, (2) Text/Pollux, (3) Image, and (4) XML. The Pollux view will provide structural/morphological support for the language of interest, thus providing within-app definition and morphological analyses of the text.
  • Digilib (http://digilib.berlios.de/) Digilib is a state-less web-based client-server application for interactive viewing and manipulation of images. It consists of two parts - the image server component proper, called “Scaler” and a client-side part that runs in the user's web browser. The users browser sends an HTTP request for a certain (zoomed, scaled, rotated) image to the Scaler server and the server returns the image data as an HTTP response. We must also take into account the client-side part consisting of HTML and Javascript code that has also been requested and loaded from a frontend-web server into the user's browser:

Digilib.png

  • Arboreal Arboreal is a Java client application for viewing and comparing XML texts. It offers also special morphological searches by the help of Donatus which could be saved for later use.

Virtual Spaces

Virtual Spaces 2008 (VSpace, http://virtualspaces.sourceforge.net/) is an application for structuring information into "virtual spaces." VSpace enables users to arrange texts, images and videos in 2D graphs and create virtual tours from them. These virtual tours are generated as HTML, PDF or RTF files. VSpace is a Java application that is based on the Eclipse framework. It is released under the Eclipse Public License.


Middleware

  • eXist (http://exist-db.org/) is an open-source XML database that allows efficient and fast indexing and searching via XQuery. eXist supports many (web) technology standards such as XQuery 1.0, XPath 2.0, XSLT 2.0, REST, WebDAV, SOAP and XMLRPC. In its newest release eXist integrates the fulltext query and indexing system Lucene. eXist applications are usually developed as pure XQuery/XSL applications but these could also be extended by own Java programs e.g. for performance reasons. The MPIWG's workflow from scanned books to transcriptions eventually ends in an XML-coded format for the original source, an efficient and easily searchable database to house these XML documents. In the context of the architecture of the MPIWG, eXist serves currently as a database repository and document index for its XML documents, which are then queried via web applications for specific information. A future setup will utilize the eXist database as a middle-layer, more akin to a cache of sorts, which draws its XML collection from the underlying Fedora database (via eSciDoc).
Architecture eXist

Repositories

  • eSciDoc (https://www.escidoc.org/) eSciDoc is a platform comprised of a set of software and services that together make the process of doing "e-research" easier. According to the eSciDoc site, "typical scenarios include storing, manipulating, enriching, disseminating, and publishing not only of the final results of the research process, but of all intermediate steps as well, such as pre-research documents, primary and experimental data, pre-prints, and learning materials."

In the context of the MPIWG, eSciDoc is attractive as a repository solution that can store the digital elements of relevant projects. Because of the services built on top of the Fedora repository, ingesting objects and managing them becomes an easier proposition. Rather than creating, for example, a number of content models for various digital types, eSciDoc takes care of this process; similarly, a single XML document obviates the management of multiple XML files for all the data streams of an object. Other services similarly make the management of PIDs, validation of documents, and storing/retrieving of Dublin Core metadata an easier proposition than simply running a Fedora repository.

Problems, Needs, and Future Directions

Problems

  1. Data mining/text extraction. What is available out there to do the kinds of data mining that the MPIWG wants to do? Would like to do keyword extraction and full-text mining to check similarities of documents and see what has been written (or not written) about a certain topic.
  2. eSciDoc. Waiting on a stable, easy-to-implement and install version of eSciDoc. Current version of eSciDoc does not provide an easy ingest process, has changed XML namespaces without support for previous versions, and is still not easy enough to install. Once a finalized and easy-to-use build of eSciDoc is available, the MPIWG wants to re-ingest content (from a previous test version of eSciDoc) and start working with the platform.
  3. Migration of Filemaker-centric projects to newer databases.
  4. Providing a repository of shared bibliographies.

Questions

  1. What kinds of services does the MPIWG want to provide for doing "out of the box" digitalHPS?
    1. need to think about storing data, storing triples, programming help to convert extant data into triples (or other data formats), etc.
    2. centralized identifier-providing system, i.e. handles, URIs, DOIs, etc.?
    3. how much programming/development help can these services expect to provide? -- up for negotiation...
  2. For triples
    1. attribution and context MATTERS for much of the historical works that the MPIWG is interested in. how do triples handle these kinds of metadata...RDFS?
    2. so, for every triple, when was it created, by whom, and in what context?
  3. VOGON - what is the potential utility for text mining of MPIWG documents?
  4. Virtual Spaces - how can this and other technologies (e.g., ISEE) be utilized to provide better virtual experiences?


Future Directions

  • Digilib - make newest version with more functionalities available to eSciDoc
  • Searching - with the many data models and formats, want to improve searching so it can effectively and quickly query the different kinds of databases.
  • Indexing for digital books. Within the context of ontology development, creating indexes for digital books is not as straightforward as one might think. Perhaps it is a limitation of Protege and how classes and instances are implemented.
  • Viewer for ECHO
    • add specific text modes
    • add highlighting of searched text in the image scan like Google Books does
    • provide PDF downloads of texts
    • integrate geographical information (GIS)
  • Language Technology
    • build up a language technology competence center especially for old and special languages
    • support more foreign languages
    • support the standard web interface REST
    • integrate encyclopedias such as Wikipedia
    • build up knowledge bases for special areas such as "Albert Einstein" and offer sophisticated knowledged based query facilities
    • enrich language technology data by multimedia data (images, videos etc.)
    • in viewer: improve browsing and navigation of language technology data (morphologies, dictionaries, encyclopedias, knowledge bases)
Personal tools