Max Planck Institute for the History of Science

From digitalHPS

Jump to: navigation, search

Contents

Tools and Services

Research at the MPIWG (http://www.mpiwg-berlin.mpg.de/en/index.html) investigates how new categories of thought, proof, and experience have emerged in the centuries-long interaction between the sciences and their ambient cultures. Of especial interest to the digital humanities is the Virtual Laboratory, which is a "platform where historians publish and discuss their research on experimentation in the life sciences, art, and technology." The Virtual Lab consists of essays and images of the experiments, people, technologies, and sites related to scientific studies. Additionally, the MPIWG is a partner in the multi-institutional European Cultural Heritage Online (ECHO, http://echo.mpiwg-berlin.mpg.de/home), an open access digital infrastructure for storing, documenting, and sharing the culture of Europe.

As a major developer in various projects the MPIWG is involved in developing technologies, tools, and a cyber-infrastructure to promote the history, philosophy, and social studies of science.

Web Services and Applications

  • Language Technologies Donatus and Pollux are currently web appliation services that provide morphology and dictionary support for many foreign languages.
    • Donatus Donatus consists of a morphological database with over 3 millions word forms with its lemmas and grammatical information in a high quality and in different languages such as arabic, german, english, french, greek, italian, latin and dutch. Donatus is used by eXist during morphological indexing of documents and morphological querying so that morphological information could be presented in morphological query results of the eXist web interface.
    • Pollux Pollux consists of a dictionary database with over 460.000 entries in 9 word dictionaries and in different languages such as arabic, english, greek, italian, latin. On the ECHO site, for example, one can view a text in four modes, (1) Text, (2) Text/Pollux, (3) Image, and (4) XML. The Pollux view will provide structural/morphological support for the language of interest, thus providing within-app definition and morphological analyses of the text.
  • Digilib (http://digilib.berlios.de/) Digilib is a state-less web-based client-server application for interactive viewing and manipulation of images. It consists of two parts - the image server component proper, called “Scaler” and a client-side part that runs in the user's web browser. The users browser sends an HTTP request for a certain (zoomed, scaled, rotated) image to the Scaler server and the server returns the image data as an HTTP response. We must also take into account the client-side part consisting of HTML and Javascript code that has also been requested and loaded from a frontend-web server into the user's browser:

Digilib.png

  • Arboreal Arboreal is a Java client application for viewing and comparing XML texts. It offers also special morphological searches by the help of Donatus which could be saved for later use.

Middleware

  • eXist (http://exist-db.org/) is an open-source XML database that allows efficient and fast indexing and searching via XQuery. eXist supports many (web) technology standards such as XQuery 1.0, XPath 2.0, XSLT 2.0, REST, WebDAV, SOAP and XMLRPC. In its newest release eXist integrates the fulltext query and indexing system Lucene. eXist applications are usually developed as pure XQuery/XSL applications but these could also be extended by own Java programs e.g. for performance reasons. The MPIWG's workflow from scanned books to transcriptions eventually ends in an XML-coded format for the original source, an efficient and easily searchable database to house these XML documents. In the context of the architecture of the MPIWG, eXist serves currently as a database repository and document index for its XML documents, which are then queried via web applications for specific information. A future setup will utilize the eXist database as a middle-layer, more akin to a cache of sorts, which draws its XML collection from the underlying Fedora database (via eSciDoc).
Architecture eXist

Repositories

  • eSciDoc (https://www.escidoc.org/) eSciDoc is a platform comprised of a set of software and services that together make the process of doing "e-research" easier. According to the eSciDoc site, "typical scenarios include storing, manipulating, enriching, disseminating, and publishing not only of the final results of the research process, but of all intermediate steps as well, such as pre-research documents, primary and experimental data, pre-prints, and learning materials."

In the context of the MPIWG, eSciDoc is attractive as a repository solution that can store the digital elements of relevant projects. Because of the services built on top of the Fedora repository, ingesting objects and managing them becomes an easier proposition. Rather than creating, for example, a number of content models for various digital types, eSciDoc takes care of this process; similarly, a single XML document obviates the management of multiple XML files for all the data streams of an object. Other services similarly make the management of PIDs, validation of documents, and storing/retrieving of Dublin Core metadata an easier proposition than simply running a Fedora repository.

Problems, Needs, and Future Directions

Problems

  1. Data mining/text extraction. What is available out there to do the kinds of data mining that the MPIWG wants to do? Would like to do keyword extraction and full-text mining to check similarities of documents and see what has been written (or not written) about a certain topic.
  2. eSciDoc. Waiting on a stable, easy-to-implement and install version of eSciDoc. Current version of eSciDoc does not provide an easy ingest process, has changed XML namespaces without support for previous versions, and is still not easy enough to install. Once a finalized and easy-to-use build of eSciDoc is available, the MPIWG wants to re-ingest content (from a previous test version of eSciDoc) and start working with the platform.
  3. Migration of Filemaker-centric projects to newer databases.
  4. Providing a repository of shared bibliographies.

Questions

  1. What kinds of services does the MPIWG want to provide for doing "out of the box" digitalHPS?
    1. need to think about storing data, storing triples, programming help to convert extant data into triples (or other data formats), etc.
    2. centralized identifier-providing system, i.e. handles, URIs, DOIs, etc.?
    3. how much programming/development help can these services expect to provide? -- up for negotiation...
  2. For triples
    1. attribution and context MATTERS for much of the historical works that the MPIWG is interested in. how do triples handle these kinds of metadata...RDFS?
    2. so, for every triple, when was it created, by whom, and in what context?
  3. VOGON - what is the potential utility for text mining of MPIWG documents?
  4. Virtual Spaces - how can this and other technologies (e.g., ISEE) be utilized to provide better virtual experiences?


Future Directions

  • Digilib - make newest version with more functionalities available to eSciDoc
  • Searching - with the many data models and formats, want to improve searching so it can effectively and quickly query the different kinds of databases.
  • Indexing for digital books. Within the context of ontology development, creating indexes for digital books is not as straightforward as one might think. Perhaps it is a limitation of Protege and how classes and instances are implemented.
  • Viewer for ECHO
    • add specific text modes
    • add highlighting of searched text in the image scan like Google Books does
    • provide PDF downloads of texts
    • integrate geographical information (GIS)
  • Language Technology
    • build up a language technology competence center especially for old and special languages
    • support more foreign languages
    • support the standard web interface REST
    • integrate encyclopedias such as Wikipedia
    • build up knowledge bases for special areas such as "Albert Einstein" and offer sophisticated knowledged based query facilities
    • enrich language technology data by multimedia data (images, videos etc.)
    • in viewer: improve browsing and navigation of language technology data (morphologies, dictionaries, encyclopedias, knowledge bases)
Personal tools