Embryo Project
From digitalHPS
Contents |
The Embryo Project as an Example of the Benefits and Challenges of Applying Informatics Approaches to HPS
Further Reading
- Maienschein, J. and M. Laubichler. 2010. "The Embryo Project: An Integrated Approach to History, Practices, and Social Contexts of Embryo Research" Journal of the History of Biology 43: 1-16. DOI: 10.1007/s10739-009-9204-1.
- Laubichler, M.D., J. Maienschein, and G. Yamashita. 2007. "The Embryo Project and the Emergence of a Digital Infrastructure for History and Philosophy of Science" Annals of the History and Philosophy of Biology. 12: 79-96. For a copy of this paper, please contact Manfred Laubichler.
The Embryo Project and Cyberinfrastructure
The Embryo Project is one such project in the area of science studies that has adopted these principles of biomedical informatics and has taken an open access approach, which aims to make available via an online encyclopedia all information related to embryo research in historical, legal, social, technological, and ethical contexts. To accomplish these goals the Embryo Project has put together scholarly work in HPS and cutting-edge digital technology. The ASU library, a large research library with a commitment to sustainable digital infrastructure, has been instrumental in setting the parameters for the Project's digital infrastructure. The library's use of the open-source Fedora Commons (http://www.fedora-commons.org/; Fedora Development Team 2005; Lagoze et al 2006) digital repository platform brought the Embryo Project into the larger community of scholarly projects that have adopted Fedora for their work. The Embryo Project has an active team of developers at the library who are committed to developing applications and standards-based workflows that not only benefit the Embryo Project and other library projects, but also feed back into sustaining the growth and development of the Fedora Commons community as a whole.
Yamashita has worked closely with the software development and cyberinfrastructure team at the ASU library and participated in the weekly Fedora meetings where applications, workflows, and technologies that will best meet the needs of the diverse projects in which the library is involved are planed and developed. These include projects in archaeology, music and theatre, polar research, the history of Arizona, and the Embryo Project, among others. Under the library's direction the Embryo Project has adopted protocols and developed tools that ensure that common standards and practices are followed and technologies and platforms are adopted that make the Project open and freely and easily accessible. All content is open access and under a non-commercial, no-derivative works Creative Commons license (http://creativecommons.org/). All articles are marked up under the National Library of Medicine’s journal article DTD, which ensures long-term sustainability and exchangeability with a broad range of collaborators. Moreover, citations in the Embryo Project are tagged with MODS metadata (http://www.loc.gov/standards/mods/), a schema developed by the Library of Congress for bibliographic entities. Likewise, images are collected and stored in the JPEG2000 format, which is fast becoming a standard for archiving digital images.
The Embryo Project has adopted a set of practices that rely on W3C standards for creating content and storing relationship information among objects, and has taken great care to annotate textual articles with these relationships in mind. Presently, all articles are marked up in the XHTML format and hand-coded with relationship information that is relevant to the various aspects of embryo research. This relationship information is currently referenced from an informal ontology of categories - People, Places, Organization, Contexts, Awards, Concepts, Law, Ethics, Religion, Technology, Experiments, Organisms, and Literature – and are stored as a RELS-EXT data-stream in the Fedora repository. Articles (and soon, images and videos) are put into Fedora via a web uploader that the library has developed, which gives control of the ingestion process to the content creators. The information contained within the XHTML document is parsed and stored as separate datastreams in the Fedora repository. The relevant datastreams are then displayed in the encyclopedia via a web browser (see Figure 1).
Figure 1 - Embryo Project workflow for creating, ingesting, and managing objects. Objects are uploaded to the repository via a web interface (currently only images, but will soon be enabled for images and videos). XHTML transformations take the text and extract and create XML representations that are then stored as datastreams in Fedora, which are disseminated via the website.
The Embryo Project and the Semantic Web
Until now, The Embryo Project members devoted most of their time and resources to developing workflows and best-practices that address core questions about how to create unique content, how to edit and publish this content, and how to manage this content effectively. Phase One of the Embryo Project, then, has been about creating and managing its own content. This, however, is only the tip of the iceberg. The next stage of the Embryo Project involves the development of an expansive and robust repository of embryo-related information that accesses other repositories and incorporates the vast sources of already-published materials on embryo research. In order to do this, the Embryo Project requires an informatics approach to mining, extracting, and analyzing content from diverse repositories. To accomplish this effectively the following challenges that the Embryo Project faces in its phase two, the semantic web phase, need to be met (these are the same challenges that all other digital projects in these areas of scholarship are also facing):
- Ontology development. A formal ontology is critical to any digital project, to organize and structure the relevant terms and their relationships to one another. This ontology will be a necessary component of using semantic web technologies to do text mining and natural language processing of text documents in both the Embryo Project repository as well as other repositories. We have begun the difficult process of ontology development utilizing common tools in the informatics community (OWL, Protégé), but as it is a never-ending part of any digital project, we require more technical expertise in this area to develop it further (Noy and McGuinness 2001; Horridge et al 2007).
- Text mining and natural language processing (NLP). In order to extract relevant information from text, one must be able to develop tools that recognize not only exact terms matched against some ontology, but also perform natural language processing by recognizing words based on structures such as parts of speech (Riloff 1999). This will enable us to access large amounts of textual information and analyze its content computationally (see Hamed and Sarkar forthcoming).
- Working with other repositories and databases. In order to increase both the size and range of objects in our repository, we need the capability of accessing other databases to cull relevant information and populate our repository with either the content or references to the content. In the larger HPS community, there are no federated databases in the way that PubMed aggregates text sources in the biomedical field. Because there are so many separate databases, many of which are specific to a journal or publisher, many different strategies for accessing the content are required.
- RDF creation, storage, and use. As part of the Embryo Project workflow of writing original articles, we are careful to include relationship information within the source of that article. These relationships are stored as a RELS-EXT datastream that contains the RDF metadata of all the objects and their relationships to one another (Brickley and Guha 2004). We have not yet determined how to utilize this rich source of information, but its use is critical if we are to leverage the power of the semantic web. If the sharing of data, rather than documents, is to be accomplished, we need to develop standards among HPS digital projects so that RDF representations are captured and made available to other projects (Allemang and Hendler 2006). Tools that facilitate the exchange of RDF information as well as its analysis within the scope of the project must also be developed.
Each of the above problems is not unique to the Embryo Project, but a larger problem in all fields of research. HPS research currently does not have the ability to leverage the informatics tools being used in the life sciences to solve these problems. Moreover, the Embryo Project grant does not cover this kind of work. Still, we have taken initial steps in that direction to address each of the issues and understand the kinds of technologies and expertise that would be necessary to further develop those areas. In general, two main problems need to be addressed: (1) how to computationally analyze large amounts of textual information and (2) how to share this data with others.
Currently, all our articles are marked up and annotated by hand. While this has the advantage of being accurate, it does not scale when hundreds and thousands of articles need to be annotated. NLP tools, like OBO-Annotator (Hamed and Sarkar, forthcoming), are able to analyze text and, run against an ontology, extract relevant terms in the ontology. This information is then stored in a relational database that can be searched. These kinds of NLP tools not only extract terms in the ontology, but perform NLP tasks such as sentence tokenization and part-of-speech tagging. The data can be stored in a relational database and transformed into RDF triples that are then deposited into a triple store, which is a large repository of statements with the subject-predicate-object schema. Alternatively, these triples can be extracted directly from the text mining process. This triple store is the repository of all data of all the relevant content in the articles that were analyzed. It can be queried utilizing languages such as Prolog and SPARQL, and a large, federated triple store can also be created by incorporating triples that have been extracted in other projects. The Encyclopedia of Life, for example, is creating this kind of federated triple store that will contain little bits of information about all species on earth. Researchers on other projects are encouraged to provide their triples to the Encyclopedia of Life, thus increasing both the amount and variety of information on species culled from projects with different interests. In this way, data about a species’ habitat and range can “live†alongside data about a species’ embryology. This promises not only a single pool of all information on species, which is the main goal of the Encyclopedis of Life, but holds the added promise of creating new knowledge because queries to the triple store will reveal relationships that were not detectable by any single project or database. The potential of sharing data among projects is now realized, with the further goal of inferring new knowledge from this pool of collective data (Allemang and Hendler 2006). Figure 2 outlines a simplified view of how this informatics approach might work.
Figure 2 - A simplified informatics workflow to analyze text computationally, extract RDF triples, and share data between projects.
The Embryo Project and digitalHPS
However, although the Embryo Project partnership with the ASU library has been fruitful and productive (and has involved a substantial amount of in-kind contributions to the NSF funded Embryo Project), it has also become clear that in this digital age no single institution, let alone a single library or even just two libraries collaborating together, will be able to solve every problem. Technological needs are often very specific and thus require specific sets of skills. Moreover, limited person-power and funds within those service institutions further limit how much time can be spent on any single project.
Therefore the development of a cyberinfrastructure for the science studies community will have to follow the same distributed approach it will ultimately enable. But, even though the only way to make substantial and sustainable progress is through an organized community efforts, something the Embryo Project team has been organizing with the support of NSF, this approach also requires that we train a small number of interdisciplinary experts who not only understand the opportunities and constraints from both a scholarly and an informatics perspective, but who can also organize and lead interdisciplinary teams of informatics experts and scholars working on these challenges. Therefore we propose that as part of this professional development grant for Yamashita that he spends an academic year training with two informatics teams at the Marine Biological Laboratory (MBL) in Woods Hole, Massachusetts and at the Max Planck Institute for the History of Science (MPI) in Berlin, Germany in order to become the kind of skilled expert and leader in the development of a cyberinfrastructure for the science studies communities.

