White Paper

From digitalHPS

Jump to: navigation, search

Introduction

Over the last two decades the life sciences have experienced unprecedented growth and developed a whole new explanatory paradigm of systems biology. These conceptual developments have been based on a shift away from simple causal explanations based on clearly identifiable factors, such as individual genes, and toward an emphasis on multiple factors and their interactions. Technological advances, especially in the area of bioinformatics, have played a central and crucial part in these conceptual changes (NCBI). As a consequence, the life sciences are now to a large degree information-based with the relevant information stored in both centralized and distributed databases. Sophisticated search algorithms and queries based on conceptual models and ontologies (in the computer science sense of the word), standardized annotation practices and new generations of relational databases connecting different kinds of data are the foundation of these new forms of life science research (Ouzounis and Valencia 2003).

Scholars in the science studies community (history, philosophy, and sociology of science) have always emphasized complex explanations of historical events, which are mostly presented as historical and richly contextual narratives and have traditionally been the end result of years of individual scholarship. The science studies community has not yet embraced the enormous benefits of the informatics revolution that has transformed the life sciences with respect to the organization of multiple forms of complex data, shared access to these data, searches in distributed relational databases that are organized around standardized practices of database management and the possibilities of digital workbenches for collaborative and distributed research. All these developments have also contributed to robust cyber-infrastructure, which has changed the ways biologists go about their research (Ouzounis 2002).

In other words, the science studies community is missing out on new ways to conduct and organize research and to store, distribute, and analyze information. One of the main consequences of the bioinformatics revolution has been the possibility of large scale and comparative analyses of data and the integration of detailed experimental research with readily available points of comparison. This strategy has facilitated a bottom-up approach that allows biologists to find patterns of increasing generality. Insofar as one goal of the science studies community is to better understand both individual sciences as well as science at large in its various contexts (technological, theoretical, historical, social or political), it too will have to move beyond the particular and focus on general patterns wherever these exist, a goal greatly facilitated by the tools of the informatics revolution.

Informatics in biology and medicine works partly because all participants adhere to a strict set of rules about data representation and publication, which includes standards for making primary data available for others to use. In this way, large-scale genomics studies that span multiple species can be performed. Standardizing the ways information is captured and stored is integral to the effective management and sharing of knowledge. The world-wide web is itself a standard that allows people to exchange information via the creation of websites that contain all kinds of content. Microsoft’s document format has become a standard in many academic and non-academic circles for the creation and sharing of textual information.

If one wants to be involved with the larger community that exchanges information in this way, one must make sure that information is captured and represented in the correct format. Meaningful collaborations that facilitate sharing of knowledge effectively build upon standardized sets of tools and methodologies. The term “cyber-infrastructure” is used to describe the sets of technologies and methodologies that enable data to be acquired, managed, and analyzed effectively so that disparate projects can share information. For example, in the biomedical field there are libraries of ontologies that describe entities and their relations to one another within specific research domains (Smith et al. 2007). The mission of the OBO Foundry specifically states their hope that “ontologies will be fully interoperable, by virtue of a common design philosophy and implementation, thereby enabling scientists and their instruments to communicate with minimum ambiguity. In this way the data generated in the course of biomedical research will form a single, consistent, cumulatively expanding, and algorithmically tractable whole.” (www.obofoundry.org).

The Embryo Project (http://embryo.asu.edu) is one such project in the area of science studies that has adopted these principles of biomedical informatics and has taken an open access approach, which aims to make available via an online encyclopedia all information related to embryo research in historical, legal, social, technological, and ethical contexts. To accomplish these goals the Embryo Project has put together scholarly work in HPS and cutting-edge digital technology. One example of recent technological advances of the Embryo Project can be found here Test-Case I: Medline queries with a set of specific ontologies and obo-annotator.

Projects

All these projects have come together and identified the following challenges that need to be met before a community-owned and developed cyberinfrastructure can transform the field of science studies:

More specifically, the group also identified the following areas as short term goals that need to be developed immediately and in a coordinated fashion as they are foundational for any future cyberinfrastructre. Group members have each committed current resources to work on those:


  1. Ontology development. (Embryo Project) A formal ontology is critical to any digital project, to organize and structure the relevant terms and their relationships to one another. This ontology will be a necessary component of using semantic web technologies to do text mining and natural language processing of text documents in both the Embryo Project repository as well as other repositories. We have begun the difficult process of ontology development utilizing common tools in the informatics community (OWL, Protégé), but as it is a never-ending part of any digital project, we require more technical expertise in this area to develop it further (Noy and McGuinness 2001; Horridge et al 2007).
  2. Text mining and natural language processing (NLP). (Embryo Project, Archimedes, MPI, Newton, MBL, Stanford Encyclopedia). In order to extract relevant information from text, one must be able to develop tools that recognize not only exact terms matched against some ontology, but also perform natural language processing by recognizing words based on structures such as parts of speech (Riloff 1999). This will enable us to access large amounts of textual information and analyze its content computationally (see Hamed and Sarkar forthcoming).
  3. Working with other repositories and databases. (Embryo Project, MPI, MBL) In order to increase both the size and range of objects in our repository, we need the capability of accessing other databases to cull relevant information and populate our repository with either the content or references to the content. In the larger HPS community, there are no federated databases in the way that PubMed aggregates text sources in the biomedical field. Because there are so many separate databases, many of which are specific to a journal or publisher, many different strategies for accessing the content are required.
  4. RDF creation, storage, and use. (Embryo Project, ASU libraries). As part of the Embryo Project workflow of writing original articles, we are careful to include relationship information within the source of that article. These relationships are stored as a RELS-EXT datastream that contains the RDF metadata of all the objects and their relationships to one another (Brickley and Guha 2004). We have not yet determined how to utilize this rich source of information, but its use is critical if we are to leverage the power of the semantic web. If the sharing of data, rather than documents, is to be accomplished, we need to develop standards among HPS digital projects so that RDF representations are captured and made available to other projects (Allemang and Hendler 2006). Tools that facilitate the exchange of RDF information as well as its analysis within the scope of the project must also be developed.