The structure of taxonomic data
Taxonomic papers are generally a synthesis of a limited set of elements, including text descriptions, scientific names and nomenclatural acts, literature references, images, specimen occurrence records, and increasingly DNA sequences. The role of the author is to link specimens (with their associated occurrence records) to nomenclature, express observations and hypotheses as text, and document observations with images and quantities. In traditional taxonomic publishing, all of these elements are merged together into a document. But what if these elements could simultaneously be released and maintained as discrete data tied to the publication? This would have effects both within and beyond the taxonomic community. With access to data elements in electronic form, data consumers (for example, taxonomists, ecologists, conservationists, molecular biologists) could use the publication in more flexible ways. This could aid in the taxonomic utility of the work, facilitate the recognition of new discoveries, and increase the testability (and hence the scientific quality) of the work. The key to unlocking this potential is semantic tagging.
Semantic tagging is a method of assigning markers, or tags, to a text string so the meaning of that string is discoverable and readable by computers [5]. Data elements organized and tagged according to accepted standards are trivial to combine. By contrast, reconciling the content of multiple traditional publications on the same group can require dedicated study and effort. Once in parsed form, data are available for recombination and repurposing. Associating data elements with a publication adds a measure of credibility based on the reputation of the authors, the review process of the publication venue, and the date of publication [6]. This contrasts with, for example, the museum-collections-based data aggregation model that currently dominates GBIF (Global Biodiversity Information Facility, http://data.gbif.org), where data credibility rests with the originating institution. To the extent that the corpus of taxonomic literature can be parsed and aggregated by various cybertaxonomic repositories, those repositories become powerful tools for fundamental information about the state of biodiversity knowledge, meta-analysis, and data reuse for a wide range of applications, including public outreach [1, 3, 4, 7, 8].
Once we agree that it would be desirable to have semantically tagged taxonomic data elements digitally interlinked and associated with publications, there are several ways to approach content development. The strategy we focus on is the journal-centered approach, considering both retrospective and prospective content. The retrospective portion involves converting legacy publications from their current formats (for example, print, PDFs) into parsed and digitally distributed content. The prospective portion involves semantic tagging embedded during the editorial production process. It seems unrealistic for the large number of individual taxonomists, each publishing a handful of taxonomic papers per year, to keep current with the technology. But publishers of journals with an emphasis in taxonomy are better positioned to develop and maintain efficient and current processes.
The infrastructure is already in place for cybertaxonomic resources to recognize and aggregate content appropriate to their focus from documents semantically tagged according to XML standards [5]. We see a future where major taxonomic journals routinely expose new and legacy content in ways that can be discovered, aggregated, and distributed by a community of cybertaxonomic repositories (Figure 1). The source XML documents might be hosted by individual journals or kept in a common repository such as Plazi (http://plazi.org/). To facilitate the integration of biological information across diverse sources, each digitized publication should be distinguished by a globally unique identifier (GUID), such as a registered DOI (digital object identifier) or LSID (Life Science Identifier), linking the data elements back to the original source, author, and journal [7].
Schematic diagram of data elements found in taxonomic publications and exemplar cybertaxonomic resources appropriate to hosting each data class. Semantic tagging of text elements, images, taxonomic nomenclature, and specimen data can be applied retrospectively to legacy publications using tools such as GoldenGATE (http://plazi.org/?q=GoldenGATE). Tagging can also be part of the prospective production process for new taxonomic manuscripts. Some electronic data elements in new papers (for example, DNA sequences) are currently deposited in online repositories by authors. A registered GUID (globally unique identifier) included in the metadata of all electronic data sets links derivative datasets back to the original source publication. BOLD, Barcode of Life Database (http://www.barcodinglife.com/); DOI, Digital Object Identifier (http://www.doi.org/); Dryad (http://datadryad.org/); EOL, Encyclopedia of Life (http://www.eol.org/); GBIF, Global Biodiversity Information Facility (http://data.gbif.org); GenBank (http://www.ncbi.nlm.nih.gov/genbank/); Global Names Architecture (http://globalnames.org/); LSID, Life Science Identifier, Morphbank (http://www.morphbank.net); IPNI, International Plant Names Index (http://www.ipni.org/); Plazi (http://plazi.org/); XML, Extensible Markup Language (http://www.w3.org/XML/); ZooBank (http://www.zoobank.org/).