This aspect of the Next Generation Phenomics for the Tree of Life grant focuses on extracting phenomic information that has already been published in journal articles and books, i.e. legacy taxonomic literature. In order to access this information and compile it into a taxon-character matrix format that can be used for phylogenetic analyses, we are using Natural Language Processing (NLP) approaches to develop programs for use with a diverse range of organisms (taxa), including plants, algae, sponges and microorganisms.
We have developed or are developing a set of tools to generate taxon-character matrix from text descriptions. They are CharaParser, ETC toolkit, MicroPIE, Matrix Converter.
- CharaParser was a desktop application and is now become part of the ETC (Explorer of Taxon Concepts) toolkit.
- ETC toolkit has software tools to create input files, to parse semi-structured morphological descriptions of any taxon group (powered by CharaParser), and to generate/review/edit raw matrices.
- MicroPIE is designed to extract physiological characters from microbial taxonomic descriptions and output a raw matrix with up to 30 characters, including min/max/optimal growth temperature. It is under development. Users can try out MicroPIE at MicroPIE Demo Site.
- Matrix Converter is a desktop application that converts raw matrices to discrete taxon-character matrices to be used in PAUP, TNT, MrBayes, RAxML, and Mesquite.
- PORO: In collaboration with the NSF-supported Phenotype Research Coordination Network, we created an ontology to describe the anatomy of sponges, (PORO: the Porifera anatomy ontology, http://www.jbiomedsem.com/content/5/1/39). Among other usages, this ontology is now used by ETC/Text Capture to transform taxonomic descriptions of sponges from text monographs into character matrices suitable for phylogenetic analyses.
- MicrO: A hierarchical, logical network of microbial phenomic terms, i.e. a microbial ontology, MicrO, is being constructed. It will be an essential resource for the identification of terms and term synonyms extracted using MicroPIE. MicrO is at https://github.com/carrineblank/MicrO.
We are also combining the crowd-sourcing power of students in microbiology classes to help with testing MicroPIE and ultimately using MicroPIE to aid in the compilation of character matrices for the extensive diversity of Bacterial and Archaeal taxa. We have initiated the Microbial Phenomics Project (https://microbewiki.kenyon.edu/index.php/Microbial_Phenomics_Project) as a collaborative effort with MicrobeWiki (http://microbewiki.kenyon.edu/index.php/MicrobeWiki), a student-edited microbiology resource that provides descriptions of microbes for the public. In Spring 2014 we ran a “friendly competition” between humans and MicroPIE as a way to evaluate the initial MicroPIE program and make improvements. In Spring 2015, we conducted another experiment.
Collaborative work with other AVAToL-sponsored projects
In addition, we are developing NLP methods to extract host-symbiont relationships and other traits related to species interactions from databases such as GenBank. Our goal is to combine our efforts with those of the AVATOL-sponsored Arbor (www.arborworkflows.com) and OpenTree (http://opentree.wikispaces.com) projects to map traits onto phylogenetic trees derived from DNA sequence data. These combined datasets will allow us to quantitatively test, for example, whether specific traits or interactions have co-evolved or whether such traits are correlated with increased diversification rates throughout the evolutionary history of Bacteria and Archaea.