WP1: Integrating HPO and ORDO
WP leader : INSERM – Participants : Charité, SickKids, EBI, Garvan Institute
The two most broadly used catalogues of rare diseases are the Orphanet Rare Disease Ontology (ORDO and the Online Mendelian Inheritance in Man database (OMIM, a catalogue of genes and genetic entities developed by John Hopkins Universtiy, USA www.omim.org). Entries in OMIM are genetically defined whereas entries in ORDO are clinically defined. Typical age of onset, age of death, and the frequency of occurrence of a phenotype feature are not systematically captured by OMIM, whereas they are annotated in the Orphanet database and in its ontological representation ORDO. Furthermore, Orphanet terms are aligned to OMIM by means of semantic relationships, annotating whether the match is exact or partial (i.e. from a narrower term to a broader term or the opposite). Orphanet terms are also aligned to other medical terminologies and classifications in use in health records such as ICD10 and SNOMED CT. The Orphanet identifier, the ORPHA number, is being progressively integrated into health information systems in European countries following the Commission Expert Group for Rare Diseases (CEGRD) recommendation to allow for better identification of RD patients in health systems.
Rare disease frequency information has been shown to improve phenotypic matching. The ontological structure of the HPO is superior to other methods of computational phenotype analysis, however, the overall performance can be further improved by better and deeper annotation.
The aim of this Workpackage is to formally align the data models used by Orphanet and the HPO team to extend and refine annotations, and to include them in the Orphanet Rare Disease Ontoligy (ORDO, glien). Rare disorders in Orphanet, as well as scientific information attached to them, are available in 7 languages, and more translations are expected in the future for the Orphanet network extends over 40 countries. Exhaustive annotation of ORDO entries with HPO will allow for translation of HPO into other languages, supporting streamlined integration into EHRs, and connections of patient registries across language boundaries. Furthermore it will allow one to connect genetic information embedded in, for instance, mutation databases using OMIM to clinical diagnoses of RD.
Orphanet provides phenotypic annotations of the rare diseases in the Orphanet nomenclature using the Human Phenotype Ontology (HPO). HOOM is a module that qualifies the annotation between a clinical entity and phenotypic abnormalities according to a frequency and by integrating the notion of diagnostic criterion. In ORDO a clinical entity is either a group of rare disorders, a rare disorder or a subtype of disorder. The « phenomes » branch of ORDO has been refactored as a logical import of HPO, and the HPO-ORDO phenotype disease-annotations have been provided in a series of triples in OBAN format in which associations, frequency and provenance are modeled.
HOOM is provided as an OWL (Ontologies Web Languages) file, using OBAN, the Orphanet Rare Disease Ontology (ORDO), and HPO ontological models.
HOOM provides extra possibilities for researchers, pharmaceutical companies and others wishing to co-analyse rare and common disease phenotype associations, or re-use the integrated ontologies in genomic variants repositories or match-making tools.
HOOM is available here: http://www.orphadata.org/cgi-bin/inc/hoom_orphanet.inc.php
CONTACT Subscribe to the HOOM list : ordo-users.orphanet
WP2: Enriching rare disease knowledge via automated concept recognition
WP leaders : Charité, Garvan Institute ; INSERM Orphanet
Over the course of the last decade, the Bio-Named Entity Recognition field has flourished, in particular in areas like gene/protein mention tagging or gene normalization. Currently, the state-of-the-art F1 scores are in the range of 86%-87% and come closer to 90% if post-processing steps are used. As opposed to these areas, recognizing phenotype descriptions adds a series of additional domain-specific challenges , such as:
- use of metaphorical expressions – e.g., bell-shaped thorax, hitchhiker thumb
- use of hedging and various forms of qualifiers – e.g., subtle flattening and squaring of the metacarpal heads, segmentation defects appear to affect L4-S1
- term coordination – e.g., short and broad metacarpals, or short and wide ribs with metaphyseal cupping
- complex intrinsic structure – the lexical structure of phenotype descriptions may take several forms. They may have a canonical form, i.e., a conjunction of well-defined quality-entity pairs: Q:bell-shaped – E:thorax or a non-canonical form, in which entities and qualities are associated either via verbs (e.g., vertebral-segmentation defects are most severe in the cervical region)
Furthermore, previous research on context or goal-driven mining of phenotype-disorder associations is almost nonexistent. Current approaches are not able to go beyond context independent co-location. This makes it impossible to identify comparative associations – as found in notes describing differential diagnoses, or causal chains – such as phenotypes causing additional phenotypic manifestations, which in turn may cause other abnormalities.
Peter Robinson and Tudor Groza have developed both resources required to validate phenotype CR approaches, as well as an end-to-end CR package tailored to address some of the challenges listed above. We have released the first gold standard HPO corpus consisting of 228 annotated abstracts and a total of 1,993 annotations. Additionally, we recognised the need for a more structured approach to performing error analysis, and hence proposed and released the HPO test suite package comprising 32 types of test cases corresponding to 2,164 HPO concepts (i.e., around 25% of the entire HPO). Finally, we have also developed the Bio-LarK CR system, an HPO-focused CR approach that aims to achieve high accuracy independently of the underlying textual format. Bio-LarK goes beyond standard CR by successfully detecting and decomposing coordinated terms or non-canonical phenotypes. From an evaluation perspective, the system has outperformed other HPO CR approaches both on the HPO gold standard, as well as on the HPO test suite package.
The complexity of this recognition task is high due to the increased variability in representation and lexical expression of phenotypes. Furthermore, subject to the underlying source (e.g., scientific literature, clinical reports or EHRs), additional challenges may emerge since surface forms denoting clinical symptoms may differ. Within this WP, we propose to develop a staged-pipeline for automatic recognition of phenotype descriptions, rare disease names and phenotype-disorder associations. The pipeline will enable a direct and continuous integration channel between scientific literature and the community-driven curation of the Orphanet knowledge base.
WP3: Crowdsourcing Rare Disease knowledge
WP leader : Hospital for Sick Children; Participants: Charité, Orphanet-INSERM, Garvan Institute
While the ontological entities for connecting ORDO and HPO have been established, the available connections are incomplete. Fewer than 500 of the more than 110,000 HPO rare-disease annotations have a qualifier as to its frequency, the age of onset of each phenotype (i.e. when the symptoms appears) within the disorder is also poorly annotated. While this information exists in Orphanet for many diseases, and some disease/phenotype associations (using the Orphanet terminology of phenotypes). The existence of this information is critical for automated diagnosis assistance and identification of causative variants in a patient genome. Collecting this information, however, is problematic due to non-scalable costs of curating the relevant data for the thousands of ORDO nosological entities from medical literature. Within WP3 we will work to further improve the ORDO/HPO linkages, and each ontology, independently, through the efforts of clinicians who use PhenoTips, as well as citizen scientists, RD patients, and medical students. The popularity of PhenoTips (developed by the SickKids team, with extensive advice from the Charité), used in dozens of rare disease studies and clinical centers around the world, offers the opportunity to significantly improve the HPO, ORDO, as well as the links between the two through the use of motivated end-users, who, by contributing to the ontologies will in-turn be able to utilize the improved entities and links in their clinical work and research. Similarly, we will develop infrastructure which will enable other motivated individuals to contribute to the ORDO/HPO ontologies, as well as mechanisms to evaluate and appropriately weight the quality of their contributions.
Phenotate: an educational crowdsourcing instance is available at http://phenotate.org . The tool allows any user to join and annotate ORDO nosological entities. To allow for the largest number of such individuals to participate in the annotation of disorders, a “class” functionality has been created, which allows instructors to assign a group of disorders to a list of students. Each student will be assigned a disorder that has already been annotated by an expert as well as several that have not (without information which is which) Based on entry to the well-annotated disease it will be possible to assign the student a grade (to be used either in actual classroom, or just to evaluate the quality of that particular annotator), while the other disease annotations will create a larger set of crowdsourced data. While each individual record may be inaccurate, it is very likely information contained in several is likely to be correct.
WP4: Improved algorithms for phenotype-driven rare disease differential diagnostics
WP leader : Charité; Participants: Hospital for Sick Children, Garvan Institute
WP4 will utilize HPO, ORDO, and patient datasets (e.g. PhenomeCentral) for improved diagnostics (Exomiser [https://www.sanger.ac.uk/resources/software/exomiser/] and Phenix [http://compbio.charite.de/PhenIX/]). We have previously developed BOQA, a Bayesian ontology query algorithm that is designed to exploit information about the frequency of each phenotypic abnormality amongst all patients with a given disease. BOQA uses this information to weight the phenotypic abnormalities of the query, and we showed that this improves the results of searching using a small set of diseases that were available at the time with sufficient frequency information . We have shown in the original project on BOQA that information on the frequency at which a phenotype occurs within a disease population can increase the accuracy of the differential diagnosis. However, the amount of frequency data we had for the project was extremely limited (several hundred datapoints).
- Orphamizer: a beta prototype is available online: http://compbio.charite.de/phenomizer_orphanet/
- The BOQA algorithm has been incorporated into the Exomiser (https://github.com/exomiser/Exomiser/pull/229), which is suite of Java code that can be extended to this purpose.
WP5. Project coordination and management
WP leader : INSERM – Participants : Charité, Toronto, Garvan, EBI
The main objective of this WP is to manage the project and to make sure that it is implemented as planned. It will also provide day-to-day administrative support to the partners. The coordination team will also ensure the dissemination of the project achievements through Orphanews (17,000 subscribers), papers and conferences.