WP1: Integrating HPO and ORDO
WP leader : INSERM – Participants : Charité, SickKids, EBI, Garvan Institute
The two most broadly used catalogues of rare diseases are the Orphanet Rare Disease Ontology (ORDO and the Online Mendelian Inheritance in Man database (OMIM, a catalogue of genes and genetic entities developed by John Hopkins Universtiy, USA www.omim.org). Entries in OMIM are genetically defined whereas entries in ORDO are clinically defined. Typical age of onset, age of death, and the frequency of occurrence of a phenotype feature are not systematically captured by OMIM, whereas they are annotated in the Orphanet database and in its ontological representation ORDO. Furthermore, Orphanet terms are aligned to OMIM by means of semantic relationships, annotating whether the match is exact or partial (i.e. from a narrower term to a broader term or the opposite). Orphanet terms are also aligned to other medical terminologies and classifications in use in health records such as ICD10 and SNOMED CT. The Orphanet identifier, the ORPHA number, is being progressively integrated into health information systems in European countries following the Commission Expert Group for Rare Diseases (CEGRD) recommendation to allow for better identification of RD patients in health systems.
Rare disease frequency information has been shown to improve phenotypic matching. The ontological structure of the HPO is superior to other methods of computational phenotype analysis, however, the overall performance can be further improved by better and deeper annotation.
The aim of this Workpackage is to formally align the data models used by Orphanet and the HPO team to extend and refine annotations, and to include them in the Orphanet Rare Disease Ontoligy (ORDO, glien). Rare disorders in Orphanet, as well as scientific information attached to them, are available in 7 languages, and more translations are expected in the future for the Orphanet network extends over 40 countries. Exhaustive annotation of ORDO entries with HPO will allow for translation of HPO into other languages, supporting streamlined integration into EHRs, and connections of patient registries across language boundaries. Furthermore it will allow one to connect genetic information embedded in, for instance, mutation databases using OMIM to clinical diagnoses of RD.
WP2: Enriching rare disease knowledge via automated concept recognition
WP leaders : Charité, Garvan Institute ; INSERM Orphanet
Over the course of the last decade, the Bio-Named Entity Recognition field has flourished, in particular in areas like gene/protein mention tagging or gene normalization. Currently, the state-of-the-art F1 scores are in the range of 86%-87% and come closer to 90% if post-processing steps are used. As opposed to these areas, recognizing phenotype descriptions adds a series of additional domain-specific challenges (some of which are shown in Fig. below), such as:
- use of metaphorical expressions – e.g., bell-shaped thorax, hitchhiker thumb
- use of hedging and various forms of qualifiers – e.g., subtle flattening and squaring of the metacarpal heads, segmentation defects appear to affect L4-S1
- term coordination – e.g., short and broad metacarpals, or short and wide ribs with metaphyseal cupping
- complex intrinsic structure – the lexical structure of phenotype descriptions may take several forms. They may have a canonical form, i.e., a conjunction of well-defined quality-entity pairs: Q:bell-shaped – E:thorax or a non-canonical form, in which entities and qualities are associated either via verbs (e.g., vertebral-segmentation defects are most severe in the cervical region)
Fig. 2. Example of phenotype–disorder associations in a differential diagnostic description. P – denote phenotypes; D – denote diseases.
Furthermore, previous research on context or goal-driven mining of phenotype-disorder associations is almost nonexistent. Current approaches are not able to go beyond context independent co-location. This makes it impossible to identify comparative associations – as found in notes describing differential diagnoses, or causal chains – such as phenotypes causing additional phenotypic manifestations, which in turn may cause other abnormalities.
Peter Robinson and Tudor Groza have developed both resources required to validate phenotype CR approaches, as well as an end-to-end CR package tailored to address some of the challenges listed above. We have released the first gold standard HPO corpus consisting of 228 annotated abstracts and a total of 1,993 annotations. Additionally, we recognised the need for a more structured approach to performing error analysis, and hence proposed and released the HPO test suite package comprising 32 types of test cases corresponding to 2,164 HPO concepts (i.e., around 25% of the entire HPO). Finally, we have also developed the Bio-LarK CR system, an HPO-focused CR approach that aims to achieve high accuracy independently of the underlying textual format. Bio-LarK goes beyond standard CR by successfully detecting and decomposing coordinated terms or non-canonical phenotypes. From an evaluation perspective, the system has outperformed other HPO CR approaches both on the HPO gold standard, as well as on the HPO test suite package.
The complexity of this recognition task is high due to the increased variability in representation and lexical expression of phenotypes. Furthermore, subject to the underlying source (e.g., scientific literature, clinical reports or EHRs), additional challenges may emerge since surface forms denoting clinical symptoms may differ. Within this WP, we propose to develop a staged-pipeline for automatic recognition of phenotype descriptions, rare disease names and phenotype-disorder associations. The pipeline will enable a direct and continuous integration channel between scientific literature and the community-driven curation of the Orphanet knowledge base.
WP3: Crowdsourcing Rare Disease knowledge
WP leader : Hospital for Sick Children; Participants: Charité, Orphanet-INSERM, Garvan Institute
While the ontological entities for connecting ORDO and HPO have been established, the available connections are incomplete. Fewer than 500 of the more than 110,000 HPO rare-disease annotations have a qualifier as to its frequency, the age of onset of each phenotype (i.e. when the symptoms appears) within the disorder is also poorly annotated. While this information exists in Orphanet for many diseases, and some disease/phenotype associations (using the Orphanet terminology of phenotypes). The existence of this information is critical for automated diagnosis assistance and identification of causative variants in a patient genome. Collecting this information, however, is problematic due to non-scalable costs of curating the relevant data for the thousands of ORDO nosological entities from medical literature. Within WP3 we will work to further improve the ORDO/HPO linkages, and each ontology, independently, through the efforts of clinicians who use PhenoTips, as well as citizen scientists, RD patients, and medical students. The popularity of PhenoTips (developed by the SickKids team, with extensive advice from the Charité), used in dozens of rare disease studies and clinical centers around the world, offers the opportunity to significantly improve the HPO, ORDO, as well as the links between the two through the use of motivated end-users, who, by contributing to the ontologies will in-turn be able to utilize the improved entities and links in their clinical work and research. Similarly, we will develop infrastructure which will enable other motivated individuals to contribute to the ORDO/HPO ontologies, as well as mechanisms to evaluate and appropriately weight the quality of their contributions.
WP4: Improved algorithms for phenotype-driven rare disease differential diagnostics
WP leader : Charité; Participants: Hospital for Sick Children, Garvan Institute
WP4 will utilize HPO, ORDO, and patient datasets (e.g. PhenomeCentral) for improved diagnostics (Exomiser [https://www.sanger.ac.uk/resources/software/exomiser/] and Phenix [http://compbio.charite.de/PhenIX/]). We have previously developed BOQA, a Bayesian ontology query algorithm that is designed to exploit information about the frequency of each phenotypic abnormality amongst all patients with a given disease. BOQA uses this information to weight the phenotypic abnormalities of the query, and we showed that this improves the results of searching using a small set of diseases that were available at the time with sufficient frequency information . We have shown in the original project on BOQA that information on the frequency at which a phenotype occurs within a disease population can increase the accuracy of the differential diagnosis. However, the amount of frequency data we had for the project was extremely limited (several hundred datapoints).
WP5. Project coordination and management
WP leader : INSERM – Participants : Charité, Toronto, Garvan, EBI
The main objective of this WP is to manage the project and to make sure that it is implemented as planned. It will also provide day-to-day administrative support to the partners. The coordination team will also ensure the dissemination of the project achievements through Orphanews (17,000 subscribers), papers and conferences.