Semantic Similarity in Ontologies
A primary feature of the biomedical term service is to provide semantic similarity query capabilities within ontologies. This feature powers the subject-level query in Cafe Variome 3, allowing a subject to be returned based on how similar it is to the search query, effectively mitigating the issue where patients are annotated with different terms for the same condition.
Semantic Similarity Model
The semantic similarity algorithm used in the system is the Relevance method, proposed by Schlicker et al. in 2006 (Schlicker, A., Domingues, F.S., Rahnenführer, J. et al. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 7, 302 (2006). https://doi.org/10.1186/1471-2105-7-302). Small adjustment have been made to accelerate the calculation, but the core concept remains the same.
By default, the Resnik Information Content (
Where
Where
Where the number on each node represents the number of annotations on that node. The total annotation count in the diagram is:
However, to speed up calculation, and represent the multi-inheritance nature of the ontology,
This is because D has 2 ancestors, so the annotation count is added to both B and C.
Based on the IC calculation, the Lin's similarity between to terms
Where
This value is pre-calculated and stored in Neo4j for fast retrieval. However, most similarity scores are close to 0, and similarity scores below 0.2 are not stored. Therefore, if a query involves a term pair with a similarity score below 0.2, the system will not return those terms. It is recommended to limit the similarity to above 0.6, otherwise a large number of terms will be returned.
Corpus Usage and Intrinsic IC
Because the ontologies loaded into the system have more than one annotating corpus, we selected the following annotation for each ontology:
HPO: Gene-Phenotype annotation from the HPO consortium
For the ontologies that do not have an annotation or mapping that is of good quality for similarity calculation, intrinsic IC is used. The intrinsic IC considers only the topological structure of the ontology, and the ontology itself. The intrinsic IC used in the system is proposed by Sanchez et al. in 2011 (Sánchez D, Batet M, Isern D. Ontology-based information content computation. Knowledge-Based Syst 2011;24:297–303.), calculated as: