Biomedical Term Service Help

Semantic Similarity in Ontologies

A primary feature of the biomedical term service is to provide semantic similarity query capabilities within ontologies. This feature powers the subject-level query in Cafe Variome 3, allowing a subject to be returned based on how similar it is to the search query, effectively mitigating the issue where patients are annotated with different terms for the same condition.

Semantic Similarity Model

The semantic similarity algorithm used in the system is the Relevance method, proposed by Schlicker et al. in 2006 (Schlicker, A., Domingues, F.S., Rahnenführer, J. et al. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 7, 302 (2006). https://doi.org/10.1186/1471-2105-7-302). Small adjustment have been made to accelerate the calculation, but the core concept remains the same.

By default, the Resnik Information Content () of a term is defined as:

Where is the probability of a term appearing in the corpus. The corpus is another term set used to annotate the terms, and the probability is calculated as:

Where reflects the number of annotations of term , and is the total number of annotations in the corpus. However, in this implementation, the is not the total number of annotation, but the sum of annotations on the nodes, propagating from the leaves to the root. Consider the following DAG:

Sample DAG for IC

Where the number on each node represents the number of annotations on that node. The total annotation count in the diagram is:

However, to speed up calculation, and represent the multi-inheritance nature of the ontology, is calculated as:

This is because D has 2 ancestors, so the annotation count is added to both B and C.

Based on the IC calculation, the Lin's similarity between to terms and is defined as:

Where is the most informative common ancestor of and . The Relevance method has one extra relevance factor:

This value is pre-calculated and stored in Neo4j for fast retrieval. However, most similarity scores are close to 0, and similarity scores below 0.2 are not stored. Therefore, if a query involves a term pair with a similarity score below 0.2, the system will not return those terms. It is recommended to limit the similarity to above 0.6, otherwise a large number of terms will be returned.

Corpus Usage and Intrinsic IC

Because the ontologies loaded into the system have more than one annotating corpus, we selected the following annotation for each ontology:

  • HPO: Gene-Phenotype annotation from the HPO consortium

For the ontologies that do not have an annotation or mapping that is of good quality for similarity calculation, intrinsic IC is used. The intrinsic IC considers only the topological structure of the ontology, and the ontology itself. The intrinsic IC used in the system is proposed by Sanchez et al. in 2011 (Sánchez D, Batet M, Isern D. Ontology-based information content computation. Knowledge-Based Syst 2011;24:297–303.), calculated as:

Last modified: 28 March 2025