Jump to content

Semantic similarity: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
www
m →‎WWW: cite
Line 48: Line 48:
<ref>Navigli, R., Lapata, M. (2007). [http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-272.pdf Graph Connectivity Measures for Unsupervised Word Sense Disambiguation], Proc. of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), Hyderabad, India, January 6-12th, 2007, pp.&nbsp;1683–1688.</ref>
<ref>Navigli, R., Lapata, M. (2007). [http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-272.pdf Graph Connectivity Measures for Unsupervised Word Sense Disambiguation], Proc. of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), Hyderabad, India, January 6-12th, 2007, pp.&nbsp;1683–1688.</ref>
<ref>{{cite journal|author=Pirolli, P.|year=2005|title=Rational analyses of information foraging on the Web|journal=Cognitive Science|volume=29|issue=3|pages=343-373|url=http://onlinelibrary.wiley.com/doi/10.1207/s15516709cog0000_20/pdf|doi=10.1207/s15516709cog0000_20}}</ref>
<ref>{{cite journal|author=Pirolli, P.|year=2005|title=Rational analyses of information foraging on the Web|journal=Cognitive Science|volume=29|issue=3|pages=343-373|url=http://onlinelibrary.wiley.com/doi/10.1207/s15516709cog0000_20/pdf|doi=10.1207/s15516709cog0000_20}}</ref>
<ref>Pirolli, P., & Fu, W.-T. (2003). SNIF-ACT: A model of information foraging on the World Wide Web. Lecture Notes in Computer Science, 2702, 45-54.</ref>
<ref>{{cite book|author=Pirolli, P., & Fu, W.-T.|year=2003|chapter=SNIF-ACT: A model of information foraging on the World Wide Web|title=Lecture Notes in Computer Science|volume=2702|pages=45-54|doi=10.1007/3-540-44963-9_8}}</ref>
<ref>Turney, P. (2001). Mining the Web for Synonyms: PMI versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001) (pp.&nbsp;491–502). Freiburg, Germany.</ref>
<ref>Turney, P. (2001). Mining the Web for Synonyms: PMI versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001) (pp.&nbsp;491–502). Freiburg, Germany.</ref>



==Measures==
==Measures==

Revision as of 15:03, 9 August 2013

Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content.

Concretely, this can be achieved for instance by defining a topological similarity, by using ontologies to define a distance between words (a naive metric for terms arranged as nodes in a directed acyclic graph, like a hierarchy, would be the minimal distance—in separating edges—between the two term nodes), or using statistical means such as a vector space model to correlate words and textual contexts from a suitable text corpus (co-occurrence).

Taxonomy

The concept of semantic similarity is more specific than semantic relatedness, as the latter includes concepts as antonymy and meronymy, while similarity does not .[1] However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?" The answer to this question is usually a number between -1 and 1, or between 0 and 1, where 1 signifies extremely high similarity/relatedness, and 0 signifies little-to-none.

Visualisation

An intuitive way of visualising the semantic similarity of terms is by grouping together closer related terms and spacing more distantly related ones wider apart. This is also common - if sometime subconscious - practice for mind maps and concept maps.

Applications

Biomedical Informatics

Semantic similarity measures have been applied and developed in biomedical ontologies,[2] [3] [4] namely, the Gene Ontology (GO)[5][6][7][8]. They are mainly used to compare genes and proteins based on the similarity of their functions rather than on their sequence similarity, but they are also being extended to other bioentities, such as chemical compounds,[9] anatomical entities[10] and diseases.[11]

These comparisons can be done using tools freely available on the web:

  • ProteInOn can be used to find interacting proteins, find assigned GO terms and calculate the functional semantic similarity of UniProt proteins and to get the information content and calculate the functional semantic similarity of GO terms.[12]
  • CMPSim provides a functional similarity measure between chemical compounds and metabolic pathways using ChEBI based semantic similarity measures.[13]
  • CESSM provides a tool for the automated evaluation of GO-based semantic similarity measures.[14]

GeoInformatics

Similarity is also applied to find similar geographic features or feature types[15]:

  • SIM-DL similarity server[16] can be used to compute similarities between concepts stored in geographic feature type ontologies.
  • Similarity Calculator can be used to compute how well related two geographic concepts are in the Geo-Net-PT ontology.[17][18]
  • The OSM Semantic Network can be used to compute the semantic similarity of tags in OpenStreetMap.[19]

Linguistics

Several metrics use WordNet: (+) humanly constructed; (−) humanly constructed (not automatically learned), cannot measure relatedness between multi-word term, non-incremental vocabulary [1][20]

WWW

Knowing one information resource in the WWW, it is often of immediate interest to find similar resources. The Semantic Web provides semantic extensions to find similar data by content and not just by arbitrary descriptors. [21][22][23][24] [25] [26] [27] [28] [29]

Measures

Topological similarity

There are essentially two types of approaches that calculate topological similarity between ontological concepts:

  • Edge-based: which use the edges and their types as the data source;
  • Node-based: in which the main data sources are the nodes and their properties.

Other measures calculate the similarity between ontological instances:

  • Pairwise: measure functional similarity between two instances by combining the semantic similarities of the concepts they represent
  • Groupwise: calculate the similarity directly not combining the semantic similarities of the concepts they represent

Some examples:

Edge-based

  • Pekar et al. [30]
  • Cheng and Cline[31]
  • Wu et al.[32]
  • Del Pozo et al. [33]
  • IntelliGO: Benabderrahmane et al.[34]

Node-based

  • Resnik [35]
    • based on the notion of information content. The information content of a concept (term or word) is the probability of the finding the concept in a given corpus.
    • only considers the information content of lowest common subsumer (lcs). A lowest common subsumer is a concept in a lexical taxonomy ( e.g. WordNet), which has the shortest distance from the two concepts compared. For example, animal and mammal both are the subsumers of cat and dog, but mammal is lower subsumer than animal for them.
  • Lin [36]
    • based on Resnik's similarity.
    • considers the information content of lowest common subsumer (lcs) and the two compared concepts.
  • Jiang and Conrath [37]
    • based on Resnik's similarity.
    • considers the information content of lowest common subsumer (lcs) and the two compared concepts to calculate the distance between the two concepts. The distance is later used in computing the similarity measure.
  • DiShIn Disjunctive Shared Information between Ontology Concepts [38]
    • other alternative: GraSM (Graph-based Similarity Measure) [39]

Pairwise

  • maximum of the pairwise similarities
  • composite average in which only the best-matching pairs are considered (best-match average)

Groupwise

Statistical similarity

  • LSA (Latent semantic analysis) [41][42](+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
  • PMI (Pointwise mutual information) (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents
  • SOC-PMI (Second-order co-occurrence pointwise mutual information) (+) sort lists of important neighbor words from a large corpus; (−) cannot measure relatedness between whole sentences or documents
  • GLSA (Generalized Latent Semantic Analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
  • ICAN (Incremental Construction of an Associative Network) (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (−) cannot measure relatedness between multi-word terms, long pre-processing times
  • NGD (Normalized Google distance) (+) large vocab, because it uses any search engine (like Google); (−) can measure relatedness between whole sentences or documents but the larger the sentence or document the more ingenuity is required, Cilibrasi & Vitanyi (2007), reference below.[43]
  • ESA (Explicit Semantic Analysis) based on Wikipedia and the ODP
  • SSA (Salient Semantic Analysis) which indexes terms using salient concepts found in their immediate context.
  • n° of Wikipedia (noW), inspired by the game Six Degrees of Wikipedia, is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, Dijkstra's shortest path algorithm is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph.
  • VGEM (Vector Generation of an Explicitly-defined Multidimensional Semantic Space) (+) incremental vocab, can compare multi-word terms (−) performance depends on choosing specific dimensions
  • BLOSSOM (Best path Length On a Semantic Self-Organizing Map) (+) uses a Self Organizing Map to reduce high dimensional spaces, can use different vector representations (VGEM or word-document matrix), provides 'concept path linking' from one word to another (−) highly experimental, requires nontrivial SOM calculation
  • SimRank

Semantics-based similarity

  • GCS-based Semantic Similarity Measure [44]
  • Comment on application of semantics-based similarity to biomedical ontologies [45]

See also

References

  1. ^ a b Budanitsky, Alexander; Hirst, Graeme (2001). (Document). PittsburghTemplate:Inconsistent citations {{cite document}}: Cite document requires |publisher= (help); Missing or empty |title= (help); Unknown parameter |contribution= ignored (help); Unknown parameter |series= ignored (help)CS1 maint: postscript (link)
  2. ^ Pesquita, Catia; Faria, Daniel; Falcão, André O.; Lord, Phillip; Couto, Francisco M. (2009). Bourne, Philip E. (ed.). "Semantic Similarity in Biomedical Ontologies". PLoS Computational Biology. 5 (7): e1000443. doi:10.1371/journal.pcbi.1000443. PMC 2712090. PMID 19649320.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  3. ^ Guzzi, Pietro Hiram; Mina, Marco; Cannataro, Mario; Guerra, Concettina (2012). "Semantic similarity analysis of protein data: assessment with biological features and issues". Briefings in Bioinformatics. 13 (5): 569–585. doi:10.1093/bib/bbr066.
  4. ^ Benabderrahmane, Sidahmed; Smail Tabbone, Malika; Poch, Olivier; Napoli, Amedeo; Devignes, Marie-Domonique. (2010). "IntelliGO: a new vector-based semantic similarity measure including annotation origin". Biomed Central. 11: 588. doi:10.1186/1471-2105-11-588. PMC 3098105. PMID 21122125.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  5. ^ Couto, F., Silva, M., & Coutinho, P. (2003). Implementation of a functional semantic similarity measure between gene-products. DI/FCUL TR 03–29, University of Lisbon
  6. ^ Pesquita, C., Faria, D., Falcão, A., Lord, P., & Couto, F. (2009). Semantic similarity in biomedical ontologies. PLoS Computational Biology, 5:e1000443
  7. ^ Couto, F., Silva, M., & Coutinho, P. (2005). "Semantic similarity over the gene ontology: Family correlation and selecting disjunctive ancestors". Proc. Of the ACM Conference in Information and Knowledge Management (CIKM).{{cite journal}}: CS1 maint: multiple names: authors list (link)
  8. ^ Couto, F., Silva, M., & Coutinho, P. (2007). "Measuring semantic similarity between Gene Ontology terms". Data and Knowledge Engineering. 61: 137–152.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  9. ^ Ferreira, João D.; Couto, Francisco M. (2010). Mitchell, John B. O. (ed.). "Semantic Similarity for Automatic Classification of Chemical Compounds". PLoS Computational Biology. 6 (9): e1000937. doi:10.1371/journal.pcbi.1000937. PMC 2944781. PMID 20885779.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  10. ^ Ferreira, João D.; Couto, Francisco M. (2011). "Generic semantic relatedness measure for biomedical ontologies" (PDF). ICBO 2011 Proceedings.
  11. ^ Köhler, S; Schulz, MH; Krawitz, P; Bauer, S; Dolken, S; Ott, CE; Mundlos, C; Horn, D; Mundlos, S (2009). "Clinical diagnostics in human genetics with semantic similarity searches in ontologies". American Journal of Human Genetics. 85 (4): 457–64. doi:10.1016/j.ajhg.2009.09.003. PMC 2756558. PMID 19800049.
  12. ^ "ProteInOn".
  13. ^ "CMPSim".
  14. ^ "CESSM".
  15. ^ Janowicz, K., Raubal, M. and Kuhn, W. (2011). "The semantics of similarity in geographic information retrieval". Journal of Spatial Information Science. 2: 29–57.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  16. ^ "SIM-DL similarity server". CiteSeerx10.1.1.172.5544. {{cite journal}}: Cite journal requires |journal= (help)
  17. ^ "Geo-Net-PT Similarity Calculator".
  18. ^ "Geo-Net-PT".
  19. ^ "OSM Semantic Network".
  20. ^ Kaur, I. & Hornof, A.J. (2005). "A Comparison of LSA, WordNet and PMI for Predicting User Click Behavior". Proceedings of the Conference on Human Factors in Computing, CHI 2005: 51–60. doi:10.1145/1054972.1054980.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  21. ^ Similarity-based Learning Methods for the Semantic Web (C. d'Amato, PhD Thesis)
  22. ^ Gracia, J. and Mena, E. (2008). "Web-Based Measure of Semantic Relatedness" (PDF). Proceedings of the 9th international conference on Web Information Systems Engineering (WISE '08). Springer-Verlag, Berlin, Heidelberg: 136–150.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  23. ^ Raveendranathan, P. (2005). Identifying Sets of Related Words from the World Wide Web. Master of Science Thesis, University of Minnesota Duluth.
  24. ^ Wubben, S. (2008). Using free link structure to calculate semantic relatedness. In ILK Research Group Technical Report Series, nr. 08-01, 2008.
  25. ^ Juvina, I., van Oostendorp, H., Karbor, P., & Pauw, B. (2005). Towards modeling contextual information in web navigation. In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1078–1083). Austin, Tx: The Cognitive Science Society, Inc.
  26. ^ Navigli, R., Lapata, M. (2007). Graph Connectivity Measures for Unsupervised Word Sense Disambiguation, Proc. of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), Hyderabad, India, January 6-12th, 2007, pp. 1683–1688.
  27. ^ Pirolli, P. (2005). "Rational analyses of information foraging on the Web". Cognitive Science. 29 (3): 343–373. doi:10.1207/s15516709cog0000_20.
  28. ^ Pirolli, P., & Fu, W.-T. (2003). "SNIF-ACT: A model of information foraging on the World Wide Web". Lecture Notes in Computer Science. Vol. 2702. pp. 45–54. doi:10.1007/3-540-44963-9_8.{{cite book}}: CS1 maint: multiple names: authors list (link)
  29. ^ Turney, P. (2001). Mining the Web for Synonyms: PMI versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001) (pp. 491–502). Freiburg, Germany.
  30. ^ Pekar, Viktor; Staab, Steffen (2002). "Taxonomy learning". 1: 1. doi:10.3115/1072228.1072318. {{cite journal}}: Cite journal requires |journal= (help)
  31. ^ Cheng, J; Cline, M; Martin, J; Finkelstein, D; Awad, T; Kulp, D; Siani-Rose, MA (2004). "A knowledge-based clustering algorithm driven by Gene Ontology". Journal of biopharmaceutical statistics. 14 (3): 687–700. doi:10.1081/BIP-200025659. PMID 15468759.
  32. ^ Wu, H; Su, Z; Mao, F; Olman, V; Xu, Y (2005). "Prediction of functional modules based on comparative genome analysis and Gene Ontology application". Nucleic Acids Research. 33 (9): 2822–37. doi:10.1093/nar/gki573. PMC 1130488. PMID 15901854.
  33. ^ Del Pozo, Angela; Pazos, Florencio; Valencia, Alfonso (2008). "Defining functional distances over Gene Ontology". BMC Bioinformatics. 9: 50. doi:10.1186/1471-2105-9-50. PMC 2375122. PMID 18221506.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  34. ^ Benabderrahmane, Sidahmed; Smail Tabbone, Malika; Poch, Olivier; Napoli, Amedeo; Devignes, Marie-Domonique. (2010). "IntelliGO: a new vector-based semantic similarity measure including annotation origin". Biomed Central. 11: 588. doi:10.1186/1471-2105-11-588. PMC 3098105. PMID 21122125.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  35. ^ Philip Resnik (1995). Chris S. Mellish (Ed.) (ed.). "Using information content to evaluate semantic similarity in a taxonomy". Proceedings of the 14th international joint conference on Artificial intelligence (IJCAI'95). 1. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA: 448–453. doi:10.1.1.41.6956. {{cite journal}}: Check |doi= value (help)
  36. ^ Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML '98), Jude W. Shavlik (Ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 296-304
  37. ^ J. J. Jiang and D. W. Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In International Conference on Research on Computational Linguistics (ROCLING X), pages 9008+, September 1997
  38. ^ Couto, F. & Silva, M. (2011), Disjunctive Shared Information between Ontology Concepts: application to Gene Ontology. Journal of Biomedical Semantics, 2:5
  39. ^ Couto, F., Silva, M., & Coutinho, P. (2007). Measuring semantic similarity between Gene Ontology terms. Data and Knowledge Engineering, 61:137–152
  40. ^ Catia Pesquita, Daniel Faria, Hugo Bastos, António Ferreira, Andre O Falcao, Francisco Couto 2008: Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics Suppl 5(9), S4
  41. ^ Landauer, T. K.; Dumais, S. T. (1997). "A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge". Psychological Review. 104 (2): 211–240.
  42. ^ Landauer, T. K., Foltz, P. W., & Laham, D. (1998). "Introduction to Latent Semantic Analysis" (PDF). Discourse Processes. 25: 259–284.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  43. ^ "Google Similarity Distance".
  44. ^ C. d'Amato, S. Staab, and N. Fanizzi. On the influence of description logics ontologies on conceptual similarity. Knowledge Engineering: Practice and Patterns, pages 48-63, 2008
  45. ^ F. Couto and H. Pinto, The next generation of similarity measures that fully explore the semantics in biomedical ontologies, Journal of Bioinformatics and Computational Biology, vol. in press, 2013. preprint


External links

Software

Web Services