Author name disambiguation: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
develop...
Line 1: Line 1:
'''Author name disambiguation''', also called '''personal disambiguation''' or '''disambiguation of people's names''', is a type of [[disambiguation]] and [[record linkage]] applied to the names of individual people. The process could, for example, distinguish individuals with the name "[[John Smith]]".
{{Multiple issues|
{{Underlinked|date=May 2016}}
{{Orphan|date=May 2016}}
{{refimprove|date=February 2017}}
}}


'''Author name disambiguation''' is a type of [[Record linkage]] that is applied to scholarly documents where the goal is to find all mentions of the same author and cluster them together. Authors of scholarly documents often share names which makes it hard to distinguish each author's work. Hence, author name disambiguation aims to find all publications that belong to a given author and distinguish them from publications of other authors who share the same name.
An editor may apply the process to scholarly documents where the goal is to find all mentions of the same author and cluster them together. Authors of scholarly documents often share names which makes it hard to distinguish each author's work. Hence, author name disambiguation aims to find all publications that belong to a given author and distinguish them from publications of other authors who share the same name.

There are multiple reasons that cause author names to be ambiguous, among which: individuals may publish under multiple names for variety of reasons including different spelling, misspelling, name change due to marriage, or the use of middle names and initials.<ref>{{cite journal
| authorlink = Smalheiser, Neil R and Torvik, Vetle I
| title = Author name disambiguation
| journal = [[Annual Review of Information Science and Technology]]
| url = http://onlinelibrary.wiley.com/doi/10.1002/aris.2009.1440430113/full
| accessdate = 2015-04-20
| doi = 10.1002/aris.2009.1440430113
}}</ref>


==Methods==
Typical approaches for author name disambiguation rely on information about the authors such as their affiliations, email addresses, year of publication, co-authors, topic information to distinguish between authors. This information can be used to train a [[machine learning]] classifier to decide whether two author mentions refer to the same author or not.<ref>{{Cite conference
Typical approaches for author name disambiguation rely on information about the authors such as their affiliations, email addresses, year of publication, co-authors, topic information to distinguish between authors. This information can be used to train a [[machine learning]] classifier to decide whether two author mentions refer to the same author or not.<ref>{{Cite conference
| author1-first = Pucktada | author1-last = Treeratpituk | author2-first = C. Lee | author2-last = Giles
| author1-first = Pucktada | author1-last = Treeratpituk | author2-first = C. Lee | author2-last = Giles
Line 26: Line 14:
| citeseerx = 10.1.1.147.3500
| citeseerx = 10.1.1.147.3500
| doi = 10.1145/1555400.1555408
| doi = 10.1145/1555400.1555408
}}</ref> Other approaches use heuristics to distinguish between authors.

Research use various algorithms to do disambiguation.<ref name="MannYarowsky2003">{{cite journal|last1=Mann|first1=Gideon S.|last2=Yarowsky|first2=David|title=Unsupervised personal name disambiguation|volume=4|year=2003|pages=33–40|doi=10.3115/1119176.1119181}}</ref><ref>{{cite journal|title=Two supervised learning approaches for name disambiguation in author citations|journal=IEEE Xplore|date=27 September 2004|doi=10.1109/JCDL.2004.240051}}</ref><ref name="HuangErtekin2006">{{cite journal|last1=Huang|first1=Jian|last2=Ertekin|first2=Seyda|last3=Giles|first3=C. Lee|title=Efficient Name Disambiguation for Large-Scale Databases|volume=4213|year=2006|pages=536–544|issn=0302-9743|doi=10.1007/11871637_53}}</ref><ref name="KhabsaTreeratpituk2015">{{cite journal|last1=Khabsa|first1=Madian|last2=Treeratpituk|first2=Pucktada|last3=Giles|first3=C. Lee|title=Online Person Name Disambiguation with Constraints|year=2015|pages=37–46|doi=10.1145/2756406.2756915}}</ref>

==Applications==
There are multiple reasons that cause author names to be ambiguous, among which: individuals may publish under multiple names for variety of reasons including different spelling, misspelling, name change due to marriage, or the use of middle names and initials.<ref>{{cite journal
| authorlink = Smalheiser, Neil R and Torvik, Vetle I
| title = Author name disambiguation
| journal = [[Annual Review of Information Science and Technology]]
| url = http://onlinelibrary.wiley.com/doi/10.1002/aris.2009.1440430113/full
| accessdate = 2015-04-20
| doi = 10.1002/aris.2009.1440430113
}}</ref>
}}</ref>
Other approaches use heuristics to distinguish between authors.


Motivations for disambiguating individuals include identifying inventors from patents.<ref>{{cite journal|last1=Morrison|first1=Greg|last2=Riccaboni|first2=Massimo|last3=Pammolli|first3=Fabio|title=Disambiguation of patent inventors and assignees using high-resolution geolocation data|journal=Scientific Data|date=16 May 2017|volume=4|pages=170064|doi=10.1038/sdata.2017.64}}</ref>

==Similar issues==
Author name disambiguation is only one record linkage problem in the scholarly data domain. Closely related, and potentially mutually beneficial problems include: organisation (affiliation) disambiguation<ref>{{Cite conference
Author name disambiguation is only one record linkage problem in the scholarly data domain. Closely related, and potentially mutually beneficial problems include: organisation (affiliation) disambiguation<ref>{{Cite conference
| author1-first = Ziqi | author1-last = Zhang | author2-first = Andrea | author2-last = Nuzzolese | first3 = Anna Lisa | last3 = Gentile
| author1-first = Ziqi | author1-last = Zhang | author2-first = Andrea | author2-last = Nuzzolese | first3 = Anna Lisa | last3 = Gentile
Line 41: Line 43:


==References==
==References==
{{Reflist}}
{{reflist}}


[[Category:Word-sense disambiguation]]
[[Category:Library cataloging and classification]]
[[Category:Metadata]]
[[Category:Metadata]]
[[Category:Data management]]
[[Category:Data management]]
[[Category:Word-sense disambiguation]]

Revision as of 17:31, 3 May 2018

Author name disambiguation, also called personal disambiguation or disambiguation of people's names, is a type of disambiguation and record linkage applied to the names of individual people. The process could, for example, distinguish individuals with the name "John Smith".

An editor may apply the process to scholarly documents where the goal is to find all mentions of the same author and cluster them together. Authors of scholarly documents often share names which makes it hard to distinguish each author's work. Hence, author name disambiguation aims to find all publications that belong to a given author and distinguish them from publications of other authors who share the same name.

Methods

Typical approaches for author name disambiguation rely on information about the authors such as their affiliations, email addresses, year of publication, co-authors, topic information to distinguish between authors. This information can be used to train a machine learning classifier to decide whether two author mentions refer to the same author or not.[1] Other approaches use heuristics to distinguish between authors.

Research use various algorithms to do disambiguation.[2][3][4][5]

Applications

There are multiple reasons that cause author names to be ambiguous, among which: individuals may publish under multiple names for variety of reasons including different spelling, misspelling, name change due to marriage, or the use of middle names and initials.[6]

Motivations for disambiguating individuals include identifying inventors from patents.[7]

Similar issues

Author name disambiguation is only one record linkage problem in the scholarly data domain. Closely related, and potentially mutually beneficial problems include: organisation (affiliation) disambiguation[8], as well as conference or publication venue disambiguation, since data publishers often use different names or aliases for these entities.

References

  1. ^ Treeratpituk, Pucktada; Giles, C. Lee (2009). Disambiguating authors in academic publications using random forests (PDF). Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM. pp. 39–48. CiteSeerX 10.1.1.147.3500. doi:10.1145/1555400.1555408.
  2. ^ Mann, Gideon S.; Yarowsky, David (2003). "Unsupervised personal name disambiguation". 4: 33–40. doi:10.3115/1119176.1119181. {{cite journal}}: Cite journal requires |journal= (help)
  3. ^ "Two supervised learning approaches for name disambiguation in author citations". IEEE Xplore. 27 September 2004. doi:10.1109/JCDL.2004.240051.
  4. ^ Huang, Jian; Ertekin, Seyda; Giles, C. Lee (2006). "Efficient Name Disambiguation for Large-Scale Databases". 4213: 536–544. doi:10.1007/11871637_53. ISSN 0302-9743. {{cite journal}}: Cite journal requires |journal= (help)
  5. ^ Khabsa, Madian; Treeratpituk, Pucktada; Giles, C. Lee (2015). "Online Person Name Disambiguation with Constraints": 37–46. doi:10.1145/2756406.2756915. {{cite journal}}: Cite journal requires |journal= (help)
  6. ^ "Author name disambiguation". Annual Review of Information Science and Technology. doi:10.1002/aris.2009.1440430113. Retrieved 2015-04-20.
  7. ^ Morrison, Greg; Riccaboni, Massimo; Pammolli, Fabio (16 May 2017). "Disambiguation of patent inventors and assignees using high-resolution geolocation data". Scientific Data. 4: 170064. doi:10.1038/sdata.2017.64.
  8. ^ Zhang, Ziqi; Nuzzolese, Andrea; Gentile, Anna Lisa (2017). Entity Deduplication on ScholarlyData. Proceedings of the Extended Semantic Web Conference. Springer-Verlag. pp. 85–100. doi:10.1007/978-3-319-58068-5_6.