Models of collaborative tagging

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

It has been argued that social tagging or collaborative tagging systems can provide navigational cues or "way-finders"[1][2] for other users to explore information. The notion is that given that social tags are labels users create to represent topics extracted from Web documents, interpretation of these tags should allow other users to predict contents of different documents efficiently. Social tags are arguably more important in exploratory search, in which the users may engage in iterative cycles of goal refinement and exploration of new information (as opposed to simple fact-retrievals), and interpretation of information contents by others will provide useful cues for people to discover topics that are relevant.

One significant challenge that arises in social tagging systems is the rapid increase in the number and diversity of tags. As opposed to structured annotation systems, tags provide users an unstructured, open-ended mechanism to annotate and organize web-content. As users are free to create any tag to describe any resource, it leads to what is referred to as the vocabulary problem.[3] Because users may use different words to describe the same document or extract different topics from the same document based on their own background knowledge, the lack of any top-down mediation may lead to an increase in the use of incoherent tags to represent the information resources in the system. In other words, the inherent "unstructuredness" of social tags may hinder their potential as navigational cues for searchers because the diversities of users and their motivation may lead to diminishing tag-topic relations as the system grows. However, a number of studies have shown that structures do emerge at the semantic level[4] – indicating that there are cohesive forces driving the emergent structures in a social tagging system.

The distinction between descriptive and predictive models[edit]

Just like any social phenomena, behavioral patterns in social tagging systems can be characterized by either a descriptive or predictive model. While descriptive models ask the question of "what", predictive models go deeper to also ask the question of "why" by attempting to provide explanations to the aggregate behavioral patterns.[5] While there may be no general agreement on what an acceptable explanation should be like, many believe that a good explanation should have certain level of predictive accuracy. Descriptive models of social tagging typically are not concerned with explaining the actions of single individuals but describing the patterns that emerge as individual behavior is aggregated in a large social information system.

Predictive models, however, attempt to explain aggregate patterns by analyzing how individuals interact and link to each other in ways that bring about similar or different emergent patterns of social behavior. In particular, a mechanism-based predictive model assumes a certain set of rules that individuals interact with each other, and understand how these interactions could produce aggregate patterns as observed and characterized by descriptive models. Predictive models can therefore provide explanations to why different system characteristics may lead to different aggregate patterns, and can therefore potentially provide information on how systems should be designed to achieve different social purposes.

Descriptive models of social tagging[edit]

Information theory models[edit]

For most tagging systems the total number of tags in the collective vocabulary is much less than the total number of objects being tagged. Given this multiplicity of tags to documents, a question remains: how effective are the tags at isolating any single document? Naively, if we specify a single tag in this system we would uniquely identify lots of documents – thus the answer to our question is "not very well!". However this method carries a faulty assumption; not every document is equal. Some documents are more popular and important than others, and this importance is conveyed by the number bookmarks per document. Thus, we can reformulate the above question to be: how well does the mapping of tags to documents retain about the distribution of the documents? Information theory provides a natural framework to understand the amount of shared information between two random variables. The conditional entropy measures the amount of entropy remaining in one random variable when we know the value of a second random variable.

Work done by Chi and Mytkowicz[6] show that the entropy of documents conditional on tags, H(D|T), is increasing rapidly. What this means is that, even after knowing completely the value of a tag, the entropy of the set of documents is increasing over time. Conditional Entropy asks the question: "Given that I know a set of tags, how much uncertainty regarding the document set that I was referencing with those tags remains?" The fact that this curve is strictly increasing suggests that the specificity of any given tag is decreasing. That is to say, as a navigation aid, tags are becoming harder and harder to use. We are moving closer and closer to the proverbial "needle in a haystack" where any single tag references too many documents to be considered useful.

Another way to look at the data is to think about mutual information, which is a measure of independence between the two variables. Full independence is reached when I(D;T) = 0. Chi and Mytkowicz research on delicious social tagging data show that as a measure of usefulness of the tags and their encoding, there is a worsening trend in the ability of users to specify and find tags and documents when they are engaged in simple fact retrieval. This suggests that we need to build search and recommendation systems that help users sift through resources in social tagging systems, especially when we are engaged in more than simple fact retrieval as characterized by the information theory. In fact, although the number of documents associated with any given tag is increasing, there are many ways contextual information can help users to look for relevant information. This is in fact one of the major weakness of the simple information theory in explaining usefulness of tags – it ignores the fact that humans can extract meanings from a set of tags assigned to a document, and this semantic extraction process is exactly the reason why humans are able to communicate efficiently even though the size of our vocabulary is increasing ever since language was developed. For example, the work by Cattuto et al. (2007),[7] published in PNAS, show that while the number of tags are increasing, the general growth pattern is scale-free – the general distribution of tag-tag co-occurrences follows a power-law.

Cattuto also finds that the characteristics of this scale-free distribution are dependent on the semantics of the tag – tags that are semantically general (e.g., blogs) tend to co-occur with many tags, while semantically narrow tags (e.g., Ajax) tend to co-occur with few number of tags across a wide set of documents in a social tagging system. What this means is that the assumption of the information theory approach is too simple – when the semantics of the set of tags assigned to documents are taken into account, the predictive value of tags on contents of documents are relatively stable. This finding is important for development of recommender systems – discovering these higher level semantic patterns is important in helping people to find relevant information (also see semantic imitation model below).

Tag convergence[edit]

Despite this potential vocabulary problem, recent research has found that at the aggregate level, tagging behavior seemed relatively stable and that the tag choice proportions seemed to be converging rather than diverging. While these observations provided evidence against the proposed vocabulary problem, they also triggered a series of research investigating how and why tag proportions tended to converge over time.

One explanation for the stability was that there was an inherent propensity for users to "imitate" word use of others as they create tags. This propensity may act as a form of social cohesion that fosters the coherence of tag-topic relations in the system,[8] and leads to stability in the system. Golder and Huberman showed that the stochastic urn model by Eggenberger and Pólya[9] was useful in explaining how simple imitation behavior at the individual level could explain the converging usage patterns of tags. Specifically, convergence of tag choices was simulated by a process in which a colored ball was randomly selected from an urn and was replaced in the urn along with an additional ball of the same color, simulating the probabilistic nature of tag reuse. The simple model, however, does not explain why certain tags would to be "imitated" more often than others, and therefore cannot provide a realistic mechanism for tag choices and how social tags could be utilized as navigational cues during exploratory search, not to mention the obviously over-simplified representation of individual users by balls in an urn.

Complex systems dynamics and emergent vocabularies[edit]

Other research, using data from the social bookmarking website, has shown that collaborative tagging systems exhibit a form of complex systems (or self-organizing) dynamics.[10] Furthermore, although there is no central controlled vocabulary to constrain the actions of individual users, the distributions of tags that describe different resources has been shown to converge over time to a stable power law distributions.[10] Once such stable distributions form, examining the correlations between different tags can be used to construct simple folksonomy graphs, which can be efficiently partitioned to obtain a form of community or shared vocabularies.[11] Such vocabularies can be seen as emerging from the decentralised actions of many users, as a form of crowdsourcing.

Tag choice by stochastic process[edit]

The memory-based Yule-Simon (MBYS) model of Cattuto[7] attempted to explain tag choices by a stochastic process. They found that the temporal order of tag assignment influences users' tag choices. Similar to the stochastic urn model, the MBYS model assumed that at each time step a tag would be randomly sampled: with probability p the sampled tag was new, and with probability 1-p the sampled tag was copied from existing tags. When copying, the probability of selecting a tag was assumed to decay with time, and this decay function was found to follow a power law distribution. Thus, tags that were recently used had a higher probability of being reused than those used in the past. One major finding by Cattuto et al. was that semantically general tags (e.g., "blog") tended to co-occur more frequently with other tags than semantically narrower tags (e.g., "ajax"), and this difference could be captured by the decay function of tag reuse in their model. Specifically, they found that a slower decay parameter (when the tag is reused more often) could explain the phenomenon that semantically general tags tended to co-occur with a larger set of tags. In other words, they argued that the "semantic breadth" of a tag could be modeled by a memory decay function, which could lead to different emergent behavioral patterns in a tagging system.

Predictive models of social tagging[edit]

Semantic imitation model of social tag choices[edit]

Descriptive models mentioned above were based on analyses of word-word relations as revealed by the various statistical structures in the organization of tags (e.g., how likely one tag would co-occur with other tags or how likely each tag was reused over time). These models are therefore descriptive models at the aggregate level, and have little to offer about predictions at the level of interface interactions and cognitive processes of individual.

Rather than imitating other users at the word level, one possible explanation for this kind of social cohesion could be grounded on the natural tendency for people to process tags at the semantic level, and it was at this level of processing that most imitation occurred. This explanation was supported by research in the area of reading comprehension,[12] which showed that people tended to be influenced by meanings of words, rather than the words themselves during comprehension. Assuming that background knowledge of people in the same culture tend to have shared structures (e.g., using similar vocabularies and their corresponding meanings in order to conform and communicate with each), users of the same social tagging system may also share similar semantic representations of words and concepts, even when the use of tags may vary across individuals at the word level. In other words, we argued that part of the reason for the stability of social tagging systems can be attributed to the shared semantic representations among the users, such that users may have relatively stable and coherent interpretation of information contents and tags as they interact with the system. Based on this assumption, the semantic imitation model[13][14] predicts how different semantic representations may lead to differences in individual tag choices and eventually different emergent properties at the aggregate behavioral level. The model also predicts that the folksonomies (i.e., knowledge structures) in the system reflect the shared semantic representations of the users.

Semantic imitation has important implication to the general vocabulary problem (see work by, e.g., Susan Dumais) in information retrieval and human-computer interaction – the creation of large number of diverse tags to describe the same set of information resource. The finding that semantic imitation occurs implies that the unit of communication among users is more likely at the semantic level, not at the word level. Thus, although there may not be strong coherence in the choice of words in describing a resource, at the semantic level there seems to be a stronger coherence force that guides the convergence of descriptive indices. This is in sharp contrast to conclusions derived based on a purely information-theoretical approach, which assumes that humans search and evaluation information at the word level. Instead, the process of semantic imitation in social tagging implies that the information-theoretic approach is at most incomplete, as it does not take into account the basic unit of human information processing. Similar to the fact that human communication occurs at the semantic level, the fact that people may use different words or syntax does not affect the effectiveness of communication, so long as the underlying "common ground" between the two persons is the same.[15]

In the social tagging case, so long as users share similar understanding of the contents of the information resources, the fact that the information value of tag-document decreases (that humans have more words in their languages) do not imply that it will always be harder to find relevant information (similarly, the fact that there are more words in our languages does not mean that our communication becomes less effective). However, it does point to the notion that one needs to effectively present these semantic structures in the information system so that people can effectively interpret the semantics of the tagged documents. Intelligent techniques based on statistical models of language such as latent semantic analysis, probabilistic topics model, etc. are promising aspects that will overcome this vocabulary problem.

See also[edit]


  1. ^ Kang, R., Fu, W.-T., & Kannampallil, T. (2010). Exploiting Knowledge-in-the-head and Knowledge-in-the-social-web: Effects of Domain Expertise on Exploratory Search in Individual and Social Search Environments. In Proceedings of the ACM Conference on Computer-Human Interaction, Atlanta, GA.
  2. ^ Furnas, G. W., Fake, C., Von Ahn, L., Schachter, J., Golder, S., Fox, K., Davis, M., Marlow, C., and Naaman, M. Why Do Tagging Systems Work? in CHI '06 Extended Abstracts on Human Factors in Computing Systems. (2006). Montréal, Québec, Canada.
  3. ^ G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais, "The vocabulary problem in human-system communication," Communications of the ACM, vol. 30, no. 11, pp. 964-971, 1987.
  4. ^ Fu, Wai-Tat (2010), "Semantic imitation in social tagging", ACM Transactions on Computer-Human Interaction, doi:10.1145/1460563.1460600
  5. ^ Hedstrom, Peter (2005). Dissecting the social. On the principle of analytic sociology. Cambridge, UK.
  6. ^ Ed H. Chi, Todd Mytkowicz. Understanding the Efficiency of Social Tagging Systems using Information Theory. In Proc. of ACM Conference on Hypertext 2008. (to appear). ACM Press, 2008. Pittsburgh, PA.
  7. ^ a b Cattuto, C., Loreto, V., and Pietronero, L., Semiotic Dynamics and Collaborative Tagging. Proceedings of National Academy of Sciences, (2007), 104, 1461-1464.
  8. ^ Golder, Scott; Huberman, Bernardo A. (2006), "Usage Patterns of Collaborative Tagging Systems", Journal of Information Science, 32 (2): 198–208, doi:10.1177/0165551506062337
  9. ^ F. Eggenberger and G. Pólya, Uber Die Statistik Verketter Vorgage. Zeit. Angew. Math. Mech, (1923), 1, 279-289.
  10. ^ a b Harry Halpin, Valentin Robu, Hana Shepherd The Complex Dynamics of Collaborative Tagging, Proceedings 6th International Conference on the World Wide Web (WWW'07), Banff, Canada, pp. 211-220, ACM Press, 2007.
  11. ^ Valentin Robu, Harry Halpin, Hana Shepherd Emergence of consensus and shared vocabularies in collaborative tagging systems, ACM Transactions on the Web (TWEB), Vol. 3(4), article 14, ACM Press, September 2009.
  12. ^ Kintsch, W. (1988). The role of knowledge in discourse comprehension: A construction-integration model. Psychological Review, 95, 163–182.
  13. ^ Fu, Wai-Tat (April 2008), "The Microstructures of Social Tagging: A Rational Model", Proceedings of the ACM 2008 conference on Computer Supported Cooperative Work.: 66–72, doi:10.1145/1460563.1460600
  14. ^ Fu, Wai-Tat (Aug 2009), "A Semantic Imitation Model of Social Tagging." (PDF), Proceedings of the IEEE conference on Social Computing: 66–72, archived from the original (PDF) on 2009-12-29
  15. ^ Clark, H. H. (1996). Using language. Cambridge: Cambridge University Press.