17 Nov 2014

Discovering Emerging Topics in Social Streams



Discovering Emerging Topics in Social Streams via Link-Anomaly Detection




ABSTRACT
                Detection of emerging topics is now receiving renewed interest motivated by the rapid growth of social networks. Conventional-term-frequency-based approaches may not be appropriate in this context, because the information exchanged in social network
posts include not only text but also images, URLs, and videos. We focus on emergence of topics signaled by social aspects of theses networks. Specifically, we focus on mentions of users—links between users that are generated dynamically (intentionally or
unintentionally) through replies, mentions, and retweets. We propose a probability model of the mentioning behavior of a social network user, and propose to detect the emergence of a new topic from the anomalies measured through the model. Aggregating anomaly scores from hundreds of users, we show that we can detect emerging topics only based on the reply/mention relationships in social-network posts. We demonstrate our technique in several real data sets we gathered from Twitter. The experiments show that the proposed mention-anomaly-based approaches can detect new topics at least as early as text-anomaly-based approaches, and in some cases much earlier when the topic is poorly identified by the textual contents in posts.

Keywords: Topic detection, anomaly detection, social networks, sequentially discounted normalized maximum-likelihood coding,
burst detection









INTRODUCTION

COMMUNICATION over social networks, such as Facebook and Twitter, is gaining its importance in our daily life. Since the information exchanged over social networks are not only texts but also URLs, images, and videos, they are challenging test beds for the study of data mining. In particular, we are interested in the problem of detecting emerging topics from social streams, which can be used to create automated “breaking news”, or discover hidden market needs or underground political movements. Compared to conventional media, social media are able to capture the earliest, unedited voice of ordinary people. Therefore, the challenge is to detect the emergence of a topic as early as possible at a moderate number of false positives.

Another difference that makes social media social is the existence of mentions. Here, we mean by mentions links to other users of the same social network in the form of message-to, reply-to, retweet-of, or explicitly in the text. One post may contain a number of mentions. Some users may include mentions in their posts rarely; other users may be mentioning their friends all the time. Some users (like celebrities) may receive mentions every minute; for others, being mentioned might be a rare occasion. In this sense, mention is like a language with the number of words equal to the number of users in a social network









Literature Survey

1.    Goal of Project:

The main aim of this paper is
Ø  Our goal is to detect emerging topics as early as the keyword-based methods

Ø  Our goal is to evaluate whether the proposed approach can detect the emergence of the topics recognized and collected by people.



2. Analysis on Existing Networks:
our basic assumption is that a new (emerging) topic is something people feel like discussing, commenting, or forwarding the information further to their friends. Conventional approaches for topic detection have mainly been concerned with the frequencies of (textual) words




Fig. 1 shows an example of the emergence of a topic through posts on social networks. The first post by Bob contains mentions to Alice and John, which are both probably friends of Bob, so there is nothing unusual here.

The second post by John is a reply to Bob but it is also visible to many friends of John that are not direct friends of Bob. Then in the third post, Dave, one of John’s friends, forwards (called retweet in Twitter) the information further down to his own friends. It is worth mentioning that it is not clear what the topic of this conversation is about from the textual information, because they are talking about something (a new gadget, car, or jewelry) that is shown as a link in the text


Dis-Advantage

A term-frequency-based approach could suffer from the ambiguity caused by synonyms or homonyms.
It may also require complicated preprocessing (e.g., segmentation) depending on the target language. Moreover, it cannot be applied when the contents of the messages are mostly non textual information. On the other hand, the “words” formed by mentions are unique, require little preprocessing to obtain (the information is often
separated from the contents), and are available regardless of the nature of the contents

3.Idea on proposed System:

we propose a probability model that can capture the normal mentioning behavior of a user, which consists of both the number of mentions per post and the frequency of users occurring in the mentions. Then this model is used to measure the anomaly of future user behavior. Using the proposed probability model, we can quantitatively measure the novelty or possible impact of a post reflected in the mentioning behavior of the user.


We aggregate the anomaly scores obtained in this way over hundreds of users and apply a recently proposed changepoint detection technique based on the sequentially discounting normalized maximum-likelihood (SDNML) coding [3]. This technique can detect a change in the statistical dependence structure in the time series of aggregated anomaly scores, and pinpoint where the topic emergence is; see Fig. 2. The effectiveness of the proposed approach is demonstrated on four data sets we have collected from Twitter. We show that our mention-anomaly-based approaches can detect the emergence of a new topic at least as fast as text-anomaly-based counterparts. Furthermore, we show that in three out of four data sets, the proposed mention-anomaly-based methods can detect the emergence of topics much earlier than the text-anomaly-based methods, which can be explained by the keyword ambiguity we mentioned above.



Results 


Fig. 3a shows that the proposed approach combined with SDNML-based change-point analysis, and DTO correctly identifies the change point at 9:00, January 16, for “Synthetic100” data set. We can clearly see that the proposed link-based anomaly score (green curve in Fig. 3a) is low in the period Jan 11-Jan 15 and high in the period Jan 16-Jan 20. The SDNML-based change-point analysis (the blue curve in Fig. 3a) sharply rises at the change-point and goes down to zero quickly. DTO converts the rise in change-point score into a binary sequence of alarms (the red curve in Fig. 3a). The first detection time of SDNML+DTO was 9:00, Jan 16, ignoring the initial instability around Jan 11. Fig. 3b further demonstrates that the proposed link-based anomaly score can be combined with burst analysis. The two-state burst model correctly identifies the low state of Jan 11-Jan 15 and the high state of Jan 16-Jan 20. The first detection time of the burst approach  was 9:01, Jan 16.




Figs. 4a and 4b show the same plots for “Synthetic20” data set. Although the change in the link-based anomaly score at Jan 16 was smaller because of the reduced number of users who reacted to the topic, the proposed SDNML+DTO successfully raised an alarm at 10:30, January 16, ignoring the initial instability around January 11. The burst-detection approach raised an alarm at 9:13, January 16, which was earlier than the SDNML-based approach. The above results show that the proposed approach can detect changes in the communication patterns of users even in a realistic setting when only some part of the users react to the emerging topic.

Discussion

Within the four data sets we have analyzed above, the proposed link-anomaly-based methods compared favorably against the text-anomaly-based methods on “Youtube”, “NASA”, and “BBC” data sets. On the other hand, the textanomaly- based methods were earlier to detect the topics on “Job hunting” data set. The proposed link-anomaly-based methods performed even better than the keyword-based methods on “NASA” and “BBC” data sets.
The above results support our hypothesis that the emergence of new topic is reflected in the anomaly of the way people communicate to each other, and also that such emergence shows up earlier and more reliably in the anomaly of the mentioning behavior than the anomaly of the textual contents. This is probably because the textual words suffer from variations caused by rephrasing

Compared to the keyword-based methods, the above observation is natural, because for “Job hunting” and “Youtube” data sets, the keywords seemed to have been unambiguously defined from the beginning of the emergence of the topics, whereas for “NASA” and “BBC” data sets, the keywords are more ambiguous. In particular, in the case of “NASA” data set, people had been mentioning “arsenic”-eating organism earlier than NASA’s official release but only rarely (see Fig. 7f). Thus, the keywordfrequency- based methods could not detect the keyword as an emerging topic, although the keyword “arsenic” appeared earlier than the official release. For “BBC” data set, the proposed link-anomaly-based burst model detects two bursty areas (Fig. 8b). Interestingly, the link-anomaly-based change-point analysis only finds the first area (Fig. 8a), whereas the text-anomaly-based methods (Figs. 8c and 8d) and the keyword-frequency-based methods only find the second area (Figs. 8e and 8f). This is probably because there was an initial stage where people reacted individually using different words, and later there was another stage in which the keywords were more unified

No comments:

Post a Comment

Note: only a member of this blog may post a comment.