Predicting Information Flow
through Blogspace




This webpage is a summary of the research that I have conducted over the past year. It was advised by Dan Weld and funded by a Mary Gates research scholarship.

Introduction

There now exist millions of blogs (or web-logs) where users can easily post their thoughts and opinions, as well as links to websites of interest. Bloggers also post links of the various blogs that they read themselves. This interconnected set of blogs is referred to as blogspace and is a complex social network.

Motivation

When a blogger posts a piece of information, each blogger that reads their blog is exposed to it. Occasionally one blogger will repost that same piece of information to their blog. This is referred to as information flow and becomes extermely interesting as it begins to recur.

A motivation for this work is found in viral marketing. Viral marketing is the idea that rather than marketing uniformly to all potential customers, a company might offer free samples of a product to an influential subset of users. This is done with the hope that these users will post positive reviews on their blogs which will influence others to try the product and also post positive reviews.

Blog Research

Previous research has been done to identify popular pieces of information flowing through blogspace as well as inferring when a user is copying a piece of information from another user. There has also been work done to identify an influential subset of users for viral marketing algorithms. However, these algorithms rely on accurate models of influence and predicting how information flows through blogspace.

This work investigates models of influence and determines if it is possible to predict information flow through blogspace. It is hypothesized that blog postings can be predicted using the models that are presented next.
Animation of actual information flow.

Models

There are two major models in research literature that are thought to predict influence and information flow. They are the 'Linear Threshold Model' and the 'Independent Cascade Model'.

Linear Threshold Model
In the linear threshold model, each link between blogs is assigned a weight. If the sum of weights for blogs that have posted on a topic exceeds a threshold, then the user will repost on that topic. This is equivalent to a single layer neural network.
An example of linear thresholds.

Independent Cascade Model
In the independent cascade model, each link between blogs is assigned a probability. Each time a blog posts on a topic, the user will repost with the probability that has been assigned to that link. This model is referred to as the 'Independent Cascade Model' because these probabilities are assumed to be independent of any other users that have already posted on the topic.
An example of independent cascades.

Dataset and learning

The data for this project was gathered from the popular blog hosting site LiveJournal.com. This site hosts millions of blogs and receives over 200,000 new posts per day. The information tracked for this experiment consisited of all posted hyperlinks as well as the 800 most popular terms. Data was gathered for around 3 months and in that time over 1,700,000 instances of a user reposting information were observed.

The 'Linear Threshold Model' and the 'Independent Cascade Model' are both graph algorithms. To use these models, grahps must be constructed from the gathered data. This is done by making each blog into a node and each instance where one blogger reads another blog a link between nodes. Thus links in the graph are directional.

As the 'Linear Threshold Model' mimics a single layer neural net, the model is easily learned with back propagation. The 'Independent Cascade Model' is learned by likelihood maximization. Many hours are taken to pre-process all of the data, but the models can be learned in under an hour on a modern PC. Besides determining when users copied information from blogs they read, it is also necessary to determine which topics are popular and should be tracked.

Results

The methods are analyzed in terms of precision and recall. Precision refers to the number of accurate predictions made. Recall is the number of predictions actually made, with the idea being that the model could have higher precision with lower recall.

As shown below, the 'Linear Threshold Model' has precision for all levels of recall. This shows it to be a better model at predicting information flow for this dataset. Both models are able however to predict information flow with the 'Independent Cascade Model' having a precision of at least 68% and the 'Linear Threshold Model' with an accurace of at least 78%.
Animated of inferred information flow.




Secondary Models

Three other models were investigated in addition to the 'Linear Threshold' and 'Independent Cascade' models. Two are modifications of the 'Linear Threshold Model': 'Global Topic' and 'Global User', while the third is a different strategy in general: 'Number of Exposures'.

Global Topic
This model incorporates additional information into the 'Linear Threshold Model'. Besides learning a weight for each link in blogspace, there are additional weights for each topic. Intuitively this makes it possible for the model to contain information about some topics being more interesting or infectious among users. It also aids the algorithm in predicting reposts for users where data is sparse (which is often the case).

Global User
Unlike the 'Linear Threshold Model' where weights are learned for links between blogs, in this model weights are learned for users globally. Thus information is shared in general for how influential different bloggers are on a global scale. Again, this aids the model in predicting reposts for users where data is sparse.

Number of Exposures
This model assumes that a user reposts simply after a given number of exposures to the topic, independent of the topic or which users have posted the topic.

Secondary Results

As can be seen on the graph below both the 'Global Topic' (accidentally referred to as 'Global Topic') and the 'Global User' models are better able to predict reposts than the 'Linear Threshold Model'. The 'Number of Exposures' model is surprisingly poor at predicting blog postings. This is due to the high correlation between the number of posts for both instances when a blogger reposted and when they did not.




Conclusions

It is concluded that blog postings are predictable! The 'Linear Threshold Model' retains an accuracy of at least 78% for all levels of recall. The 'Global Topic' model has the highest precision of all, with over 82% accuracy for all recall levels. Thus with over 82% accuracy, it can be predicted when a blogger will find a topic interesting enough to repost!

Future Work

None of the models presented truly model user interests in any way. Though, this should have a large impact on predicting repostings. Incorporating this data in the model would require decomposing all topics into distributions over interests. Some sort of interest modeling, or perhaps user clustering, should greatly aid in this problem.

It would also be of interest to expand the number of blogs being watched to beyond LiveJournal, or to incorporate more types of information being tracked. This research only tracked hyperlinks and the 800 most popular terms. Clearly this could have been done more extensivley by classifying documents according to topics, or simply interests in general (again, clustering).

Sources

This is a subset of the many papers I read for ideas, in no particular order:

Last updated: May 22, 2006 Edited for typos: October 23, 2006