Data Mining (91 page)

Read Data Mining Online

Authors: Mehmed Kantardzic

BOOK: Data Mining
9.87Mb size Format: txt, pdf, ePub

In general, these sequences, obtained from large log files, correspond to a frequently accessed pattern in an information-providing service.

The problem of finding LRS is very similar to that of finding frequent itemsets (occurring in a sufficient number of transactions) in association-rule mining. However, they are different from each other in that a reference sequence in the mining-traversal patterns has to be references in a given order, whereas a large itemset in mining association rules is just a combination of items in a transaction. The corresponding algorithms are different because they perform operations on different data structures: lists in the first case, sets in the second. As the popularity of Internet applications explodes, it is expected that one of the most important data-mining issues for years to come will be the problem of effectively discovering knowledge on the Web.

11.5 PAGERANK ALGORITHM

PageRank was originally published by Sergey Brin and Larry Page, the co-creators of Google. It likely contributed to the early success of Google. PageRank provides a global ranking of nodes in a graph. For search engines it provides a query-independent, authority ranking of all Web pages. PageRank has similar goals of finding authoritative Web pages to that of the HITS algorithm. The main assumption behind the PageRank algorithm is that every link from page
a
to page
b
is a vote by page
a
for page
b
. Not all votes are equal. Votes are weighted by the PageRank score of the originating node.

PageRank is based on the random surfer model. If a surfer were to randomly select a starting Web page, and at each time step the surfer were to randomly select a link on the current Web page, then PageRank could be seen as the probability that this random surfer is on any given page. Some Web pages do not contain any hyperlinks. In this model it is assumed that the random surfer selects a random Web page when exiting pages with no hyperlinks. Additionally, there is some chance that the random surfer will stop following links and restart the process.

Computationally the PageRank (
Pr
) of page u can be computed as follows:

Here
d
is a dampening factor
usually set to 0.85.
N
refers to the total number of nodes in the graph. The function
In
(
u
) returns the set of nodes with edges pointing into node u. |
Out
(
v
)| returns the number of nodes with edges pointing from
v
to that node. For example, if the Web-page connections are those given in Figure
11.4
, and the current node under consideration were node B, then the following values would hold through all iterations: N = 3, In(B) = {A,C}, |Out(A)| = |{B,C}| = 2, and |Out(C)| = |{B}| = 1. The values for Pr(A) and Pr(C) would vary depending on the calculations from the previous iterations. The result is a recursive definition of PageRank. To calculate the PageRank of a given node, one must calculate the PageRank of all nodes with edges pointing into that given node.

Figure 11.4.
First example used to demonstrate PageRank.

Often PageRank is calculated using an iterative approach where all nodes are given an initial value for Pr of 1/N. Then during a single iteration we calculate what the PageRank of each node would be according to the current values of all nodes linking to that node. This process is repeated until the change between iterations is below some predetermined threshold or the maximum number of iterations is achieved. Let us consider an example graph with three nodes as follows:

Initially the Pr values are as follows:

The first iteration is as follows:

The second iteration then shows the passing of these values through the graph:

Other books

The Flesh Cartel by Rachel Haimowitz, Heidi Belleau
Corrupt Cravings by Salaiz, Jennifer
Genius of Place by Justin Martin
The Duke Dilemma by Shirley Marks
Moon of Aphrodite by Sara Craven
Mr. Louie Is Screwy! by Dan Gutman
But Inside I'm Screaming by Flock, Elizabeth