Catálogo de publicaciones - libros

Compartir en
redes sociales


Advances in Web Mining and Web Usage Analysis: 8th International Workshop on Knowledge Discovery on the Web, WebKDD 2006 Philadelphia, USA, August 20, 2006 Revised Papers

Olfa Nasraoui ; Myra Spiliopoulou ; Jaideep Srivastava ; Bamshad Mobasher ; Brij Masand (eds.)

En conferencia: 8º International Workshop on Knowledge Discovery on the Web (WebKDD) . Philadelphia, PA, USA . August 20, 2006 - August 20, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Artificial Intelligence (incl. Robotics); Computer Communication Networks; Database Management; Information Storage and Retrieval; Information Systems Applications (incl. Internet); Computers and Society

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-77484-6

ISBN electrónico

978-3-540-77485-3

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

Adaptive Website Design Using Caching Algorithms

Justin Brickell; Inderjit S. Dhillon; Dharmendra S. Modha

Visitors enter a website through a variety of means, including web searches, links from other sites, and personal bookmarks. In some cases the first page loaded satisfies the visitor’s needs and no additional navigation is necessary. In other cases, however, the visitor is better served by content located elsewhere on the site found by navigating links. If the path between a user’s current location and his eventual goal is circuitous, then the user may never reach that goal or will have to exert considerable effort to reach it. By mining site access logs, we can draw conclusions of the form “users who load page are likely to later load page .” If there is no direct link from to , then it is advantageous to provide one. The process of providing links to users’ eventual goals while skipping over the in-between pages is called . Existing algorithms for shortcutting require substantial offline training, which make them unable to adapt when access patterns change between training sessions. We present improved online algorithms for shortcut link selection that are based on a novel analogy drawn between shortcutting and caching. In the same way that cache algorithms predict which memory pages will be accessed in the future, our algorithms predict which web pages will be accessed in the future. Our algorithms are very efficient and are able to consider accesses over a long period of time, but give extra weight to recent accesses. Our experiments show significant improvement in the utility of shortcut links selected by our algorithm as compared to those selected by existing algorithms.

Pp. 1-20

Incorporating Usage Information into Average-Clicks Algorithm

Kalyan Beemanapalli; Ramya Rangarajan; Jaideep Srivastava

A number of methods exists that measure the distance between two web pages. Average-Clicks is a new measure of distance between web pages which fits user’s intuition of distance better than the traditional measure of clicks between two pages. Average-Clicks however assumes that the probability of the user following any link on a web page is the same and gives equal weights to each of the out-going links. In our method “Usage Aware Average-Clicks” we have taken the user’s browsing behavior into account and assigned different weights to different links on a particular page based on how frequently users follow a particular link. Thus, Usage Aware Average-Clicks is an extension to the Average-Clicks Algorithm where the static web link structure graph is combined with the dynamic Usage Graph (built using the information available from the web logs) to assign different weights to links on a web page and hence capture the user’s intuition of distance more accurately. A new distance metric has been designed using this methodology and used to improve the efficiency of a web recommendation engine.

Pp. 21-35

Nearest-Biclusters Collaborative Filtering with Constant Values

Panagiotis Symeonidis; Alexandros Nanopoulos; Apostolos Papadopoulos; Yannis Manolopoulos

Collaborative Filtering (CF) Systems have been studied extensively for more than a decade to confront the “information overload” problem. Nearest-neighbor CF is based either on common user or item similarities, to form the user’s neighborhood. The effectiveness of the aforementioned approaches would be augmented, if we could combine them. In this paper, we use biclustering to disclose this duality between users and items, by grouping them in both dimensions simultaneously. We propose a novel nearest-biclusters algorithm, which uses a new similarity measure that achieves partial matching of users’ preferences. We apply nearest-biclusters in combination with a biclustering algorithm – Bimax – for constant values. Extensively performance evaluations on two real data sets is provided, which show that the proposed method improves the performance of the CF process substantially. We attain more than 30% and 10% improvement in terms of precision and recall, respectively.

Pp. 36-55

Fast Categorization of Web Documents Represented by Graphs

Alex Markov; Mark Last; Abraham Kandel

Most text categorization methods are based on the vector-space model of information retrieval. One of the important advantages of this representation model is that it can be used by both instance-based and model-based classifiers for categorization. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the mark-up information that is available from web document HTML tags.

A recently developed graph-based representation of web documents can preserve the structural information. The new document model was shown to outperform the traditional vector representation, using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this chapter, three new, hybrid approaches to web document categorization are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector-based models using two model-based classifiers (C4.5 decision- tree algorithm and probabilistic Naïve Bayes) and several benchmark web document collections. The results demonstrate that the hybrid methods outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant increase in the categorization speed.

Pp. 56-71

Leveraging Structural Knowledge for Hierarchically-Informed Keyword Weight Propagation in the Web

Jong Wook Kim; K. Selçuk Candan

Although web navigation hierarchies, such as . and , enable effective browsing, their individual nodes cannot be indexed for search independently. This is because contents of the individual nodes in a hierarchy are related to the contents of their neighbors, ancestors, and descendants in the structure. In this paper, we show that significant improvements in precision can be obtained by leveraging knowledge about the structure of hierarchical web content. In particular, we propose a novel keyword weight propagation technique to properly enrich the data nodes in web hierarchies. Our approach relies on leveraging the context provided by neighbor entries in a given structure. We leverage this information for developing preserving keyword propagation schemes. We compare the results obtained through proposed hierarchically-informed keyword weight (pre-) propagation schemes to existing state-of-the-art score and keyword propagation techniques and show that our approach significantly improves the precision.

Pp. 72-91

How to Define Searching Sessions on Web Search Engines

Bernard J. Jansen; Amanda Spink; Vinish Kathuria

In this research, we investigate three techniques for defining user sessions on Web search engines. We analyze 2,465,145 interactions from 534,507 Web searchers. We compare three methods for defining sessions using: 1) Internet Protocol address and cookie; 2) Internet Protocol address, cookie, and a temporal limit on intra-session interactions; and 3) Internet Protocol address, cookie, and query reformulation patterns. Research results shows that defining sessions by query reformulation provides the best measure of session identification, with a nearly 95% accuracy. This method also results in an 82% increase in the number of sessions compared to Internet Protocol address and cookie alone. Regardless of the method, mean session length was fewer than three queries and the mean session duration was less than 30 minutes. Implications are that unique sessions may be a better indicator than the common industry metric of unique visitors for measuring search traffic. Results of this research may lead to tools to better support Web searching.

Pp. 92-109

Incorporating Concept Hierarchies into Usage Mining Based Recommendations

Amit Bose; Kalyan Beemanapalli; Jaideep Srivastava; Sigal Sahar

Recent studies have shown that conceptual and structural characteristics of a website can play an important role in the quality of recommendations provided by a recommendation system. Resources like Google Directory, Yahoo! Directory and web-content management systems attempt to organize content conceptually. Most recommendation models are limited in their ability to use this domain knowledge. We propose a novel technique to incorporate the conceptual characteristics of a website into a usage-based recommendation model. We use a framework based on biological sequence alignment. Similarity scores play a crucial role in such a construction and we introduce a scoring system that is generated from the website’s concept hierarchy. These scores fit seamlessly with other quantities used in similarity calculation like browsing order and time spent on a page. Additionally they demonstrate a simple, extensible system for assimilating more domain knowledge. We provide experimental results to illustrate the benefits of using concept hierarchy.

Pp. 110-126

A Random-Walk Based Scoring Algorithm Applied to Recommender Engines

Augusto Pucci; Marco Gori; Marco Maggini

Recommender systems are an emerging technology that helps consumers find interesting products and useful resources. A recommender system makes personalized product suggestions by extracting knowledge from the previous users’ interactions. In this paper, we present “ItemRank”, a random–walk based scoring algorithm, which can be used to rank products according to expected user preferences, in order to recommend top–rank items to potentially interested users. We tested our algorithm on a standard database, the MovieLens data set, which contains data collected from a popular recommender system on movies and that has been widely exploited as a benchmark for evaluating recently proposed approaches to recommender systems (e.g. [1,2]). We compared ItemRank with other state-of-the-art ranking techniques on this task. Our experiments show that ItemRank performs better than the other algorithms we compared to and, at the same time, it is less complex with respect to memory usage and computational cost too. The presentation of the method is accompanied by an analysis of the MovieLens data set main properties.

Pp. 127-146

Towards a Scalable NN CF Algorithm: Exploring Effective Applications of Clustering

Al Mamunur Rashid; Shyong K. Lam; Adam LaPitz; George Karypis; John Riedl

Collaborative Filtering (CF)-based recommender systems bring mutual benefits to both users and the operators of the sites with too much information. Users benefit as they are able to find items of interest from an unmanageable number of available items. On the other hand, e-commerce sites that employ recommender systems can increase sales revenue in at least two ways: a) by drawing customers’ attention to items that they are likely to buy, and b) by cross-selling items. However, the sheer number of customers and items typical in e-commerce systems demand specially designed CF algorithms that can gracefully cope with the vast size of the data. Many algorithms proposed thus far, where the principal concern is recommendation quality, may be too expensive to operate in a large-scale system. We propose , a simple and intuitive algorithm that is well suited for large data sets. The method first compresses data tremendously by building a straightforward but efficient clustering model. Recommendations are then generated quickly by using a simple -based approach. We demonstrate the feasibility of both analytically and empirically. We also show, by comparing with a number of other popular CF algorithms that, apart from being highly scalable and intuitive, provides very good recommendation accuracy as well.

Pp. 147-166

Detecting Profile Injection Attacks in Collaborative Filtering: A Classification-Based Approach

Chad A. Williams; Bamshad Mobasher; Robin Burke; Runa Bhaumik

Collaborative recommender systems have been shown to be vulnerable to profile injection attacks. By injecting a large number of biased profiles into a system, attackers can manipulate the predictions of targeted items. To decrease this risk, researchers have begun to study mechanisms for detecting and preventing profile injection attacks. In prior work, we proposed several attributes for attack detection and have shown that a classifier built with them can be highly successful at identifying attack profiles. In this paper, we extend our work through a more detailed analysis of the information gain associated with these attributes across the dimensions of attack type and profile size. We then evaluate their combined effectiveness at improving the robustness of user based recommender systems.

Pp. 167-186