Catálogo de publicaciones - libros

Compartir en
redes sociales


Advances in Web Mining and Web Usage Analysis: 8th International Workshop on Knowledge Discovery on the Web, WebKDD 2006 Philadelphia, USA, August 20, 2006 Revised Papers

Olfa Nasraoui ; Myra Spiliopoulou ; Jaideep Srivastava ; Bamshad Mobasher ; Brij Masand (eds.)

En conferencia: 8º International Workshop on Knowledge Discovery on the Web (WebKDD) . Philadelphia, PA, USA . August 20, 2006 - August 20, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Artificial Intelligence (incl. Robotics); Computer Communication Networks; Database Management; Information Storage and Retrieval; Information Systems Applications (incl. Internet); Computers and Society

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-77484-6

ISBN electrónico

978-3-540-77485-3

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

Predicting the Political Sentiment of Web Log Posts Using Supervised Machine Learning Techniques Coupled with Feature Selection

Kathleen T. Durant; Michael D. Smith

As the number of web logs dramatically grows, readers are turning to them as an important source of information. Automatic techniques that identify the political sentiment of web log posts will help bloggers categorize and filter this exploding information source. In this paper we illustrate the effectiveness of supervised learning for sentiment classification on web log posts. We show that a Naïve Bayes classifier coupled with a forward feature selection technique can on average correctly predict a posting’s sentiment 89.77% of the time with a standard deviation of 3.01. It significantly outperforms Support Vector Machines at the 95% confidence level with a confidence interval of [1.5, 2.7]. The feature selection technique provides on average an 11.84% and a 12.18% increase for Naïve Bayes and Support Vector Machines results respectively. Previous sentiment classification research achieved an 81% accuracy using Naïve Bayes and 82.9% using SVMs on a movie domain corpus.

Pp. 187-206

Analysis of Web Search Engine Query Session and Clicked Documents

David Nettleton; Liliana Calderón-Benavides; Ricardo Baeza-Yates

The identification of a user’s intention or interest by the analysis of the queries submitted to a search engine and the documents selected as answers to these queries, can be very useful to offer more adequate results for that user. In this Chapter we present the analysis of a Web search engine query log from two different perspectives: the query session and the clicked document. In the first perspective, that of the query session, we process and analyze web search engine query and click data for the query session (query + clicked results) conducted by the user. We initially state some hypotheses for possible user types and quality profiles for the user session, based on descriptive variables of the session. In the second perspective, that of the clicked document, we repeat the process from the perspective of the documents (URL’s) selected. We also initially define possible document categories and select descriptive variables to define the documents.

We apply a systematic data mining process to click data, contrasting non- supervised (Kohonen) and supervised (C4.5) methods to cluster and model the data, in order to identify profiles and rules which relate to theoretical user behavior and user session “quality”, from the point of view of user session, and to identify document profiles which relate to theoretical user behavior, and document (URL) organization, from the document perspective.

Pp. 207-226

Understanding Content Reuse on the Web: Static and Dynamic Analyses

Ricardo Baeza-Yates; Álvaro Pereira; Nivio Ziviani

In this paper we present static and dynamic studies of duplicate and near-duplicate documents in the Web. The static and dynamic studies involve the analysis of similar content among pages within a given snapshot of the Web and how pages in an old snapshot are reused to compose new documents in a more recent snapshot. We ran a series of experiments using four snapshots of the Chilean Web. In the static study, we identify duplicates in both parts of the Web graph – reachable (connected by links) and unreachable components (unconnected) – aiming to identify where duplicates occur more frequently. We show that the number of duplicates in the Web seems to be much higher than previously reported (about 50% higher) and in our data the duplicated in the unreachable Web is 74,6% higher than the number of duplicates in the reachable component of the Web graph. In the dynamic study, we show that some of the old content is used to compose new pages. If a page in a newer snapshot has content of a page in an older snapshot, we say that the source is a parent of the new page. We state the hypothesis that people use search engines to find pages and republish their content as a new document. We present evidences that this happens for part of the pages that have parents. In this case, part of the Web content is biased by the ranking function of search engines.

Pp. 227-246