Catálogo de publicaciones - libros

Compartir en
redes sociales


Advances in Web-Age Information Management: 7th International Conference, WAIM 2006, Hong Kong, China, June 17-19, 2006, Proceedings

Jeffrey Xu Yu ; Masaru Kitsuregawa ; Hong Va Leong (eds.)

En conferencia: 7º International Conference on Web-Age Information Management (WAIM) . Hong Kong, China . June 17, 2006 - June 19, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-35225-9

ISBN electrónico

978-3-540-35226-6

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

DGCL: An Efficient Density and Grid Based Clustering Algorithm for Large Spatial Database

Ho Seok Kim; Song Gao; Ying Xia; Gyoung Bae Kim; Hae Young Bae

Spatial clustering, which groups similar objects based on their distance, connectivity, or their relative density in space, is an important component of spatial data mining. Clustering large data sets has always been a serious challenge for clustering algorithms, because huge data set makes the clustering process extremely costly. In this paper, we propose DGCL, an enhanced Density-Grid based Clustering algorithm for Large spatial database. The characteristics of dense area can be enhanced by considering the affection of the surrounding area. Dense areas are analytically identified as clusters by removing sparse area or outliers with the help of a density threshold. Synthetic datasets are used for testing and the result shows the superiority of our approach.

- Clustering | Pp. 362-371

Scalable Clustering Using Graphics Processors

Feng Cao; Anthony K. H. Tung; Aoying Zhou

We present new algorithms for scalable clustering using graphics processors. Our basic approach is based on k-means. By changing the order of determining object labels, and exploiting the high computational power and pipeline of graphics processing units (GPUs) for distance computing and comparison, we speed up the k-means algorithm substantially. We introduce two strategies for retrieving data from the GPU, taking into account the low bandwidth from the GPU back to the main memory. We also extend our GPU-based approach to data stream clustering. We implement our algorithms in a PC with a Pentium IV 3.4G CPU and a NVIDIA GeForce 6800 GT graphics card. Our comprehensive performance study shows that the common GPU in desktop computers could be an efficient co-processor of CPU in traditional and data stream clustering.

- Clustering | Pp. 372-384

TreeCluster: Clustering Results of Keyword Search over Databases

Zhaohui Peng; Jun Zhang; Shan Wang; Lu Qin

A critical challenge in keyword search over relational data- bases (KSORD) is to improve its result presentation to facilitate users’ quick browsing through search results. An effective method is to organize the results into clusters. However, traditional clustering method is not applicable to KSORD search results. In this paper, we propose a novel clustering method named TreeCluster. In the first step, we use labels to represent schema information of each result tree and reformulate the clustering problem as a problem of judging whether labeled trees are isomorphic. In the second step, we rank user keywords according to their frequencies in databases, and further partition the large clusters based on keyword nodes. Furthermore, we give each cluster a readable description, and present the description and each result graphically to help users understand the results more easily. Experimental results verify our method’s effectiveness and efficiency.

- Clustering | Pp. 385-396

A New Method for Finding Approximate Repetitions in DNA Sequences

Di Wang; Guoren Wang; Qingquan Wu; Baichen Chen; Yi Zhao

Searching for approximate repetitions in a DNA sequence has been an important topic in gene analysis. One of the problems in the study is that because of the varying lengths of patterns, the similarity between patterns cannot be judged accurately if we use only the concept of ED ( Edit Distance ). In this paper we shall make effort to define a new function to compute similarity, which considers both the difference and sameness between patterns at the same time. Seeing the computational complexity, we shall also propose two new filter methods based on frequency distance and Pearson correlation, with which we can sort out candidate set of approximate repetitions efficiently. We use SUA instead of sliding window to get the fragments in a DNA sequence, so that the patterns of an approximate repetition have no limitation on length. The results show that with our technique we are able to find a bigger number of approximate repetitions than that of those found with tandem repeat finder.

- Clustering and Classification | Pp. 397-409

Dynamic Incremental Data Summarization for Hierarchical Clustering

Bing Liu; Yuliang Shi; Zhihui Wang; Wei Wang; Baile Shi

In many real world applications, with the databases frequent insertions and deletions, the ability of a data mining technique to detect and react quickly to dynamic changes in the data distribution and clustering over time is highly desired. Data summarizations (e.g., data bubbles) have been proposed to compress large databases into representative points suitable for subsequent hierarchical cluster analysis. In this paper, we thoroughly investigate the quality measure (data summarization index) of incremental data bubbles. When updating databases, we show which factors could affect the mean and standard deviation of data summarization index or not. Based on these statements, a fully dynamic scheme to maintain data bubbles incrementally is proposed. An extensive experimental evaluation confirms our statements and shows that the fully dynamic incremental data bubbles are effective in preserving the quality of the data summarization for hierarchical clustering.

- Clustering and Classification | Pp. 410-421

Classifying E-Mails Via Support Vector Machine

Lidan Shou; Bin Cui; Gang Chen; Jinxiang Dong

For addressing the growing problem of junk E-mail on the Internet, this paper proposes an effective E-mail classifying technique. Our work handles E-mail messages as semi-structured documents consisting of a set of fields with predefined semantics and a number of variable length free-text contents. The main contributions of this paper include the following: First, we present a Support Vector Machine (SVM) based model that incorporates the Principal Component Analysis (PCA) technique to reduce the data in terms of size and dimensionality of the input feature space. As a result, the input data become classifiable with fewer features, and the training process has faster convergence speed. Second, we build the classification model using both the -support vector machine and -support vector machine algorithms. Various control parameters for performance tuning are studied in an extensive set of experiments. The results of our performance evaluation indicate that the proposed technique is effective in E-mail classification.

- Clustering and Classification | Pp. 422-434

A Novel Web Page Categorization Algorithm Based on Block Propagation Using Query-Log Information

Wenyuan Dai; Yong Yu; Cong-Le Zhang; Jie Han; Gui-Rong Xue

Most existing web page classification algorithms, including content-based, link-based, or query-log analysis methods, treat the pages as smallest units. However, web pages usually contain some noisy or biased information which could affect the performance of classification. In this paper, we propose a Block Propagation Categorization (BPC) algorithm which deep mines web structure and views blocks as basic semantic units. Moreover, with query log information, BPC propagates only suitable information () among web pages to emphasize their topics. We also optimize the BPC algorithm to significantly speed up the block propagation process, without losing any precision. Our experiments on ODP and MSN search engine log show that BPC achieves a great improvement over traditional approaches.

- Data Mining | Pp. 435-446

Counting Graph Matches with Adaptive Statistics Collection

Jianhua Feng; Qian Qian; Yuguo Liao; Lizhu Zhou

High performance of query processing in large scale graph-structured data poses a pressing demand for high-quality statistics collection and selectivity estimation. Precise and succinct statistics collection about graph-structured data plays a crucial role for graph query selectivity estimation. In this paper, we propose the approach SMT, Succinct Markov Table, which achieves high precision in selectivity estimation with low memory space consumed. Four core notions of SMT are constructing, refining, compressing and estimating. The efficient algorithm SMTBuilder provides facility to build adaptive statistics model in the form of SMT. Versatile optimization rules, which investigate local bi-directional reachability, are introduced in SMT refining. During compressing, affective SMT grouping techniques are introduced. Statistical methods are used for selectivity estimations of various graph queries basing on SMT, especially for twig queries. By a thorough experimental study, we demonstrate SMT’s advantages in accuracy and space by comparing with previously known alternative, as well as the preferred optimization rules and compressing technique that would favor different real-life data.

- Data Mining | Pp. 447-459

Tight Bounds on the Estimation Distance Using Wavelet

Bing Liu; Zhihui Wang; Jingtao Li; Wei Wang; Baile Shi

Time series similarity search is of growing importance in many applications. Wavelet transforms are used as a dimensionality reduction technique to permit efficient similarity search over high-dimensional time series data. This paper proposes the tight upper and lower bounds on the estimation distance using wavelet transform, and we show that the traditional distance estimation is only part of our lower bound. According to the lower bound, we can exclude more dissimilar time series than traditional method. And according to the upper bound, we can directly judge whether two time series are similar, and further reduce the number of time series to process in original time domain. The experiments have shown that using the upper and lower tight bounds can significantly improve filter efficiency and reduce running time than traditional method.

- Data Mining | Pp. 460-471

Load Shedding for Window Joins over Streams

Donghong Han; Chuan Xiao; Rui Zhou; Guoren Wang; Huan Huo; Xiaoyun Hui

We present a novel load shedding technique over sliding window joins. We first construct a dual window architectural model including join-windows and aux-windows. With the statistics built on aux-windows, an effective load shedding strategy is developed to produce maximum subset join outputs. For the streams with high arrival rates, we propose an approach incorporating front-shedding and rear-shedding, and then address the problem of how to cooperate these two shedding processes through a series of calculations. Based on extensive experimentation with synthetic data and real life data, we show that our load shedding strategy delivers superb join output performance, and dominates the existing strategies.

- Data Stream Processing | Pp. 472-483