Catálogo de publicaciones - libros
Advances in Web-Age Information Management: 7th International Conference, WAIM 2006, Hong Kong, China, June 17-19, 2006, Proceedings
Jeffrey Xu Yu ; Masaru Kitsuregawa ; Hong Va Leong (eds.)
En conferencia: 7º International Conference on Web-Age Information Management (WAIM) . Hong Kong, China . June 17, 2006 - June 19, 2006
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
No disponibles.
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2006 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-35225-9
ISBN electrónico
978-3-540-35226-6
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2006
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2006
Cobertura temática
Tabla de contenidos
doi: 10.1007/11775300_31
DGCL: An Efficient Density and Grid Based Clustering Algorithm for Large Spatial Database
Ho Seok Kim; Song Gao; Ying Xia; Gyoung Bae Kim; Hae Young Bae
Spatial clustering, which groups similar objects based on their distance, connectivity, or their relative density in space, is an important component of spatial data mining. Clustering large data sets has always been a serious challenge for clustering algorithms, because huge data set makes the clustering process extremely costly. In this paper, we propose DGCL, an enhanced Density-Grid based Clustering algorithm for Large spatial database. The characteristics of dense area can be enhanced by considering the affection of the surrounding area. Dense areas are analytically identified as clusters by removing sparse area or outliers with the help of a density threshold. Synthetic datasets are used for testing and the result shows the superiority of our approach.
- Clustering | Pp. 362-371
doi: 10.1007/11775300_32
Scalable Clustering Using Graphics Processors
Feng Cao; Anthony K. H. Tung; Aoying Zhou
We present new algorithms for scalable clustering using graphics processors. Our basic approach is based on k-means. By changing the order of determining object labels, and exploiting the high computational power and pipeline of graphics processing units (GPUs) for distance computing and comparison, we speed up the k-means algorithm substantially. We introduce two strategies for retrieving data from the GPU, taking into account the low bandwidth from the GPU back to the main memory. We also extend our GPU-based approach to data stream clustering. We implement our algorithms in a PC with a Pentium IV 3.4G CPU and a NVIDIA GeForce 6800 GT graphics card. Our comprehensive performance study shows that the common GPU in desktop computers could be an efficient co-processor of CPU in traditional and data stream clustering.
- Clustering | Pp. 372-384
doi: 10.1007/11775300_33
TreeCluster: Clustering Results of Keyword Search over Databases
Zhaohui Peng; Jun Zhang; Shan Wang; Lu Qin
A critical challenge in keyword search over relational data- bases (KSORD) is to improve its result presentation to facilitate users’ quick browsing through search results. An effective method is to organize the results into clusters. However, traditional clustering method is not applicable to KSORD search results. In this paper, we propose a novel clustering method named TreeCluster. In the first step, we use labels to represent schema information of each result tree and reformulate the clustering problem as a problem of judging whether labeled trees are isomorphic. In the second step, we rank user keywords according to their frequencies in databases, and further partition the large clusters based on keyword nodes. Furthermore, we give each cluster a readable description, and present the description and each result graphically to help users understand the results more easily. Experimental results verify our method’s effectiveness and efficiency.
- Clustering | Pp. 385-396
doi: 10.1007/11775300_34
A New Method for Finding Approximate Repetitions in DNA Sequences
Di Wang; Guoren Wang; Qingquan Wu; Baichen Chen; Yi Zhao
Searching for approximate repetitions in a DNA sequence has been an important topic in gene analysis. One of the problems in the study is that because of the varying lengths of patterns, the similarity between patterns cannot be judged accurately if we use only the concept of ED ( Edit Distance ). In this paper we shall make effort to define a new function to compute similarity, which considers both the difference and sameness between patterns at the same time. Seeing the computational complexity, we shall also propose two new filter methods based on frequency distance and Pearson correlation, with which we can sort out candidate set of approximate repetitions efficiently. We use SUA instead of sliding window to get the fragments in a DNA sequence, so that the patterns of an approximate repetition have no limitation on length. The results show that with our technique we are able to find a bigger number of approximate repetitions than that of those found with tandem repeat finder.
- Clustering and Classification | Pp. 397-409
doi: 10.1007/11775300_35
Dynamic Incremental Data Summarization for Hierarchical Clustering
Bing Liu; Yuliang Shi; Zhihui Wang; Wei Wang; Baile Shi
In many real world applications, with the databases frequent insertions and deletions, the ability of a data mining technique to detect and react quickly to dynamic changes in the data distribution and clustering over time is highly desired. Data summarizations (e.g., data bubbles) have been proposed to compress large databases into representative points suitable for subsequent hierarchical cluster analysis. In this paper, we thoroughly investigate the quality measure (data summarization index) of incremental data bubbles. When updating databases, we show which factors could affect the mean and standard deviation of data summarization index or not. Based on these statements, a fully dynamic scheme to maintain data bubbles incrementally is proposed. An extensive experimental evaluation confirms our statements and shows that the fully dynamic incremental data bubbles are effective in preserving the quality of the data summarization for hierarchical clustering.
- Clustering and Classification | Pp. 410-421
doi: 10.1007/11775300_36
Classifying E-Mails Via Support Vector Machine
Lidan Shou; Bin Cui; Gang Chen; Jinxiang Dong
For addressing the growing problem of junk E-mail on the Internet, this paper proposes an effective E-mail classifying technique. Our work handles E-mail messages as semi-structured documents consisting of a set of fields with predefined semantics and a number of variable length free-text contents. The main contributions of this paper include the following: First, we present a Support Vector Machine (SVM) based model that incorporates the Principal Component Analysis (PCA) technique to reduce the data in terms of size and dimensionality of the input feature space. As a result, the input data become classifiable with fewer features, and the training process has faster convergence speed. Second, we build the classification model using both the -support vector machine and -support vector machine algorithms. Various control parameters for performance tuning are studied in an extensive set of experiments. The results of our performance evaluation indicate that the proposed technique is effective in E-mail classification.
- Clustering and Classification | Pp. 422-434
doi: 10.1007/11775300_37
A Novel Web Page Categorization Algorithm Based on Block Propagation Using Query-Log Information
Wenyuan Dai; Yong Yu; Cong-Le Zhang; Jie Han; Gui-Rong Xue
Most existing web page classification algorithms, including content-based, link-based, or query-log analysis methods, treat the pages as smallest units. However, web pages usually contain some noisy or biased information which could affect the performance of classification. In this paper, we propose a Block Propagation Categorization (BPC) algorithm which deep mines web structure and views blocks as basic semantic units. Moreover, with query log information, BPC propagates only suitable information () among web pages to emphasize their topics. We also optimize the BPC algorithm to significantly speed up the block propagation process, without losing any precision. Our experiments on ODP and MSN search engine log show that BPC achieves a great improvement over traditional approaches.
- Data Mining | Pp. 435-446
doi: 10.1007/11775300_38
Counting Graph Matches with Adaptive Statistics Collection
Jianhua Feng; Qian Qian; Yuguo Liao; Lizhu Zhou
High performance of query processing in large scale graph-structured data poses a pressing demand for high-quality statistics collection and selectivity estimation. Precise and succinct statistics collection about graph-structured data plays a crucial role for graph query selectivity estimation. In this paper, we propose the approach SMT, Succinct Markov Table, which achieves high precision in selectivity estimation with low memory space consumed. Four core notions of SMT are constructing, refining, compressing and estimating. The efficient algorithm SMTBuilder provides facility to build adaptive statistics model in the form of SMT. Versatile optimization rules, which investigate local bi-directional reachability, are introduced in SMT refining. During compressing, affective SMT grouping techniques are introduced. Statistical methods are used for selectivity estimations of various graph queries basing on SMT, especially for twig queries. By a thorough experimental study, we demonstrate SMT’s advantages in accuracy and space by comparing with previously known alternative, as well as the preferred optimization rules and compressing technique that would favor different real-life data.
- Data Mining | Pp. 447-459
doi: 10.1007/11775300_39
Tight Bounds on the Estimation Distance Using Wavelet
Bing Liu; Zhihui Wang; Jingtao Li; Wei Wang; Baile Shi
Time series similarity search is of growing importance in many applications. Wavelet transforms are used as a dimensionality reduction technique to permit efficient similarity search over high-dimensional time series data. This paper proposes the tight upper and lower bounds on the estimation distance using wavelet transform, and we show that the traditional distance estimation is only part of our lower bound. According to the lower bound, we can exclude more dissimilar time series than traditional method. And according to the upper bound, we can directly judge whether two time series are similar, and further reduce the number of time series to process in original time domain. The experiments have shown that using the upper and lower tight bounds can significantly improve filter efficiency and reduce running time than traditional method.
- Data Mining | Pp. 460-471
doi: 10.1007/11775300_40
Load Shedding for Window Joins over Streams
Donghong Han; Chuan Xiao; Rui Zhou; Guoren Wang; Huan Huo; Xiaoyun Hui
We present a novel load shedding technique over sliding window joins. We first construct a dual window architectural model including join-windows and aux-windows. With the statistics built on aux-windows, an effective load shedding strategy is developed to produce maximum subset join outputs. For the streams with high arrival rates, we propose an approach incorporating front-shedding and rear-shedding, and then address the problem of how to cooperate these two shedding processes through a series of calculations. Based on extensive experimentation with synthetic data and real life data, we show that our load shedding strategy delivers superb join output performance, and dominates the existing strategies.
- Data Stream Processing | Pp. 472-483