Catálogo de publicaciones - libros

Compartir en
redes sociales


Document Analysis Systems VII: 7th International Workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006, Proceedings

Horst Bunke ; A. Lawrence Spitz (eds.)

En conferencia: 7º International Workshop on Document Analysis Systems (DAS) . Nelson, New Zealand . February 13, 2006 - February 15, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Database Management; Pattern Recognition; Information Storage and Retrieval; Image Processing and Computer Vision; Simulation and Modeling; Computer Appl. in Administrative Data Processing

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-32140-8

ISBN electrónico

978-3-540-32157-6

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

The Effects of OCR Error on the Extraction of Private Information

Kazem Taghva; Russell Beckley; Jeffrey Coombs

OCR error has been shown not to affect the average accuracy of text retrieval or text categorization.Recent studies however have indicated that information extraction is significantly degraded by OCR error. We experimented with information extraction software on two collections, one with OCR-ed documents and another with manually-corrected versions of the former. We discovered a significant reduction in accuracy on the OCR text versus the corrected text. The majority of errors were attributable to zoning problems rather than OCR classification errors.

Palabras clave: Hide Markov Model; Private Information; Information Extraction; Text Categorization; Document Image.

- Session 10: Retrieval and Segmentation | Pp. 348-357

Combining Multiple Classifiers for Faster Optical Character Recognition

Kumar Chellapilla; Michael Shilman; Patrice Simard

Traditional approaches to combining classifiers attempt to improve classification accuracy at the cost of increased processing. They may be viewed as providing an accuracy-speed trade-off: higher accuracy for lower speed. In this paper we present a novel approach to combining multiple classifiers to solve the inverse problem of significantly improving classification speeds at the cost of slightly reduced classification accuracy. We propose a cascade architecture for combining classifiers and cast the process of building such a cascade as a search and optimization problem. We present two algorithms based on steepest-descent and dynamic programming for producing approximate solutions fast. We also present a simulated annealing algorithm and a depth-first-search algorithm for finding optimal solutions. Results on handwritten optical character recognition indicate that a) a speedup of 4-9 times is possible with no increase in error and b) speedups of up to 15 times are possible when twice as many errors can be tolerated.

Palabras clave: Simulated Annealing; Hide Node; Simulated Annealing Algorithm; Multiple Classifier; Convolutional Neural Network.

- Session 10: Retrieval and Segmentation | Pp. 358-367

Performance Comparison of Six Algorithms for Page Segmentation

Faisal Shafait; Daniel Keysers; Thomas M. Breuel

This paper presents a quantitative comparison of six algorithms for page segmentation: X-Y cut, smearing, whitespace analysis, constrained text-line finding, Docstrum, and Voronoi-diagram-based. The evaluation is performed using a subset of the UW-III collection commonly used for evaluation, with a separate training set for parameter optimization. We compare the results using both default parameters and optimized parameters. In the course of the evaluation, the strengths and weaknesses of each algorithm are analyzed, and it is shown that no single algorithm outperforms all other algorithms. However, we observe that the three best-performing algorithms are those based on constrained text-line finding, Docstrum, and the Voronoi-diagram.

Palabras clave: Segmentation Algorithm; Document Image; Optical Character Recognition; Text Region; Text Block.

- Session 10: Retrieval and Segmentation | Pp. 368-379

HVS Inspired System for Script Identification in Indian Multi-script Documents

Peeta Basa Pati; A. G. Ramakrishnan

Identification of the script of the text, present in multi-script documents, is one of the important first steps in the design of an OCR system. Much work has been reported relating to Roman, Arabic, Chinese, Korean and Japanese scripts. Though some work has already been reported involving Indian scripts, the work is still in its nascent stage. For example, most of the work assumes that the script changes only at the level of the line, which is rarely an acceptable assumption in the Indian scenario. In this work, we report a script identification algorithm, which takes into account the fact that the script changes at the word level in most Indian bilingual or multilingual documents. Initially, we deal with the identification of the script of words, using Gabor filters, in a bi-script scenario. Later, we extend this to tri-script and then, five-script scenarios. The combination of Gabor features with nearest neighbor classifier shows promising results. Words of different font styles and sizes are used. We have shown that our identification scheme, inspired from the Human Visual System (HVS), utilizing the same feature and classifier combination, works consistently well for any of the combination of scripts experimented.

Palabras clave: Gabor filter; script identification; prototype selection.

- Posters | Pp. 380-389

A Shared Fragments Analysis System for Large Collections of Web Pages

Junchang Ma; Zhimin Gu

Dividing web pages into fragments has been shown to provide significant benefits for both content generation and caching. However, the lack of good methods to analyze interesting fragments in large collections of web pages is preventing existing large web sites from using fragment-based techniques. Fragments are considered to be interesting if they are completely or structurally shared among multiple web pages. This paper first gives a formal description of the problem, and then presents our system for shared fragments analysis. We propose a well-designed data structure for representing web pages, and develop an efficient algorithm by utilizing database techniques. Our system is unique in its shared fragments analysis for large collections of web pages. The system has been built and successfully applied to some sets of large web pages, which has shown its effectiveness and usefulness, and may serve as a core building block in many applications.

Palabras clave: Large Collection; Fragment Analysis; Document Object Model; Document Object Model Tree; Shared Fragment.

- Posters | Pp. 390-401

Offline Handwritten Arabic Character Segmentation with Probabilistic Model

Pingping Xiu; Liangrui Peng; Xiaoqing Ding; Hua Wang

The research on offline handwritten Arabic character recognition has received more and more attention in recent years, because of the increasing needs of Arabic document digitization. The variation in Arabic handwriting brings great difficulty in character segmentation and recognition, eg., the sub-parts (diacritics) of the Arabic character may shift away from the main part. In this paper, a new probabilistic segmentation model is proposed. First, a contour-based over-segmentation method is conducted, cutting the word image into graphemes. The graphemes are sorted into 3 queues, which are character main parts, sub-parts (diacritics) above or below main parts respectively. The confidence for each character is calculated by the probabilistic model, taking into account both of the recognizer output and the geometric confidence besides with logical constraint. Then, the global optimization is conducted to find optimal cutting path, taking weighted average of character confidences as objective function. Experiments on handwritten Arabic documents with various writing styles show the proposed method is effective.

Palabras clave: Optical Character Recognition; Logical Rule; Integral Character; Arabic Text; Stroke Width.

- Posters | Pp. 402-412

Automatic Keyword Extraction from Historical Document Images

Kengo Terasawa; Takeshi Nagasaki; Toshio Kawashima

This paper presents an automatic keyword extraction method from historical document images. The proposed method is language independent because it is purely appearance based, where neither lexical information nor any other statistical language models are required. Moreover, since it does not need word segmentation, it can be applied to Eastern languages where they do not put clear spacing between words. The first half of the paper describes the algorithm to retrieve document image regions which have similar appearance to the given query image. The algorithm was evaluated in recall-precision manner, and showed its performance of over 80–90% average precision. The second half of the paper describes the keyword extraction method which works even if no query word is explicitly specified. Since the computational cost was reduced by the efficient pruning techniques, the system could extract keywords successfully from relatively large documents.

Palabras clave: Query Image; Document Image; Word Segmentation; Matching Cost; Query Word.

- Posters | Pp. 413-424

Digitizing a Million Books: Challenges for Document Analysis

K. Pramod Sankar; Vamshi Ambati; Lakshmi Pratha; C. V. Jawahar

This paper describes the challenges for document image analysis community for building large digital libraries with diverse document categories. The challenges are identified from the experience of the on-going activities toward digitizing and archiving one million books. Smooth workflow has been established for archiving large quantity of books, with the help of efficient image processing algorithms. However, much more research is needed to address the challenges arising out of the diversity of the content in digital libraries.

Palabras clave: Digital Library; Document Image; Digital Content; Digitization Process; Indian Language.

- Posters | Pp. 425-436

Toward File Consolidation by Document Categorization

Abdel Belaïd; André Alusse

An efficient adaptive document classification and categorization approach is proposed for personal file creation corresponding to user’s specific needs and profile. This kind of approach is needed because the search engines are often too general to offer a precise answer to the user request. As we cannot act directly on the search engines methodology, we propose to rather act on the documents retrieved by classifying and ranking them properly. A classifier combination approach is considered. These classifiers are chosen very complementary in order to treat all the query aspects and to present to the user at the end a readable and comprehensible result. The application performed corresponds to the law articles stemmed from the European Union data base. The law texts are always entangled with cross-references and accompanied by some updating files (for application dates, for new terms and formulations). Our approach found here a real application offering to the specialist (jurist, lawyer, etc. ) a synthetic vision of the law related to the topic requested.

Palabras clave: Vector Space Model; Agglomerative Hierarchical Cluster; Document Retrieval; Document Categorization; Automatic Cluster.

- Posters | Pp. 437-448

Finding Hidden Semantics of Text Tables

Saleh A. Alrashed

Combining data from different sources for further automatic processing is often hindered by differences in the underlying semantics and representation. Therefore when linking information presented in documents in tabular form with data held in databases, it is important to determine as much information about the table and its content. Important information about the table data is often given in the text surrounding the table in that document. The table’s creators cannot clarify all the semantics in the table itself therefore they use the table context or the text around it to give further information. These semantics are very useful when integrating and using this data, but are often difficult to detect automatically. We propose a solution to part of this problem based on a domain ontology. The input to our system is a document that contains tabular data and the system aims to find semantics in the document that are related to the tabular data. The output of our system is a set of detected semantics linked to the corresponding table. The system uses elements of semantic detection, semantic representation, and data integration. In this paper, we discuss the experiment used to evaluate the prototype system. We also discuss the different types of test, the experiment will perform. After using the system with the test data and gathering the results of these tests, we show the significant results in our experiment.

- Posters | Pp. 449-461