A Tool for Thematic Cartography of Corpora

here “thematic maps”) of documents' sets in taking account of themes chosen ... Generalist search engines, like Google or Yahoo, return hundreds of links from a.
104KB taille 1 téléchargements 325 vues
A Tool for Thematic Cartography of Corpora Thibault Roy and Pierre Beust GREYC, Team ISLanD, Computer Science Laboratory, University of Caen, F14032, Caen Cedex, France {troy, beust}@info.unicaen.fr

1 Introduction In our researches in Computer Science and particularly in Natural Language Processing (NLP), we are interested in electronic management of textual documents. For a lot of tasks of extraction and retrieval of information, the discovery of themes in sets of documents is an important and difficult analysis. The ProxiDocs1 tool [1] helps its users in such tasks allowing them to visualize graphical representations (called here “thematic maps”) of documents’ sets in taking account of themes chosen and specified by users. These maps allow users to discover thematic differences and similarities existing between each document of the analyzed set. By the term "themes" we consider the main subjects dealing in a corpus of documents [2]. In the first section, we’ll give an overview of tools using visualization methods in order to access information of documents’ sets. The second section will deal with the mechanism of ProxiDocs and thematic maps returned by the tool.

2 Textual data visualization The quantity of information exchanged every day on the Web doesn’t stop increasing. Generalist search engines, like Google or Yahoo, return hundreds of links from a user’s request. These links are ordered in lists, but such lists do not allow users to have a global view on returned results. However, such view would reveal the principal categories of information from returned Web pages. In order to give users a global view on documents’ sets, it can be interesting to use graphical representations [3]. A lot of NLP tools represent sets of documents in 2 or 3-Dimensional space (we can cite [4], [5], [6], [7], for such works), each one uses specific visualization methods. The visualization method used in ProxiDocs is calling “cartography”. A map of a set of documents reveals proximities and links between textual entities (like words, documents, etc.); we can compare this visualization method with road maps which reveals proximities and ways between towns. Since 2001, the two metasearch engines KartOO [8] and MapStan [9] are available on the Web2. These tools return a map 1 2

ProxiDocs is available at: http://www.info.unicaen.fr/~troy/proxidocs. KartOO and MapStan are respectively available at: http://www.kartoo.com and http://www.mapstan.net.

from the user's request; this map represents Web pages related to the request. For these systems, similar pages are located in a same zone on the map. Also, it's possible to distinguish principal categories of information related to the user’s request.

3 ProxiDocs, an NLP tool for thematic cartography ProxiDocs returns thematic maps from a set of documents (for example: a corpus of newspaper's articles or a set of Web pages returned by a search engine) and themes specified by the user. A theme will be defined by a set of words chosen by the user. For example, the theme “Sport” could be specified by the following terms: “winner”, “looser”, “tennis”, “football”, “Zinedine Zidane”, etc. We advise the ThemeEditor3 tool [10] develop in our research team for a such task of themes contruction’s. The first stage consists in counting words of each theme in each document. So, we associate a numbers’ list with each document, and, each list constitutes a NDimensional vector (N is the number of themes specified by the user). For instance, if a document contains 3 words of the theme “Economy”, 2 words of the theme “Sport” and 7 words of the theme “Education”, the vector associated to this document is (3, 2, 7). The next stage consists to project these N-Dimensional vectors on a 2-Dimensional space. In order to realize a such projection, we use a method called Principal Components Analysis [11]. Each document is then represented by a point in a 2Dimensional space, so, first maps can be constructed. On these maps, proximities between different points could inform the user on thematic similarities between different documents. In order to underline documents’ subsets with similar themes, we use a clustering method called Ascendant Hierarchical Clustering [11]. Thus, thematic maps reveal clusters of documents, each cluster containing documents dealing with similar subjects. An experimentation of ProxiDocs from a corpus of around 800 articles of the French newspaper “Le Monde” of 1989 (around 700,000 words) and a generalist set of themes can reveal the two following kinds of maps:

3

ThemeEditor is available at: http://www.info.unicaen.fr/~beust/ThemeEd/.

Fig. 1. Each point on the left map represent a document of the analyzed corpus, each disc on the right map represent a subset of documents. The height of a disc is proportional to the number of documents included in the subset. A color related to the principal theme of documents or sets of documents is attributed to each point and disc. Each point is an hyperlink to the represented document, each disc is an hyperlink to a document outlining the themes of documents of the represented set.

4 Conclusion Proxidocs allows its users to visually identify main thematic tendencies of a set of documents. We work now on the application of thematic cartography to results of search engines in order to distinguish principal thematic categories of information contain in returned Web pages. In an technological watch context, our tool could also reveal thematic changes of a set of documents regarding to the time. In order to improve our tool, we have to measure the “quality” of thematic maps returned to users. This assessment will be difficult because we will have to elaborate a protocol taking account of the point of view of each user on themes used during the construction of thematic maps.

References 1. Roy T., Beust P., ProxiDocs, un outil de cartographie et de catégorisation thématique de corpus, Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data, pp. 978-987, 2004. 2. Pichon R., Sébillot P., Différencier les sens des mots à l'aide du thème et du contexte de leurs occurrences: une expérience, Proceedings of Natural Language Processings, pp. 279299, 1999. 3. Shneiderman B., The eyes have it: a task by data type taxonomy for information visualization, Proceedings of Visual Languages, Boulder, 1996. 4. Salton G., Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, Pennsylvania, 1989.

5. Hearst M. A., Tilebars: Visualization of term distribution information in full text information access}, Proceedings of ACM SIGCHI, pp. 59-66, 1995. 6. Robertson G.G, Mackinlay Jock D. and Stuart K.C., Cone Trees: Animated 3D Visualizations of Hierarchical Information, Proceedings of ACM SIGCHI, pp.189-194, 1991. 7. Lamping J., A focus+context technique based on hyperbolic geometry for viewing large hierarchies, Proceedings of ACM SIGCH, 1995. 8. Chung W., Chen H., Nunamaker J.F.Jr.: Business Intelligence Explorer : A Knowledge Map Framework for Discovering Business Intelligence on the Web, Proceedings of the 36th Hawaii International Conference on System Sciences , 2002. 9. Spinat E., Pourquoi intégrer des outils de cartographie au sein des systèmes d’information de l’entreprise ?, Colloquium of Information Cartography, 2002. 10.Beust P., Un outil de coloriage de corpus pour la représentation de thèmes, Proceedings of the 6th International Conference on the Statistical Analysis of Textual Data, pp. 161-172, 2002. 11.Bouroche J.M., Saporta G., L’analyse des données, Collection Que sais-je ?, PUF, 1980.