Performance and Trends in Recent Opinion Retrieval Techniques
Sylvester Olubolu Orimaye, Saadat. M Alhashmi and Siew Eu-Gene
Faculty of Information Technology, Monash University
email: {sylvester.orimaye, alhashmi, siew.eu-gene}@monash.edu
The Knowledge Engineering Review (in press), Cambridge University Press.
This paper presents trends and performance of opinion retrieval techniques proposed within the last eight years. We... more This paper presents trends and performance of opinion retrieval techniques proposed within the last eight years. We identify major techniques in opinion retrieval and group them into four popular categories. We describe the state-of-the-art techniques for each category and emphasize on their performance and limitations. We then summarize with a performance comparison table for the techniques on different datasets. Finally, we highlight possible future research directions that can help solve existing challenges in opinion retrieval.
26 views
Seen by:Data Pre-processing on Web Server Logs for Generalized Association Rules Mining Algorithm
with Mohd Norzali Haji Mohd, Hafizul Fahri Hanafi, Mohamad Farhan, Mohamad Mohsin. Published in WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL. 48, pp. 190 - 197
Conquering Language: Using Nlp on a Massive Scale to Build High Dimensional Language Models From the Web
Gregory Grefenstette: Conquering Language: Using NLP on a Massive Scale to Build High Dimensional Language Models from the Web. CICLing 2007: 35-49
Dictionaries only contain some of the information we need to know about a language. The growth of the Web, the... more Dictionaries only contain some of the information we need to know about a language. The growth of the Web, the maturation of linguistic processing tools, and the decline in price of memory storage allow us to envision descriptions of languages that are much larger than before. We can conceive of building a complete language model for a language using all the text that is found on the Web for this language. This article describes our current project to do just that.
10 views
Seen by:Expanding lexicons by inducing paradigms and validating attested forms
Gregory Grefenstette, Yan Qu, and D. A. Evans, Expanding lexicons by inducing paradigms and validating attested forms. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC-2002), Las Palmas, Canary Islands, Spain, 2002.
One of the bottlenecks in Natural Language Processing for a given language is creating a lexicon that covers the... more One of the bottlenecks in Natural Language Processing for a given language is creating a lexicon that covers the language. The morphological lexicon provides two important pieces of information for NLP applications: 1) the normalization of a word, its lemmatization, which allows the application to recognize two variants of the same word; and 2) the part-of-speech roles that the word can play, which allows the application to parse the text, creating relations between the words in a text. Many NLP applications, e.g. Information Retrieval, Classification, Terminology Extraction, etc., depend upon the normalization and parsing information found in lexicons. When words are not present in these lexicons, it is difficult to predict what their proper lemmatizations and parts-of-speech are. In this paper we present a technique for updating a lexicon given an unknown word via induction of paradigms from an existing, but incomplete, lexicon and validation of the paradigm using corpus evidence.
18 views
Seen by:Disputed sentence suggestion towards credibility-oriented Web search
Proceedings of the 14th Asia-Pacific Web Conference (APWeb 2012)
We propose a novel type of query suggestion to support credibility-oriented Web search. When users issue queries to... more We propose a novel type of query suggestion to support credibility-oriented Web search. When users issue queries to search for Web pages, our system collects disputed sentences about queries from the Web. Then, the system measures how typical and relevant each of the collected disputed sentences are to the given queries. Finally, the system suggests some of the most typical and rel- evant disputed sentences to the users. Conventional query suggestion techniques focus only on making it easy for users to search for Web pages matching their intent. Therefore, when users search for Web pages to check the credibility of specific opinions or statements, queries suggested by conventional techniques are not always useful in searching for evidence for credibility judgments. In addition, if users are not careful about the credibility of information in the Web search pro- cess, it is difficult to be aware of the existence of suspicious Web information through conventional query suggestions. Our disputed sentence suggestion en- hances users’ awareness of suspicious statements so that they can search for Web pages with careful attention to them.
27 views
Seen by:Estimating the Number of Concepts
Grefenstette, Gregory. "Estimating the Number of Concepts" chapter 8 in A Way with Words: Recent Advances in Lexical Theory and Analysis: A Festschrift for Patrick Hanks Gilles-Maurice de Schryver (editor) (Ghent University and University of the Western Cape) Kampala: Menha Publishers, 2010, vii+375 pp; ISBN 978-9970-10-101-6
Most Natural Language Processing systems have been built around the idea of a word being something found between white... more Most Natural Language Processing systems have been built around the idea of a word being something found between white spaces and punctuation. This is a normal and efficient way to proceed. Tasks such as Word Sense Disambigua-tion, Machine Translation, or even indexing rarely go beyond the single word. Language models used in NLP applications are built on the word, with a few multiword expressions taken as exceptions. But future NLP systems will neces-sarily venture out into the uncharted areas of multiword expressions. The di-mensions and the topology of multiword concepts are unknown: Are there hun-dreds of thousands or tens of millions? Which words participate in multiword concepts and which do not? As the corpus grows, will their number keep on increasing? In this paper, I estimate the number of multiword concepts that are used in English, systematically probing the Web as our corpus.
24 views
Seen by:[Web Usage Mining] Détermination des facteurs de succès d’un site web par un modèle de régression logistique
Le Web Usage Mining est l’application des techniques du datamining aux données d’usage du web dans le but de découvrir... more
Le Web Usage Mining est l’application des techniques du datamining aux données d’usage du web dans le but de découvrir les réelles attentes des utilisateurs d’un site web. Ce dernier est de plus en plus vu comme une source de fort revenu pour l’annonceur et donc pour le propriétaire du site qui
dispose d’un large champ d’action pour rentabiliser son site, tout en s’alignant aux comportements observés chez les visiteurs de son site web.
Ce travail se propose de fournir au webmaster un modèle statistique mettant en relief les variables qui
réaliseraient le mieux un objectif avancé par le propriétaire du site web. Le but est de guider un webmaster à agir correctement, non arbitrairement, pour atteindre son objectif, souvent lucratif.
A partir d’un échantillon donné de visites effectuées sur un site web étudié, nous avons construit un modèle de régression logistique, qui a pour ambition de modéliser la probabilité d’atteindre l’objectif souhaité en fonction de certaines variables décrivant la visite, le visiteur, et le site lui-même.
Dans un premier temps, le modèle a été choisi par une sélection pas à pas descendante. Ensuite, un algorithme de validation croisée a été appliqué pour décider du pouvoir prédictif du modèle. La lecture des résultats a permis d’énoncer quelques recommandations au webmaster.
Knowledge Retrieval from Dynamically Generated Web Pages
International Journal of Software Engineering 2(1) pp:2-6
Improved knowledge management in modern World Wide Web is one of the major and significant issues to retrieve accurate... more Improved knowledge management in modern World Wide Web is one of the major and significant issues to retrieve accurate and complete data. The hidden Web, also known as the invisible Web or deep Web, has given rise to a new issue of Web mining research. Most documents in the hidden Web, including pages hidden behind search forms, specialized databases, and dynamically generated Web pages, are not accessible by general Web mining application. In this paper a system is designed that has a robust ability to access these hidden web pages using web structure mining techniques for better knowledge management. As dynamic content generation is used in modern web pages and user forms are used to get information from a particular user and stored in a database. The link structure lying in these forms can not be accessed during conventional mining procedures. The accuracy ratio of web page hierarchical structures can be improved by including these hidden web pages in the process of Web structure mining. The designed system is adequately strong to process the dynamic Web pages along with static ones.
Web Layout Mining (WLM): A New Paradigm for Intelligent Web Layout Design
IEEE 4th International Conference on Information and Communication Technology (ICICT 2006) pp:639-650
The problem in designing of modern Website projects is to produce contents according to the latest trends and styles.... more The problem in designing of modern Website projects is to produce contents according to the latest trends and styles. The common Website editors just help to draw the intended layouts but the problem is to design the accurate Web layout according to the demand and latest trends and style. This approach is useful when the user has a specific layout already in mind and is familiar with the Web page layout principles as to what kinds of layouts are possible. It is intrinsically difficult for particularly those who have limited artistic and creative abilities to design good layout from scratch which is acceptable in every respect. An automated system is required that has ability to mine the layouts of the desired type of Websites. The designed system for "Web layout mining (WLM)" helps to mine the most popular web-layouts from the Internet database and design a Web-layout that is near to acceptable and have all the marks and features of modern requirements. The designed system actually bases on a rule based algorithm which helps the user to search out some samples related to his Website category and afterwards the user himself chooses a desired Web-layout and designs its own one with proper implications and variations according to his own requirements.
Opinion mining in social media: modeling, simulating, and visualizing political opinion formation in the web
Affordable and ubiquitous online communications (social media) provide the means for flows of ideas and opinions and... more Affordable and ubiquitous online communications (social media) provide the means for flows of ideas and opinions and play an increasing role for the transformation and cohesion of society – yet little is understood about how online opinions emerge, diffuse, and gain momentum. To address this problem, an opinion formation framework based on content analysis of social media and sociophysical system modeling is proposed. Based on prior research and own projects, three building blocks of online opinion tracking and simulation are described: (1) automated topic and opinion detection in real-time, (2) topic and opinion modeling and agent-based simulation, and (3) visualizations of topic and opinion networks. Finally, two application scenarios are presented to illustrate the framework and motivate further research.
Scientific LogAnalyzer: A Web-based tool for analyses of server log files in psychological research.
Co-autored by Stieger, S., published 2004 in Behavior Research Methods, Instruments, & Computers, 36, 304-311.
Scientific LogAnalyzer is a platform-independent interactive Web service for the analysis of log files. Scientific... more Scientific LogAnalyzer is a platform-independent interactive Web service for the analysis of log files. Scientific LogAnalyzer offers several features not available in other log file analysis tools — for example, organizational criteria and computational algorithms suited to aid behavioral and social scientists. Scientific LogAnalyzer is highly flexible on the input side (unlimited types of log file formats), while strictly keeping a scientific output format. Features include (1) free definition of log file format, (2) searching and marking dependent on any combination of strings (necessary for identifying conditions in experiment data), (3) computation of response times, (4) detection of multiple sessions, (5) speedy analysis of large log files, (6) output in HTML and/or tab-delimited form, suitable for import into statistics software, and (7) a module for analyzing and visualizing drop-out. Several methodological features specifically needed in the analysis of data collected in Internet-based experiments have been implemented in the Web-based tool and are described in this article. A regression analysis with data from 44 log file analyses shows that the size of the log file and the domain name lookup are the two main factors determining the duration of an analysis. It is less than a minute for a standard experimental study with a 2 X 2 design, a dozen Web pages, and 48 participants (ca. 800 lines, including data from drop-outs). The current version of Scientific LogAnalyzer is freely available for small log files. Its Web address is h
Web Layout Mining (WlM): A New Paradigm for Intelligent Web Layout Design
Egyptian Computer Science Journal, May 2007, 29(2) pp:54-63
The problem in designing of modern website projects is to produce contents according to the latest trends and styles.... more The problem in designing of modern website projects is to produce contents according to the latest trends and styles. The common website editors just help to draw the intended layouts but the problem is to design the accurate web layout according to the demand and latest trends and style. This approach is useful when the user has a specific layout already in mind and is familiar with the web page layout principles as to what kinds of layouts are possible. It is intrinsically difficult for particularly those who have limited artistic and creative abilities to design good layout from scratch which is acceptable in every respect. An automated system is required that has ability to mine the layouts of the desired type of websites. The designed system for “Web Layout Mining (WLM)” helps to mine the most popular web-layouts from the internet database and design a web-layout that is near to acceptable and have all the marks and features of modern requirements. The designed system actually bases on a rule based algorithm which helps the user to search out some samples related to his website category and afterwards the user himself chooses a desired web-layout and designs its own one with proper implications and variations according to his own requirements.
Web Information Mining Framework using XML Based Knowledge Representation Engine
International Conference on Software Engineering (ISE’06) pp:18-23
Information or knowledge representation is one of the principal elements of artificial intelligence based... more Information or knowledge representation is one of the principal elements of artificial intelligence based applications. Conventionally, predicate logic is used in various languages to represent the required knowledge. The recent fashion in knowledge representation languages is to use XML as the low-level syntax. XML is the standard representation of multi-facet data. This generic representation tends to make the output of these KR languages easy for machines to parse. On the other hand, modern web repository is a huge collection of multilingual information contents. Information extraction is a vexing problem due to this multi-lingual base of the web data collection. The issue if semantic based information searching has become an impasse due to the variance and discrepancies in the characteristics of the data. In the conducted research, a dynamic Web information framework has been proposed which is using a XML based knowledge representation Engine. The proposed framework has a profound ability to store data straight away in XML format in the knowledge representation engine to avoid the time and other communication constraints involved in using a XML layer for the communication between the conspired mining agents and the underlying knowledge base representation engine.
94 views
Seen by: and 3 moreThe WWW As a Resource for Lexicography
Grefenstette, Gregory. 2002. The WWW as a Resource for Lexicography. In: Marie-Hélène Corréard (ed.), Lexicography and Natural Language Processing: A Festschrift in Honour of B.T.S. Atkins, Euralex, pp. 199-215
Until the appearance of the Brown Corpus with its 1 million words in the 1960s and then, on a larger scale, the... more Until the appearance of the Brown Corpus with its 1 million words in the 1960s and then, on a larger scale, the British National Corpus (the BNC) with its 100 million words, the lexicographer had to rely pretty much on his or her intuition (and amassed scraps of papers) to describe how words were used. Since the task of a lexicographer was to summarize the senses and usages of a word, that person was called upon to be very well read, with a good memory, and a great sensitivity to nuance. These qualities are still and always will be needed when one must condense the description of a great variety of phenomena into a fixed amount of space. But what if this last constraint, a fixed amount of space, disappears? One can then imagine fuller descriptions of how words are used. Taking this imaginative step, the FrameNet project has begun collecting new, fuller descriptions into a new type of lexicographical resource in which e ach entry will in principle provide an exhaustive account of the semantic and syntactic combinatorial properties of one "lexical unit" (i.e., one word in one of its uses).' (Fillmore & Atkins 1998) This ambition to provide an exhaustive accounting of these properties implies access to a large number of examples of words in use. Though the Brown Corpus and the British National Corpus can provide a certain number of these, the World Wide Web (WWW) presents a vastly larger collection of examples of language use. The WWW is a new resource for lexicographers in their task of describing word patterns and their meanings. In this chapter, we look at the WWW as a corpus, and see how this will change how lexicographers model word meaning.
17 views
Seen by:Object/Background Scene Joint Classification in Photographs Using Linguistic Statistics from the Web
Bertrand Delezoide, Guillaume Pitel, Hervé Le Borgne
Gregory Grefenstette, Pierre-Alain Moëllic, Christophe Millet, "Object/Background Scene Joint Classification in Photographs Using Linguistic
Statistics from the Web" OntoImage 2008 Workshop, LREC 2008, Marrakesh, p. 22-30, 2008
Object and scene recognition is widely recognized as a difficult problem in computer vision. We present here an... more Object and scene recognition is widely recognized as a difficult problem in computer vision. We present here an approach to this problem that merges recognition of an object and its background. Relying on the assumption that given objects are strongly linked to given background scenes (a deer is more likely to appear in a forest than on an iceberg), we learn object classifiers using joint estimations of object and scene. Such an approach would normally require a large quantity of training images labelled with object/background scene associations. To circumvent costly manual training set labelling, we propose a cross-modal approach, learning and incorporating contextual information via automatic text analysis from theWeb, to generate the conditional probabilities of an object given a background scene. This method allows us to strictly distinguish the object classifier from the background scene classifier, and then merge them using estimated conditional probabilities through a learned Bayesian network. The key contribution of this paper is a framework that provides a unified, multimodal approach to learning and using contextual information for improving image processing using statistics obtained from processing Web text
70 views
Seen by:The Color of Things: Towards the Automatic Acquisition of Information for a Descriptive Dictionary
Grefenstette, G. (2005) The Color of Things: Towards the Automatic Acquisition of Information for a Descriptive Dictionary, in Revue Française de Linguistique Appliquée (RFLA), December, pp. 83-94
Physical objects are often described in dictionaries by visual features. But the information needed by computer... more
Physical objects are often described in dictionaries by visual features. But the information needed by computer applications for image analysis is not always found in dictionaries, nor in a complete form in any other publicly available information source. This article describes some first steps in finding more complete visual information about objects that could be used to enhance computer usable dictionaries and other knowledge repositories.
We show that some information about the common colors of objects can be extracted automatically from text found on the Web.
21 views
Seen by:Social Media Driven Image Retrieval
Adian Popescu, Gregory Grefenstette. Social Media Driven Image Retrieval, ICMR 2011, April 17 - 20, Trento, Italy
People often try to find an image using a short query and images are usually indexed using short annotations. Matching... more People often try to find an image using a short query and images are usually indexed using short annotations. Matching the query vocabulary with the indexing vocabulary is a difficult problem when little text is available. Textual user generated content in Web 2.0 platforms contains a wealth of data that can help solve this problem. Here we describe how to use Wikipedia and Flickr content to improve this match. The initial query is launched in Flickr and we create a query model based on co-occurring terms. We also calculate nearby concepts using Wikipedia and use these to expand the query. The final results are obtained by ranking the results for the expanded query using the similarity between their annotation and the Flickr model. Evaluation of these expansion and ranking techniques, over the ImageCLEF 2010 Wikipedia Collection containing 237,434 images and their textual annotations, shows that a consistent improvement is obtained compared to existing methods
41 views
Seen by:
