Research Papers & Patents

Understanding the importance of basic research, we consistently participate in academic conferences, review for journals, and do our own original research.

Books Digital Advertising Information Retrieval Machine Learning Text Mining Patents

Books

Human language technology systems have typically focused on the factual aspect of content analysis. Other aspects, including pragmatics, point of view, and style, have received much less attention. However, to achieve an adequate understanding of a text, these aspects cannot be ignored. In this symposium, we address computer-based analysis of point of view. Our goal is to bring together people from academia, government, and industry to explore annotation, modeling, mining, and classification of opinion, subjectivity, attitude, and affect in text, across a range of text management applications. The symposium therefore addresses a rather wide range of issues, from theoretical questions and models, through annotation standards and methods, to algorithms for recognizing, clustering, characterizing, and displaying attitudes and affect in text. Despite growing interest in this area, with papers recently published in major conferences and new corpora developed, there has never been a workshop or symposium that targets a wide audience of researchers and practitioners on these topics.

Knowledge discovery is an area of computer science that attempts to uncover interesting and useful patterns in data that permit a computer to perform a task autonomously or assist a human in performing a task more efficiently. Soft Computing for Knowledge Discovery provides a self-contained and systematic exposition of the key theory and algorithms that form the core of knowledge discovery from a soft computing perspective. It focuses on knowledge representation, machine learning, and the key methodologies that make up the fabric of soft computing - fuzzy set theory, fuzzy logic, evolutionary computing, and various theories of probability (e.g. naive Bayes and Bayesian networks, Dempster-Shafer theory, mass assignment theory, and others). In addition to describing many state-of-the-art soft computing approaches to knowledge discovery, the author introduces Cartesian granule features and their corresponding learning algorithms as an intuitive approach to knowledge discovery. This new approach embraces the synergistic spirit of soft computing and exploits uncertainty in order to achieve tractability, transparency and generalization. Parallels are drawn between this approach and other well known approaches (such as naive Bayes and decision trees) leading to equivalences under certain conditions. The approaches presented are further illustrated in a battery of both artificial and real-world problems. Knowledge discovery in real-world problems, such as object recognition in outdoor scenes, medical diagnosis and control, is described in detail. These case studies provide further examples of how to apply the presented concepts and algorithms to practical problems. The author provides web page access to an online bibliography, datasets, source codes for several algorithms described in the book, and other information. Soft Computing for Knowledge Discovery is for advanced undergraduates, professionals and researchers in computer science, engineering and business information systems who work or have an interest in the dynamic fields of knowledge discovery and soft computing.

Digital Advertising

Over the past 18 years online advertising has grown to a $70 billion industry worldwide annually. Despite this impressive growth, online advertising faces many (and some would say traditional) challenges including how to measure the efficiency or the potential loss of sales caused by the inefficient use of advertising dollars. Consequently, it is vital to measure, maximize, and benchmark the efficiency of advertising media expenditures. This tutorial introduces the field of econometrics as a means of measuring the effectiveness of digital marketing. Econometrics is a field that extends and applies statistical methods to the analysis of economic phenomena. In that vein, econometrics goes beyond traditional statistics and explicitly recognizes the complexities of human behavior. Consider for example the impact of deep discounts on survival of restaurants. Struggling businesses are more likely to offer these deep discounts and eventually fail. A naive application of statistical techniques will overestimate the impact of deep discounts on business survival. In this case, the discounts are an endogenous variable as compared to an exogenous variable. This type of specification error highlights why we need a deeper look at the variables that go into statistical models. Econometrics addresses these and other issues in a formal and rigorous manner.

Active learning is a form of supervised machine learning in which the learning algorithm is able to interactively query the teacher to obtain a label for new data points. There are situations in which unlabeled data is abundant but labeling data is expensive. In such a scenario the learning algorithm can actively query the user/teacher for labels. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning.

Digital online advertising is a form of promotion that uses the Internet and Web for the express purpose of delivering marketing messages to attract customers. Examples of online advertising include text ads that appear on search engine results pages, banner ads, in-text ads, or Rich Media ads that appear on regular web pages, portals, or applications. Over the past 15 years online advertising, a $65 billion industry worldwide in 2009, has been pivotal to the success of the Web. That being said, the field of advertising has been equally revolutionized by the Internet, Web, and more recently, by the emergence of the social web, and mobile devices. This success has arisen largely from the transformation of the advertising industry from a low-tech, human intensive, "Mad Men" way of doing work to highly optimized, quantitative, mathematical, computer- and data-centric processes that enable highly targeted, personalized, performance-based advertising. This chapter provides a clear and detailed overview of the technologies and business models that are transforming the field of online advertising primarily from statistical machine learning and information science perspectives.

Digital online advertising is a form of promotion that uses the Internet and World Wide Web for the express purpose of delivering marketing messages to attract customers. Frequency capping is a term in digital advertising that means restricting (or capping) the amount of times (frequency) a specific visitor to a website or group of websites (in the case of ad networks) is shown a particular advertisement. Frequency capping is a feature within ad serving that allows the advertiser/ad-network to limit the maximum number of impressions/views a visitor can see a specific ad within a period of time. The advertiser or advertising network specifies a limit to the number of impressions you will allow per day, per week, or per month for an individual user. Frequency capping is often viewed as a key means in preventing banner burnout (the point where visitors are being overexposed and response drops) and in maintaining a competitive quality score (a core component of expected CPM-based ranking). Generally, the frequency capping policy for an ad is heuristically set by the advertiser or is determined heuristically by the ad network where the ad runs and is optimized for short term gain. In this paper we propose a data driven and principled approach that optimizes the life time value of site visitors. We propose to set frequency capping policies for different online marketing segments using Markov decision processes (MDP). Managing targeted marketing (customer relationship management) campaigns in this way can lead to substantial improvement in several business metrics such as click through rates and revenue. Though the current proposed approach lacks evaluation at the time of submission it is hoped to complete a study using this approach and present the results at the workshop.

Advertising is a multi-billion dollar industry that has become a significant component of the Web browsing experience. Online advertising systems incorporate many information retrieval techniques by combining content analysis, user interaction models, and commercial constraints. Advances in online advertising have come from integrating several core research areas: information retrieval, data mining, machine learning, and user modeling. The workshop will serve as an open forum for discussion of new ideas and current research related to information retrieval topics relevant to online advertising. The outcome will be a set of full and short papers covering a variety of topics. The short paper format will allow researchers new to the area to actively participate and explore novel themes. It will also enable researchers without access to extensive empirical data to propose ideas and experiments. We also expect the workshop to help develop a community of researchers interested in this area, and yield future collaboration and exchanges. Despite its commercial significance, advertising is a rather young field of research. This workshop will help the emerging research community better organize and develop a common perspective. The workshop will serve as a forum for researchers and industry participants to exchange latest ideas and best practices while encouraging future breakthroughs. It will also aid in fostering collaboration between industry and academia.

Information Retrieval

To investigate what kind of snippets are better suited for structured search on mobile devices, we built an experimental mobile search application and conducted a task-oriented interactive user study with 36 participants. Four different versions of a search engine result page (SERP) were compared by varying the snippet type (query-biased vs. non-redundant) and the snippet length (two vs. four lines per result). We adopted a within-subjects experiment design and made each participant do four realistic search tasks using different versions of the application. During the study sessions, we collected search logs, "think-aloud" comments, and post-task surveys. Each session was finalized with an interview. We found that with non-redundant snippets the participants were able to complete the tasks faster and find more relevant results. Most participants preferred non-redundant snippets and wanted to see more information about each result on the SERP for any snippet type. Yet, the participants felt that the version with query-biased snippets was easier to use. We conclude with a set of practical design recommendations.

To help users cope with the scale and influx of new information, professional social networks (PSNs) provide a search functionality. However, most of the search engines within PSNs today only support keyword queries and basic faceted search capabilities overlooking serendipitous network exploration and search for relationships between entities. This results in siloed information and a limited search space. My thesis is that we must redesign all major elements of a search user interface, such as input, control, and informational, to enable more effective search interactions within PSNs. I will introduce new insights and algorithms supporting the thesis.

Domain expertise is regarded as one of the key factors impacting search success: experts are known to write more effective queries, to select the right results on the result page, and to find answers satisfying their information needs. Search transaction logs play the crucial role in the result ranking. Yet despite the variety in expertise levels of users, all prior interactions are treated alike, suggesting that weighting in expertise can improve the ranking for informational tasks. The main aim of this paper is to investigate the impact of high levels of technical domain expertise on both search behavior and task outcome. We conduct an online user study with searchers proficient in programming languages. We focus on Java and Javascript, yet we believe that our study and results are applicable for other expertise-sensitive search tasks. The main findings are three-fold: First, we constructed expertise tests that effectively measure technical domain expertise and correlate well with the self-reported expertise. Second, we showed that there is a clear position bias, but technical domain experts were less affected by position bias. Third, we found that general expertise helped finding the correct answers, but the domain experts were more successful as they managed to detect better answers. Our work is using explicit tests to determine user expertise levels, which is an important step toward fully automatic detection of expertise levels based on interaction behavior. A deeper understanding of the impact of expertise on search behavior and task outcome can enable more effective use of expert behavior in search logs - essentially make everyone search as an expert.

Sorting tuples by an attribute value is a common search scenario and many search engines support such capabilities, e.g. price-based sorting in e-commerce, time-based sorting on a job or social media website. However, sorting purely by the attribute value might lead to poor user experience because the relevance is not taken into account. Hence, at the top of the list the users might see irrelevant results. In this paper we choose a different approach. Rather than just returning the entire list of results sorted by the attribute value, additionally we suggest doing the relevance-aware search results (post-) filtering. Following this approach, we develop a new algorithm based on the dynamic programming that directly optimizes a given search quality metric. It can be seamlessly integrated as the final step of a query processing pipeline and provides a theoretical guarantee on optimality. We conduct a comprehensive evaluation of our algorithm on synthetic data and real learning to rank data sets. Based on the experimental results, we conclude that the proposed algorithm is superior to typically used heuristics and has a clear practical value for the search and related applications.

Popular online social networks (OSN) generate hundreds of terabytes of new data per day and connect millions of users. To help users cope with the immense scale and influx of new information, OSNs provide a search functionality. However, most of the search engines in OSNs today only support keyword queries and provide basic faceted search capabilities overlooking serendipitous network exploration and search for relationships between OSN entities. This results in siloed information and a limited search space. In 2013 Facebook introduced its innovative Graph Search product with the goal to take the OSN search experience to the next level and facilitate exploration of the Facebook Graph beyond the first degree. In this paper we explore people search on Facebook by analyzing an anonymized social graph, anonymized user profiles, and large scale anonymized query logs generated by users of Facebook Graph Search. We uncover numerous insights about people search across several demographics. We find that named entity and structured queries complement each other across one's duration on Facebook, that females search for people proportionately more than males, and that users submit more queries as they gain more friends. We introduce the concept of a lift predicate and highlight how a graph distance varies with the search goal. Based on these insights, we present a set of design implications to guide the research and development of the OSN search in the future.

The pervasive nature of the internet has caused a significant transformation in the field of genealogical research. This has impacted not only how research is conducted, but has also dramatically increased the number of people discovering their family history. Recent market research (Maritz Marketing 2000, Harris Interactive 2009) indicates that general interest in the United States has increased from 45% in 1996, to 60% in 2000, and 87% in 2009. Increased popularity has caused a dramatic need for improvements in algorithms related to extracting, accessing, and processing genealogical data for use in building family trees. This paper presents one approach to algorithmic improvement in the family history domain, where we infer the familial relationships of households found in human transcribed United States census data. By applying advances made in natural language processing, exploiting the sequential nature of the census, and using state of the art machine learning algorithms, we were able to decrease the error by 35% over a hand coded baseline system. The resulting system is immediately applicable to hundreds of millions of other genealogical records where families are represented, but the familial relationships are missing.

Examples are very important in design, but existing tools for design example search still do not cover many cases. For instance, long tail queries containing subtle and subjective design concepts, like "calm and quiet", "elegant", "dark background with a hint of color to make it less boring", are poorly supported. This is mainly due to the inherent complexity of the task, which so far has been tackled only algorithmically using general image search techniques. We propose a powerful new approach based on crowdsourcing, which complements existing algorithmic approaches and addresses their shortcomings. Out of many explored crowdsourcing configurations we found that (1) a design need should be represented via several query images and (2) AMT crowd workers should assess a query-specific relevance of a candidate example from a pre-built design collection. To test the utility of our approach, we compared it with Google Images in a query-by-example mode. Based on feedback from expert designers, the crowd selects more relevant design examples.

Aiming to improve user experience for a job search engine, in this paper we propose an idea to switch from query-biased snippets used by most web search engines to rich structured snippets associated with the main sections of a job posting page, which are more appropriate for job search due to specific user needs and the structure of job pages. We present a very simple yet actionable approach to generate such snippets in an unsupervised way. The advantages of the proposed approach are two-fold: it doesn't require manual annotation and therefore can be easily deployed to many languages, which is a desirable property for a job search engine operating internationally; it fuses naturally with the trend towards Mobile Web where the content needs to be optimized for small screen devices and informativeness.

Local search engines allow users to search for entities such as businesses in a particular geographic location. To improve the geographic relevance of search, user feedback data such as logged click locations are traditionally used. In this paper, we use anonymized mobile call log data as an alternate source of data and investigate its relevance to local search. Such data consists of records of anonymized mobile calls made to local businesses along with the locations of celltowers that handled the calls. We model the probability of calls made to particular categories of businesses as a function of distance, using a generalized linear model framework. We provide a detailed comparison between a click log and a mobile call log, showing its relevance to local search. We describe our probabilistic models and apply them to anonymized mobile call logs for New York City and Los Angeles restaurants.

Local search is a specialization of the web search that allows users to submit geographically constrained queries. However, one of the challenges for local search engines is to uniquely understand and locate the geographical intent of the query. Geographical constraints (or location references) in a local search are often incomplete and thereby suffer from the referent ambiguity problem where the same location name can mean several different possibilities. For instance, just the term "Springfield" by itself can refer to 30 different cities in the USA. Previous approaches to location disambiguation have generally been hand compiled heuristic models. In this paper, we examine a data-driven, machine learning approach to location disambiguation. Essentially, we separately train a Gradient Boosted Decision Tree (GBDT) model on thousands of desktop and mobile-based local searches and compare the performance to one of our previous heuristic based location disambiguation system (HLDS). The GBDT based approach shows promising results with statistically significant improvements over the HLDS approach. The error rate reduction is about 9% and 22% for the desktop-based and the mobile-based local searches respectively. Additionally, we examine the relative influence of various geographic and non-geographic features that help with the location disambiguation task. It is interesting to note that while the distance between the user and the intended location has been considered as an important variable, the relative influence of distance is secondary to the popularity of the location in the GBDT learnt models.

Parsing unstructured local web queries is often tackled using simple syntactic rules that tend to be limited and brittle. Here we present a data-driven approach to learning a query parser for local-search (geographical) queries. The learnt model uses class-level ngram language model-based features; these ngram language models, harvested from structured queries logs, insulate the model from surface-level tokens. The proposed approach is compared with a finite state model. Experiments show significant improvements for parsing geographical web queries using these learnt models.

This paper talks about why local search relevance is important and how to crowdsource local search relevance tasks, as well as what factors have influences on the precision of these tasks.

The idea behind the semantic web is that documents will contain additional markup that make explicit the information content of unstructured media. We present here the Document Souls system which allows documents to become animate, actively searching the wider world for more information about their own contents, attaching relevant information to itself as additional markup. A Document Soul is a set of information requests that are attached to a document as annotation. A demon within the system polls these requests and activates search agents that asynchronously respond to unsatisfied requests. To control search and relevance, collections of information requests are packaged as personalities which filter out unwanted information. In this paper, we present the structure of the Document Souls architecture and the function of personalities for performing contextualized search.

When people read or write documents, they spontaneously generate new information needs: for example, to understand the text they are reading; to find additional information related to the points they are making in their drafts. Simultaneously, each Information Object (IO) (i.e., word, entity, term, concept, phrase, proposition, sentence, paragraph, section, document, collection, etc.) someone reads or writes also creates context for the other IOs in the same discourse. We present a conceptual model of Agentized, Contextualized Filters (ACFs) - agents that identify an appropriate context for an information object and then actively fetch and filter relevant information concerning the information object in other information sources the user has access to. We illustrate the use of ACFs in a prototype knowledge management system called ViviDocs.

The Clairvoyance team participated in the Filtering Track, submitting two runs in the Batch Filtering category. While we have been exploring the question of both topic modeling and ensemble filter construction (as in our previous TREC filtering experiments [5]), we had one distinct objective this year, to explore the viability of monolithic filters in classification-like tasks. This is appropriate to our work, in part, because monolithic filters are a crucial starting point for ensemble filtering, and it is possible for them to contribute substantially in the ensemble approach. Our primary goal in experiments this year, thus, was to explore two issues in monolithic filter construction: (1) term count selection and (2) filter threshold optimization.

Current learning approaches to computer vision have mainly focussed on low-level image processing and object recognition, while tending to ignore high-level processing such as understanding. Here we propose an approach to object recognition that facilitates the transition from recognition to understanding. The proposed approach embraces the synergistic spirit of soft computing, exploiting the global search powers of genetic programming to determine fuzzy probabilistic models. It begins by segmenting the images into regions using standard image processing approaches, which are subsequently classified using a discovered fuzzy Cartesian granule feature classifier. Understanding is made possible through the transparent and succinct nature of the discovered models. The recognition of roads in images is taken as an illustrative problem in the vision domain. The discovered fuzzy models while providing high levels of accuracy (97%), also provide understanding of the problem domain through the transparency of the learnt models. The learning step in the proposed approach is compared with other techniques such as decision trees, naive Bayes and neural networks using a variety of performance criteria such as accuracy, understandability and efficiency.

We are concerned with object recognition in the framework of perceptual organization. The approach presented incorporates a number of concepts from human visual analysis especially the Gestalt laws of organization. Fuzzy techniques are used for the definition and evaluation of the grouping/non-grouping properties as well as for the construction of structures from grouped input tokens. This method takes as input the initially fitted line segments (tokens) and then recursively groups these tokens into higher level structures (tokens) such as lines, u-structures, quadrilaterals, etc. The output high-level structures can then be used to compare with object models and thus lead to object recognition. In this paper inference (grouping) of line segments, line symmetry, junctions, closed regions and strands is presented. The approach is supported by experimental results on 2D images of an office scene environment.

Current learning approaches to computer vision have mainly focussed on low-level image processing and object recognition, while tending to ignore higher level processing for understanding. We propose an approach to scene analysis that facilitates the transition from recognition to understanding. It begins by segmenting the image into regions using standard approaches, which are then classified using a discovered fuzzy Cartesian granule feature classifier. Understanding is made possible through the transparent and succinct nature of the discovered models. The recognition of roads in images is taken as an illustrative problem. The discovered fuzzy models while providing high levels of accuracy (97%), also provide understanding of the problem domain through the transparency of the learnt models. The learning step in the proposed approach is compared with other techniques such as decision trees, naive Bayes and neural networks using a variety of performance criteria such as accuracy, understandability and efficiency.

We present a new approach to representing and acquiring controllers based upon Cartesian granule features - multidimensional features formed over the cross product of words drawn from the linguistic partitions of the constituent input features - incorporated into additive models. Controllers expressed in terms of Cartesian granule features enable the paradigm "controlling with words" by translating process data into words that are subsequently used to interrogate a rule base, which ultimately results in a control action. The system identification of good, parsimonious additive Cartesian granule feature models is an exponential search problem. In this paper we present the GDACG constructive induction algorithm as a means of automatically identifying additive Cartesian granule feature models from example data. GDACG combines the powerful optimisation capabilities of genetic programming with a novel and cheap fitness function, which relies on the semantic separation of concepts expressed in terms of Cartesian granule fuzzy sets, in identifying these additive models. We illustrate the approach on a variety of problems including the modelling of a dynamical process and a chemical plant controller.

Current approaches to knowledge discovery can be differentiated based on the discovered models using the following criteria: effectiveness, understandability (to a user or expert in the domain) and evolvability (the ability to adapt over time to a changing environment). Most current approaches satisfy understandability or effectiveness, but not simultaneously while tending to ignore knowledge evolution. We show how knowledge representation based upon Cartesian granule features and a corresponding induction algorithm can effectively address these knowledge discovery criteria (in this paper, the discussion is limited to understandability and effectiveness) across a wide variety of problem domains, including control, image understanding and medical diagnosis.

Variables defined over Cartesian granule feature universes can be viewed as multidimensional linguistic variables. These variable universes are formed over the cross product of words drawn from the fuzzy partitions of the constituent base features. Here we present a constructive induction algorithm, which identifies not only the Cartesian granule feature model but also the concepts/variables in which the model is expressed. The presented constructive induction algorithm combines the genetic programming search paradigm with a rather novel and cheap fitness function, which is based upon semantic discrimination analysis. Parsimony is promoted in this model discovery process, thereby leading to models with better generalisation power and transparency. The approach is demonstrated on an image understanding problem, an area that has traditionally been dominated by quantitative and black box modelling techniques. Overall the discovered Cartesian granule features models when demonstrated on a large test set of outdoor images provides highly accurate image interpretation, using four input features, with over 78% of the image area labelled correctly.

Address the problem of structure inference in an image, in the framework of perceptual organization. This paper describes work in progress which builds on a subset of the authors' previous work on fuzzy perceptual grouping. More precisely, the authors are concerned with obtaining a fuzzy system which can achieve grouping of line segments. The data are obtained from images and consist of the results of edge extraction to which a line segment fitting algorithm has been applied. For each collection of similar and collinear segments used as input, a representative segment is used to summarize this collection. In the training stage both the input collection and the output segment can be either indicated by a human user, or obtained by overlaying the segment and real images.

Apache Spark is an open-source cluster computing framework for big data processing. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce's linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala (and shortly R), and its core data abstraction, the distributed data frame, and it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing. This tutorial will provide an accessible introduction to Spark and its potential to revolutionize academic and commercial data science practices.

Machine Learning

Apache Spark is an open-source cluster computing framework for big data processing. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce's linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala (and shortly R), and its core data abstraction, the distributed data frame, and it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing. This tutorial will provide an accessible introduction to Spark and its potential to revolutionize academic and commercial data science practices.

Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even more challenging. Over the last decade research on adversarial information retrieval has gained a lot of interest both from academia and industry. In this paper we present a systematic review of web spam detection techniques with the focus on algorithms and underlying principles. We categorize all existing algorithms into three categories based on the type of information they use: content-based methods, link-based methods, and methods based on non-traditional data such as user behaviour, clicks, HTTP sessions. In turn, we perform a subcategorization of link-based category into five groups based on ideas and principles used: labels propagation, link pruning and reweighting, labels refinement, graph regularization, and featurebased. We also define the concept of web spam numerically and provide a brief survey on various spam forms. Finally, we summarize the observations and underlying principles applied for web spam detection.

Over the last decade learning to rank (L2R) has gained a lot of attention and many algorithms have been proposed. One of the most successful approach is to build an algorithm following the ensemble principle. Boosting is the key representative of this approach. However, even boosting isn't effective when used to increase the performance of individually strong algorithms, scenario when we want to blend already successful L2R algorithms in order to gain an additional benefit. To address this problem we propose a novel algorithm, based on a theory of nonlinear monotonic ensembles, which is able to blend strong base rankers effectively. Specifically, we provide the concept of defect of a set of algorithms that allows to deduce a popular pairwise approach in strict mathematical terms. Using the concept of defect, we formulate an optimization problem and propose a sound method of its solution. Finally, we conduct experiments with real data which shows the effectiveness of the proposed approach.

The Conference on Information and Knowledge Management (CIKM) provides an international forum for presentation and discussion of research on information and knowledge management, as well as recent advances on data and knowledge bases. The purpose of the conference is to identify challenging problems facing the development of future knowledge and information systems, and to shape future directions of research by soliciting and reviewing high quality, applied and theoretical research findings. An important part of the conference is the Workshops program which focuses on timely research challenges and initiatives.

In several organizations, it has become increasingly popular to document and log the steps that makeup a typical business process. In some situations, a normative workflow model of such processes is developed, and it becomes important to know if such a model is actually being followed by analyzing the available activity logs. In other scenarios, no model is available and, with the purpose of evaluating cases or creating new production policies, one is interested in learning a workflow representation of such activities. In either case, machine learning tools that can mine workflow models are of great interest and still relatively unexplored. We present here a probabilistic workflow model and a corresponding learning algorithm that runs in polynomial time. We illustrate the algorithm on example data derived from a real world workflow.

The Association for the Advancement of Artificial Intelligence, in cooperation with Stanford University's Department of Computer Science, presented the 2004 Spring Symposium Series, Monday through Wednesday, March 22-24, at Stanford University. The titles of the eight symposia were (1) Accessible Hands-on Artificial Intelligence and Robotics Education; (2) Architectures for Modeling Emotion: Cross-Disciplinary Foundations; (3) Bridging the Multiagent and Multirobotic Research Gap; (4) Exploring Attitude and Affect in Text: Theories and Applications; (5) Interaction between Humans and Autonomous Systems over Extended Operation; (6) Knowledge Representation and Ontologies for Autonomous Systems; (7) Language Learning: An Interdisciplinary Perspective; and (8) Semantic Web Services. Each symposium had limited attendance. Most symposia chairs elected to create AAAI technical reports of their symposium, which are available as paperbound reports or (for AAAI members) are downloadable on the AAAI members-only Web site. This report includes summaries of the eight symposia, written by the symposia chairs.

Cartesian granule features were originally introduced to address some of the shortcomings of existing forms of knowledge representation such as decomposition error and transparency, and also to enable the paradigm modelling with words through related learning algorithms. This chapter presents a detailed analysis of the impact of granularity on Cartesian granule features models that are learned from example data in the context of classification problems. This analysis provides insights on how to effectively model problems using Cartesian granule features using various levels of granulation, granule characterizations, granule dimensionalies and granule generation techniques. Other modelling with words approaches such as the data browser [1, 2] and fuzzy probabilistic decision trees [3] are also examined and compared. In addition, this chapter provides a useful platform for understanding many other learning algorithms that may or may not explicitly manipulate fuzzy events. For example, it is shown how a naive Bayes classifier is equivalent to crisp Cartesian granule feature classifiers under certain conditions.

Text Mining

Automated text classification is one of the most important learning technologies to fight information overload. However, the information society is not only confronted with an information flood but also with an increase in "information volatility", by which we understand the fact that kind and distribution of a data source's emissions can significantly vary. In this paper we show how to estimate the expected effectiveness of a classification solution when the underlying data source undergoes a shift in the distribution of its subclasses (modes). Subclass distribution shifts are observed among others in online media such as tweets, blogs, or news articles, where document emissions follow topic popularity. To estimate the expected effectiveness of a classification solution we partition a test sample by means of clustering. Then, using repetitive resampling with different margin distributions over the clustering, the effectiveness characteristics is studied. We show that the effectiveness is normally distributed and introduce a probabilistic lower bound that is used for model selection. We analyze the relation between our notion of expected effectiveness and the mean effectiveness over the clustering both theoretically and on standard text corpora. An important result is a heuristic for expected effectiveness estimation that is solely based on the initial test sample and that can be computed without resampling.

Human Language Technology (HLT) and Natural Language Processing (NLP) systems have typically focused on the "factual" aspect of content analysis. Other aspects, including pragmatics, opinion, and style, have received much less attention. However, to achieve an adequate understanding of a text, these aspects cannot be ignored. The chapters in this book address the aspect of subjective opinion, which includes identifying different points of view, identifying different emotive dimensions, and classifying text by opinion. Various conceptual models and computational methods are presented. The models explored in this book include the following: distinguishing attitudes from simple factual assertions; distinguishing between the author's reports from reports of other people's opinions; and distinguishing between explicitly and implicitly stated attitudes. In addition, many applications are described that promise to benefit from the ability to understand attitudes and affect, including indexing and retrieval of documents by opinion; automatic question answering about opinions; analysis of sentiment in the media and in discussion groups about consumer products, political issues, etc.; brand and reputation management; discovering and predicting consumer and voting trends; analyzing client discourse in therapy and counseling; determining relations between scientific texts by finding reasons for citations; generating more appropriate texts and making agents more believable; and creating writers' aids. The studies reported here are carried out on different languages such as English, French, Japanese, and Portuguese. Difficult challenges remain, however. It can be argued that analyzing attitude and affect in text is an "NLP"-complete problem. The interpretation of attitude and affect depends on audience, context, and world knowledge. In addition, there is much yet to learn about the psychological and biological relationships between emotion and language. To continue to progress in this area in NLP, more comprehensive theories of emotion, attitude and opinion are needed, as are lexicons of affective terms and knowledge of how such terms are used in context, and annotated corpora for training and evaluation. This book is a first foray into this area; it grew out of a symposium on this topic that took place at Stanford University in March, 2004, under support from American Association for Artificial Intelligence (AAAI). Several of the presentations were extended into the chapters that appear here. The chapters in this collection reflect the majors themes of the workshop, corresponding to a balance among conceptual models, computational methods, and applications. The chapters in this book are organized along these themes into three broad, overlapping parts.

In addition to factual content, many texts contain an emotional dimension. This emotive, or affect, dimension has not received a great amount of attention in computational linguistics until recently. However, now that messages (including spam) have become more prevalent than edited texts (such as newswire), recognizing this emotive dimension of written text is becoming more important. One resource needed for identifying affect in text is a lexicon of words with emotion-conveying potential. Starting from an existing affect lexicon and lexical patterns that invoke affect, we gathered a large quantity of text to measure the coverage of our existing lexicon. This chapter reports on our methods for identifying new candidate affect words and on our evaluation of our current affect lexicons. We describe how our affect lexicon can be extended based on results from these experiments.

The first workshop on stylistic analysis of text for information access was held on the day following the 2005 SIGIR conference. This workshop addressed the automatic analysis and extraction of stylistic aspects of natural language texts. Style, roughly defined as the 'manner' in which something is expressed, as opposed to the 'content' of a message is usually disregarded by information access applications as having no bearing on the target notion of relevance: systems have typically focused on the "factual" aspect of content analysis.

Papers from the workshop held in conjunction with the 28th Annual International ACM Conference on Research and Development in Information Retrieval, August 13-19, 2005, Salvador, Bahia, Brazil

Newspapers generally attempt to present the news objectively. But textual affect analysis shows that many words carry positive or negative emotional charge. In this article, we show that coupling niche browsing technology and affect analysis technology allows us to create a new application that measures the slant in opinion given to public figures in the popular press.

Modelling with Words is an emerging modelling methodology closely related to the paradigm of Computing with Words introduced by Lotfi Zadeh. This book is an authoritative collection of key contributions to the new concept of Modelling with Words. A wide range of issues in systems modelling and analysis is presented, extending from conceptual graphs and fuzzy quantifiers to humanist computing and self-organizing maps. Among the core issues investigated are: balancing predictive accuracy and high level transparency in learning, scaling linguistic algorithms to high-dimensional data problems, integrating linguistic expert knowledge with knowledge derived from data, identifying sound and useful inference rules, integrating fuzzy and probabilistic uncertainty in data modelling.

The Clairvoyance team participated in the High Accuracy Retrieval from Documents (HARD) Track of TREC 2004, submitting three runs. The principal hypothesis we have been pursuing is that small numbers of documents in clusters can provide a better basis for relevance feedback than ranked lists or, alternatively, than top-N pseudo-relevance feedback (PRF). Clustering of a query response can yield one or more groups of documents in which there are "significant" numbers (greater than 30%) of relevant documents; we expect the best results when such groups are selected for feedback. Following up on work we began in our TREC-2003 HARD-Track experiments [Shanahan et al. 2004], therefore, we continued to explore approaches to clustering query response sets to concentrate relevant documents, with the goal of (a) providing users (assessors) with better sets of documents to judge and (b) making the choices among sets easier to evaluate. Our experiments, thus, focused primarily on exploiting assessor feedback through clarification forms for query expansion and largely ignored other features of the documents or metadata.

Today, much of product feedback is provided by customers/critiques online through websites, discussion boards, mailing lists, and blogs. People trying to make strategic decisions (e.g., a product launch, a purchase) will find that a web search will return many useful but heterogeneous and, increasingly, multilingual opinions on a product. Generally, the user will find it very difficult and time consuming to assimilate all available information and make an informed decision. To date, most work in automating this process has focused on monolingual texts and users. This extended abstract describes our preliminary work on mining product ratings in a multilingual setting. The proposed approaches are automatic, using a combination of techniques from classification and translation, thereby alleviating human-intensive construction and maintenance of linguistic resources.

Support vector machine (SVM) learning algorithms focus on finding the hyperplane that maximizes the margin (the distance from the separating hyperplane to the nearest examples) since this criterion provides a good upper bound of the generalization error. When applied to text classification, these learning algorithms lead to SVMs with excellent precision but poor recall. Various relaxation approaches have been proposed to counter this problem including: asymmetric SVM learning algorithms (soft SVMs with asymmetric misclassification costs); uneven margin based learning; and thresholding. A review of these approaches is presented here. In addition, in this paper, we describe a new threshold relaxation algorithm. This approach builds on previous thresholding work based upon the beta-gamma algorithm. The proposed thresholding strategy is parameter free, relying on a process of retrofitting and cross validation to set algorithm parameters empirically, whereas our previous approach required the specification of two parameters (beta and gamma). The proposed approach is more efficient, does not require the specification of any parameters, and similarly to the parameter-based approach, boosts the performance of baseline SVMs by at least 20% for standard information retrieval measures.

In general, support vector machines (SVM), when applied to text classification provide excellent precision, but poor recall. One means of customizing SVMs to improve recall, is to adjust the threshold associated with an SVM. We describe an automatic process for adjusting the thresholds of generic SVM which incorporates a user utility model, an integral part of an information management system. By using thresholds based on utility models and the ranking properties of classifiers, it is possible to overcome the precision bias of SVMs and insure robust performance in recall across a wide variety of topics, even when training data are sparse. Evaluations on TREC data show that our proposed threshold adjusting algorithm boosts the performance of baseline SVMs by at least 20% for standard information retrieval measures.

The Clairvoyance team participated in the HARD Track, submitting fifteen runs. Our experiments focused primarily on exploiting user feedback through clarification forms for query expansion. We made limited use of the genre and related text metadata. Within the clarification form feedback framework, we explored the cluster hypothesis in the context of relevance feedback. The cluster hypothesis states that closely associated documents tend to be relevant to the same requests [Van Rijsbergen, 1979]. With this in mind we investigated the impact on performance of exploiting user feedback on groups of documents (i.e., organizing the top retrieved documents for a query into intuitive groups through agglomerative clustering or document-centric clustering), as an alternative to a ranked list of titles. This forms the basis for a new blind feedback mechanism (used to expand queries) based upon clusters of documents, as an alternative to blind feedback based upon taking the top N ranked documents, an approach that is commonly used.

In this paper, we present a method based on document probes to quantify and diagnose topic structure, distinguishing topics as monolithic, structured, or diffuse. The method also yields a structure analysis that can be used directly to optimize filter (classifier) creation. Preliminary results illustrate the predictive value of the approach on TREC/Reuters-96 topics.

Traditionally, fuzzy set-based approaches have performed excellently in modeling small to medium scale problem domains. This paper examines the scalability of fuzzy systems to a large-scale problem that is inherently vague and of text categorization. The paper presents two fuzzy probabilistic approaches to text classification and the corresponding machine learning algorithms to learn such systems from example data. The first approach follows the traditional fuzzy set paradigm, while the second approach fits within the modeling with words paradigm using granule features to represent the text problem domain.

The Clairvoyance team participated in the Filtering Track, submitting the maximum number of runs in each of the filtering categories: Adaptive, Batch, and Routing. We had two distinct goals this year: (1) to establish the generalizability of our approach to adaptive filtering and (2) to experiment with relatively more "radical" approaches to batch filtering using ensembles of filters. Our routing runs served principally to establish an internal basis for comparisons in performance to adaptive and batch efforts and are not discussed in this report.

Patents

Systems and methods use machine learning techniques to resolve location ambiguity in search queries. In one aspect, a dataset generator generates a training dataset using query logs of a search engine. A training engine applies a machine learning technique to the training dataset to generate a location disambiguation model. A location disambiguation engine uses the location disambiguation model to resolve location ambiguity in subsequent search queries.

A system and method is provided which may comprise parsing an unstructured geographic web-search query into a field-based format, by utilizing conditional random fields, learned by semi-supervised automated learning, to parse structured information from the unstructured geographic web-search query. The system and method may also comprise establishing semi-supervised conditional random fields utilizing one of a rule-based finite state machine model and a statistics-based conditional random field model. Systematic geographic parsing may be used with the one of the rule-based finite state machine model and the statistics-based conditional random field model. Parsing an unstructured local geographical web-based query in local domain may be done by applying a learned model parser to the query, using at least one class-based query log from a form-based query system. The learned model parser may comprise at least one class-level n-gram language model-based feature harvested from a structured query log.

A system includes a meta-document, i.e., a document including content information which has a set of document service requests associated with it. A document service is a process which uses a portion of the document content as a starting point to obtain other information pertaining to that content. A scheduler selects a document service request from the set, then initiates and manages managing communication with a service provider to satisfy the selected document service. Any results received from the selected document service are integrated into the document.

A method, system and article of manufacture therefor, are disclosed for automatically generating a query from document content.

A system includes a meta-document, i.e., a document including content information which has a set of document service requests associated with it. A document service is a process which uses a portion of the document content as a starting point to obtain other information pertaining to that content. A scheduler selects a document service request from the set, then initiates and manages managing communication with a service provider to satisfy the selected document service. Any results received from the selected document service are integrated into the document.

In a document enrichment system, device buttons can be programmed to associate a number of personalities to the buttons on the device. These associations may be stored locally on the device. When the device with the programmed personality is activated on the device, the document produced is sent to a document enrichment system. Once enriched the document enrichment system notifies the user of the availability of the document in its new form.

A method and processing system for generating a workflow graph from empirical data of a process are described. Data for multiple instances of a process are obtained, the data including information about task ordering. The processing system analyzes occurrences of tasks to identify order constraints. A set of nodes representing tasks is partitioned into a series of subsets, where no node of a given subset is constrained to precede any other node of the given subset unless said pair of nodes are conditionally independent given one or more nodes in an immediately preceding subset, and such that no node of a following subset is constrained to precede any node of the given subset. Nodes of each subset are connected to nodes of each adjacent subset with edges based upon the order constraints and based upon conditional independence tests applied to subsets of nodes, thereby providing a workflow graph.

A technique for representing an information need and employing one or more filters to select documents that satisfy the represented information need, including a technique of creating filters that involves (a) dividing a set of documents into one or more subsets such that each subset can be used as the source of features for creating a filtering profile or used to set or validate the score threshold for the profile and (b) determining whether multiple profiles are required and how to combine them to create an effective filter. Multiple profiles can be incorporated into an individual filter and the individual filters combined to create an ensemble filter. Ensemble filters can then be further combined to create meta filters.

An information need can be modeled by a binary classifier such as support vector machine (SVM). SVMs can exhibit very conservative precision oriented behavior when modeling information needs. This conservative behavior can be overcome by adjusting the position of the hyperplane, the geometric representation of a SVM. The present invention describes a couple of automatic techniques for adjusting the position of an SVM model based upon a beta-gamma thresholding procedure, cross fold validation and retrofitting. This adjustment technique can also be applied to other types of learning strategies.

A system provides a plurality of controls for enriching the content of a meta-document. A meta-document includes document content and personalities that describe enrichment themes. The system is adapted to automatically, with user settable constrains, determine whether to propagate enrichment between documents using an interaction history. In addition, the system is adapted to automatically determine, with user settable constraints, the form that markup is to take in the document content.

A text categorizer classifies a text object into one or more classes. The text categorizer includes a pre-processing module, a knowledge base, and an approximate reasoning module. The pre-processing module performs feature extraction, feature reduction, and fuzzy set generation to represent an unlabelled text object in terms of one or more fuzzy sets. The approximate reasoning module uses a measured degree of match between the one or more fuzzy set and categories represented by fuzzy rules in the knowledge base to assign labels of those categories that satisfy a selected decision making rule.

An information space is created using a document. Entities from the document and its information space are used to create a database of entities. An auto-completion system uses contextual information surrounding a fragment from the document to formulate a query. The query is used to identify a set of entities in the database of entities that complete the fragment. An auto-correction system uses contextual information from identified errors in the document to formulate a query. The query is used to identify a set of entities in the database of entities that correct the error.

A system generates a query using an entity extractor, a categorizer, a query generator, and a short run aspect vector. The entity extractor identifies a set of entities in selected document content for searching information related thereto using an information retrieval system. The categorizer defines an organized classification of document content with each class in the organization of content having associated therewith a classification label that corresponds to a category of information in the information retrieval system. The categorizer assigns the selected document content a classification label from the organized classification of content. A query generator formulates a query that restricts a search at the information retrieval system to the category of information in the information retrieval system identified by the assigned classification label. The short length aspect vector generator generates terms for further refining the query using context information surrounding the set of entities in the selected document content.

A system operates using meta-documents which include document content associated with one or more personalities. Each personality is associated with a set of document service requests. Users are provided different techniques for creating personalities and modifying existing personalities. These techniques include: the use of an algebra to tailor existing personalities, the use of a list of links or documents to create a personality, the use of predefined personalities and knowledge levels in a field to create new personalities, the use of question answering techniques, and the use of learning personalities. Specified personalities are then used to enrich document content by integrating into corresponding meta-documents the results received from their document service requests.