Publications

Welcome to the MIRROR Project website’s Publications section! You can find all the publications related to H2020 project Migration-Related Risks Caused by Misconceptions of Opportunities and Requirement below.

 

Automated Text Analysis for Intelligence Purposes: A Psychological Operations Case Study

With the availability of an abundance of data through the Internet, the premises to solve some intelligence analysis tasks have changed for the better. The study presented herein sets out to examine whether and how a data-driven approach can contribute to solve intelligence tasks. During a full day observational study, an ordinary military intelligence unit was divided into two uniform teams. Each team was independently asked to solve the same realistic intelligence analysis task. Both teams were allowed to use their ordinary set of tools, but in addition one team was also given access to a novel text analysis prototype tool specifically designed to support data-driven intelligence analysis of social media data. The results, obtained from the case study with a high ecological validity, suggest that the prototype tool provided valuable insights by bringing forth information from a more diverse set of sources, specifically from private citizens that would not have been easily discovered otherwise. Also, regardless of its objective contribution, the capabilities and the usage of the tool were embraced and subjectively perceived as useful by all involved analysts.

 

 

Towards an Aspect-based Ranking Model for Clinical Trial Search

Clinical Trials are crucial for the practice of evidence-based medicine. It provides updated and essential health-related information for the patients. Sometimes, Clinical trials are the first source of information about new drugs and treatments. Different stakeholders, such as trial volunteers, trial investigators, and meta-analyses researchers often need to search for trials. In this paper, we propose an automated method to retrieve relevant trials based on the overlap of UMLS concepts between the user query and clinical trials. However, different stakeholders may have different information needs, and accordingly, we rank the retrieved clinical trials based on the following four aspects – Relevancy, Adversity, Recency, and Popularity. We aim to develop a clinical trial search system which covers multiple disease classes, instead of only focusing on retrieval of oncology-based clinical trials. We follow a rigorous annotation scheme and create an annotated retrieval set for 25 queries, across five disease categories. Our proposed method performs better than the baseline model in almost 90% cases. We also measure the correlation between the different aspect-based ranking lists and observe a high negative Spearman rank’s correlation coefficient between popularity and recency.

 

Summarizing Situational Tweets in Crisis Scenarios: An Extractive-Abstractive Approach

Microblogging platforms such as Twitter are widely used by eyewitnesses and affected people to post situational updates during mass convergence events such as natural and man-made disasters. These crisis-related messages disperse among multiple classes/categories such as infrastructure damage, shelter needs, information about missing, injured, and dead people, etc. Side by side, we observe that sometimes people post information about their missing relatives, friends with details like name, last location, etc. Such kind of information is time-critical in nature and their pace and quantity do not match with other kinds of generic situational updates. Also, the requirement of different stakeholders (government, NGOs, rescue workers, etc.) varies a lot. This brings two-fold challenges — (i). extracting important high-level situational updates from these messages, assign them appropriate categories, finally summarize big trove of information in each category and (ii). extracting small-scale time-critical sparse updates related to missing or trapped persons. In this paper, we propose a classification-summarization framework that first assigns tweets into different situational classes and then summarizes those tweets. In the summarization phase, we propose a two-stage extractive-abstractive summarization framework. In the first step, it extracts a set of important tweets from the whole set of information, develops a bigram-based word-graph from those tweets, and generates paths by traversing the word-graph. Next, it uses Integer-linear programming (ILP) based optimization technique to select the most important tweets and paths based on different optimization parameters such as informativeness, coverage of content words, etc. Apart from general class-wise summarization, we also show the customization of our summarization model to address time-critical sparse information needs (e.g., missing relatives). Our proposed method is time and memory efficient and shows better performance than state-of-the-art methods both in terms of quantitative and qualitative judgement.

 

Going Beyond Content Richness: Verified Information Aware Summarization of Crisis-Related Microblogs

High-impact catastrophic events (bomb attacks, shootings) trigger posting of large volume of information on social media platforms such as Twitter. Recent works have proposed content-aware systems for summarizing this information, thereby facilitating post-disaster services. However, a significant proportion of the posted content is unverified, which restricts the practical usage of the existing summarization systems. In this paper, we work on the novel task of generating verified summaries of information posted on Twitter during disasters. We first jointly learn representations of content-classes and expression-classes of tweets posted during disasters using a novel LDA-based generative model. These representations of content & expression classes are used in conjunction with pre-disaster user behavior and temporal signals (replies) for training a Tree-LSTM based tweet-verification model. The model infers tweet verification probabilities which are used, besides information content of tweets, in an Integer Linear Programming (ILP) framework for generating the desired verified summaries. The summaries are fine-tuned using the class information of the tweets as obtained from the LDA-based generative model. Extensive experiments are performed on a publicly-available labeled dataset of man-made disasters which demonstrate the effectiveness of our tweet-verification (3-13% gain over baselines) and summarization (12-48% gain in verified content proportion, 8-13% gain in ROUGE-score over state-of-the-art) systems.

 

Identifying Deceptive Reviews: Feature Exploration, Model Transferability and Classification Attack

The temptation to influence and sway public opinion most certainly increases with the growth of open online forums where anyone anonymously can express their views and opinions. Since online review sites are a popular venue for opinion influencing attacks, there is a need to automatically identify deceptive posts. The main focus of this work is on automatic identification of deceptive reviews, both positive and negative biased. With this objective, we build a deceptive review SVM based classification model and explore the performance impact of using different feature types (TF-IDF, word2vec, PCFG). Moreover, we study the transferability of trained classification models applied to review data sets of other types of products, and, the classifier robustness, i.e., the accuracy impact, against attacks by stylometry obfuscation trough machine translation. Our findings show that i) we achieve an accuracy of over 90% using different feature types, ii) the trained classification models do not perform well when applied on other data sets containing reviews of different products, and iii) machine translation only slightly impacts the results and can not be used as a viable attack method.

 

Extracting Account Attributes for Analyzing Influence on Twitter

The last years has witnessed a surge of autogenerated content on social media. While many uses are legitimate, bots have also been deployed in influence operations to manipulate election results, affect public opinion in a desired direction, or to divert attention from a specific event or phenomenon. Today, many approaches exist to automatically identify bot-like behaviour in order to curb illegitimate influence operations. While progress has been made, existing models are exceedingly complex and nontransparent, rendering validation and model testing difficult. We present a transparent and parsimonious method to study influence operations on Twitter. We define nine different attributes that can be used to describe and reason about different characteristics of a Twitter account. The attributes can be used to group accounts that have similar characteristics and the result can be used to identify accounts that are likely to be used to influence public opinion. The method has been tested on a Twitter data set consisting of 66,000 accounts. Clustering the accounts based on the proposed features show promising results for separating between different groups of reference accounts.

 

Veracity assessment of online data

Fake news, malicious rumors, fabricated reviews, generated images and videos, are today spread at an unprecedented rate, making the task of manually assessing data veracity for decision-making purposes a daunting task. Hence, it is urgent to explore possibilities to perform automatic veracity assessment. In this work we review the literature in search for methods and techniques representing state of the art with regard to computerized veracity assessment. We study what others have done within the area of veracity assessment, especially targeted towards social media and open source data, to understand research trends and determine needs for future research.

The most common veracity assessment method among the studied set of papers is to perform text analysis using supervised learning. Regarding methods for machine learning much has happened in the last couple of years related to the advancements made in deep learning. However, very few papers make use of these advancements. Also, the papers in general tend to have a narrow scope, as they focus on solving a small task with only one type of data from one main source. The overall veracity assessment problem is complex, requiring a combination of data sources, data types, indicators, and methods. Only a few papers take on such a broad scope, thus, demonstrating the relative immaturity of the veracity assessment domain.