Writing Feedback /
Sifting Multi-task Learning for Fake News Detection [3]
Method of fake news detection
AbstractRecently, neural networks based on multi-task learning has achieved promising performance on fake news detection, which focuses on learning shared features among tasks as complementarity features to serve different tasks. However, in most existing approaches, the shared features are completely assigned to different tasks without selection, which may lead to some useless and even unfavorable features integrated into specific tasks. In this paper, we design a sifting multi-task learning method with a selected sharing layer for fake news detection. The selected sharing layer adopts gate mechanism and attention mechanism to filter and select feature flows between the two tasks respectively. Experiments on two public, widely used competition datasets, i.e. RumourEval and PHEME, demonstrate that our proposed method outperforms the state-of-the-art detection models.
1 IntroductionIn recent years, the proliferation of fake news with various content, high-speed spreading, and extensive influence has become an increasingly daunting issue in human society. The problem has seriously disturbed the individual's normal life, healthy development of economy, and national cyber security. A concrete instance was cited by Time Magazine in 2013 when a false announcement of Barack Obama's injury in a White House explosion "wiped off 130 Billion US Dollars in stock value in a matter of seconds". In addition, an analysis of the US Presidential Election in 2016 by Allcott and Gentzkow [1] revealed that fake news was widely shared during the three months prior to the election with 30 million total Facebook shares of 115 known pro-Trump fake stories and 7.6 million of 41 known pro-Clinton fake stories. Therefore, automatic detection of fake news has attracted significant research attention in both industries and academia.
Most existing methods develop deep neural networks to capture credibility features from different perspectives for fake news detection. Some methods provide in-depth analysis of text features, e.g., linguistic [20], semantic [23], emotional [22], stylistic [28], etc. On this basis, some work additionally extracts social context features (a.k.a. meta-data features) as credibility features, including source-based [21], user-centered [24, 26], post-based [13, 25] and network-based [27], etc. These methods have been modestly successful. Additionally, recent research [16, 17, 18] finds that doubtful and opposing voices against fake news are always triggered along with its propagation. Fake news tends to provoke tremendous controversies than real news [2] [3]. Therefore, stance analysis of these controversies can serve as valuable credibility features for fake news detection.
There are two effective ways to improve the performance of fake news detection combined with stance analysis. One is to capture stance information from response and forwarding of posts acting as auxiliary features for fake news detection [7, 13, 14]. An obvious drawback of this way is that the lack of stance information in response and forwarding of a high proportion of posts leads to the sparseness of stance features. Furthermore, the other way, as a better route, is to build multi-task learning model to jointly train both stance analysis and fake news detection for boosting each other's performance [11, 12]. These approaches model information sharing and representation reinforcement between the two tasks, not only avoid the sparseness of stance information but also expand credibility features for their respective tasks. However, there is a common and prominent drawback to these methods and even typical multi-task learning methods, like the shared-private model that the same shared features obtained from the shared layer are equally sent to their respective tasks without filtering, as shown in Figure 1(a). By that the network would be confused by the useless and even unfavorable features, interfering effective sharing.
To address this problem, we design a sifting multi-task learning model with a well-designed selected sharing layer (Figure 1(b)) based on both tasks of fake news detection and stance detection. Specifically, the selected sharing layer composes of gated sharing cell and attention sharing cell to filter outputs of the shared layer for selecting features that are conducive to their respective tasks. Gated sharing cell filters passively useless features by training driven and attention sharing cell actively focuses on task-specific relevant features from shared features. Besides, to better capture long-range dependencies and improve the parallelism of the model, we apply transformer encoder module [15] to our model for encoding input representations of both tasks. Experimental results show the proposed model achieves better performance than the other state-of-the-art methods and gains new benchmarks.
In summary, the contributions of this paper are as follows:
1) We explore a selected sharing layer relying on gate mechanism and attention mechanism, which can selectively capture valuable shared features between tasks of fake news detection and stance detection for respective tasks.
2) The transformer encoder is introduced into our model for encoding inputs of both tasks, without recurrent or convolutional layers. The performance of our method is enhanced by taking advantages of long-range dependencies and the parallelism of transformer. To our best knowledge, it is the first work to apply transformer encoder to the task of fake news detection.
3) We conduct experiments on two public, widely used fake news competition datasets, and the experimental results demonstrate that our proposed model significantly and consistently outperforms previous state-of-the-art methods. We release the source code publicly for further research .
2 Related Work
2.1 Fake News Detection
The task of fake news detection is usually regarded as a text classification problem. In recent years, its development process can be roughly summarized into two stages. The first stage is to extract or construct comprehensive and complex features with manual ways. The first systematic work at this stage is proposed by Castillo et al. [29], which extracts 68 linguistic features around posts from Twitter to analyze the authenticity of news, which achieves 86% precision on true and fake news of binary classification. Subsequently, a series of follow-up studies explore multiple valuable features around posts from multi-perspectives, such as source-based [30], content-based [32], user-based [24], and network-based [27] to promote the performance of fake news detection.
Instead of gaining features by labor-intensive manual design, the second stage is to automatically capture deep features based on neural networks and natural language process. There are two routes at this stage. One route is to capture linguistic features from text content, such as semantic [25, 50, 51], writing styles [28], and textual entailments [53]. These methods outperform manual methods of the first stage. The other route is to focus on gaining effective features from the organic integration of text and user interaction [35, 36, 37]. Typically, Long et al. [31] exploit speaker profiles in details and incorporate them into text representations through an attention-based LSTM model for rumor detection. Ruchansky et al. [27] develop CSI model to integrate users' behavior, users' profile, and article text into a multi-RNN model for fake news detection. These methods achieve prominent improvements than text-only models. In this work, following the second route, we automatically learn representations of text and user interactions (stance information) from response and forwarding based on multi-task learning for fake news detection.
2.2 Stance Detection
Stance detection is the task of automatically determining from text whether the author of the text is in favor of, against, or neutral towards a proposition or target. This task has gained increasing popularity in different research areas [54, 55]. Especially, the research [38, 39] demonstrates that the stance detected from fake news can serve as an effective credibility indicator to improve the performance of fake news detection. There are two categories of stance detection in rumors. The first category is to extract abundant and shallow indicative features based on common statistical strategies or machine learning approaches [41, 56]. For example, Qazvinian et al. [40] extract three types of features: content-based, network-based, and microblog-specific memes, and then adopt a Bayesian classifier for stance detection. Subsequently, Hamidian and Diab [42] extend time-related information features to perform rumor stance classification relying on J48 classifier and bring noticeable performance improvements. In addition, the authors also report that the best performing features were the content-based features among multitudinous features. Consequently, the second category we introduced for stance detection is to catch deep semantics from text content based on neural networks [57, 58]. Kochkina et al. [43] project branch-nested LSTM model to address the task of single tweet stance detection as a sequential classification problem, where the stance of each tweet takes into consideration the features and labels of the predict tweets, which reflects the best performance in RumourEval shared task at SemEval 2017. More recently, Pamungkas et al. [37] employ BERT architecture combined with the embeddings of tokens, positions, and segments to build an end-to-end system which has reached the competitive performance (the F1 score of 61.67%). In this work, we utilize transformer encoder to acquire semantics from response and forwarding of fake news for stance detection.
2.3 Multi-task Learning
Multi-task learning refers to that the target task and related tasks are trained jointly to boost the performance of their respective tasks [59]. Multi-task learning includes multiple types [44, 45], such as hard-shared [60], soft-shared [61], shared-private [46], and cascade-shared [62], where the shared-private model has been extensively applied because its independent shared layer can specially capture common features between tasks. Subsequently, a collection of improved models [47, 48] are developed based on this model. For a successful example, Liu et al. [46] explore an adversarial shared-private model to alleviate the shared and private latent feature spaces from interfering with each other. However, there is still a common drawback that these models transmit all shared features in the shared layer to related tasks without distillation, which may cause that some useless and even harmful shared features disturb specific tasks. How to filter shared features to different tasks is the main challenge of this work.
3 Model
In this paper, we creatively present a novel sifting multi-task learning method on the ground of shared-private model to jointly train the tasks of stance detection and fake news detection, filter original outputs of shared layer by a clever selected sharing layer. The selected sharing layer aims to filter out useless shared features and capture appropriate shared features for different tasks, instead of transmitting all shared features to different tasks. The overview of the architecture of our model is illustrated in Figure 2. Moreover, the model also consists of the following major components: input embeddings, shared-private feature extractor, and the output layer. Next, we will describe each part of our proposed model in detail.
3.1 Input Embeddings
In our notation, a sentence of length l tokens is indicated as X={x_1,x_2,...,x_l}. Each token is concatenated by word embeddings and position embeddings. Word embeddings w_i of token x_i are a d_w-dimensional vector obtained by pre-trained word2vec model [66], i.e.,w_i ϵR^(d_w ). Position embeddings refer to vectorization representations of position information of words in a sentence. We employ one-hot encoding to represent position embeddings p_i of token x_i, where p_i ϵR^(d_p ), d_p is the positional embedding dimension. Therefore, the embeddings of a sentence are represented as E={[w_1;p_1 ],[w_2;p_2 ],...,[w_l;p_l ]}, EϵR^(〖l*(d〗_p+d_w)). In particular, we adopt one-hot encoding to embed positions of tokens, rather than sinusoidal position encoding recommended in BERT model [63]. The reason is that our experiments show that compared with one-hot encoding, sinusoidal position encoding not only increases the complexity of models but also performs poorly on relatively small datasets.
In addition, in order to compensate for the lack of contextual information caused by the short length of fake news in social media, in the task of fake news detection, we concatenate a piece of news and its related response text to form a longer and context-rich sentence.
3.2 Shared-private Feature Extractor
Behind the state-of-the-art performance of BERT model [63] on 11 NLP tasks, in addition to benefiting from the training based on a great deal of data, it benefits from the excellent architecture - transformer. In this paper, we apply the encoder module of transformer (henceforth, transformer encoder) to the shared-private extractor of our model.
Specially, we employ two transformer encoders to encode the input embeddings of the two tasks as their respective private features. A transformer encoder is used to encode simultaneously the input embeddings of the two tasks as shared features of both tasks. This process is illustrated by the shared-private layer of Figure 2. The red box in the middle denotes the extraction of shared features and the left and right boxes represent the extraction of private features of two tasks. Next, we take the extraction of the private feature of fake news detection as an example to elaborate on the process of transformer encoder in detail.
The kernel of transformer encoder is the scaled dot-product attention, which is a special case of attention mechanism. It can be precisely described as follows:
Attention(Q,K,V)=softmax((QK^T)/√(d_k ))V (1)
Where Q ϵR^(〖l×(d〗_p+d_w)), KϵR^(〖l×(d〗_p+d_w)), and VϵR^(〖l×(d〗_p+d_w)) are query matrix, key matrix, and value matrix, respectively. In our setting, the query Q stems from the inputs itself, i.e., Q=K=V=E.
To explore the high parallelizability of attention, transformer encoder designs a multi-head attention mechanism based on the scaled dot-product attention. More concretely, multi-head attention first linearly projects the queries, keys and values h times by using different linear projections. Then h projections perform the scaled dot-product attention in parallel. Finally, these results of attention are concatenated and once again projected to get the new representation. Formally, the multi-head attention can be formulated as follows:
head_i=Attention(QW_i^Q,KW_i^K,〖VW〗_i^V) (2)
H=MultiHead(Q,K,V)=Concat(head_1,head_2,... head_h ) W^o (3)
Where W_i^Q ϵR^(〖(d〗_p+d_w)×d_k ), W_i^K ϵR^(〖(d〗_p+d_w)×d_k ), W_i^V ϵR^(〖(d〗_p+d_w)×d_k ) are trainable projection parameters. d_k is 〖(d〗_p+d_w)/h, h is the number of heads, and W^o ϵR^(〖(d〗_p+d_w)×〖(d〗_p+d_w)) is also trainable parameter.