Difference between revisions of "Página de pruebas"

From Sinfronteras
Jump to: navigation, search
Line 1: Line 1:
 +
https://docs.google.com/document/d/1UKlEmrAboGozcNB_fTC8bhPYJ8BlgVWbMYxlLuoQnnk/edit
 +
 
==Project proposal - Establishing an authenticity of sports news by Machine Learning Models==
 
==Project proposal - Establishing an authenticity of sports news by Machine Learning Models==
  

Revision as of 20:59, 4 April 2019

https://docs.google.com/document/d/1UKlEmrAboGozcNB_fTC8bhPYJ8BlgVWbMYxlLuoQnnk/edit

Project proposal - Establishing an authenticity of sports news by Machine Learning Models

Introduction

The idea

In recent years, with a continuous growth of the Web 2.0, it is really easy for almost anyone to publish any kind of information on the Web because of the absence of any verification process or authenticity check before publishing, giving a birth of fake news on the Web. Using the public data from Twitter API, we are going to develop an application that is based on Machine Learning techniques and ultimately, the application could detect the fake news from a pool of random news.

Key words

Authenticity, Fake news, machine learning, Data Mining, Text Mining.

Chapter 1 - Project Proposal

Problem area - Innovation area

«The term fake news is simultaneously too broad and too narrow» (Winston Churchill, 2018).


The problem area of the present project start at defining the main subject in study that is “Fake News”. It is a term that can be stretched by several directions, because of its effects and aspects involved. At this section we are going to have a brief look at Historical Foundations in order to explain and narrow down what “Fake News” definition is for the proposal of this project.


Historical Foundations

"An lie is hallway around the world before the truth has even got its boots on!", Mark Twain


Fake news was named 2017’s word of the year, raising tensions between nations, and may lead to regulation of social media. Fake news is not only the most debated socio-political topics of the three years ago but also it is seen as one of the greatest threats to democracy, free debate and the Western order. (Telegraph, 2018).

Despite of the fact that Fake News has been widely used in recent years may ensure to be new terminology. Therefore, it is actual a quite old matter, known as “Disinformation”. The history of disinformation has been repeating over years however the main goal still the same.


It was an concept used by Governments and powerful individuals as an weapon for millennia. Octavian famously used an campaign of disinformation to aid his victory over Marc Anthony… (Carson, 2018).

There are plenty of examples of false news throughout history, in the XV century, more specifically year 1439, year when the printing press was invented by Johannes Gutenbergi the diffusion of disinformation and misinformation was facilitated through sensationalism accounts of the everyday events, there list of scandals, lies, hoaxes thought history is long. (Darnton, 2017).

In 1522, Pontifical election of the same year was manipulated by writing “wicked sonnets” of all the candidates but of Medici, Pietro Aretino was one of the favourites authors of these sonnets. The sonnets were placed at the bust of a figure known as Pasquino. It was located near to the Piazza Navona in Rome where most of fake news about public figures were diffused. Pasquinade was the giving name which developed into a genre of news. After 1755 when the Lisbon Earthquake took place, the church and many European authorities blaming the natural disaster on divine retribution against sinners. Fakes news pamphlets alleging that some survivors owed their lives to an apparition of the Virgin Mary. The spreader comments were one of the more complex news stories of all time (Soll, 2016).

It was announced on the 1780s, the capture of a monster in Chile that was allegedly being shipped to Spain. It is register on the best seller called "canard", which could be considered as the successor of the pasquinade. Canards were the fake news version in Paris for the next two hundred years.

Going to India New Delhi, in July 1983 a remarkable history appears in the newspaper: “Aids may invade India. Mystery disease caused by US experiments. It was created as a biological weapon”. [Figure.1].

The fake history which still believed to be truth nowadays was spread out for six years through the world. Africa, Kenya, Bangladesh, Bulgaria, Cameron, Finland, Pakistan, London and Russia. In March 30th, 1987, it crosses the Ocean and reaches the national television over all Unites States.

The image of United States was damaged with a toxic impact in their culture and policies. It was in every day on the back of their minds, fear and anger emerging from the Americans.


According to The New York Times (2018), Disinformation Campaign is weapon created by KGB (Committee for State Security) in the 50’s. The main goal of it is to change the perception of reality of every American until the point where no one is able to get sensible conclusions in the interest of defending themselves or their community. KGB had a department named as Ideological Subversion also known as Active Measure (Russian: активные мероприятия) which was active since the 50’s. Their target was to subvert anything with value in United States. They wanted demoralized the fiber of the nation, destabilize them from inside like a virus.

KGB spend 85% of their time to create false histories to influence people and their rules to spread it were seven in total. First one, look for cracks or social division, second create an big lie and wrapped it with the third rule, an kernel of truth. Fourth conceal your hand in other words ignore the source.

Fifth use the useful idiot to push the message to the population. Sixth deny everything because people’s attention is short. Seventh play the long game, year to accumulate the news, by repetition it will became truth.

It was in 1986-87 when the square of truth spots out the most powerful and effective disinformation campaign to be a Fake News Propaganda. (United State Department of State, 1987). The official document reveal the source and how Fake News was detected. In 1989 the Soviet Union was dissolved and also was the Active Measures thought to be too.

Washington, March 27, 2016, the Clinton’s email scandal took root by an hack attack. The source was found to be Guccifer 2.0 GRU officer Grizodubovoy Saint Petersburg, Russia, KGP has returned in another form and uses technology as a venue for their most effective weapon.

At the time of 1975 to 1991, Vladimir Putin joined the Community Part of Soviet Union and KGP. Had as a mentor of the Law degree Anatoly Sobchak and according to KGP director was the most talented on creating false histories. To be promoted at KGP their agents had to spend 25 % of their time to have ideas of creating false histories. It is believed that since that period Putin has tested disinformation in the Russians.

In 1998-99 Vladimir Putin becomes President of Russian. Creates the Global English language News channel for Americans in order to promote Trump sympathy over Americans. At 2008 launch the Internet Research Agency. All the pieces were together, and Donald Trump successfully wins the election of 2016.

The technology turnup as a way of spreading fake news much quicker that was ever been thought to be possible. A process that used to take six year of work now it is done in six months.

With the arrival of the internet in the late 20th century and rapid evolution of mobile devices, social media has also been growing at exponential rates since the early 2000s transitioning society into a more digital, mobile and social media environment.

As it is shown [Figure.2], since 2009 the number of Internet users worldwide has skyrocketed, internet users increased to 44 million in 1995 and 413 million in the year 2000. Since then the growth of internet users has accelerated and reached 3.4 billion in 2016

The first recognizable social media site, Six Degree was created in 1997. It enabled users to upload a profile and make friends with other users then Sites like MySpace and LinkedIn gained prominence in the early 2000s.

YouTube came out in 2005 following by Facebook and Twitter in 2006. They both became available to users throughout the world. These sites remain some of the most popular social networks on the Internet as shows in [Figure.3], (Hendricks, 2013).

In this social media age, marked by technological innovation, new organization face new challenges, resulting in an adaptation to this platform-dominated media environment. It is more accessible to find news in social media rather than traditional news organizations. This accessibility is the result to its often timelier and less expensive reach of news and its further comment and share of them.


The chart at [Figure.3] is a report based on a YouGov survey of about 50,000 people across 26 countries: Facebook and other social media outlets have moved beyond being "places of news discovery" to become the place people consume their news, it suggests. (Wakefield, 2016).

Thanks to this fast and easy access to information, the quality of news on social media is lower than traditional news organizations. This low cost and fast dissemination of information through social media dramatically has led to an urgency to be informed every minute multiplying the sharing of fake news compared to what has been seen in past times.

At the Nature Communications journal an recently study conducted by Indiana University researchers analysis of low-credibility stories posted on Twitter, show that 14 million messages spreading 400 thousands thousand articles on Twitter during ten months in 2016 and 2017, 389,569 articles from low-credibility sources and 15,053 articles from fact-checking sources were collected, furthermore public posts linking to these articles: 13,617,425 tweets linked to low-credibility sources and 1,133,674 linked to fact-checking were also collected. All of them published on Nov 20th, 2018.

Evidence showed that that boots contribute significantly to the spread of low-credibility articles before it goes viral even though only 6% of accounts were identified as bots, it was enough to spread 31% of the low-credibility sources, and the retweeting for the 34% these shared articles. The [Figure.4] shows the popularity and bot support for the top sources. Satire websites are shown in orange, fact-checking sites in blue, and low-credibility sources in red. Popularity is measured by total tweet volume (horizontal axis) and median number of tweets per article (circle area).


Bot support is gauged by the median bot score of the 100 most active accounts posting links to articles from each source (vertical axis). Low-credibility sources have greater support by bots, as well as greater median and/or total volume in many cases. A bot is an automated application used to perform simple and repetitive tasks, bots can also work in social network sites and simulate the internet users’ behaviors in social networks, i.e., both are capable of different social interaction on Twitter that make may resemble the behaviors of people:

   • Based on scripts they have the availability to reply to postings or questions from. 
   • They can contact users by sending them questions resulting in the exchange of communication and this way bots generate trust of this users. 
   • Generate debate by posting messages about trending topics.

Bot’s algorithms allow them to respond to particular situation training from response patterns or input values and their resembling in people’s behaviors helps to the propagation of fake news. Bots have the capabilities to search and retrieve information that has not been validated nor authenticated; they also post continuingly this non-authenticated information using strategies such as “trending topics” or “hashtags” to reach audience. As history shows the Fake News is a concept which has basically the intention of distort information by spreading it in different ways of communication to manipulate people. It has been around for quite long time and it is repeated constantly by government or powerful individuals.

At the present project Fake News will be defined as disinformation which is believed to be the most suitable term for the proposal. Fake news molds people’s perceptions an impact in people’s reality by changing and confusing their thought and actions.


"Disinformation is defined as deliberately distorted information that secretly leaked into the communication process in order to deceive and manipulate.", Vladmir Bitman KGP Director


Solution to Problem/Innovation Solution

Machine Learning Model for Fake News Detection

In short, this project seeks to build machine learning models that allow to identify if a sport news article published on the Internet is fake according to the terms defined above.

In order to do so, we first need to recognize and extract features from the news article that can be related to the veracity of its content. There are many different techniques that have been used in similar projects to build models for fake news detection.

Text Mining techniques
  • Tidy Data Principles: In order to prepare the data a tidy data principles will be used to handle data easier and more effective . At Figure[7a] a flowchart of an typical analysis using tidy data principles and Figure[7b] a small program code in R language using an small piece of text or token to and organizing it in a data frame (Slige, 2017a).


As mentioned by Granskogen (2018), this kind of analysis can be done through two approaches:

  • The Linguistic approach: It is based only on the analysis of the content of the text itself. «This approach involves using techniques that analyses frequency, usage, and patterns in the text (Granskogen (2018).

This approach is reasonable because news articles are usually intentionally created using inflammatory language and sensational headlines for specific purposes: i.e., to tempt readers to click on a link or to incite confusion. So, the linguistic analysis seeks to capture the writing styles in fake news articles. (Shu et al, 2016 and Chen et al, 2015).

  • The Contextual approach: «Incorporate most of the information that is not text. This includes data about users, such as comments, likes, re-tweets, shares and so on. It can also be information regarding the origin, both as who created it and where it was first published» (Granskogen, 2018).


Some of the most common techniques that have been used for fake news detection are:

  • Linguistic approach:
    • Sentiment analysis
    • Naive Bayes
    • Support VectorMachines


  • Contextual approach:
    • Network analysis
    • Logistic regression
    • Trust Networks


To build the Machine Learning Model we will need a dataset constituted by True and False news articles [Figure.6]. This data or called the training data will be analysed using the techniques mentioned above such as sentiment analysis, Naive Bayes, Network Analysis, Logistic regression, etc.


It is very important to notice that there is not a unique way for fake news detection. We can actually say that this is recent problem that has been studied in the last years.

To generate an accurate Machine Learning Model, we will have to perform lot of tests using different techniques and approaches. That is why, at this point of the project, it is not possible to determinate the exact methodology that will be used for building the Machine Learning Model.

In this project proposal we have decided to reduce the domain for fake news detection to «Sport news». We have taken this decision because most of the researches we have reviewed confirmed that Machine Learning Models for fake news detection have shown good results in closed domains (Conroy et al, 2015). However, more recent researches indicate that a contextual approach must improve accuracy in open domains (Granskogen, 2018).


Prototype of an Application to Integrate the Machine Learning Model to an Browser to access Sport News Articles

The second part of this project consists of creating an application that provides a practical way of detecting fake news at the moment when the reader accesses them.

The idea is to develop an application that works as an extension of the internet browser or the application that is used to access the news for instance Twitter App. In this sense, when the reader accesses the news, the application by using the Machine Learning Model must check the news article and give a feedback of its authenticity.

We want the application to display a kind of Pop Up Notification when the reader is acceding the news article. This notification must show a measure of the veracity of the news article as shown in [Figure.8].


Project Goal

Extract Data from Twitter:

  • Authentic News
  • Fake News


Detect:

  • Disinformation that is deliberately distorted.
  • Secretly leaked into the communication process.
  • Intent to deceive and manipulate people.


Built:

  • Machine learning algorithm.


Train:

  • Machine learning model.


Project Objectives

Data extracted from Twitter by using tweepy in order to build machine learning algorithm and train model.


Part I
Part II

Resource Requirements

Project Scope

Extrating mining the Data from Sport News

Our first objective is to mine the data that will be used to build the Machine Learning Model. We have decided to start our project using data from Twitter by using tweepy.

Why Twitter data?

  • Twitter is one of the most important social networks. With millions of active users, Twitter provides a huge volume of data.
  • «Unlike other social platforms, almost every user's tweets are completely public and pullable.» (Sistilli, 2015)
  • Another good reason to use Twitter platform is that «Twitter's API allows you to do complex queries like pulling every tweet about a certain topic» (Sistilli, 2015).
  • Mining data from a social network will allow us to use contextual approach techniques for fake news detection.

We are going to use a Python code that interact with Twitter's API to mine the data.

The data will be storage in a MySQL database [Figure.6].

Build the Machine Learning Model for Fake News Detection
  1. The first step will be to perform a linguistic-based approach using sentiment analysis.
  2. Then, we will test other linguistic-based algorithms.
  3. Finally, we will extend the analysis using contextual techniques.
  4. We are going to use R language to test and implement the different techniques that will allow us to extract features from the data and build the model.
Blow is a simple model to detect fake news by words associated with fake News

[Figure.9]


Summary Schedule

Risk Analysis

Conclusion

Fake News concept has been around for ages and has been user for financial and political gain. Therefore it impacts extremely negative on individuals and society. Shu et all (2016) show that social media has been used to provide low quality news because it is cheap to provide and much faster and easier to disseminate.

For this reason Fake News detection on social media is challenge and relevant. Machine learning promises help us as an human to scale up the fake news detection and hopefully prevent user from that.


Reference List