Extractive Text Summarization System for News Texts

ABSTRACT


Introduction
Writing is one of the main ways of communication.
There are many types of writing due to the large number of languages spoken around the world. Every day, thousands of texts are being written by people worldwide. To get to know the main idea and the basic information that those texts bring to the readers, it is a must to read them completely. Many people wonder how texts include the wanted or important information through sentences, or in short, how much of those sentences the texts have to deliver the main theme. Sometimes, in any text, the amount of unwanted sentences may get more than the amount of wanted information. In that way, just as we've said, readers need to read the entire text just to find out what the post is about or what is meant to be told, which makes readers waste lots of times. To have the main theme or wanted information from texts and solve these problems, the automatic text summarization topic has got much attention.
Automatic summarization systems let readers get the main theme of texts with ease. Thanks to that, they get rid of the redundant data. So, reading time is decreased, text importance is increased. Thus, information access is easier.
Automatic summarization systems can be represented as extractive or abstractive systems depending on the application ways and methods. This work has focused on information extraction, because abstractive summarization systems are still performing lower performance and often having less exact information than extractive summarization systems. This is because extractive systems choose informations directly from the main text, instead of trying to have new informations by itself. Even though the extractive summarization systems are not perfect, they give readers an opinion about original text [4]. Besides, there are many measurement December, 2020 methods used to evaluate the performance of automatic summarization systems on the datasets used. These evaluation methods can be examined under two titles as task-independent and task-based methods. Taskindependent methods are based on an expert opinion summary (ideal/reference summary). Task-based evaluation methods do not analyse the sentences in the summary. Main goal is to analyse the possibility of a summary usage for a specific task. There are many approaches to task-based evaluation. The three most important tasks are categorization, information retrieval and question answering [4]. Additionally, automatic summarization systems can be used as single or multiple document summarization.
This work has evaluations based on multiple document summarization and task-independent method. The algorithm will be evaluated according to ROUGE automated evaluation metrics.
For testing the algorithm, five dataset categories of news were used and amount of documents in categories are: Business: 510 Entertainment: 386 Politics: 417 Sport: 511 Technology: 401 One of the most important topics that will be needed in this and similar studies is text mining.
The dataset can be found here [17]. It also contains ideal summaries which are hard to obtain, to be able to evaluate the system easily.

Text Mining
Text mining is a data mining method and makes information exploration from raw texts possible. It's mostly used for finding documents related to each other and exploring relationships between concepts [12]. Text Mining is a data analysis method that makes it possible to obtain information from existing data with statistics, machine learning, database systems or similar subjects. As in this article, word or phrase extraction, feature extraction or data preprocessing are examples of text mining. It can be used to extract information from large data, summarize or similarity calculations. Text mining reduces the cost of time and resources. It generally consists of six steps. These are:

Data Acquisition
The first stage of text or data mining is to obtain information [15].
Data sources suitable for the project can be obtained from an online or offline source. In addition, having an expert summary package will let researchers to evaluate the system.

Preprocessing Phase
While the data is being obtained, they may often contain unwanted characters or have an incorrect data order. This situation can cause an unacceptable issue. For example, in a sentiment analysis study, when microtexts were replaced with their originals, a ~ 4% performance improvement was achieved [6]. Also, data can be preprocessed using methods such as lowercase conversion, space removal, punctuation mark deletion, character replacement, minimum word length elimination, ineffective word elimination, stems [14], lemmatization or more.

Feature Extraction
This is the phase where the raw dataset is reduced to more controllable pieces to process well. In short, it is the main title of the methods that accurately and completely describe the original data set while reducing the amount of data that needs to be processed.

Data Mining
It is the stage of translation of unprocessed data to useful information. It's based on data collection, storage and mathematical processing.
Data mining is an important phase of extraction of data patterns where various methods used. The aim is to find the relationship between the groups of knowledge that reveal the points and to make researchers able to discover new information that is difficult to obtain [15].

Data Visualization
It is the presentation of the values obtained as a result of a series of processes to the user in a design way, such as a graphic.

Evaluation
In data mining, the evaluation of the results is provided by precision, recall and the f-score which depends on precision and recall values. The formulas are as following [16]: x = Number of matching sentences in the reference summary and the system summary y = Amount of the sentences in the system summary z = Amount of the sentences in the reference summary *x, y and z value sources may vary according to the project.

Related Works
The first automatic summation system was developed by Luhn in 1958 based on term frequency. Automatic text summarization system in 1969 by Edmundson, has used some standard keyword methods from before such as word frequency, cue words, title and positioning to assign sentence weights. The Trainable Document Summarizer carried out sentence extraction which is weight based heuristic, in 1995. The machine learning techniques in natural language processing have utilized statistical techniques to create file summaries in 1990s [2].
Zemberek, one of the studies on Turkish Language, is a resource that many researches who work on natural language processing is commonly examined and used it for many times.
Zemberek is an open source Turkish natural language processing library. It can be used for functions such as finding word roots, special names in texts, etc. in natural language processing. The second version, Zemberek2, which is currently published, can be used for Turkic languages [11] [13].
Autotext Summarization is used today by platforms such as search marketing, search engines, news websites, bots, and social media marketing. Google Infographics and Bing News Snippets are known examples of automatic text summarization.
Automatic Text Summarization is examined under two main titles: Extractive Text Summarization and Abstractive Text Summarization.

Extractive Text Summarization
Extractive summarization is based on the method of sentence weighting by obtaining the words and phrases in the text with their frequencies. It does select the sentences with highest score from the document and removes the rest of the useless sentences. It uses many methods while scoring. It also uses automated methods (e.g. ROUGE) for algorithm evaluation.

Abstractive Text Summarization
Abstractive summarization works differently than extractive text summarization. Interpreting the text and then creating a new and shorter summary text that differs from the original text. Similarly, ROUGE or different evaluation methods can be used for algorithm evaluation. It is more difficult to implement than extractive summarization. Although the accuracy rate of the obtained results is lower than the extractive method, the results are more similar to human-like summaries.

Data Acquisition and Preprocessing
There are two types of processing methods: singular and multiple file/data processing. Regardless of the processing method, the data whose summary is requested must be parsed into words and sentences.

Creating a Frequency Table
Separating all the words by their roots and placing them in a table with their number of occurrences in the whole text.

Sentence Scoring
A table is needed to keep the sentence scores in it. It can be trimmed to have a good rating. For example, only the first twenty characters of all sentences can be processed for equality. The scoring process continues depending on the word frequency table. Scoring can be done with various methods.

Term Frequency
The term frequency algorithm is used for weighting the sentences in the scoring process.

Term Weighting
Performing the division of the term frequencies of all sentences over the highest term frequency score in the document represents the Term Weighting method [2].

Numerical Data
Finding numerical data within sentences is one of the useful ways to understand sentence significance. Numerical data calculation is done as follows [7]:

Sentence Length
The importance of sentences may increase depending on the length of the sentences. It's calculated as follows [7]: ℎ ℎ (7)

Proper Nouns
Quantity of proper nouns can identify dominant sentences in the document. Its value is computed as follows [2]:

Sentence Location
Location of sentences are also important to see their importance. For the following formula, N is the number of sentences and Pᵢ is the location of the sentence, in the document. Sentence location value is computed as follows [5]:

Sentence Similarity
The next step will be scoring the sentences according to similarity to the first and last sentences in the text. The cosine similarity formula is needed for similarity computation. The cosine similarity is computed as below [5]: To calculate the similarity of current sentence to the first sentence, the variables are selected as follows; x is equal to the current sentence, y is equal to the first sentence of the text. Likewise, for the resemblance to last sentence, the variables should be as follows; x is equal to the current sentence and y is equal to the last sentence of the text.
Cosine Similarity requires vector forms of texts. There are some practical models (i.e. Bag of Words) that you can use to convert text to vectors.
You can check out this article for more scoring methods [1].

Evaluation
There are three evaluation methods in the test part of the study: ROUGE-N, ROUGE-L and ROUGE-S. The following texts explain the methods for readers.

ROUGE-N
In the DUC organization, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) was used as automated evaluation method. N-gram based ROUGE evaluation measurements package was first introduced in 2003 [8].
For example, ROUGE-1 represents the number of matches of unigrams between the system summary and the reference summary. ROUGE-2 stands for the number of matches of the bigrams between the system summary and the reference summary [9].
To understand ROUGE-N, we must first know what N-Gram is. Suppose that we have a sentence; "Data science is very important". If we split the given sentence according to the N-gram models; According to unigram (N=1) model, it gets: ["Data", "science", "is", "very", "important"] According to bigram(N=2) model, it gets: ["Data science", "science is", "is very", "very important"] According to trigram (N=3) model, it gets: ["Data science is", "science is very", "is very important"] According to fourgram (N=4) model, it gets: ["Data science is very", "science is very important"] The ROUGE-N evaluation for a system summary is performed as below:

ROUGE-L
For the ROUGE-L evaluation, the longest common subsequence (LCS) between the system summary and the reference must be found.
One of the advantages of LCS usage is that it does not necessitate consecutive matches like other n-gram models. A pre-prepared n-gram model is not required, because it automatically contains the longest common ngrams [9]. ROUGE-L test score of a system summary is computed as follows: =

ROUGE-S
Skip-bigram is a model that any word pair in the sentence order that allows arbitrary spaces. Skip-bigram co-occurrence statistics measure the matching bigrams between system summary and reference summary [3]. For the model calculations, you can optionally use formulas or functions in the NLTK library [10].
After all skip-bigrams in the system summary and the reference summary are found, it is necessary to find the matching objects and their matching quantities in order to continue the evaluation process. ROUGE-S test score is calculated as follows [

Test Results
As mentioned before; Business(B) category has 510, Entertainment(E) has 386, Politics(P) has 417, Sports(S) has 511 and Technology(T) has 401 documents. ROUGE based test result table and graph:  The ROUGE-N (N: 1, 2, 3, 4,…) metric requires the use of the N-Gram algorithm to calculate the result. The ROUGE-L metric requires finding the longest common substring in order to calculate the result. Also, the ROUGE-S metric requires the use of the Skip-bigram algorithm.
You can check Lin's article for detailed evaluation steps. [3]

Conclusion
Automatic extractive text summarization work carried out the steps described above. According to the evaluation result table, we can say that when N increases, the ROUGE-N test result decreases. The reason for the drop is the N-Gram algorithm. When N increases, the N-Gram algorithm returns fewer list objects and the denominator of the ROUGE-N formula does not change. In this way, the division ratio of values is reduced.
Aside, this work can be integrated to search engines to gather summaries of news or any text. Also it can be used to retrieve information from long and important data groups to make the time cost lower. This paper contains simplified formulas and process steps to be helping to understand and apply on easily for anyone who wants to work on extractive text summarizaton topic.

Future Works
This article is about how the Extractive Text Summarization model works. We hope that this work helps and inspire anyone who reads it. For future projects, along with the methods used in the study, other articles can be used to examine and obtain new features. Also, the Hidden Semantic Analysis method can be used to see new and different results. In fact, new metrics can be added to get more evaluation results. Moreover, it would be good to work on the Abstractive Text Summarization project or to developq a model with deep learning techniques.