number of available documents in digital media makes it difficult to obtain the
necessary information related to the needs of a researcher or a learner. As
users do not have enough time to read documents in detail, shorter versions of
the documents are of interest. Text summarization generates a short text for a
single document or multiple documents and serves the purpose. By using the
summary produced, a user can decide if a document is related to his/her needs
without reading the whole document. Summaries are very much useful and have
become inevitable in day to day life. Nowadays, people buy products after
reading the summarized reviews. Readers decide which newspaper article to read
by scanning through the article titles which summarizes what is being explained
below. Before reading a book, the readers skim through the summary to find out
whether the book is worth reading. This also suits the online text resources
and specifically research papers due to its high availability. The research
community or the learning community can make out what has been dealt so far in
that domain and might decide on the relevancy of the document or the scope of
research. Hence, summaries can take various forms and can sum up various kinds
summarization is one of the core tasks of computational linguistics and has
become one of the significant topics in today's research. The summary should be
informative and as short as possible. It means that the summary should cover
important concepts of the original document or documents and should not include
unnecessary details. Though many research has been carried out for summary
generation identifying the key sentences and organizing them into a coherent
summary makes summarization still a scope for vast research.
Though we have many types of text documents,
almost all are written to explain a concept, an idea or a topic. The core idea
will be the primary subject and the whole document will be divided into subsections
inline with the principle point. The notion of the summary is to convey the
overall idea being dealt with, comprising the significant subject matter.
Content to be summarized can be any information representable in text. Manual
summary writing involves a good understanding of the topic being discussed in
the document. Human summarizers apply
their domain knowledge, and comprehending skills to produce a summary. Even
without the domain knowledge, humans can produce summary by applying their
common sense and experience. But, making the computers to replicate the human
performance is very difficult. Computers cannot understand and interpret
natural languages as humans do.
a good deal of work has been done to make computers generate summaries
comparable to manual summaries they are restricted to domains and require a lot
of training. Hence, most of the automatic summarization programs analyze a text
statistically and linguistically, to determine important sentences, and then
generate a summary text from these important sentences. The main ideas of most
documents can be described with as little as 20 percent of the original text (Goldstein 1999). Automatic summarization
generates a brief and compressed block of text containing the essential information
content from the input document. Text summarization systems can be classified based
on various factors as follows:
1. Number of input
Single document - Generating summary from one document.
Multiple documents - Generating a single summary from two or
more related documents.
2. Methodology applied
and temperament of output
Extraction - It extracts important text units such as
phrases, sentences and paragraphs from the input document. The summary contains
the sentences present in the input document per se.
Abstraction - The important textual units are identified and
rewritten without loss of information. The summary contains a new set of
3. Summary type
Generic - Provides a general overview of the text document.
The main focus is capturing the main topics of the input document and conveys
what the document is about.
Query based - Provides a summary specific to the query term.
Indicative or Descriptive - Provides an idea of what the input
document is about without informative content. It is the description about the input
Informative - Provide the main ideas or topics of the input
Critical – Provides the writer's opinion or an evaluation or
review about the input document.
Mono lingual – Provides summary in the same language as that
of the input document.
Cross lingual - Provides summary in a language different from
that of the input document.
Multi lingual – The input documents are in multiple languages
and the summary is generated in the target language.
Supervised - Uses annotated data for
Unsupervised - Do not use annotated data,
but uses linguistic and statistical information obtained from the document
Graph based summarization is an unsupervised model, where
a document or a set of documents is represented as a text similarity graph
constructed by taking the text units such as words, phrases or sentences as
vertices and their interconnecting relationships as edges.
1.2 DEFINITION EXTRACTION
in text documents are used to explain about a term, a concept or an idea. The Longman dictionary defines the word 'definition' as 'A
phrase or sentence that says exactly what a word, phrase, or idea means' (i.e)
they describe the meaning of a term. In some scenarios, it is not necessary to
give the exact meaning, but it is sufficient to provide a description of how to
employ the word. 'A phrase or sentence that makes clear what a word, phrase, or
idea means'. The phrases that fit the Longman dictionary are called 'narrow
definitions' and the others are called 'broad definitions'.
extraction is the task of automatically identifying the definitions present in
the documents. It is useful in the automatic creation of glossaries for the building
of dictionaries, question answering systems, ontology learning, relation
extraction and eLearning. In question answering definition extraction deals
with "what is" questions (Saggion
2004 ; Hang Cui et al. 2007). In eLearning,
they are used to help students assimilate knowledge (Westerhout & Monachesi
2007b). In this work, the focus is on the use of definition extraction in
Learning, where definitions can help learners conceptualise new terms and help
towards the understanding of new concepts encountered in learning material. The
definition extraction system has been used to extract the definition sentences
which serve as candidate sentences for the summary generation.
Many authors have classified definitions from various perspectives
and none of the classifications provide a complete listing. This thesis has adopted the classification given by Westerhout
& Monachesi (2007a). Definitions
are broadly classified into real and nominal definitions.
Real Definitions – tells about what the definiendum denotes and
Nominal Definitions – – tells about the use of definiendum.
These definitions are based on what for the
definiendum exists, used, done or its rationale. Based on this perspective, definitions
are divided into two categories (a) lexical definition and (b) stipulative definition
Lexical definition (dictionary or
reportive definition) – conveys the
meaning of the definiendum used in common by people. (i.e) the meaning found in
Stipulative definition (working
or operational definition) – assigns
meaning to a new term or gives a different meaning to an existing term.
definitions are based on the way something is done, the technique used or the
procedure applied. Accordingly the definitions are classified into the
definitions - gives the meaning of a term by a defined set of properties. (e.g)
An even number is any number that is divisible by 2.
definition - defines a term by listing all the members that suits the
definiendum (e.g) non-renewable energy is defined by listing he non-renewable
energy sources such as coal, petroleum, natural gas etc...
Synonymous definition - the
definiendum is replaced by a term or phrase that conveys the same or
approximately same idea. (e.g) A toddler
is a preschooler. The subtypes of synonymous definition are:
Derivative - the definiendum is defined with reference to its
origin. (e.g) lingua (Latin: language)
Translational - explains an unfamiliar word using a familiar
word having the same or comparatively same meaning. (e.g) amica means friend in
Analogic - define a term by comparing it with a similar
entity for better
Ostensive definition - provides the
meaning of a term by pointing out examples. (e.g) A circle can be explained by
showing objects like ball, rings, full moon, pizza etc...
Exemplifying - examples
are used to illustrate the meaning of a word or term. (e.g) Birds means hens,
ducks, crows and doves and not bats, bees, or aeroplanes.
Analytical definition (genus-differentia
definition) – provides an analysis of the definiendum by identifying the broad
category to which it belongs, and then mentions its distinguishing properties.
(e.g) A sphere is a round solid figure with every point on its surface
equidistant from its centre.
Classificatory - conveys only the
class to which the definiendum belongs to.(e.g) A sphere is a round solid
Operational - defines a term by
describing its function. (e.g) Graphviz is a open
source graph visualization software
Anatomic - specifies the parts of
the definiendum for clarification. (e.g) A Central Processing Unit consists of
an Arithmetic Logic Unit and control unit.
Qualitative - specifies the traits
of the definiendum. (e.g) A peninsula is a piece of land surrounded by water,
while being connected to a mainland from which it extends.
Quantitative – describes the
definiendum by mentioning its size, height, weight or other measurable traits.
(e.g) Low birth weight describes babies who
are born weighing less than 2,500 grams.
definition – gives the complete listing of all entities that fit the
definiendum (e.g) Primary colours are red, yellow and blue.
definition – the definiedum or a part of it appears in the definien. (e.g) A singly liked list is a list of
elements each containing a single pointer, pointing the next element.
Precising definition – describes the definiendum
with of notion of reducing its vagueness by enforcing certain criteria on its
the lexical meaning. (e.g) In the context of
public distribution system, poor means a family whose annual income is less
definition – biased description of the definiendum in favor of a particular argument or
point of view. It attaches an emotional meaning to the definiendum. (e.g) "abortion"
is the murder of an innocent person during pregnancy.
j. Contextual definition - describes
the definiendum by placing it in a context or by defining a larger expression
containing the definiendum. (e.g) A square has two diagonals and each of
them divides the square into tow two right-angled isosceles triangles .
k. Reference definition- is a definition in which the author refers to another source
of information (e.g) According to SAS big data "describes the large volume of
data – both structured and unstructured – that inundates a business on a
day-to-day basis"( www.sas.com/en_us/insights/big-data).
definition- describes the definiendum by sonsidering the relations among
Antonymic – the
definiendum is described by contrasting words. (e.g) foe is the opposite of
explains the definiendum by situating it between two other terms which refer to
anything in between the synonym and antonym of the word to be defined (e.g) mediocre means the quality
is between good and bad.
These definitions are based on
the assumption of Walter and Pinkal (2006), which states every definition is
said to have three parts- a definiendum, definiens and a connector. The definiendum, often the subject, is the term
or concept to be defined. The definiens provides the meaning of the
definiendum. The definiendum and the definiens are linked via the connector,
which can be a verbal phrase or a punctuation character. The connector signifies
the relation is between the definiendum and definiens. The classification of
definitions based on patterns is as follows:
'is a' definition -
definitions in which a form of the verb
'to be' is used as the connector (e.g) Android is an operating system for
mobile phones and tablets.
Verb definition - definitions in which a verb or verbal phrase (except a
form of to be such as to describe, to explain to mean, to consist of, is called
etc…) is used as the connector between
definiendum and definien. (e.g) The process of reducing inflected or derived
words to their word stem, base or root form is called stemming.
c. Punctuation definition – definitions that contain punctuation patterns as connectors. Four types of characters can be used as the
connectors – colon, bracket, comma and dash.
- a colon is used to connect definiendum and definiens. (e.g) HTML: The standard markup language for creating
- brackets are used around either the definiendum or the definiens. (e.g) "NLTK" ( a python package for natural language
– a comma is used as a connector between definiendum and definiens. (e.g) Graphviz, a open
source graph visualization software.
definitions - a dash is used to connect the definiendum with the definiens. (e.g) Platelets – tiny blood cells that stop bleeding.
Pronoun definitions - definitions in which a pronoun or a phrase
containing a pronoun is used to refer to a definiendum or definiens. (e.g) A solution
to fragmentation problem is Paging. It is a memory management mechanism that
allows the physical address space of a process to be non-contagious.
Layout definitions -
layout of the sentence is the only
indication that it might be a definition. The position and formatting
styles serve as indicators of definitions. (e.g)
M2M (Machine to machine)
wireless data communication between machines
Research and development in
automatic text summarization has been growing in importance with the rapid
growth of on-line information services. Huge number of available documents in digital media
makes it difficult to obtain the necessary information related to the needs of
a researcher or a learner. The most common
approach is to skim the abstract and the conclusion of the article. Many
documents do not necessarily contain abstracts as part of it. Even if an
abstract is present, it gives us an overview of what the document is talking
about and does not necessarily list out the most important ideas or points in
Sometimes the authors use the abstracts to
promote their articles, by making claims which are not supported in the
article. The summaries given by the authors not
the source's content and are biased. The task of
finding relevant documents can be made easier if summaries can be produced
automatically. Hence, summaries of documents and papers help tremendously while
conducting research. In order to solve this issue, automatic summarization systems can be
used to generate summaries form research articles.
Definitions capture the correct
meaning of the terms and provide the researcher or learner with enough information to
understand a term (Westerhout 2009a). Not only they learn the most important
terms, but can decide whether or not the document is relevant for them to
proceed. A method to automatically retrieve definitions from
texts is of great value to the research and learner community. The challenge is
to develop a method that is able to distinguish definitions from
non-definitions automatically. Huge amount of research has been done on this issue and
still needs improvisation.
Named Entity Recognition (NER)
is used to find and classify expressions of special meaning in texts written in
natural language. It can be used as a pre-processing tool for Natural Language
Processing (NLP) tasks such as Machine Translation, Question Answering, Text
Summarization, Language Modelling or Sentiment Analysis. The performance of
automatic definition extraction system can be improved by named entities. Definition
extraction systems capture the important terms and concepts that are being in a
research article. The extracted terms and their definitions can be used to
identify the candidate sentences for summarization to improve the performance
and output of the automatic summarization system.
design a methodology for automatic focused summarization from research
design, implement and evaluate a system for automatic definition extraction.
model a summarization system to generate research article summaries via
design a framework for applying philosophical principles to generate
multi-document summarization via definition extraction.
1.5 PROBLEM STATEMENT
This thesis proposes the design and
implementation of a summary generation system for research articles based on
definition extraction. In order to
implement and improvise definition extraction various approaches have been
applied. In this thesis a mechanism to extract the definitions, an algorithm
for rating the definitions based on Nannool and a graph based summary
generation system to generate summary graphs has been discussed.
summary generation from research articles by applying semantic filters.
A definition extraction
system has been developed using word-class lattices.
The formal concept
analysis based definition extraction has been improved by applying Named Entity
Language models and
classification algorithms have been applied to improve the performance of
definition extraction system.
A framework has been
proposed to improve the performance of definition extraction system by applying
the philosophical principles from Nannool.
A definition extraction
based multi-document summarization has been done by applying the philosophical
OF THE THESIS
thesis is organized as follows: Chapter 1 will provide an introduction to
graph-based summarization, definition extraction and the challenges faced by
the existing systems. Chapter 2 presents a survey of the relevant work
published on summarization, the various approaches, graphs in summarization and
the various methods of automatic definition extraction. Chapter 3 discusses about
the overall system design, and its functional and technical architectures.
Chapter 4 deals with the extraction of definitions using word-class lattices
and language models such as Hidden Markov Models (HMM). Also it explains about
improving formal concept analysis based definition extraction via named entity
recognition and rating the definitions by applying the philosophical
principles. Chapter 5 proposes a methodology to generate abstractive summaries
using semantic filters to generate research paper summaries and compares the
proposed approach with Text Rank. Chapter 6 will give the conclusion of the
thesis, along with the scope for future work.