1.1 TEXTSUMMARIZATIONHugenumber of available documents in digital media makes it difficult to obtain thenecessary information related to the needs of a researcher or a learner.
Asusers do not have enough time to read documents in detail, shorter versions ofthe documents are of interest. Text summarization generates a short text for asingle document or multiple documents and serves the purpose. By using thesummary produced, a user can decide if a document is related to his/her needswithout reading the whole document. Summaries are very much useful and havebecome inevitable in day to day life. Nowadays, people buy products afterreading the summarized reviews.
Readers decide which newspaper article to readby scanning through the article titles which summarizes what is being explainedbelow. Before reading a book, the readers skim through the summary to find outwhether the book is worth reading. This also suits the online text resourcesand specifically research papers due to its high availability. The researchcommunity or the learning community can make out what has been dealt so far inthat domain and might decide on the relevancy of the document or the scope ofresearch.
Hence, summaries can take various forms and can sum up various kindsof information. Textsummarization is one of the core tasks of computational linguistics and hasbecome one of the significant topics in today's research. The summary should beinformative and as short as possible. It means that the summary should coverimportant concepts of the original document or documents and should not includeunnecessary details. Though many research has been carried out for summarygeneration identifying the key sentences and organizing them into a coherentsummary makes summarization still a scope for vast research.
Though we have many types of text documents,almost all are written to explain a concept, an idea or a topic. The core ideawill be the primary subject and the whole document will be divided into subsectionsinline with the principle point. The notion of the summary is to convey theoverall idea being dealt with, comprising the significant subject matter.Content to be summarized can be any information representable in text. Manualsummary writing involves a good understanding of the topic being discussed inthe document.
Human summarizers applytheir domain knowledge, and comprehending skills to produce a summary. Evenwithout the domain knowledge, humans can produce summary by applying theircommon sense and experience. But, making the computers to replicate the humanperformance is very difficult. Computers cannot understand and interpretnatural languages as humans do.
Thougha good deal of work has been done to make computers generate summariescomparable to manual summaries they are restricted to domains and require a lotof training. Hence, most of the automatic summarization programs analyze a textstatistically and linguistically, to determine important sentences, and thengenerate a summary text from these important sentences. The main ideas of mostdocuments can be described with as little as 20 percent of the original text (Goldstein 1999). Automatic summarizationgenerates a brief and compressed block of text containing the essential informationcontent from the input document. Text summarization systems can be classified basedon various factors as follows: 1. Number of inputdocuments· Single document - Generating summary from one document.
· Multiple documents - Generating a single summary from two ormore related documents.2. Methodology appliedand temperament of output· Extraction - It extracts important text units such asphrases, sentences and paragraphs from the input document. The summary containsthe sentences present in the input document per se.
· Abstraction - The important textual units are identified andrewritten without loss of information. The summary contains a new set ofsentences. 3. Summary type· Generic - Provides a general overview of the text document.The main focus is capturing the main topics of the input document and conveyswhat the document is about.· Query based - Provides a summary specific to the query term.
4. Purpose· Indicative or Descriptive - Provides an idea of what the inputdocument is about without informative content. It is the description about the inputdocument.· Informative - Provide the main ideas or topics of the inputdocument.
· Critical – Provides the writer's opinion or an evaluation orreview about the input document.5.Language used· Mono lingual – Provides summary in the same language as thatof the input document.· Cross lingual - Provides summary in a language different fromthat of the input document. · Multi lingual – The input documents are in multiple languagesand the summary is generated in the target language.
6.Technique used· Supervised - Uses annotated data fortraining. · Unsupervised - Do not use annotated data,but uses linguistic and statistical information obtained from the documentitself. Graph based summarization is an unsupervised model, wherea document or a set of documents is represented as a text similarity graphconstructed by taking the text units such as words, phrases or sentences asvertices and their interconnecting relationships as edges. 1.2 DEFINITION EXTRACTIONDefinitionsin text documents are used to explain about a term, a concept or an idea.
The Longman dictionary defines the word 'definition' as 'Aphrase or sentence that says exactly what a word, phrase, or idea means' (i.e)they describe the meaning of a term. In some scenarios, it is not necessary togive the exact meaning, but it is sufficient to provide a description of how toemploy the word. 'A phrase or sentence that makes clear what a word, phrase, oridea means'.
The phrases that fit the Longman dictionary are called 'narrowdefinitions' and the others are called 'broad definitions'. Definitionextraction is the task of automatically identifying the definitions present inthe documents. It is useful in the automatic creation of glossaries for the buildingof dictionaries, question answering systems, ontology learning, relationextraction and eLearning. In question answering definition extraction dealswith "what is" questions (Saggion2004 ; Hang Cui et al. 2007). In eLearning,they are used to help students assimilate knowledge (Westerhout & Monachesi2007b).
In this work, the focus is on the use of definition extraction inLearning, where definitions can help learners conceptualise new terms and helptowards the understanding of new concepts encountered in learning material. Thedefinition extraction system has been used to extract the definition sentenceswhich serve as candidate sentences for the summary generation.Many authors have classified definitions from various perspectivesand none of the classifications provide a complete listing. This thesis has adopted the classification given by Westerhout& Monachesi (2007a).
Definitionsare broadly classified into real and nominal definitions. · Real Definitions – tells about what the definiendum denotes andits nature· Nominal Definitions – – tells about the use of definiendum. (i) PURPOSE-BASED DEFINITIONS These definitions are based on what for thedefiniendum exists, used, done or its rationale. Based on this perspective, definitionsare divided into two categories (a) lexical definition and (b) stipulative definitiona. Lexical definition (dictionary orreportive definition) – conveys themeaning of the definiendum used in common by people. (i.
e) the meaning found indictionary.b. Stipulative definition (workingor operational definition) – assignsmeaning to a new term or gives a different meaning to an existing term. (ii) METHOD-BASED DEFINITIONS Thesedefinitions are based on the way something is done, the technique used or theprocedure applied.
Accordingly the definitions are classified into thefollowing categories:a. Intensionaldefinitions - gives the meaning of a term by a defined set of properties. (e.g)An even number is any number that is divisible by 2.b. Extensionaldefinition - defines a term by listing all the members that suits thedefiniendum (e.
g) non-renewable energy is defined by listing he non-renewableenergy sources such as coal, petroleum, natural gas etc...c. Synonymous definition - thedefiniendum is replaced by a term or phrase that conveys the same orapproximately same idea. (e.
g) A toddleris a preschooler. The subtypes of synonymous definition are:• Derivative - the definiendum is defined with reference to itsorigin. (e.g) lingua (Latin: language)• Translational - explains an unfamiliar word using a familiarword having the same or comparatively same meaning. (e.
g) amica means friend inLatin• Analogic - define a term by comparing it with a similarentity for betterd. Ostensive definition - provides themeaning of a term by pointing out examples. (e.g) A circle can be explained byshowing objects like ball, rings, full moon, pizza etc...
• Exemplifying - examplesare used to illustrate the meaning of a word or term. (e.g) Birds means hens,ducks, crows and doves and not bats, bees, or aeroplanes. e.
Analytical definition (genus-differentiadefinition) – provides an analysis of the definiendum by identifying the broadcategory to which it belongs, and then mentions its distinguishing properties.(e.g) A sphere is a round solid figure with every point on its surfaceequidistant from its centre.• Classificatory - conveys only theclass to which the definiendum belongs to.(e.g) A sphere is a round solidfigure.
• Operational - defines a term bydescribing its function. (e.g) Graphviz is a opensource graph visualization software• Anatomic - specifies the parts ofthe definiendum for clarification. (e.g) A Central Processing Unit consists ofan Arithmetic Logic Unit and control unit.
• Qualitative - specifies the traitsof the definiendum. (e.g) A peninsula is a piece of land surrounded by water,while being connected to a mainland from which it extends. • Quantitative – describes thedefiniendum by mentioning its size, height, weight or other measurable traits.(e.g) Low birth weight describes babies whoare born weighing less than 2,500 grams.
f. Enumerativedefinition – gives the complete listing of all entities that fit thedefiniendum (e.g) Primary colours are red, yellow and blue.g. Recursivedefinition – the definiedum or a part of it appears in the definien. (e.
g) A singly liked list is a list ofelements each containing a single pointer, pointing the next element.h. Precising definition – describes the definiendumwith of notion of reducing its vagueness by enforcing certain criteria on itsthe lexical meaning. (e.g) In the context ofpublic distribution system, poor means a family whose annual income is lessthan ?.15000.
i. Persuasivedefinition – biased description of the definiendum in favor of a particular argument orpoint of view. It attaches an emotional meaning to the definiendum. (e.
g) "abortion"is the murder of an innocent person during pregnancy.j. Contextual definition - describesthe definiendum by placing it in a context or by defining a larger expressioncontaining the definiendum. (e.
g) A square has two diagonals and each ofthem divides the square into tow two right-angled isosceles triangles .k. Reference definition- is a definition in which the author refers to another sourceof information (e.g) According to SAS big data "describes the large volume ofdata – both structured and unstructured – that inundates a business on aday-to-day basis"( www.sas.
com/en_us/insights/big-data). l. Relationaldefinition- describes the definiendum by sonsidering the relations amongobjects.• Antonymic – thedefiniendum is described by contrasting words.
(e.g) foe is the opposite offriend• Meronymic –explains the definiendum by situating it between two other terms which refer toanything in between the synonym and antonym of the word to be defined (e.g) mediocre means the qualityis between good and bad.(iii) PATTERNBASED DEFINITIONS These definitions are based onthe assumption of Walter and Pinkal (2006), which states every definition issaid to have three parts- a definiendum, definiens and a connector. The definiendum, often the subject, is the termor concept to be defined.
The definiens provides the meaning of thedefiniendum. The definiendum and the definiens are linked via the connector,which can be a verbal phrase or a punctuation character. The connector signifiesthe relation is between the definiendum and definiens. The classification ofdefinitions based on patterns is as follows:a. 'is a' definition -definitions in which a form of the verb'to be' is used as the connector (e.g) Android is an operating system formobile phones and tablets.
b. Verb definition - definitions in which a verb or verbal phrase (except aform of to be such as to describe, to explain to mean, to consist of, is calledetc…) is used as the connector betweendefiniendum and definien. (e.g) The process of reducing inflected or derivedwords to their word stem, base or root form is called stemming.c.
Punctuation definition – definitions that contain punctuation patterns as connectors. Four types of characters can be used as theconnectors – colon, bracket, comma and dash.· colon definition- a colon is used to connect definiendum and definiens. (e.g) HTML: The standard markup language for creatingWeb pages. · bracket definition- brackets are used around either the definiendum or the definiens.
(e.g) "NLTK" ( a python package for natural languageprocessing)· comma definition– a comma is used as a connector between definiendum and definiens. (e.g) Graphviz, a opensource graph visualization software. · dashdefinitions - a dash is used to connect the definiendum with the definiens. (e.
g) Platelets – tiny blood cells that stop bleeding.d. Pronoun definitions - definitions in which a pronoun or a phrasecontaining a pronoun is used to refer to a definiendum or definiens. (e.g) A solutionto fragmentation problem is Paging. It is a memory management mechanism thatallows the physical address space of a process to be non-contagious.
e. Layout definitions -layout of the sentence is the onlyindication that it might be a definition. The position and formattingstyles serve as indicators of definitions. (e.g) M2M (Machine to machine) wireless data communication between machines 1.3 MOTIVATION Research and development inautomatic text summarization has been growing in importance with the rapidgrowth of on-line information services.
Huge number of available documents in digital mediamakes it difficult to obtain the necessary information related to the needs ofa researcher or a learner. The most commonapproach is to skim the abstract and the conclusion of the article. Manydocuments do not necessarily contain abstracts as part of it. Even if anabstract is present, it gives us an overview of what the document is talkingabout and does not necessarily list out the most important ideas or points inthe document. Sometimes the authors use the abstracts topromote their articles, by making claims which are not supported in thearticle.
The summaries given by the authors notnecessarily reflectthe source's content and are biased. The task offinding relevant documents can be made easier if summaries can be producedautomatically. Hence, summaries of documents and papers help tremendously whileconducting research. In order to solve this issue, automatic summarization systems can beused to generate summaries form research articles.
Definitions capture the correctmeaning of the terms and provide the researcher or learner with enough information tounderstand a term (Westerhout 2009a). Not only they learn the most importantterms, but can decide whether or not the document is relevant for them toproceed. A method to automatically retrieve definitions fromtexts is of great value to the research and learner community. The challenge isto develop a method that is able to distinguish definitions fromnon-definitions automatically. Huge amount of research has been done on this issue andstill needs improvisation.
Named Entity Recognition (NER)is used to find and classify expressions of special meaning in texts written innatural language. It can be used as a pre-processing tool for Natural LanguageProcessing (NLP) tasks such as Machine Translation, Question Answering, TextSummarization, Language Modelling or Sentiment Analysis. The performance ofautomatic definition extraction system can be improved by named entities. Definitionextraction systems capture the important terms and concepts that are being in aresearch article. The extracted terms and their definitions can be used toidentify the candidate sentences for summarization to improve the performanceand output of the automatic summarization system.1.
4 OBJECTIVES1. Todesign a methodology for automatic focused summarization from researcharticles.2. Todesign, implement and evaluate a system for automatic definition extraction.3.
Tomodel a summarization system to generate research article summaries viadefinition extraction4. Todesign a framework for applying philosophical principles to generatemulti-document summarization via definition extraction.1.5 PROBLEM STATEMENTThis thesis proposes the design andimplementation of a summary generation system for research articles based ondefinition extraction. In order toimplement and improvise definition extraction various approaches have beenapplied. In this thesis a mechanism to extract the definitions, an algorithmfor rating the definitions based on Nannool and a graph based summarygeneration system to generate summary graphs has been discussed.
1.6 CONTRIBUTIONS1. Automatic focusedsummary generation from research articles by applying semantic filters.2. A definition extractionsystem has been developed using word-class lattices.
3. The formal conceptanalysis based definition extraction has been improved by applying Named EntityRecognition.4. Language models andclassification algorithms have been applied to improve the performance ofdefinition extraction system.5.
A framework has beenproposed to improve the performance of definition extraction system by applyingthe philosophical principles from Nannool.6. A definition extractionbased multi-document summarization has been done by applying the philosophicalprinciples.1.7 ORGANISATIONOF THE THESIS Thethesis is organized as follows: Chapter 1 will provide an introduction tograph-based summarization, definition extraction and the challenges faced bythe existing systems. Chapter 2 presents a survey of the relevant workpublished on summarization, the various approaches, graphs in summarization andthe various methods of automatic definition extraction.
Chapter 3 discusses aboutthe overall system design, and its functional and technical architectures.Chapter 4 deals with the extraction of definitions using word-class latticesand language models such as Hidden Markov Models (HMM). Also it explains aboutimproving formal concept analysis based definition extraction via named entityrecognition and rating the definitions by applying the philosophicalprinciples. Chapter 5 proposes a methodology to generate abstractive summariesusing semantic filters to generate research paper summaries and compares theproposed approach with Text Rank.
Chapter 6 will give the conclusion of thethesis, along with the scope for future work.