Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

51
Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005

Transcript of Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Page 1: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Tehnologiile Limbajului Uman

Curs 3: Prelucrări sub-sintactice

Dan Cristea17 octombrie 2005

Page 2: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Cuprins

• Probleme de segmentare

• Probleme de etichetare

Page 3: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Probleme de segmentare

• Identificarea unităţilor lexicale (graniţele dintre cuvinte) (tokenisation)

• Identificarea grupurilor nerecursive (chunking)

• Identificarea graniţelor dintre fraze şi propoziţii (sentences and clauses)

Page 5: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Probleme de tokenizare• Dependenţa de limbă• Cazul limbilor aglutinante: împărţirea în morfeme

– epájărjestelmállisqttămăttōmyydellănsăkăăn (even without his lack for the capacity for organization)

• Cazul cuvintelor compuse– binecrescut, bineînţeles (nu bineânţeles), …

• Cazul prefixelor (conform MDA)– reanalizare, neortodoxă, dar renegociere nu – nu şi remorcare, remunerare etc.

• Cazul prescurtărilor– P.N.L., pt. o mai bună organizare…

Page 6: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Tokenization in segmented languages

• Segmented languages: all modern languages that use a Latin-, Cyrillic- or Greek-based writing system

• Traditionally, tokenization rules are written using regular expressions

• Problems:

– Abbreviations: solved by lists of abbreviations (pre-compiled or automatically extracted from a corpus), guessing rules

– Hyphenated words: “One word or two?”

– Numerical and special expressions (Email addresses, URLs, telephone numbers, etc.) are handled by specialized tokenizers (preprocessors)

– Apostrophe: (they’re => they + ‘re; don’t => do + n’t) solved by language-specific rules

Page 7: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Sisteme: MtSeg

• http://aune.lpl.univ-aix.fr:16080/projects/multext/MtSeg/MSG1.overview.html

• Set of tools (processes) doing: – splitting text at spaces, – isolating punctuation, – identifying abbreviations, – recombining compounds, etc.

• The rules determining how to treat punctuation, identify abbreviations, compounds, etc. are provided as data to the appropriate tools via a set of language-specific, user-defined resource files, and are thus entirely customizable.

Page 8: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Using the segmenter• There are three input formats: plain, normalized sgml, tabular.

We will use the plain format.

• Consider “infile” containing the plain text (Ro)

Într-un cuvânt, acesta este un exemplu.

• The segmenter can be invoked three ways, depending on the input format:

– plain text: mtseg -lang ro -input plain <infile >ofile

Page 9: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Output format (black&red=full format; red=filtered format)

[CHUNK <DIV FROM="1">; (PAR <P FROM="1">; (SENT <S>; 1\1 TOK Într 1\6 PROC -un 1\9 TOK cuvânt 1\15 PUNCT , 1\17 TOK acesta 1\24 TOK este 1\29 TOK un 1\32 TOK exemplu 1\39 PTERM_P . )SENT </S>; )PAR </P>

]CHUNK </DIV>;

Page 10: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Sisteme: Penn Treebank tokenizer

• http://www.cis.upenn.edu/~treebank/tokenization.html• Rules:

– most punctuation is split from adjoining words – double quotes (") are changed to doubled single forward- and

backward- quotes (`` and '') – verb contractions and the Anglo-Saxon genitive of nouns are

split into their component morphemes, and each morpheme is tagged separately.

• children's --> children 's • parents' --> parents ' • won't --> wo n't • gonna --> gon na • I'm --> I 'm

Page 11: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Tokenization innon-segmented languages

• Non-segmented languages: Oriental languages

• Problems:– tokens are written directly adjacent to each other

– almost all characters can be one-character word by themselves but can also form multi-character words

• Solutions:– Pre-existing lexico-grammatical knowledge

– Machine learning employed to extract segmentation regularities from pre-segmented data

– Statistical methods: character n-grams

Page 12: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Tokenizers (1)ALEMBICAuthor(s): M. Vilain, J. Aberdeen, D. Day, J. Burger, The MITRE CorporationPurpose: Alembic is a multi-lingual text processing system. Among other tools, it incorporates tokenizers for:

English, Spanish, Japanese, Chinese, French, Thai.Access: Free by contacting [email protected]

ELLOGONAuthor(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research, GreecePurpose: Ellogon is a multi-lingual, cross-platform, general-purpose language engineering environment. One of

the provided components that can be adapted to various languages can perform tokenization. Supported languages: Unicode.

Access: Free at http://www.ellogon.org/

GATE (General Architecture for Text Engineering)Author(s): NLP Group, University of Sheffield, UKAccess: Free but requires registration at http://gate.ac.uk/

HEART Of GOLDAuthor(s): Ulrich Schäfer, DFKI Language Technology Lab, GermanyPurpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian, Greek, German, French,

English, Chinese.Access: Free at http://heartofgold.dfki.de/

Page 13: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

LT TTTAuthor(s): Language Technology Group, University of Edinburgh, UKPurpose: LT TTT is a text tokenization system and toolset which enables users to produce a swift and

individually-tailored tokenisation of text.Access: Free at http://www.ltg.ed.ac.uk/software/ttt/

MXTERMINATOR Author(s): Adwait RatnaparkhiPlatforms: Platform independentAccess: Free at http://www.cis.upenn.edu/~adwait/statnlp.html

QTOKENAuthor(s): Oliver Mason, Birmingham University, UKPlatforms: Platform independentAccess: Free at http://www.english.bham.ac.uk/staff/omason/software/qtoken.html

SProUTAuthor(s): Feiyu Xu, Tim vor der Brück, LT-Lab, DFKI GmbH, GermanyPurpose: SProUT provides tokenization for Unicode, Spanish, Japanese, German, French, English, Chinese.Access: Not free. More information at http://sprout.dfki.de/

Tokenizers (2)

Page 14: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

THE QUIPU GROK LIBRARYAuthor(s): Gann Bierner and Jason Baldridge, University of Edinburgh, UKAccess: Free at https://sourceforge.net/project/showfiles.php?group_id=4083

TWOLAuthor(s): Lingsoft Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, Danish.Access: Not free. More information at http://www.lingsoft.fi/

Tokenizers (3)

Page 15: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Sentence splitting• Sentence splitting is the task of segmenting text into

sentences

• In the majority of cases it is a simple task: . ? ! usually signal a sentence boundary

• However, in cases when a period denotes a decimal point or is a part of an abbreviation, it does not always signal a sentence break.

• The simplest algorithm is known as ‘period-space-capital letter’ (not very good performance). Can be improved with lists of abbreviations, a lexicon of frequent sentence initial words and/or machine learning techniques

Page 16: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Probleme de etichetare

• Etichetare la parte de vorbire (POS-tagging)

• Recunoaşterea rădăcinii (stemming)

• Recunoaşterea lemei (lemmatisation)

• Recunoaşterea entităţilor (name entity recognition)

Page 17: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Part of Speech (POS) Tagging

• POS Tagging is the process of assigning a part-of-speech or lexical class marker to each word in a corpus (Jurafsky and Martin)

Thecouplespentthehoneymoononayacht

WORDSTAGS

NVPDET

Page 18: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

POS Tagger Prerequisites

• Lexicon of words

• For each word in the lexicon information about all its possible tags according to a chosen tagset

• Different methods for choosing the correct tag for a word:

– Rule-based methods

– Statistical methods

– Transformation Based Learning (TBL) methods

Page 19: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

POS Tagger Prerequisites:Lexicon of words

• Classes of words

– Closed classes: a fixed set

• Prepositions: in, by, at, of, …

• Pronouns: I, you, he, her, them, …

• Particles: on, off, …

• Determiners: the, a, an, …

• Conjunctions: or, and, but, …

• Auxiliary verbs: can, may, should, …

• Numerals: one, two, three, …

– Open classes: new ones can be created all the time, therefore it is not possible that all words from these classes appear in the lexicon

• Nouns

• Verbs

• Adjectives

• Adverbs

Page 20: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

POS Tagger PrerequisitesTagsets

• To do POS tagging, need to choose a standard set of tags to work with

• A tagset is normally sophisticated and linguistically well grounded

• Could pick very coarse tagets

– N, V, Adj, Adv.

• More commonly used set is finer grained, the “UPenn TreeBank tagset”, 48 tags

• Even more fine-grained tagsets exist

Page 21: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

POS Tagger PrerequisitesTagset example – UPenn tagset

1 CC Coordinating conjunction2 CD Cardinal number3 DT Determiner4 EX Existential there5 FW Foreign word6 IN Preposition/subord. conjunction7 J J Adjective8 J JR Adjective, comparative9 J JS Adjective, superlative

10 LS List item marker11 MD Modal12 NN Noun, singular or mass13 NNS Noun, plural14 NNP Proper noun, singular15 NNPS Proper noun, plural16 PDT Predeterminer17 POS Possessive ending18 PRP Personal pronoun19 PP Possessive pronoun20 RB Adverb21 RBR Adverb, comparative22 RBS Adverb, superlative23 RP Particle24 SYM Symbol (mathematical or scientific)

25 TO to26 UH Interjection27 VB Verb, base form28 VBD Verb, past tense29 VBG Verb, gerund/present participle30 VBN Verb, past participle31 VBP Verb, non-3rd ps. sing. present32 VBZ Verb, 3rd ps. sing. present33 WDT wh-determiner34 WP wh-pronoun35 WP Possessive wh-pronoun36 WRB wh-adverb37 # Pound sign38 $ Dollar sign39 . Sentence-final punctuation40 , Comma41 : Colon, semi-colon42 ( Left bracket character43 ) Right bracket character44 " Straight double quote45 ` Left open single quote46 " Left open double quote47 ' Right close single quote48 " Right close double quote

Page 22: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

POS TaggingRule based methods

• Start with a dictionary

• Assign all possible tags to words from the dictionary

• Write rules by hand to selectively remove tags

• Leaving the correct tag for each word

Page 23: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

POS TaggingStatistical methods (1)

The Most Frequent Tag Algorithm• Training

– Take a tagged corpus

– Create a dictionary containing every word in the corpus together with all its possible tags

– Count the number of times each tag occurs for a word and compute the probability P(tag|word); then save all probabilities

• Tagging– Given a new sentence, for each word, pick the most frequent

tag for that word from the corpus

Page 24: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

POS TaggingStatistical methods (2)

Bigram HMM Tagger• Training

– Create a dictionary containing every word in the corpus together with all its possible tags

– Compute the probability of each tag generating a certain word, compute the probability each tag is preceded by a specific tag (Bigram HMM Tagger => probability is dependent only on the previous tag)

• Tagging– Given a new sentence, for each word, pick the most likely tag for

that word using the parameters obtained after training– HMM Taggers choose the tag sequence that maximizes this

formula: P(word|tag) * P(tag|previous tag)

Page 25: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Bigram HMM Tagging: Example

People/NNS are/VBZ expected/VBN to/TO queue/VB at/IN the/DT registry/NNS

The/DT police/NN is/VBZ to/TO blame/VB for/IN the/DT queue/NN

• to/TO queue/???the/DT queue/???

• tk = argmaxk P(tk|tk-1)*P(wi|tk) – i = number of word in sequence, k = number among possible tags for the word “queue”

• How do we compute P(tk|tk-1)?– count(tk-1tk)/count(tk-1)

• How do we compute P(wi|tk)?– count(wi tk)/count(tk)

• max[P(VB|TO)*P(queue|VB) , P(NN|TO)*P(queue|NN)]

• Corpus:– P(NN|TO) = 0.021 * P(queue|NN) = 0.00041 => 0.000007– P(VB|TO) = 0.34 * P(queue|VB) = 0.00003 => 0.00001

Page 26: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

POS TaggingTransformation Based Tagging (1)

• Combination of rule-based and stochastic tagging methodologies

– Like rule-based because rule templates are used to learn transformations

– Like stochastic approach because machine learning is used — with tagged corpus as input

• Input:

– tagged corpus

– lexicon (with all possible tags for each word)

Page 27: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

POS TaggingTransformation Based Tagging (2)

• Basic Idea:– Set the most probable tag for each word as a start value– Change tags according to rules of type “if word-1 is a determiner

and word is a verb then change the tag to noun” in a specific order

• Training is done on tagged corpus:1. Write a set of rule templates2. Among the set of rules, find one with highest score3. Continue from 2 until lowest score threshold is passed4. Keep the ordered set of rules

• Rules make errors that are corrected by later rules

Page 28: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Transformation Based TaggingExample

• Tagger labels every word with its most-likely tag– For example: race has the following probabilities in the

Brown corpus:• P(NN|race) = 0.98

• P(VB|race)= 0.02

• Transformation rules make changes to tags– “Change NN to VB when previous tag is TO”

… is/VBZ expected/VBN to/TO race/NN tomorrow/NNbecomes… is/VBZ expected/VBN to/TO race/VB tomorrow/NN

Page 29: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

ACOPOSTAuthor(s): Jochen Hagenstroem, Kilian Foth, Ingo Schröder, Parantu ShahPurpose: ACOPOST is a collection of POS taggers. It implements and extends well-known

machine learning techniques and provides a uniform environment for testing.Platforms: All POSIX (Linux/BSD/UNIX-like OSes)Access: Free at http://sourceforge.net/projects/acopost/

BRILL’S TAGGER Author(s): Eric BrillPurpose: Transformation Based Learning POS TaggerAccess: Free at http://www.cs.jhu.edu/~brill

fnTBLAuthor(s): Radu Florian and Grace Ngai, John Hopkins University, USAPurpose: fnTBL is a customizable, portable and free source machine-learning toolkit

primarily oriented towards Natural Language-related tasks (POS tagging, base NP chunking, text chunking, end-of-sentence detection). It is currently trained for English and Swedish.

Platforms: Linux, Solaris, WindowsAccess: Free at http://nlp.cs.jhu.edu/~rflorian/fntbl/

POS Taggers (1)

Page 30: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

LINGSOFTAuthor(s): LINGSOFT, FinlandPurpose: Among the services offered by Lingsoft one can find POS taggers for Danish,

English, German, Norwegian, Swedish.Access: Not free. Demos at http://www.lingsoft.fi/demos.html

LT POS (LT TTT)Author(s): Language Technology Group, University of Edinburgh, UKPurpose: The LT POS part of speech tagger uses a Hidden Markov Model disambiguation

strategy. It is currently trained only for English.Access: Free but requires registration at http://www.ltg.ed.ac.uk/software/pos/index.html

MACHINESE PHRASE TAGGER Author(s): Connexor Purpose: Machinese Phrase Tagger is a set of program components that perform basic

linguistic analysis tasks at very high speed and provide relevant information about words and concepts to volume-intensive applications. Available for: English, French, Spanish, German, Dutch, Italian, Finnish.

Access: Not free. Free access to online demo at http://www.connexor.com/demo/tagger/

POS Taggers (2)

Page 31: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

MXPOSTAuthor(s): Adwait RatnaparkhiPurpose: MXPOST is a maximum entropy POS tagger. The downloadable version includes

a Wall St. Journal tagging model for English, but can also be trained for different languages.

Platforms: Platform independentAccess: Free at http://www.cis.upenn.edu/~adwait/statnlp.html

MEMORY BASED TAGGERAuthor(s): ILK - Tilburg University, CNTS - University of AntwerpPurpose: Memory-based tagging is based on the idea that words occurring in similar

contexts will have the same POS tag. The idea is implemented using the memory-based learning software package TiMBL.

Access: Usable by email or on the Web at http://ilk.uvt.nl/software.html#mbt

µ-TBLAuthor(s): Torbjörn LagerPurpose: The µ-TBL system is a powerful environment in which to experiment with

transformation-based learning.Platforms: WindowsAccess: Free at http://www.ling.gu.se/~lager/mutbl.html

POS Taggers (3)

Page 32: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

QTAGAuthor(s): Oliver Mason, Birmingham University, UKPurpose: QTag is a probabilistic parts-of-speech tagger. Resource files for English and

German can be downloaded together with the tool.Platforms: Platform independentAccess: Free at http://www.english.bham.ac.uk/staff/omason/software/qtag.html

STANFORD POS TAGGERAuthor(s): Kristina Toutanova, Stanford University, USAPurpose: The Stanford POS tagger is a log-linear tagger written in Java. The downloadable

package includes components for command-line invocation and a Java API both for training and for running a trained tagger.

Platforms: Platform independentAccess: Free at http://nlp.stanford.edu/software/tagger.shtml

SVM TOOLAuthor(s): TALP Research Center, University of Catalunya, Spain Purpose: The SVMTool is a simple and effective part-of-speech tagger based on Support

Vector Machines. The SVMLight software implementation of Vapnik's Support Vector Machine by Thosten Joachims has been used to train the models for Catalan, English and Spanish.

Access: Free. SVMTool at http://www.lsi.upc.es/~nlp/SVMTool/ and SVMLight at http://svmlight.joachims.org/

POS Taggers (4)

Page 33: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

TnTAuthor(s): Thorsten Brants, Saarland University, GermanyPurpose: TnT, the short form of Trigrams'n'Tags, is a very efficient statistical part-of-

speech tagger that is trainable on different languages and virtually any tagset. The tagger is an implementation of the Viterbi algorithm for second order Markov models. TnT comes with two language models, one for German, and one for English.

Platforms: Platform independent.Access: Free but requires registration at http://www.coli.uni-saarland.de/~thorsten/tnt/

TREETAGGERAuthor(s): Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart,

GermanyPurpose: The TreeTagger has been successfully used to tag German, English, French,

Italian, Spanish, Greek and old French texts and is easily adaptable to other languages if a lexicon and a manually tagged training corpus are available.

Access: Free athttp://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

POS Taggers (5)

Page 34: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Xerox XRCE MLTT Part Of Speech TaggersAuthor(s): Xerox Research Centre EuropePurpose: Xerox has developed morphological analysers and part-of-speech disambiguators

for various languages including Dutch, English, French, German, Italian, Portuguese, Spanish. More recent developments include Czech, Hungarian, Polish and Russian.

Access: Not free. Demos at http://www.xrce.xerox.com/competencies/content-analysis/fsnlp/tagger.en.html

YAMCHAAuthor(s): Taku KudoPurpose: YamCha is a generic, customizable, and open source text chunker oriented toward

a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using Support Vector Machines (SVMs), first introduced by Vapnik in 1995. YamCha is exactly the same system which performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking task.

Platforms: Linux, WindowsAccess: Free at http://www2.chasen.org/~taku/software/yamcha/

POS Taggers (6)

Page 35: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Stemming• Stemmers are used in IR to reduce as many related

words and word forms as possible to a common canonical form – not necessarily the base form – which can then be used in the retrieval process.

• Frequently, the performance of an IR system will be improved if term groups such as: CONNECT, CONNECTED, CONNECTING, CONNECTION, CONNECTIONS are conflated into a single term (by removal of the various suffixes -ED, -ING, -ION, -IONS to leave the single term CONNECT). The suffix stripping process will reduce the total number of terms in the IR system, and hence reduce the size and complexity of the data in the system, which is always advantageous.

Page 36: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

The Porter Stemmer

• A conflation stemmer developed by Martin Porter at the University of Cambridge in 1980

• Idea: the English suffixes (approximately 1200) are mostly made up of a combination of smaller and simpler suffixes

• Can be adapted to other languages (needs a list of suffixes and context sensitive rules)

Page 37: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

ELLOGONAuthor(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research,

GreeceAccess: Free at http://www.ellogon.org/

FSAAuthor(s): Jan Daciuk, Rijksuniversiteit Groningen and Technical University of Gdansk,

PolandPurpose: Supported languages: German, English, French, Polish.Access: Free at http://juggernaut.eti.pg.gda.pl/~jandac/fsa.html

HEART Of GOLDAuthor(s): Ulrich Schäfer, DFKI Language Technology Lab, GermanyPurpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian,

Greek, German, French, English, Chinese.Access: Free at http://heartofgold.dfki.de/

Stemmers (1)

Page 38: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

LANGSUITEAuthor(s): PetaMemPurpose: Supported languages: Unicode, Spanish, Polish, Italian, Hungarian, German,

French, English, Dutch, Danish, Czech. Access: Not free. More information at http://www.petamem.com/

SNOWBALLPurpose: Presentation of stemming algorithms, and Snowball stemmers, for English,

Russian, Romance languages (French, Spanish, Portuguese and Italian), German, Dutch, Swedish, Norwegian, Danish and Finnish.

Access: Free at http://www.snowball.tartarus.org/

SProUTAuthor(s): Feiyu Xu, Tim vor der Brück, LT-Lab, DFKI GmbH, GermanyPurpose: Available for: Unicode, Spanish, Japanese, German, French, English, ChineseAccess: Not free. More information at http://sprout.dfki.de/

TWOLAuthor(s): Lingsoft Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, DanishAccess: Not free. More information at http://www.lingsoft.fi/

Stemmers (2)

Page 39: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Lemmatization• The process of grouping the inflected forms of a word together

under a base form, or of recovering the base form from an inflected form, e.g. grouping the inflected forms COME, COMES, COMING, CAME under the base form COME

• Dictionary based

– Input: token + pos

– Output: lemma

• Note: needs POS information

• Example:

– left+v -> leave, left+a->left

• It is the same as looking for a transformation to apply on a word to get its normalized form (word endings: what word suffix should be removed and/or added to get the normalized form) => lemmatization can be modeled as a machine learning problem

Page 40: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

CONNEXOR LANGUAGE ANALYSIS TOOLSAuthor(s): Connexor, FinlandPurpose: Supported languages: English, French, Spanish, German, Dutch, Italian, Finnish.Access: Not free. Demos at http://www.conexor.fi/

ELLOGONAuthor(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research,

GreeceAccess: Free at http://www.ellogon.org/

FSAAuthor(s): Jan Daciuk, Rijksuniversiteit Groningen and Technical University of Gdansk,

PolandPurpose: Supported languages: German, English, French, Polish.Access: Free at http://juggernaut.eti.pg.gda.pl/~jandac/fsa.html

MBLEMAuthor(s): ILK Research Group, Tilburg UniversityPurpose: MBLEM is a lemmatizer for English, German, and Dutch.Access: Demo at http://ilk.uvt.nl/mblem/

Lemmatizers (1)

Page 41: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

SWESUMAuthor(s): Hercules Dalianis, Martin Hassel, KTH, Euroling ABPurpose: Supported languages: Swedish, Spanish, German, French, EnglishAccess: Free at http://www.euroling.se/produkter/swesum.html

TREETAGGERAuthor(s): Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart,

GermanyPurpose: The TreeTagger has been successfully used for German, English, French, Italian,

Spanish, Greek and old French texts and is easily adaptable to other languages if a lexicon is available.

Access: Free at http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

TWOLAuthor(s): Lingsoft Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, DanishAccess: Not free. More information at http://www.lingsoft.fi/

Lemmatizers (2)

Page 42: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Shallow Parsing (chunking)

• Partition the input into a sequence of non-overlapping units, or chunks, each a sequence of words labelled with a syntactic category and possibly a marking to indicate which word is the head of the chunk

• How?

– Set of regular expressions over POS labels

– Training the chunker on manually marked up text

Page 43: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

fnTBLAuthor(s): Radu Florian and Grace Ngai, John Hopkins University, USAPurpose: fnTBL is a customizable, portable and free source machine-learning toolkit

primarily oriented towards Natural Language-related tasks (POS tagging, base NP chunking, text chunking, end-of-sentence detection, word sense disambiguation). It is currently trained for English and Swedish.

Platforms: Linux, Solaris, WindowsAccess: Free at http://nlp.cs.jhu.edu/~rflorian/fntbl/

YAMCHAAuthor(s): Taku KudoPurpose: YamCha is a generic, customizable, and open source text chunker oriented toward

a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using Support Vector Machines (SVMs), first introduced by Vapnik in 1995. YamCha is exactly the same system which performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking task.

Platforms: Linux, WindowsAccess: Free at http://www2.chasen.org/~taku/software/yamcha/

Noun Phrase (NP) Chunkers

Page 44: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Named Entity Recognition• Identification of proper names in texts, and their

classification into a set of predefined categories of interest:– entities: organizations, persons, locations– temporal expressions: time, date– quantities: monetary values, percentages, numbers

• Two kinds of approaches

Knowledge Engineering• rule based • developed by experienced

language engineers • make use of human intuition • small amount of training data• very time consuming • some changes may be hard to

accommodate

Learning Systems• use statistics or other machine

learning • developers do not need LE

expertise • require large amounts of

annotated training data • some changes may require re-

annotation of the entire training corpus

Page 45: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Named Entity RecognitionKnowledge engineering approach

• identification of named entities in two steps:– recognition patterns expressed as WFSA (Weighted Finite-State

Automaton) are used to identify phrases containing potential candidates for named entities (longest match strategy)

– additional constraints (depending on the type of candidate) are used for validating the candidates

• usage of on-line base lexicon for geographical names, first names

Page 46: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Named Entity RecognitionProblems

• Variation of NEs, e.g. John Smith, Mr. Smith, John

• Since named entities may appear without designators (companies, persons) a dynamic lexicon for storing such named entities is used

Example:

“Mars Ltd is a wholly-owned subsidiary of Food Manufacturing Ltd, a non-trading company registered in England. Mars is controlled by members of the Mars family.”

• Resolution of type ambiguity using the dynamic lexicon:

If an expression can be a person name or company name (Martin Marietta Corp.) then use type of last entry inserted into dynamic lexicon for making decision.

• Issues of style, structure, domain, genre etc.

• Punctuation, spelling, spacing, formatting

Page 47: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

ELLOGONAuthor(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research,

GreecePurpose: Available for Unicode.Access: Free at http://www.ellogon.org/

HEART Of GOLDAuthor(s): Ulrich Schäfer, DFKI Language Technology Lab, GermanyPurpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian,

Greek, German, French, English, Chinese.Access: Free at http://heartofgold.dfki.de/

INSIGHT DISCOVERER EXTRACTORAuthor(s): TEMISPurpose: Supported language: Spanish, Russian, Portuguese, Polish, Italian, Hungarian,

Greek, German, French, English, Dutch, Czech.Access: Not free. More information at http://www.temis-group.com/

Named Entity Recognizers (1)

Page 48: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

LINGPIPEAuthor(s): Bob Carpenter, Breck Baldwin, Alias-IPurpose: Supported languages: Unicode, Spanish, German, French, English, Dutch.Access: Free at http://www.alias-i.com/lingpipe/

YAMCHAAuthor(s): Taku KudoPurpose: YamCha is a generic, customizable, and open source text chunker oriented toward

a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using Support Vector Machines (SVMs), first introduced by Vapnik in 1995. YamCha is exactly the same system which performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking task.

Platforms: Linux, WindowsAccess: Free at http://www2.chasen.org/~taku/software/yamcha/

Named Entity Recognizers (2)

Page 49: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Dincolo de nume de entităţi

Beyond Named Entity Recognition

Semantic labelling for NLP tasks

Workshop In Association with 4th INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION LREC2004

Page 50: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Named Entity Recognition without Gazetteers (1999)  

Andrei Mikheev, Marc Moens, Claire Grover

http://citeseer.ist.psu.edu/284697.html

Page 51: Tehnologiile Limbajului Uman Curs 3: Prelucrări sub-sintactice Dan Cristea 17 octombrie 2005.

Alte tipuri de prelucrări

• Despărţirea în silabe• Statistici de frecvenţă

– cuvintelor– lemelor– vocalelor– silabelor– grupurilor de cuvinte (colocaţii)– Legea Zipf

• Recunoaşterea limbii • Inserţia de diacritice