Grundläggande Textanalys: Uppgift 2
This lab is about parts of speech and Markov models. The practical part consists of training and evaluating taggers with the system HunPos. The theoretical part consists of questions about Markov models and related n-gram models. There is an additional task for those who are aiming for VG as well.
1. Ordklasstaggning med HunPoS
In the first task you use a parts of speech tagger for Swedish on data from the Stockholm-Umeå corpus (SUC). Then you use the HunPos system in order to evaluate the accuracy of a particular test sample from SUC (which is separate from the training set). The files to be used are in /local/kurs/gta14/ at the STP system:
- suc-train.txt (manually tagged training set)
- suc-test.in (untagged test set)
- suc-test.txt (manually tagged test set for evaluation)
- suc-train.lex (lexicon for evaluation)
To train the parts of speech tagger, use the command hunpos-train with the name of the model you want to create as first argument. Training data is read from stdin. For instance, in order to train the model "mymodel" with default settings, you should run:
hunpos-train mymodel < suc-train.txt
To tag the new text, use the command hunpos-tag with the model name as argument and retrieving input from stdin and saving stdout to a file. For instance, if we name the output file suc-test.out , you should run:
hunpos-tag mymodel < suc-test.in > suc-test.out
For evaluation, use the command tnt-diff called as follows:
tnt-diff -l suc-train suc-test.txt suc-test.out
The -l Suc-train flag is necessary for tnt-diff to provide various statistics for known and unknown words using dictionary file suc-train.lex.
HunPos has parameters that can be varied to create different models. You should first compare a bigram- and a trigram model for which you need to use the flag -t, although you are of course free to test other parameters as well. Check the user manual for HunPos on https://code.google.com/p/hunpos/wiki/UserManualI for more information.
Errata: Flag -t must be used for bigram model not for trigram model. i.e., for bigram model add flag -t1
2. N-gram-modeller
For this task, use a small sample of SUC and compute by your self statistics for parts of speech with additive smoothing (ADD-1, Laplace). Use only SUC's tags and use <s> as a dummy tag for starting of a sentence probabilities. Please note that the probabilities should be defined for all possible bigrams, not only for those actually present in the sample. Express the result in the form of a table of 26 x 25 = 650 rows, which begins with the following form:
<s> AB : 0.0375Here is the list of SUC's basic parts of speech:
<s> DT : 0.075
...
AB AB : 0.120481927711
AB DT : 0.0240963855422
...
AB DT HA HD HP HS IE IN JJ KN MAD MID NN PAD PC PL PM PN PP PS RG RO SN UO VBAnd samples are available on this page.
Errata: Only 26 rows required (more details in the email of 6th of May)
3. Markov-modeller
Below is a Markov model with two states ko and Anka, as well as a start and an end-state. Symbol probabilities:
- The state Ko generates the signal Mu with probability 0.9 and hej with probability 0.1.
- The condition Anka generates the signal Kvack with probability 0.6 and hej with probability 0.4. (The duck is better in Swedish than the cow.)
- The start and end state generate nothing.
- From start-condition the model goes to state ko with probability 0.5 and to state Anka with probability 0.5.
- From state Ko the model stays on Ko with a probability of 0.5, it goes to the state Anka with a probability of 0.3 it goes to the slut state with a probability of 0.2.
- From state Anka the model stays on Anka with a probability of 0.5, it goes to the state Ko with a probability of 0.3 and it goes to the slut state with a probability of 0.2.
A symbol sequence S = s1, ..., sn is generated by the model after the start state in n transitions on Ko and Anka until it reaches the Slut state.
Using the symbol sequence S assigned to you on this page, answer the following questions :
- Question 1: What is the most probable state sequence that generates S?
- Question 2: What probability has S given this model?
Both answers must be justified by calculations or logical arguments, but you do not need to use formal algorithms like Viterbi, FORWARD or BACKWARD.
4. För VG
For the grade VG you should do one of the following additional tasks:
- Make a careful exploration of how HunPos' suffixanalys affects the tagging results for unknown words.
- Calculate bigram probabilities in sub-task 2 with any of the other smoothing methods treated in Jurafsky & Martin.
- Show in detail how to calculate the solutions of the sub-task 3 by the Viterbi algorithm and forward- or backward algorithm.
Rapport
Write a report that explains how you solved this lab. The report should be 2-4 pages if only the mandatory tasks have been made and 3-5 pages if VG (excluding the large table in task 2). Send your report in pdf (name format example: BjornSVENSSON_report.pdf) and your table in plain text (name format example: BjornSVENSSON_table.txt) to marie.dubremetz@lingfil.uu.se not later than thursday 7 May 2015 Monday 11th of May (Before noon). Wednesday 13th of MayHave a look at the FAQ as well.