-->

Grundläggande Textanalys: Uppgift 2

This lab is about parts of speech and Markov models. The practical part consists of training and evaluating taggers with the system HunPos. The theoretical part consists of questions about Markov models and related n-gram models. There is an additional task for those who are aiming for VG as well.

1. Ordklasstaggning med HunPoS

In the first task you use a parts of speech tagger for Swedish on data from the Stockholm-Umeå corpus (SUC). Then you use the HunPos system in order to evaluate the accuracy of a particular test sample from SUC (which is separate from the training set). The files to be used are in /local/kurs/gta14/ at the STP system:

To train the parts of speech tagger, use the command hunpos-train with the name of the model you want to create as first argument. Training data is read from stdin. For instance, in order to train the model "mymodel" with default settings, you should run:

hunpos-train mymodel < suc-train.txt

To tag the new text, use the command hunpos-tag with the model name as argument and retrieving input from stdin and saving stdout to a file. For instance, if we name the output file suc-test.out , you should run:

hunpos-tag mymodel < suc-test.in > suc-test.out

For evaluation, use the command tnt-diff called as follows:

tnt-diff -l suc-train suc-test.txt suc-test.out

The -l Suc-train flag is necessary for tnt-diff to provide various statistics for known and unknown words using dictionary file suc-train.lex.

HunPos has parameters that can be varied to create different models.  You should first compare a bigram- and a trigram model for which you need to use the flag -t, although you are of course free to test other parameters as well.  Check the user manual for HunPos on https://code.google.com/p/hunpos/wiki/UserManualI for more information.

Errata: Flag -t must be used for bigram model not for trigram model. i.e., for bigram model add flag -t1

2. N-gram-modeller

For this task, use a small sample of SUC and compute by your self statistics for parts of speech with additive smoothing (ADD-1, Laplace).  Use only SUC's tags and use <s> as a dummy tag for starting of a sentence probabilities.  Please note that the probabilities should be defined for all possible bigrams, not only for those actually present in the sample.  Express the result in the form of a table of 26 x 25 = 650 rows, which begins with the following form:

<s> AB : 0.0375
<s> DT : 0.075
...
AB AB : 0.120481927711
AB DT : 0.0240963855422
...
Here is the list of SUC's basic parts of speech:
AB DT HA HD HP HS IE IN JJ KN MAD MID NN PAD PC PL PM PN PP PS RG RO SN UO VB
And samples are available on this page.

Errata: Only 26 rows required (more details in the email of 6th of May)

3. Markov-modeller

Below is a Markov model with two states ko and Anka, as well as a start and an end-state.  Symbol probabilities:

Transition Probabilities:

A symbol sequence  S = s1, ..., sn is generated by the model after the start state in n transitions on Ko and Anka until it reaches the Slut state.

Using the symbol sequence S assigned to you on this page, answer the following questions :

Both answers must be justified by calculations or logical arguments, but you do not need to use formal algorithms like Viterbi, FORWARD or BACKWARD.

4. För VG

For the grade VG you should do one of the following additional tasks:

Rapport

Write a report that explains how you solved this lab.  The report should be 2-4 pages if only the mandatory tasks have been made and 3-5 pages if VG (excluding the large table in task 2).  Send your report in pdf (name format example: BjornSVENSSON_report.pdf) and your table in plain text (name format example: BjornSVENSSON_table.txt) to marie.dubremetz@lingfil.uu.se not later than thursday 7 May 2015 Monday 11th of May (Before noon). Wednesday 13th of May

Have a look at the FAQ as well.