UPPSALA UNIVERSITET  
Inst. f. lingvistik och filologi Lärare: Marie Dubremetz
Uppsala universitet
Hoppa över länkar
Kursplan

Språkteknologi
och datorlingvistik







http://stp.lingfil.uu.se/~marie/undervisning/textanalys16/lab1.html

Grundläggande textanalys: Uppgift 1

Laboration - Textnormalisering och meningssegmentering

Aim

The purpose of this lab is to make you work on some common problems of text normalization.

You will have to compare your own normalization with two others: first a normalization performed by a complete normalization program, second a normalization hand made.

The text you will work with comes from the SUC corpus. The normalization program we use is included in the text processing package stb-light.

Read the full description including how you should report your work before you begin.


Task

Your task in this lab is to:

Download materials and software

You can find a tar file in /local/kurs/gta14/stb-light.tar.gz . Download and unpack it. You will get a folder that includes:

Browse the folder and make sure that you know which files to use.


Make a cleaned version

You will see in norm.raw that some meta data (@, ‹‹‹/ca04b››› etc.) are in the text. We want to keep only the text and not those metadata. Using command lines (sed, tr, etc.) with regular expressions make a clean version of the text where you remove all those meta data. Call it for example norm.cl.

In your report:

List which command/regular expressions you used for cleaning.

Make your own text normalization

Your task is to tokenize and sentence split the raw text by putting one token per line and by adding a blank line for each sentence boundary. For comparison's sake, you will do at least two alternative tokenization:

  1. A version with the basic script: baseline . Call it e.g., norm.simple
  2. A version with your own improvements. Call it norm.mine
Don't forget to make your sentence boundaries!

han
var
17
år
kom
beskedet
att
familjen
fick
åka
hem
.

-
Men
en
deporterad
blir
aldrig
fri
,
säger
Ricardas
och
tar
fram
sitt
gamla
pass
.

Download the baseline script and run:

sh baseline norm.cl norm.simple

(2016-04-07) If the baseline.sh did not download or output an error use this command instead:

sed 's/ /\n/g' norm.cl | sed -r 's/([\.\!\?])/\n&\n/g' > norm.simple

This will output norm.simple , you will notice that the tokenization/sentence splitting is really basic and not adapted to this corpus. Make your own script that improves this baseline. You can take baseline.sh change it or you can start from scratch. Anyway don't forget to change the output file name, call it ' norm.mine ' .


Tokenization and segmentation with stb-light

In this step, use stb-light. Make sure you are in the directory that contain stb-pipeline.py . Run the program with the following command:

./stb-pipeline.py --output-dir=OUTPUT --tokenized INFIL

OUTPUT is file name (including path) of your output, call it "norm.stb", and INFIL is the name of the file you want to process: norm.cl (including path). You will now get a tokenized file norm.stb .


Compare the different tokenizations and evaluate against the manually corrected tokenization

You now have access to 4 different tokenized versions of the text material:

  1. norm.simple . The “passably tokenized" version .
  2. norm.mine . Your own tokenization.
  3. norm.stb . stb-lights tokenization .
  4. facit.txt . Gold-Standard: proof-read tokenization according SUC2.0 .

Now compare all those different tokenizations. Show that you have an idea of which tokenisation program is the most perfomant. You can use diff command (example below), but you will probably need other command lines combined together to get significant statistics (e.g., wc, egrep, etc.)

diff norm.mine facit.txt > UTFIL

Try to get an overview of the problems that seem common and what is more unusual.

If you want, you can try to improve your tokenization based on the insights you made with the comparison.

In your report:

Make sure in your report that you answer to all those questions:


Transducer exercize:

On this page you will find your questions about morphological analysis and transducers. Make the transducer with your favorite drawing program and send a pdf version of it or do it by hand on a simple paper (make sure you give it to me, or put it in my department mail box, with your name on it).


VG-task

  • Stemming
  • For this task I ask you to stem and lemmatize the text. You will be evaluated on your analyze.

    Download and install the Stemmer Snowball on your machine:

    http://snowball.tartarus.org/download.php

    When installed run it on facit.txt file.

    Command line:

  • Lemmatization
  • Then create a sample of facit.txt with only the 500 lines, for that you can, for instance, run the command:

    Lemmatize this extract with Granska:

    http://skrutten.nada.kth.se/


    (2016-04-20) Note that the lemmas are displayed only in the XML output.

  • Research and development on lemmatization
  • Look at this webpage:
    http://www.lexiconista.com/datasets/lemmatization/
    You will notice that there is a lemmatization list for swedish. Download it, look at the lexicon.

    In your report (for VG-Task):

    Compare the lemmas vs the stems what are the differences? Can you find 2 or 3 recurrent mistakes in the lemmatization/stemming? Have you got an hypothesis on why they happened?
    On research and development part: based on the error analysis of the output of Granska, can you tell if the lexicon would be useful for improving it? In other words: do you think it would be worth developping a lemmatizer out of this lexicon and that it could outporferm Granska? No implementation needed for passing, but argument well your opinion.


    For the developpment of your script(s) you can collaborate, but your report must be written with your own words, do not abuse the copy/paste !

    You will load your report on studentportalen. It must be written in English.  Deadline: 21 of april (14:00)  Format: pdf (no *.zip, *.txt, *.doc(x), *.odt files please!). Don't forget the advices I gave in class to improve your report. You can as well read Uppsala report guidelines to get some inspiration. You will write maximum 1000 words if no VG task included, 1500 max if you do the VG task (wc -w).  Don't forget to join your transducer exercise either by pdf either by hand on a paper that you drop in my professional mailbox.


    For questions e-mail me. Don't forget to read the FAQ, your answer is maybe there! I will update it if I get more recurrent questions from the class.

    References: stb-light package: compiled by Joakim Nivre and based on the package Svannotate compiled by Filip Salomonsson (2009). For your information, in addition to text normalization stb-light performs also parts of speech tagging with HunPos (Halácsy, Kornai och Oravecz) and syntactic parsing with parser MaltParser (Hall, Nilsson och Nivre) .