Inst. f. lingvistik och filologi Lärare: Marie Dubremetz
Uppsala universitet
Hoppa över länkar

och datorlingvistik


Grundläggande textanalys: Uppgift 1

Laboration - Textnormalisering och meningssegmentering


The purpose of this lab is to make you work on some common problems of text normalization.

You will have to compare your own normalization with two others: first a normalization performed by a complete normalization program, second a normalization hand made.

The text you will work with comes from the SUC corpus. The normalization program we use is included in the text processing package stb-light compiled by Joakim Nivre and based on the package Svannotate compiled by Filip Salomonsson (2009). For your information, in addition to text normalization stb-light performs also parts of speech tagging with HunPos (Halácsy, Kornai och Oravecz) and syntactic parsing with parser MaltParser (Hall, Nilsson och Nivre) .

Read the full description including how you should report your work before you begin.


Your task in this lab is to:

Download materials and software

You can find a tar file in /local/kurs/gta14/stb-light.tar.gz . Download and unpack it. You will get a folder that includes:

Browse the folder and make sure that you know which files to use.

Make a cleaned version

You will see in norm.raw that some meta data (@, ‹‹‹/ca04b››› etc.) are in the text. We want to keep only the text and not those metadata. Using command lines (sed, tr, etc.) with regular expressions make a clean version of the text where you remove all those meta data. Call it for example norm.cl.

In your report:

List which command/regular expressions you used for cleaning.

Make your own text normalization

Your task is to normalize the raw text by putting one token per line and by adding a blank line for each sentence boundary. For comparison's sake, do at least two alternative tokenization:

  1. A version where you just split token based on the space. Call it e.g., norm.simple
  2. A version where you make a tokenization better by considering other boundaries (ponctuations etc.). Call it norm.token
Don't forget to make your sentence boundaries!



Perform normalization / tokenization with regular expressions, such as with commands like sed and tr . Shell commands are sufficient for this task, but be aware that the regular expressions can differ from a command to another.

Look at:

  1. How many tokens you get in each version of the normalization you did, why?
  2. Use diff command to look at the differences between the files. How many differences are there?
You run the diff command this way:

diff norm.simple norm.token > UTFIL

You can use as well for this task whatever comparison tool fits you, try several ones and see which one you prefer.

When you become familiar with your two versions, you can proceed the stb-light normalization.

In your report:

Answer the questions above and list as well the regular expressions you use for that task.

Tokenization and segmentation with stb-light

In this step, use stb-light. Make sure you are in the directory that contain stb-pipeline.py . Run the program with the following command:

./stb-pipeline.py --output-dir=OUTPUT --tokenized INFIL

OUTPUT is file name (including path) of your output, call it "norm.t", and INFIL is the name of the file you want to process: norm.cl (including path). You will now get a tokenized file norm.t .

Compare the different tokenizations.

You now have access to (at least) 4 different tokenized versions of the text material:

  1. norm.simple . Your “passably tokenized" version ( norm.simple ).
  2. norm.token . The version you put a little more effort into ( norm.token ).
  3. norm.t . stb-lights tokenization ( norm.t ).
  4. facit.txt . Proof-read tokenization according SUC2.0 (facit.txt ) .

Now compare the tokenization in norm.simple and norm.token with stb-lights tokenization ( norm.t ). For example, use wc and diff . Count the tokens in all the files (i.e., counting rows with wc -l ) and compare the numbers you get. Then compare the files with diff :

diff norm.token norm.t > UTFIL

Using egrep you can in the diff file examine how often some differences appear in stb-lights tokenization. Try to get an overview of the problems that seem common and what is more unusual.

If you want, you can try to improve your tokenization based on the insights you made with the comparison.

Evaluate against a manually corrected tokenization

When you think you're ready to improve your tokenization can you compare both yours and stb-lights tokenization with the proof-read tokenization called "facit.txt" which is manually corrected?

In your report:

Examine the following files : norm.token, norm.t, facit.txt :

Transducer exercize:

On this page you will find your questions about morphological analysis and transducers. Make the transducer with your favorite drawing program and send a pdf version of it or do it by hand on a simple paper (make sure you give it to me, or put it in my department mail box, with your name on it).


  • Stemming
  • For this task I ask you to stem and lemmatize the text. You will be evaluated on your analyze.

    Download and install the Stemmer Snowball on your machine:


    When installed run it on facit.txt file.

    Command line:

  • Lemmatization
  • Then create a sample of facit.txt with only the 500 first lines, for that you can run the command:

    Lemmatize this extract with Granska:


    In your report (for VG-Task):

    Compare the lemmas vs the stems what are the differences? Can you find 2 or 3 recurrent mistakes in the lemmatization/stemming? Have you got an hypothesis on why they happened?

    Laboratory report is an individual task, i.e., you should write it alone. During the manipulation itself, however, I encourage cooperation. Discuss possible solutions with your peers for the work of the lab but then formulate the written report by yourself.

    You will send your report written in English + your exercize on transducers to marie.dubremetz@lingfil.uu.se . For writing your report follow those instructions: report instructions. (2-4 pages if no VG task included, 3-5 pages if you do the VG task). Only the "pdf" format will be accepted. I will receive 15 different files, so I would preciate if you call it by your name (Like in this example: "BjornSVENSSON_report.pdf"). You should make precise in your report file your name as well. The deadline is April the 23rd (Thursday). Don't forget to join your transducer exercize either by pdf ("BjornSVENSSON_fst.pdf") either by hand in my professional mailbox.

    For questions e-mail me. Don't forget to read the FAQ, your answer is maybe there! I will update it if I get more recurrent questions from the class.