http://stp.lingfil.uu.se/~marie/undervisning/textanalys14/lab1NormOchTok2014.html

Grundläggande textanalys: Uppgift 2

Laboration - Textnormalisering och meningssegmentering

Aim

The purpose of this lab is to make you work on some common problems of text normalization.

You will have to compare your own normalization with two others: first a normalization performed by a complete normalization program, second a normalization hand made.

The text you will work with comes from the SUC corpus. The normalization program we use is included in the text processing package stb-light compiled by Joakim Nivre and based on the package Svannotate compiled by Filip Salomonsson (2009). For your information, in addition to text normalization stb-light performs also parts of speech tagging with HunPos (Halácsy, Kornai och Oravecz) and syntactic parsing with parser MaltParser (Hall, Nilsson och Nivre) .

Read the full description including how you should report your work before you begin.

Task

Your task in this lab is to:

Get the text material and the software.
Design your own normalization of the text material.
Perform a normalization with the stb-light program.
Compare the output of your own normalization with the output from stb-light.
Finish by comparing your output, stb-lights output and proof read version of the normalization / tokenization / segmentation of the text.
Write a laboratory report (individual assignment) and leave it to Marie last day May the 2nd
Make your exercize on transducers (individual assignment)

Laboratory report is an individual task, i.e., you should write it alone. During the work itself, however, I encourage cooperation. Discuss possible solutions with your peers for the work of the lab but then formulate the written report itself.

Download materials and software

You can find a tar file in /local/kurs/gta14/stb-light.tar.gz . Download and unpack it. You will get a folder that includes:

stb-light with the ability to run HunPos and MaltParser on the Swedish texts.
A raw text from SUC1.0 called norm.raw (slightly edited).
A tokenized version of this text based on SUC2.0 called facit.txt .

Browse the folder and make sure that you know which files to use.

Make a cleaned version

You will see in norm.raw that some meta data (@, ‹‹‹/ca04b››› etc.) are in the text. We want to keep only the text and not those metadata. Using command lines (sed, tr, etc.) with regular expressions make a clean version of the text where you remove all those meta data. Call it for example norm.cl.

In your report:

Tell in the report which command and regular expressions you used for cleaning.

Make your own text normalization

Your task is to normalize the raw text by putting one token per line and by adding a blank line for each sentence boundary. For comparison's sake, do at least two alternative tokenization:

A version where you just split token based on the space. Call it e.g., norm.simple
A version where you make a tokenization better by considering other boundaries (ponctuations etc.). Call it norm.token

Don't forget to make your sentence boundaries!

Då              
han	        
var	        
17	        
år	        
kom	        
beskedet	
att	        
familjen	
fick	        
åka	        
hem	        
.	        

-	        
Men	        
en	        
deporterad	
blir	        
aldrig	        
fri	        
,	        
säger	        
Ricardas	
och	        
tar	        
fram	        
sitt	        
gamla	        
pass	        
.

Perform normalization / tokenization with regular expressions, such as with commands like sed and tr . Shell commands are sufficient for this task, but be aware that the regular expressions can differ from a command to another.

Suggestions of things to look at:

How many tokens you get in each version of the normalization you did, why?
Use diff command to look at the differences between the files. How many differences are there?

You run the diff command this way:

diff norm.simple norm.token > UTFIL

You can use as well for this task whatever comparison tool fit you, try several ones and see which one you prefer.

When you become familiar with your two versions, you can proceed the stb-light normalization.

In your report:

Write down the expressions you use for that task. Your report should tell what they are used for.

Tokenization and segmentation with stb-light

In this step, use stb-light. Make sure you are in the directory that contain stb-pipeline.py . Run the program with the following command:

./stb-pipeline.py --output-dir=OUTPUT --tokenized INFIL

OUTPUT is file name (including path) of your output, call it "norm.t", and INFIL is the name of the file you want to process: norm.raw (including path). You will now get a tokenized file norm.t .

Compare the different tokenizations.

You now have access to (at least) 4 different tokenized versions of the text material:

norm.simple . Your “passably tokenized" version ( norm.simple ).
norm.token . The version you put a little more effort into ( norm.token ).
norm.t . stb-lights tokenization ( norm.t ).
facit.txt . Proof-read tokenization according SUC2.0 (facit.txt ) .

Now compare the tokenization in norm.simple and norm.token with stb-lights tokenization ( norm.t ). For example, use wc and diff . Count the tokens in all the files (i.e., counting rows with wc -l ) and compare the numbers you get. Then compare the files with diff :

diff norm.token norm.t > UTFIL

Using egrep you can in the diff file examine how often some differences appear in stb-lights tokenization. Try to get an overview of the problems that seem common and what is more unusual.

If you want, you can try to improve your tokenization based on the insights you made with the comparison.

Evaluate against a manually corrected tokenization

When you think you're ready to improve your tokenization can you compare both yours and stb-lights tokenization with the proof-read tokenization called "facit.txt" which is manually corrected?

In your report:

Examine the following files : norm.simple, norm.token, norm.t, facit.txt :

Is the number of tokens the same in all files? If not, how big are the differences?
How many sentences are in the different versions?
Can you tell what are the most recurrent differences in tokenization ? In sentence segmentation?
Can you say which part was more difficult, the tokenization or the sentence segmentation? Why?
Can you tell which of your tokenization file has more errors?

Transducer exercize:

Complete the transducer in this file . You can modify directly the file or print the PDF version and do it by hand (make sure you give it to me, or put it in my department mail box, with your name on it).

VG-task

Stemming

For this task I ask you to stem and lemmatize the text. You will be evaluated on your analyze.

Download and install the Stemmer Snowball on your machine:

http://snowball.tartarus.org/download.php

When installed run it on facit.txt file.

Command line:

stemwords -l swedish -i facit.txt -o outputFileName

Lemmatization

Then create a sample of facit.txt with only the 500 first lines, for that you can run the command:

head -n 500 facit.txt > sampleFacit.txt

Lemmatize this extract with Granska:

http://skrutten.nada.kth.se/

In your report (for VG-Task):

Compare the lemmas vs the stems what are the differences? Can you find 2 or 3 recurrent mistakes in the lemmatization/stemming? Have you got an hypothesis on why they happened?

Laboratory report is an individual task, ie you should write it alone. During the work itself, however, I encourage cooperation. Reason and discuss possible solutions with your peers in the work of the lab - it will be more fruitful so - but then formulate the written report itself.

You will send your report written in English + your exercize on transducers to marie.dubremetz@lingfil.uu.se . For writing your report try to follow those instructions: report instructions. I accept the format ".pdf" or ".doc" for your report, call it by your name "MyName.doc" or "MyName.pdf". You should make precise in your report file your name as well (First name and last name, write your last name with upper case letters). The deadline is May the 2nd (Friday). There is not just the report to send. Don't forget to join your transducer exercize with the format MyName.odg, this is not an optional task. See the part "Transducer exercize". If your operating system cannot read *.odg files you can put the exercize done by hand in my mail box (but don't forget to write your first name and your last name on it, write your last name with upper case letters).

For questions you can e-mail me or pass by my office: 9-2041

UPPSALA UNIVERSITET
Inst. f. lingvistik och filologi	Lärare: Marie Dubremetz