http://stp.lingfil.uu.se/~marie/undervisning/textanalys14/lab1NormOchTok2014.html
Grundläggande textanalys: Uppgift 2
Laboration - Textnormalisering och meningssegmentering
Aim
The purpose of this lab is to make you work on some common problems of text normalization.You will have to compare your own normalization with two others: first a normalization performed by a complete normalization program, second a normalization hand made.
The text you will work with comes from the SUC corpus. The normalization program we use is included in the text processing package stb-light compiled by Joakim Nivre and based on the package Svannotate compiled by Filip Salomonsson (2009). For your information, in addition to text normalization stb-light performs also parts of speech tagging with HunPos (Halácsy, Kornai och Oravecz) and syntactic parsing with parser MaltParser (Hall, Nilsson och Nivre) .
Read the full description including how you should report your work before you begin.
Task
Your task in this lab is to:- Get the text material and the software.
- Design your own normalization of the text material.
- Perform a normalization with the stb-light program.
- Compare the output of your own normalization with the output from stb-light.
- Finish by comparing your output, stb-lights output and proof read version of the normalization / tokenization / segmentation of the text.
- Write a laboratory report (individual assignment) and leave it to Marie last day May the 2nd
- Make your exercize on transducers (individual assignment)
Laboratory report is an individual task, i.e., you should write it alone. During the work itself, however, I encourage cooperation. Discuss possible solutions with your peers for the work of the lab but then formulate the written report itself.
Download materials and software
You can find a tar file in/local/kurs/gta14/stb-light.tar.gz
. Download and unpack it. You will get a folder that includes:
- stb-light with the ability to run HunPos and MaltParser on the Swedish texts.
-
A raw text from SUC1.0 called
norm.raw
(slightly edited). -
A tokenized version of this text based on SUC2.0 called
facit.txt
.
Browse the folder and make sure that you know which files to use.
Make a cleaned version
You will see in norm.raw that some meta data (@, ‹‹‹/ca04b››› etc.) are in the text. We want to keep only the text and not those metadata. Using command lines (sed, tr, etc.) with regular expressions make a clean version of the text where you remove all those meta data. Call it for example norm.cl.In your report:
Tell in the report which command and regular expressions you used for cleaning.Make your own text normalization
Your task is to normalize the raw text by putting one token per line and by adding a blank line for each sentence boundary. For comparison's sake, do at least two alternative tokenization:-
A version where you just split token based on the space. Call it e.g.,
norm.simple
-
A version where you make a tokenization better by considering other boundaries (ponctuations etc.). Call it
norm.token
Då han var 17 år kom beskedet att familjen fick åka hem . - Men en deporterad blir aldrig fri , säger Ricardas och tar fram sitt gamla pass .
Perform normalization / tokenization with regular expressions, such as with commands like
sed
and
tr
. Shell commands are sufficient for this task, but be aware that the regular expressions can differ from a command to another.
Suggestions of things to look at:
- How many tokens you get in each version of the normalization you did, why?
-
Use
diff
command to look at the differences between the files. How many differences are there?
diff
command this way:
diff norm.simple norm.token > UTFIL
You can use as well for this task whatever comparison tool fit you, try several ones and see which one you prefer.
When you become familiar with your two versions, you can proceed the stb-light normalization.
In your report:
Write down the expressions you use for that task. Your report should tell what they are used for.Tokenization and segmentation with stb-light
In this step, use stb-light. Make sure you are in the directory that contain
stb-pipeline.py
. Run the program with the following command:
./stb-pipeline.py --output-dir=OUTPUT --tokenized INFIL
OUTPUT
is file name (including path) of your output, call it "norm.t", and
INFIL
is the name of the file you want to process: norm.raw (including path). You will now get a tokenized file
norm.t
.
Compare the different tokenizations.
You now have access to (at least) 4 different tokenized versions of the text material:
-
norm.simple
. Your “passably tokenized" version (norm.simple
). -
norm.token
. The version you put a little more effort into (norm.token
). -
norm.t
. stb-lights tokenization (norm.t
). -
facit.txt
. Proof-read tokenization according SUC2.0(facit.txt
) .
Now compare the tokenization in
norm.simple
and
norm.token
with stb-lights tokenization (
norm.t
). For example, use
wc
and
diff
. Count the tokens in all the files (i.e., counting rows with
wc -l
) and compare the numbers you get. Then compare the files with
diff
:
diff norm.token norm.t > UTFIL
Using
egrep
you can in the diff file examine how often some differences appear in stb-lights tokenization. Try to get an overview of the problems that seem common and what is more unusual.
If you want, you can try to improve your tokenization based on the insights you made with the comparison.
Evaluate against a manually corrected tokenization
When you think you're ready to improve your tokenization can you compare both yours and stb-lights tokenization with the proof-read tokenization called "facit.txt
" which is manually corrected?
In your report:
Examine the following files :
norm.simple, norm.token, norm.t, facit.txt
:
- Is the number of tokens the same in all files? If not, how big are the differences?
- How many sentences are in the different versions?
- Can you tell what are the most recurrent differences in tokenization ? In sentence segmentation?
- Can you say which part was more difficult, the tokenization or the sentence segmentation? Why?
- Can you tell which of your tokenization file has more errors?
Transducer exercize:
Complete the transducer in this file . You can modify directly the file or print the PDF version and do it by hand (make sure you give it to me, or put it in my department mail box, with your name on it).
VG-task
Download and install the Stemmer Snowball on your machine:
http://snowball.tartarus.org/download.php
When installed run it on facit.txt file.
Command line:
stemwords -l swedish -i facit.txt -o outputFileName
Then create a sample of facit.txt with only the 500 first lines, for that you can run the command:
head -n 500 facit.txt > sampleFacit.txt
Lemmatize this extract with Granska:
In your report (for VG-Task):
Compare the lemmas vs the stems what are the differences? Can you find 2 or 3 recurrent mistakes in the lemmatization/stemming? Have you got an hypothesis on why they happened?Laboratory report is an individual task, ie you should write it alone. During the work itself, however, I encourage cooperation. Reason and discuss possible solutions with your peers in the work of the lab - it will be more fruitful so - but then formulate the written report itself.
You will send your report written in English + your exercize on transducers to marie.dubremetz@lingfil.uu.se . For writing your report try to follow those instructions: report instructions. I accept the format ".pdf" or ".doc" for your report, call it by your name "MyName.doc" or "MyName.pdf". You should make precise in your report file your name as well (First name and last name, write your last name with upper case letters). The deadline is May the 2nd (Friday). There is not just the report to send. Don't forget to join your transducer exercize with the format MyName.odg, this is not an optional task. See the part "Transducer exercize". If your operating system cannot read *.odg files you can put the exercize done by hand in my mail box (but don't forget to write your first name and your last name on it, write your last name with upper case letters).
For questions you can e-mail me or pass by my office: 9-2041