http://stp.lingfil.uu.se/~marie/undervisning/textanalys15/lab1.html
Grundläggande textanalys: Uppgift 1
Laboration - Textnormalisering och meningssegmentering
Aim
The purpose of this lab is to make you work on some common problems of text normalization.You will have to compare your own normalization with two others: first a normalization performed by a complete normalization program, second a normalization hand made.
The text you will work with comes from the SUC corpus. The normalization program we use is included in the text processing package stb-light compiled by Joakim Nivre and based on the package Svannotate compiled by Filip Salomonsson (2009). For your information, in addition to text normalization stb-light performs also parts of speech tagging with HunPos (Halácsy, Kornai och Oravecz) and syntactic parsing with parser MaltParser (Hall, Nilsson och Nivre) .
Read the full description including how you should report your work before you begin.
Task
Your task in this lab is to:- Get the text material and the software.
- Design your own tokenization and sentence segmentation of the text material.
- Perform tokenization and sentence segmentation with the stb-light program.
- Compare the output of your own tokenization and sentence segmentation with the output from stb-light.
- Finish by comparing your output, stb-lights output and proof read version of the tokenization / segmentation of the text.
- Write a laboratory report (individual assignment) and leave it to Marie last day April the 23rd
- Make your exercize on transducers (individual assignment)
Download materials and software
You can find a tar file in/local/kurs/gta14/stb-light.tar.gz
. Download and unpack it. You will get a folder that includes:
- stb-light with the ability to run HunPos and MaltParser on the Swedish texts.
-
A raw text from SUC1.0 called
norm.raw
(slightly edited). -
A tokenized version of this text based on SUC2.0 called
facit.txt
.
Browse the folder and make sure that you know which files to use.
Make a cleaned version
You will see in norm.raw that some meta data (@, ‹‹‹/ca04b››› etc.) are in the text. We want to keep only the text and not those metadata. Using command lines (sed, tr, etc.) with regular expressions make a clean version of the text where you remove all those meta data. Call it for example norm.cl.In your report:
List which command/regular expressions you used for cleaning.Make your own text normalization
Your task is to normalize the raw text by putting one token per line and by adding a blank line for each sentence boundary. For comparison's sake, do at least two alternative tokenization:
-
A version where you just split token based on the space. Call it e.g.,
norm.simple
-
A version where you make a tokenization better by considering other boundaries (ponctuations etc.). Call it
norm.token
Då
han
var
17
år
kom
beskedet
att
familjen
fick
åka
hem
.
-
Men
en
deporterad
blir
aldrig
fri
,
säger
Ricardas
och
tar
fram
sitt
gamla
pass
.
Perform normalization / tokenization with regular expressions, such as with commands like
sed
and
tr
. Shell commands are sufficient for this task, but be aware that the
regular expressions can differ from a command to another.
Look at:
- How many tokens you get in each version of the normalization you did, why?
-
Use
diff
command to look at the differences between the files. How many differences are there?
diff
command this way:
diff norm.simple norm.token > UTFIL
You can use as well for this task whatever comparison tool fits you, try several ones and see which one you prefer.
When you become familiar with your two versions, you can proceed the stb-light normalization.
In your report:
Answer the questions above and list as well the regular expressions you use for that task.Tokenization and segmentation with stb-light
In this step, use stb-light. Make sure you are in the directory that contain
stb-pipeline.py
. Run the program with the following command:
./stb-pipeline.py --output-dir=OUTPUT --tokenized INFIL
OUTPUT
is file name (including path) of your output, call it "norm.t", and
INFIL
is the name of the file you want to process: norm.cl (including path). You will now get a tokenized file
norm.t
.
Compare the different tokenizations.
You now have access to (at least) 4 different tokenized versions of the text material:
-
norm.simple
. Your “passably tokenized" version (norm.simple
). -
norm.token
. The version you put a little more effort into (norm.token
). -
norm.t
. stb-lights tokenization (norm.t
). -
facit.txt
. Proof-read tokenization according SUC2.0(facit.txt
) .
Now compare the tokenization in
norm.simple
and
norm.token
with stb-lights tokenization (
norm.t
). For example, use
wc
and
diff
. Count the tokens in all the files (i.e., counting rows with
wc -l
) and compare the numbers you get. Then compare the files with
diff
:
diff norm.token norm.t > UTFIL
Using
egrep
you can in the diff file examine how often some differences appear in
stb-lights tokenization. Try to get an overview of the problems that
seem common and what is more unusual.
If you want, you can try to improve your tokenization based on the insights you made with the comparison.
Evaluate against a manually corrected tokenization
When you think you're ready to improve your tokenization can you compare both yours and stb-lights tokenization with the proof-read tokenization called "facit.txt
" which is manually corrected?
In your report:
Examine the following files :
norm.token, norm.t, facit.txt
:
- Is the number of tokens the same in all files? If not, how big are the differences?
- How many sentences are in the different versions?
- Can you tell what are the most recurrent differences in tokenization ? (suggestion: pick one or two representative examples of differences to justify your explanation)
- Can you tell what are the most recurrent in sentence segmentation? (suggestion: pick one or two representative examples of differences to justify your explanation)
- Can you tell which tokenization file has more errors?
Transducer exercize:
On this page you will find your questions about morphological analysis and transducers. Make the transducer with your favorite drawing program and send a pdf version of it or do it by hand on a simple paper (make sure you give it to me, or put it in my department mail box, with your name on it).
VG-task
Download and install the Stemmer Snowball on your machine:
http://snowball.tartarus.org/download.php
When installed run it on facit.txt file.
Command line:
stemwords -l swedish -i facit.txt -o outputFileName
Then create a sample of facit.txt with only the 500 first lines, for that you can run the command:
head -n 500 facit.txt > sampleFacit.txt
Lemmatize this extract with Granska:
In your report (for VG-Task):
Compare the lemmas vs the stems what are the differences? Can you find 2 or 3 recurrent mistakes in the lemmatization/stemming? Have you got an hypothesis on why they happened?Laboratory report is an individual task, i.e., you should write it alone. During the manipulation itself, however, I encourage cooperation. Discuss possible solutions with your peers for the work of the lab but then formulate the written report by yourself.
You will send your report written in English + your exercize on
transducers to marie.dubremetz@lingfil.uu.se . For writing your report
follow those instructions: report instructions.
(2-4 pages if no VG task included, 3-5 pages if you do the VG task).
Only the "pdf" format will be accepted. I will receive 15 different
files, so I would preciate if you call it by your name (Like in this
example: "BjornSVENSSON_report.pdf"). You should make precise in your
report file your name as well. The deadline is April the 23rd
(Thursday).
Don't forget to join your transducer exercize either by pdf ("BjornSVENSSON_fst.pdf") either by hand in my professional mailbox.