Document Type

Technical Report

Publication Date

2002-04-18

Filename

wucse-2002-18.pdf

Technical Report Number

WUCSE-2002-18

Abstract

This thesis describes an unsupervised system to learn natural language morphology, specifically suffix identification from unannotated text. The system is language independent, so that is can learn the morphology of any human language. For English this means identifying “-s”, “-ing”, “-ed”, “-tion” and many other suffixes, in addition to learning which stems they attach to. The system uses no prior knowledge, such as part of speech tags, and learns the morphology by simply reading in a body of unannotated text. The system consists of a generative probabilistic model which is used to evaluate hypotheses, and a directed search and a hill-climbing search which are used in conjunction to find a highly probably hypothesis. Experiments applying the system to English and Polish are described.

Comments

Permanent URL: http://dx.doi.org/10.7936/K7RR1WMV

Share

COinS