ASIAN LINGUISTIC SUITE

Teragram accurately and efficiently handles the complex processing of Asian languages such as Japanese, Chinese, and Korean; each of which can be encoded in several formats. Teragram also provides services to Unicode-enable your software. Text-search, e-commerce search, and many other text applications software vendors, and Internet service providers are currently using various components of Teragram Language suites to achieve completeness, accuracy and speed.


 
 
Back to OEM Products

PRODUCTS

Character Encoding Mapping
Morphological Stemming
Morphological Segmentation

        Chinese
        Japanese
        Korean


RELATED PRODUCTS

Language and Character Encoding Identification


OVERVIEW

Teragram offers a complete set of Asian language tools and libraries that handle the subtle complexities of Asian text processing. Tasks such as character encoding recognition and mapping, word-tokenization, and morphological stemming are basic processing requirments for Asian languages such as Chinese, Japanese, and Korean. Asian languages are additionally complex due to the fact that no single standard character encoding exists. For example, the number of characters used in Japanese is far greater than the 256 bound that is found in single byte encoding. Therefore, any Japanese encoding system (EUC, shift-JIS, Unicode, etc.) uses at least some two byte character encoding.

Teragram also provides tools to break texts into sequences of words rather than sequences of characters. This functionality is the first step toward any intelligent text processing in languages such as Chinese, Japanese, and Korean. It is a critical functionality. In addition to providing information on word sequences, Teragram's segmenting tools can associate part-of-speech information with each word, further expediting the complex task of classifying Asian language input text.

Teragram offers solutions to the following challenges of processing Asian language texts:

First, there are numerous standards used to represent Asian characters in text. Usually these standards are not implicitly used without an obvious way to identify them. Teragram language and character encoding recognition automatically recognizes the language and encoding used by any text.

Second, when processing text, it is critical to be able to map and unify all documents into a unique encoding. Teragram character mapping software enables you to convert between any two encodings and also allows you to convert text into the portable and flexible Unicode representation.

Third, Asian language texts are written with little or no use of word separators. Chinese and Japanese text is written without any space separations, and Korean with limited space separations. Therefore, the first task of any information processing system is to segment the initial text into a sequence of words (this process is called word segmentation). Such task is solved by Teragram word segmentation software.

And finally, there is a need for morphological analysis. This is apparent in the context of information retrieval. In English, relating a word like "children" to its root "child" is an obvious necessity. The importance of morphology, however, is even greater in a language like Chinese, Japanese, and Korean. In fact, Asian text is written with limited of no space separations. The task of segmenting the initial text into a sequence of words is strongly related to the morphological analysis process. For example, recognizing that a given sequence of characters is a word usually means that the word has been recognized with a specific part-of-speech (such as Verb, Noun, etc.).



Character Encoding Mapping

Teragram provides solutions for the growing complexity of recognizing, manipulating, and converting numerous character encodings, including one-byte and multiple-byte encodings. In particular, Teragram Software can map texts into Unicode and handle Unicode documents, using UTF8, UCS-2, UCS-4 or other encodings. Teragram Character Encoding Mapping Toolkit handles more than 200 language-character encodings (including Unicode, UTF8, UCS-2, UCS4, Shift-JIS, JIS, EUC, GB, extended GB, big5, KSC, EBCDIC, Iso, Microsoft Code Page, IBM, Cyrillic, Latin-1, MacOS, among many others) and allows mapping between any encodings. Teragram's Character Encoding API is designed to meet three important requirments, speed, simplicity, and precision. First, the need for speed is obvious. Teragram provides character mapping operations that are so extensive that they are quick. Second, Teragram pushes the need for simplicity to its limit so that it is possible to do any character encoding mapping with five different functions. Two are for loading and freeing data, one is for mapping from encoding to Unicode, the fourth maps Unicode to another encoding while the last function maps between any pair of encodings. The third requirement is the precise determination of the encoding. The pivot encoding in this API is the UCS-2 representation of the Unicode. Teragram additionally provides a wide range of string manipulation tools in the Unicode Standard.


Back to top
Morphological Stemming and Segmentation

Teragram provides Asian language word-segmentation and stemming software for Chinese, Japanese, and Korean at unmatched accuracy and speed. The importance of morphology is paramount in Asian languages such as Japanese, Korean, and Chinese. In fact, Chinese and Japanese text is written without space separations, and Korean with limited space separations. Therefore, the first task of any information processing system is to segment the initial text into a sequence of words (this process is called word segmentation). This task of breaking the input text into a sequence of words is strongly related to the morphological analysis process. For instance, recognizing that a given sequence of characters is a word usually means that the word has been recognized with a specific part-of-speech (such as Verb, Noun, etc.)


Back to top
CHINESE

A fundamental problem in morphological analysis when retrieving information in Chinese is due to the fact that Chinese does not use spaces to mark word boundaries. Therefore, it is necessary and important that an information processing system is capable of first breaking the original Chinese text into a series of words or phrases--such a process is called word segmentation, and subsequently recognizes a given word or phrase as a particular part of speech, such as a noun or noun phrase, verb or verb phrase, adjective or adjective phrase, etc. Segmentation of Chinese text into words is a very difficult task. Many characters form one-character words by themselves, but these characters can also form multi-character words when used with other characters. Chinese words have variable lengths, the same character may occur in many different words.

The complexity of Chinese segmentation is shown in the following examples. First, consider the following example:

Which means: Single heart and single mind. It is a four-character word. It is also called quadrisyllabic Chengyu (idiom). The character "?" means one in English and it occurs twice in this idiom. It can also be a Chinese numeral as a one-character word. A second example follows:

Which means "People's Republic of China". This is a seven-character word. In other contexts, , and are also multi-character words and can join others to form other compounds in other contexts.

Teragram Chinese segmentation software uses extremely large dictionaries of various types to accurately and efficiently resolve the segmentation of Chinese text. Besides commonly used words, these dictionaries contain compounds, idioms, companies', people's and product names, among many other kinds of entries.


Back to top
JAPANESE

Japanese, like Chinese, does not use spaces to mark word boundaries. Like Chinese, characters can form a one-character word, or when combined with other characters, multi-character words. This characteristic makes the process of word segmentation difficult but critical. For example, see the following input sentence:

Jon(John) ga(subject-marker) hon(book) wo(object) katta(bought)
(John bought a book.)

six words are found. Among these words, the word "katta" has been recognized both as an individual segment and as the past tense of the verb "kau" (to buy). The word segmentation and stemming software program, together with the library and its API, gives the programmer the ability to break any input sentence into a sequence of words, each word being related to a part-of-speech and eventual morphological features.


Back to top
KOREAN

Among with Chinese and Japanese, Korean ranks among the most complex languages to analyze linguistically. In fact, many problems that appear only at the syntax level for languages like English are already present within Korean morphology. In English, a word can take at most five forms (five tenses of a verb) and nouns are even simpler (only singular and plural) and everything else is invariable. This means that English morphology consists of relating a small number of words for each paradigm (root form).

In Korean, however, a verb or a noun, for example, can be analyzed as follows:

e.g., must have seen --->
   ( (to see) + (past) + (have) + (must) + (sentence ending))

e.g., from under the table --->
   ( (table) + (under) + (from)) .

The complexity inherent in the Korean language makes morphological analysis (stemming) both difficult and very important. It is difficult because there is no way to manipulate any dictionary that would contain any segment (this would be tantamount to listing all the noun groups in English), and it is crucial because finding all the occurrences of a noun means that all the noun groups in which it appears have be correctly analyzed.

Teragram's unique combination of extremely large dictionaries and morpho-syntactic grammars solve Korean Morphological and Word Segmentation at an unsurpassed level of accuracy and speed. Teragram's Korean dictionaries and grammar contain tens of millions of entries and they are optimized for very fast processing.


Back to top

Copyright © Teragram Corporation