Skip to content

intersystems-community/languagemodel-slavic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

languagemodel-slavic

Build Status

Binaries are available at Maven Central

Please follow this link for project documentation.

Stemming engines available

Hunspell

Dictionary locations:

  • /usr/share/hunspell
  • /usr/local/share/hunspell

On Mac OS X, which also relies on hunspell for spell checking purposes, additional dictionary locations can be examined:

  • /System/Library/Spelling
  • /Library/Spelling
  • ~/Spelling
  • /opt/local/share/hunspell (in case MacPorts are installed)
  • /sw/share/hunspell (in case Fink is installed)

Morphological analysis

If such information is provided by dictionaries, hunspell can also perform morphological analysis – see hunspell(4) man page, section "Optional data fields".

Java API status

2 projects are available – one is based on JNA and the other one on BridJ.

Mac OS X status

Despite Mac OS X inherently relies on hunspell for spell checking tasks and supprts 3rd party hunspell dictionaries, its Objective C API doesn't support stemming nor morphological analysis (see NSSpellChecker class reference).

Mac OS X Java API

"Mac OS X for Java Geeks" (Chapter 11, "The Mac OS X Spelling Framework"), refers to com.apple.spell.ui Java package, but the book has been published in 2003, and covers Mac OS X 10.2 and JDK 1.4. The package mentioned is missing from Mac OS X 10.9 distribution. The Apple-shipped Java packages are instead:

  • apple.applescript
  • apple.awt
  • apple.keychain (JDK 1.4 only)
  • apple.laf
  • apple.launcher
  • apple.security
  • apple.util
  • com.apple.concurrent
  • com.apple.crypto
  • com.apple.dnssd
  • com.apple.eawt
  • com.apple.eio
  • com.apple.java
  • com.apple.jobjc (particularly, contains com.apple.jobjc.appkit.NSSpellChecker and com.apple.jobjc.foundation.NSSpellServer classes)
  • com.apple.laf
  • com.apple.mrj
  • com.apple.resources

seman by aot.ru

Stemming and morphological analysis (Linux)

$ for w in 'друг' 'друзья' 'люди' 'какая'; do echo $w; done | iconv -t CP1251 | ./TestLem russian  | iconv -f CP1251
Loading..
Input a word..
+ ДРУГ С од мр,им,ед 147889 ДРУ'Г
+ ДРУГ С од мр,им,мн 147889 ДРУЗЬЯ'
+ ЧЕЛОВЕК С од мр,им,мн 135031 ЛЮ'ДИ
+ КАКАТЬ ДЕЕПРИЧАСТИЕ нп,нс дст,нст 151931 КА'КАЯ	+ КАКОЙ МС-П  но,од,жр,им,ед 148987 КАКА'Я

Syntax analysis (Linux)

$ echo 'Варкалось, хливкие шорьки пырялись по наве' | iconv -t CP1251 | ./TestSynan russian | iconv -f CP1251
ok
sentences count: 1
sentences count: 1
<chunk>
<input>Варкалось, хливкие шорьки пырялись по наве</input>
<sent>
	<synvar>
		<clause type="ГЛ_ЛИЧН">Варкалось , хливкие шорьки пырялись по наве</clause>
		<group type="ПРИЛ_СУЩ">хливкие шорьки</group>
		<group type="ОДНОР_ИГ">Варкалось , хливкие шорьки</group>
		<group type="ПГ">по наве</group>
	</synvar>
	<rel name="ПРИЛ_СУЩ" gramrel="вн,им,мн," lemmprnt="ШОРЕК" grmprnt="но,мр,вн,им,мн," lemmchld="ХЛИВКИЙ" grmchld="но,од,вн,им,мн," > шорьки -> хливкие </rel>
	<rel name="ПГ" gramrel="пр," lemmprnt="ПО" grmprnt="" lemmchld="НАВ" grmchld="но,мр,пр,ед," > по -> наве </rel>
	<rel name="ОДНОР_ИГ" gramrel="вн,им,мн," lemmprnt="," grmprnt="" lemmchld="ВАРКАЛОСЬ" grmchld="но,ср,жр,мр,пр,тв,вн,дт,рд,им,ед,мн," > , -> Варкалось </rel>
	<rel name="ОДНОР_ИГ" gramrel="вн,им,мн," lemmprnt="," grmprnt="" lemmchld="ШОРЕК" grmchld="но,мр,вн,им,мн," > , -> шорьки </rel>
	<rel name="ПОДЛ" gramrel="" lemmprnt="ПЫРЯТЬСЯ" grmprnt="дст,нп,нс,прш,мн," lemmchld="ВАРКАЛОСЬ" grmchld="но,ср,жр,мр,пр,тв,вн,дт,рд,им,ед,мн," > пырялись -> Варкалось </rel>
</sent>
</chunk>

mystem by Yandex

Setting up

Version 2.1 for Mac OS X is linked incorrectly against /usr/local/Cellar/gcc47/4.7.2/gcc/lib/libstdc++.6.dylib:

$ otool -L mystem
mystem:
	/usr/local/Cellar/gcc47/4.7.2/gcc/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.17.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)

and dumps a core when run. The problem can be fixed with install_name_tool:

$ install_name_tool -change /usr/local/Cellar/gcc47/4.7.2/gcc/lib/libstdc++.6.dylib /usr/lib/libstdc++.6.dylib mystem
$ otool -L mystem
mystem:
	/usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.17.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)

Invocation example:

$ echo -e 'какая\nдрузья\nлюди\nваркалось\nхливкие\nшорьки\nглокая\nкуздра' | ./mystem -n -e utf-8 -i -l
какать=V,несов,нп=непрош,деепр|какой=APRO=им,ед,жен
друг=S,муж,од=им,мн
человек=S,муж,од=им,мн
варкаться?=V,несов,нп=прош,ед,изъяв,сред
хливкий?=A=им,мн,полн|?=A=вин,мн,полн,неод
шорька?=S,жен,неод=им,мн|?=S,жен,неод=род,ед|?=S,жен,неод=вин,мн
глокать?=V,несов,нп=непрош,деепр|глокий?=A=им,ед,полн,жен
куздра?=S,ед,жен,неод=им|куздра?=S,гео,жен,неод=им,ед

Apache Lucene

Apache Licene contains a port of C++ hunspell API to Java, see the API documentation.

LanguageTool

Feature comparison

Human languages support

Colons can be used to align columns.

Product Russian Ukrainian English German Morphological Analysis Syntax Analysis
hunspell yes yes yes yes yes (if supported by dictionaries) no
seman yes no yes yes yes yes
mystem yes no no no yes no
LanguageTool yes yes yes yes yes no
Lucene ? ? ? ? ? no

Programming languages support

Product C++ Java
hunspell yes yes
seman yes no
mystem yes no
LanguageTool no yes
Lucene no yes

OS support

Product Windows Linux Mac OS X
hunspell yes yes yes
seman yes yes no
mystem yes yes yes
LanguageTool yes yes yes
Lucene yes yes yes

License

Product License Can be distributed with Caché?
hunspell GPL/LGPL/MPL yes
seman LGPL yes
mystem non-commercial no
LanguageTool LGPL yes
Lucene Apache License yes

Native (C++) implementation

For C++ implementation, it is possible to link against either hunspell, seman or mystem and return the results of morphological analytis as a JSON object using Boost Property Tree

ZWARRAYP type can be used to pass strings from/to Caché.