Currently scrapes Wiktionary to return dictionary entries, and stores data regarding parts of speech and verb inflections. The goal is to ultimately use this data to produce an automatic syntax tree generator.
Currently, there is one main module:
- dictsearch.py
This project was tested and developed on Python 3.10.2. The following libraries must be installed:
- Beautiful Soup 4 - used by the scraper in
dictsearch.py
- (tentative) Natural Language Toolkit (NTLK) - used by
psr.py
andparser.py
.
Unfortunately, I can't make any promises regarding which versions of Python 3.x this project targets. (tentative): I haven't tested thoroughly on Python 3.7, but it is the minimum Python version supported by NTLK.
Aims to provide an extremely basic interface to scrape English language dictionary entries from Wiktionary.
The class constructor automatically does the scraping in the background, so the interface consists solely of the constructor and a few instance variables. As such, you'll want to make a new instance of the class when you want to search a new word, instead of modifying the instance variables.
Searches for word
on Wiktionary, scrapes the data if an English entry exists, and stores a dictionary entry and part of speech and verb inflection data.
- If a page for
word
does not exist on Wiktionary, aPageNotFound
exception is raised. - If a page for
word
exists, but an English entry for it does not exist, aNoEnglishEntry
exception is raised.
set
class attribute that holds the parts of speech relevant for forming syntax trees. If a word has a part of speech that does not belong toDictSearch.POSList
but is included inDictSearch._POSList
,.entry
will include the definition, but this library does not intend to automatically generate phrase structure rules forDictSearch
instances with an empty.POS
.
str
instance variable that holds the word used to initialize this instance.
str
instance variable that holds the dictionary entry on Wiktionary for the word.
set
instance variable that holds the parts of speech associated with.word
in English.
bool
instance variable that isTrue
if.word
is an auxiliary/modal verb and isFalse
otherwise.
bool
instance variable that isTrue
if.word
is a plural noun and isFalse
otherwise.
str
instance variable that holds the root form of.word
if it is a verb inflected for tense/aspect and isFalse
otherwise.
str
instance variable that holds the root form of.word
if it is a noun inflected for plurality and isFalse
otherwise.
set
instance variable that holds any verb inflections associated with.word
in English and is an emptyset
if.word
is not an inflected verb.
(tentative): Holds the phrase structure rules later imported by parser.py
. This module expands the functionality offered by NTLK's implementation of context-free grammars in nltk.grammar.CFG
to demonstrate phrasal and head movement.
My implementation may be hacky, and from a (computational) linguistics perspective, there's a number of potential complaints about my implementation. For one, I violate the recursive spirit of Chomsky's generative syntax approach by defining multiple specialized phrase structure rules for, say, subject DPs/TPs/CP. Second of all, to the best of my knowledge, CFGs are unable to account for phrasal/head movement and subject/object control structures, so there may be fundamental concerns with using CFGs in the first place here. Unfortunately, due to the limits of my theoretical knowledge, this is a case of the ends taking precedence over the means.
Given a target configuration, aims to output potential parse trees for an input phrase/sentence.
I'm wondering if it'll be possible to show raising/control structures if I have a list of words that trigger this behavior; until I manage to actually get the basic functionality working, I simply can't really say if this'll end up being the case or not.