Skip to content

SearchInMemory is a reseach about full text (fault-tolerant) searches

Notifications You must be signed in to change notification settings

RaVbaker/search_in_memory_php

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SearchInMemory as a reseach about full text/fault-tolerant searches

SearchInMemory is my first impression about how works and how should work full text search engines or fault-tolerant searches (FTS). It begins as a research about how stuff works. 

It was also a test for: is it hard to write own full text search engine? 

You may consider it as a experiment to uncover how full text search engines works. 

I have also tried a BinaryTree implementation of index, but it wasn't so good for me as HashIndex. 

But even when I finished a really huge part of code there is still place for improvements(you can use it as a roadmap for your own FTS), like:

- n-gram indexing and searching with levenstein sorting
- wildcard searching
- excluding some phrases from results
- caching generated indexed on some memory based structure on disk 
- steeming words to others
- improved and more complex way to do faceting
- a socket connector for searcher from a unix level
- index updating and deleting particular records
- whole phrases in " " signs, like: "billy bob" to match exactly this phrase (need to improve inverted indexes in HashIndex)
- possibilities of import/export data from indexes using formats: json/xml/csv
- tweaking results based on special criteria or queries

Cheers,
Rafal "RaVbaker" Piekarski

Contact:
web: http://about.me/ravbaker
twitter: ravbaker
github: https://github.com/RaVbaker


Great start for your own research:
- http://en.wikipedia.org/wiki/Levenshtein_distance - a minimal knowlegde about comparing similar words
- http://en.wikipedia.org/wiki/Inverted_index - goot start for building indexes - specially full inverted indexes                                               
- http://en.wikipedia.org/wiki/N-gram - N-grams, what it is and why?
- http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html - quite old Google Research department post about n-grams in practise - with a large available dataset. 
- http://ngrams.googlelabs.com/ - a practise usage of ngrams with Books Ngram Viewer from Google.
- http://today.java.net/pub/a/today/2005/08/09/didyoumean.html - great article about how Did you mean works with Lucene. Very inspiring post - but mainly about Java.
- http://framework.zend.com/code/filedetails.php?repname=Zend+Framework&path=%2Ftrunk%2Flibrary%2FZend%2FSearch%2FLucene.php - sourcecode from Zend Framework with their PHP implementation of Lucene. Nice source of thougts.
- http://www.ir.uwaterloo.ca/book/ - A book when you think BIG. It's about building your own service for full text search engine scalable almost like Google/Bing. Lots of theory and C/C++ code and algorithms. For very begining I suggest reading an excerpt from chapter 4 - Static inverted indicies - http://www.ir.uwaterloo.ca/book/04-static-inverted-indices.pdf 

- Helpful php functions: http://www.php.net/manual/en/function.levenshtein.php, http://www.php.net/manual/en/function.metaphone.php, http://php.net/manual/en/function.soundex.php, http://docs.php.net/manual/en/language.types.array.php :)

About

SearchInMemory is a reseach about full text (fault-tolerant) searches

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages