Skip to content

serhiybutz/DocumentIndexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Swift Platform SPM License

Document Indexer

A convenient Swifty wrapper for Apple's Search Kit.

Search Kit is Apple's content indexing and searching solution which is widely used in OS X, for example in System Preferences, Address Book, Help Viewer, Xcode, Mail and even Spotlight is built on top of it. Search Kit features:

  • Fast indexing and asynchronous searching
  • Google-like query syntax, including phrase-based, prefix/suffix/substring, and Boolean searching
  • Text summarization
  • Control over index characteristics, like minimum term length, stopwords, synonyms, and substitutions
  • Flexible management of document hierarchies and indexes
  • Unicode support
  • Relevance ranking and statistical analysis of documents
  • Thread-safe

The goal of Document Indexer is to simplify work with Core Foundation-based Search Kit in Swift by making it more Swift-friendly. It provides:

  • In-memory (for lightning-fast search) and on-disk (for persistent storage) thread-safe text document indexers with all the functionality provided by Apple's Search Kit
  • Option to automatically use standard stopwords lists (custom stopwords can be provided too)
  • Auto-flushing capability, etc

Usage

Creating an in-memory document index

import DocumentIndexer
...
// Create an inverted index (by default)
let indexer = InMemoryDocumentIndexer()
// Or create an inverted vector index, for example
let vectorIndexer = InMemoryDocumentIndexer(indexType: .invertedVector)

For the the details on search indexes, refer to Search Basics.

Creating a persistent (on-disk) document index

import DocumentIndexer
...
let fileURL = "file:/INDEX_STORAGE_PATH"
// Create an inverted index (by default)
let indexer = PersistentDocumentIndexer(creatingAtURL: fileURL)
// Or create an inverted vector index, for example
let vectorIndexer = PersistentDocumentIndexer(creatingAtURL: fileURL, indexType: .invertedVector)

For the details on search indexes, refer to Search Basics.

Opening a persistent (on-disk) document index that already exists

import DocumentIndexer
...
let fileURL = "file:/INDEX_STORAGE_PATH"
let indexer = PersistentDocumentIndexer(openingAtURL: fileURL)

Creating a document URL object

import DocumentIndexer
...
let documentURLObject = DocumentURL(URL(string: ":document-name")!)! // where "document-name" is an arbitrary document identifying string adhering to the URI syntax

Note: DocumentURL is simply a wrapper around SKDocument.

Indexing a document explicitly

import DocumentIndexer
...
let documentURLObject = DocumentURL(URL(string: ":document-name")!)!
let documentTextualContent = "Lorem ipsum ..."
try indexer.indexDocument(at: documentURLObject, withText: documentTextualContent)
// Commit all in-memory changes to backing store
try indexer.flush() 

The last operation is an explicit flushing of the state to backing store - the actual flushing strategy depends on the implementation (see Index flushing).

Indexing a file document

import DocumentIndexer
...
let textContentFileURL = URL(string: "file:/FILE_PATH")!
let fileURLObject = FileDocumentURL(textContentFileURL)!
try indexer.indexFileDocument(at: fileURLObject)
// Commit all in-memory changes to backing store
try indexer.flush() 

The last operation is an explicit flushing of the state to backing store - the actual flushing strategy depends on the implementation (see Index flushing).

Removing a document from an index

import DocumentIndexer
...
let documentURLObject = DocumentURL(URL(string: ":document-name")!)!
try indexer.removeDocument(at: documentURLObject)

Searching

The Document Indexer provides two ways of searching: sequence-based search and completion-based search. A good practice is not to present all the search results at once, but rather provide them gradually, in blocks. Thus Search Kit's search is block-oriented.

Document Indexer allows specifying the number of hits in a block with the hitsAtATime parameter. Other than that you can also provide the search options and the maximum search time.

The hits (or hit objects) are represented by the SearchHit struct, which contains a document URL object associated with the original document, documentURL, and a not normalized hit relevance score, score.

For query format description see Search Kit - Queries.

Sequence-based search

import DocumentIndexer
...
for hits in indexer.makeSearch(for: "foo bar", hitsAtATime: 100) {
    hits.forEach { print("\($0.documentURL) \($0.score)") }
}

The makeSearch method returns a searcher sequence that provides search result hits in hitsAtATime-sized blocks of hit objects for the given query string. If you don't need the search results broken into blocks, the following one-liner demonstrates getting a search result's hits all at once:

let allHits = indexer.makeSearch(for: "foo bar").reduce([], +)

The searcher sequence that the makeSearch method returns does support laziness and if used in a 'lazy' context it performs the actual searching for the next hits block only on demand.

Completion-based search

import DocumentIndexer
...
indexer.search(for: "foo bar", hitsAtATime: 100) { hits, hasMore, shouldStop in
    hits.forEach { print("\($0.documentURL) \($0.score)") }
}

The search completion closure receives the results in the form of a hit object array.

Search Kit is thread-safe and was developed with asynchronous work scenarios in mind, so wrapping the search query with a DispatchQueue block is a way to go.

Text analysis properties

Currently, there are available 8 text analysis properties, affecting such aspects of indexing as phrase-based searches support, index size, search efficiency. These properties are provided to the index at the time of creation. Document Indexer keeps these properties grouped in the TextAnalysisProperties struct. The properties struct provides a flexible way to customize its properties right in the declaration spot by way of modifying them from within the closure handler provided by the customized method. For example:

import DocumentIndexer
...
let indexer = InMemoryDocumentIndexer(textAnalysisProperties: TextAnalysisProperties().customized({
    $0.minTermLength = 4
  	$0.substitutions = ["bar": "the"]
    $0.stopwords = .custom(isoLanguageCode: "en")
}))

Basically, Document Indexer just mirrors the Search Kit's text analysis properties described in Text Analisys Keys.

Finally, Document Indexer is capable of taking on the heavy lifting of providing stopwords to the indexing. As shown in the example above, by setting stopwords to use specific language stopwords, we are forcing Document Indexer to use a standard stopword list for that language. There's also available the .auto() option to have Document Indexer automatically determine the user's preferred language (if unavailable, it uses the system one).

Index flushing

The index becomes stale when the application updates it by indexing or removing a document. A search on an index in such a state won’t have access to the nonflushed updates. Calling the method flush() makes the state consistent, by flushing index-update information and committing index caches to backing store.

Document Indexer provides the option to enable automatic flushing either before each search or after each index update. The following code illustrates how to turn on the automatic flushing before each search:

import DocumentIndexer
...
let indexer = InMemoryDocumentIndexer(autoflushStrategy: .beforeEachSearch)

For a persistent (on-disk) document indexer, the autoflushStrategy has to be specified for both creating and opening. The flushing is not a cheap operation so it's not recommended to perform on the main thread. The handling of flushing should be done carefully, and which way is apropriate depends on the implementation.

See Also: SKIndexFlush

Index compacting

The index can develop fragmentation (that is, it can become bloated with unused data) as documents are indexed and removed. Compacting an index is done with the method compact(). Because this operation typically takes significant time, it should only be done when an index is significantly fragmented.

Document Indexer provides a property uncompactedDocuments which does its best to tell how many uncompacted documents the index contains. It does its job by tracking fragmantation state and this involves an additional overhead from the user. To track fragmantation state there should be maintained a fragmentation state preservation. It's optional and delegated to a fragmentation state preserver implemented by the user. The uncompactedDocumentsproperty returns a non-nil value only if this delegate is provided.

The fragmentation state preservation is done by the user's implemented preserver, which conforms to the protocol FragmentationStatePreserver. Its instance is provided at the time of creating (for an on-disk indexer - both creating and opening) the document indexer. The preserver's only responsibility is to persist the provided piece of information in any way by being able of storing and restoring it at request.

Here's an example of how fragmentation state preserver can be implementated and then provided to a document indexer:

import DocumentIndexer
...
struct IndexerFragmentationStatePreserver: FragmentationStatePreserver {
    func preserve(_ state: FragmentationState) {
        UserDefaults.standard.setValue(state.maximumDocumentID, forKey: "maximumDocumentID")
        UserDefaults.standard.setValue(state.documentCount, forKey: "documentCount")
    }
    func restore() -> FragmentationState {
        guard let maximumDocumentID = UserDefaults.standard.object(forKey: "maximumDocumentID") as? Int,
              let documentCount = UserDefaults.standard.object(forKey: "documentCount") as? Int
        else { preconditionFailure() }
        
        return (maximumDocumentID: maximumDocumentID,
                documentCount: documentCount)
    }
}
let statePreserver = IndexerFragmentationStatePreserver()
let indexer = InMemoryDocumentIndexer(fragmentationStatePreserver: statePreserver)

Now that the fragmentation state preservation is implemented, it is can be used for compacting like so:

import DocumentIndexer
...
let uncompactedDocumentsAllowance = 50
if indexer.uncompactedDocuments! > uncompactedDocumentsAllowance {
    DispatchQueue.global().async {
        try indexer.compact()
    }
}

Note: in case of a persistent document indexer the fragmentation state preserver must be specified to both creating and opening initializers.

Installation

Swift Package as dependency in Xcode 11+

  1. Go to "File" -> "Swift Packages" -> "Add Package Dependency"
  2. Paste Document Indexer repository URL into the search field:

https://github.com/SerhiyButz/DocumentIndexer.git

  1. Click "Next"

  2. Ensure that the "Rules" field is set to something like this: "Version: Up To Next Major: 1.3.0"

  3. Click "Next" to finish

For more info, check out here.

License

This project is licensed under the MIT license.

Resources