Skip to content

do-me/copernicus-services-semantic-search

Repository files navigation

Copernicus Services Semantic Search

Tutorial here with lots of tips

A basic semantic search app based on 834 entries from Copernicus Services Catalogue chunked and indexed (mean embedding of all content chunks) in a ~2.4MB gzipped json with all-MiniLM-L6-v2. Enter any query and hit submit or enter. App loads ~27Mb of resources of data and scripts. The ML model runs entirely in the browser thanks to transformers.js.

Advanced search

If you'd like to search within the result's content, consider installing the Chrome extension of SemanticFinder, GitHub repo.

It finds the most relevant sections to your query in the actual content of the results by performing semantic search on the fly.

Data mining tutorial

The process of creating the data dump includingcan be repeated with the included Jupyter Notebook. It includes the whole processing pipeline:

  • data mining with requests and beautifulsoup
  • preprocessing in pandas
  • chunking the document text in smaller paragraphs of the right size for the ML model
  • creating embeddings for each chunk
  • calculating the mean embedding for each document
  • saving as gzipped json (small file size & easy and fast to read in js with pako.js)

You can re-run the process for updates (if you do so, please open a pull request for this repo or write so I can keep the data dump updated) or use other indexing models like the current MTEB leaders of the bge or gte family. You could also use a multilingual model to perform search queries in other languages than English. The current dump holds 834 entries from 21 October 2023.

If you like this project, ⭐ the repo or give a shoutout on social media!