Indexation speed issues: call for feedback #2131

curquiza · 2022-02-02T10:26:27Z

curquiza
Feb 2, 2022
Maintainer

Hello everyone 👋

If you are here it means you probably got some issues during your indexation with Meilisearch: the document addition might be really slow or even has led to a crash due to memory consumption.

The whole Meilisearch team is really sorry for this inconvenience. Be sure we are always working on making our search engine better regarding these points.

Before posting with your problem, please read the whole post.

Current issues

We have currently identified 3 types of issues during indexation:

You want to add a lot of documents and this takes a lot of time
You want to quickly add/update documents to a huge database. This indexation time increases with the size of the database.
Meilisearch crashes when you add/update documents.

Current solutions to fix the indexation

Here are some solutions we have already implemented and documented to fix the indexation issues.

Ensure you went through this guide. This guide gathers tips that might help fix your current problem.
As part of the 0.26.0 release, we introduced a new experimental feature: auto-batching. Auto-batching is a feature that will batch together consecutive document additions and perform them together to improve indexing speed. More about this feature and how to use it here.
Please let us know about your experience with this experimental feature in this discussion, it would definitely help us improve our search engine!

If you still have issues with indexation

If after having tested all the previous points and you still get an issue (bad performance or a crash), please let us know about your use case in this discussion.

⚠️ Please, one use case per thread, and the answers of this use case should be in the same thread 🙂

We strongly need the following information to get your feedback in the most efficient way.

Technical information

The version of Meilisearch you are using, by using the following command. You might want to replace http://127.0.0.1:7700 by the server address of your Meilisearch instance if you don't use Meilisearch locally.

 curl -s http://127.0.0.1:7700/version

The metrics of your machine (number of core, RAM, distribution etc...)
How do you host Meilisearch? Is it on a Cloud provider? If yes, which one?
Is your Meilisearch running with Kubernetes? Is your Meilisearch running in a Docker container?
If you send your documents by batch. How big are these batches?

Dataset information

If possible, provide your dataset. We completely understand you cannot share your data publically, but you can still send it to me in private by email. Be sure we will use it only for test purposes and will delete it right after the tests.

If you cannot share your dataset, please let us know about the following points.

The size of your dataset and the number of documents.
For example, the movies.json dataset we provide in the documentation has a size of 9,1Mb.
The composition of your documents, I mean, the number of fields per document and the number of words per field.
Ex:

[
  { "id": 2,    "title": "Pride and Prejudice",                    "author": "Jane Austen",              "genre": "romance" },
  { "id": 456,  "title": "Le Petit Prince",                        "author": "Antoine de Saint-Exupéry", "genre": "adventure" },
  { "id": 1,    "title": "Alice In Wonderland",                    "author": "Lewis Carroll",            "genre": "fantasy" },
  { "id": 1344, "title": "The Hobbit",                             "author": "J. R. R. Tolkien",         "genre": "fantasy" },
  { "id": 4,    "title": "Harry Potter and the Half-Blood Prince", "author": "J. K. Rowling",            "genre": "fantasy" },
  { "id": 42,   "title": "The Hitchhiker's Guide to the Galaxy",   "author": "Douglas Adams" }
]

This dataset of 6 documents contains between 3 and 4 fields per document and less than 10 words per field.

The language of the dataset. For example, Chinese is slower to tokenize than Latin languages.
The settings of your index(es):

curl \
  -X GET 'http://localhost:7700/indexes/movies/settings'

Your usecase

What is your use case:
- SaaS application
- B2C application
- E-commerce
- Marketplace
- Media
- Site search
- Internal tool
- ...
How often do you need to add or/and update documents? Which type is it:
- Document addition
- Document update
- Both
How critical is it to have your data indexed as quickly as possible?
- Nice to have
- Important
- Critical
What is an acceptable duration between the time a document is sent to Meilisearch and the time it is searchable?
- A few seconds
- A few minutes
- A few hours
Most of all regarding the last answers, why?

Misc

Any other information that might help us understand your context: for example, are you using an SDK provided by Meilisearch?

Thanks for reading this, and most of all, thanks for your time for your feedback 🙂

kehers · 2022-02-12T06:53:50Z

kehers
Feb 12, 2022

What is your use case:
SaaS application

How often do you need to add or/and update documents?
~1/sec.
Which type is it:
Document addition

How critical is it to have your data indexed as quickly as possible?
Important

What is an acceptable duration between the time a document is sent to Meilisearch and the time it is searchable?
A few minutes

Most of all regarding the last answers, why?
Answer below

Any other information that might help us understand your context:
Environment: MeiliSearch 0.25.0 on Debian 10. 8GB RAM, 4vCPUs, DigitalOcean

We are a marketing automation solution and we are looking to use move our mail search to Meilisearch. We have customers that also use us for transactional email analytics (AWS SES, Mailgun...) and many of them do tens of thousands of daily emails. The goal is to allow them to be able to search for mails which is why indexing has to be fast. This hasn't been the case. What we have also noticed is that tasks are stuck and did not process at all. I just checked and I have tasks stuck in queue since 4 days ago. I added a test document and it's yet to be indexed in over 50 minutes.

/tasks/468222

{
    "uid": 468222,
    "indexUid": "testindex",
    "status": "enqueued",
    "type": "documentAddition",
    "details": {
        "receivedDocuments": 1,
        "indexedDocuments": null
    },
    "duration": null,
    "enqueuedAt": "2022-02-12T06:06:01.385674946Z",
    "startedAt": null,
    "finishedAt": null
}

PS: you may need to paginate the task APIs - /tasks and /indexes/{indexUID}/tasks; and possibly allow filtering on status. Wanted to check pending tasks and the request broke the server. Even when it worked, Postman could not display the response due to size.

2 replies

gmourier Feb 21, 2022
Collaborator

Thanks, @kehers for this feedback.

We have started to work on speeding up the indexing of a document for that kind of use case. We hope to be able to share with you major advances on this soon.

Concerning the tasks API, we have prepared 2 specifications:

Love to hear your feedback on that!

kehers Feb 21, 2022

Thanks for this. I will watch closely and continue to test.

csnuknet · 2022-02-21T17:55:49Z

csnuknet
Feb 21, 2022

The version of Meilisearch you are using
v0.25.2

How do you host Meilisearch? Is it on a Cloud provider? If yes, which one?
AWS - EC2 - c6g.4xlarge 16(cores) 32 GiB Ram with attached SSD storage.

If you send your documents by batch. How big are these batches?
Batched initial import (50k) , following which documents are sent as they are created by user/automated actions 1 at a time (typically). Our indexation has fallen massively behind overtime (weeks, irrespective of server size).

Dataset information
Our data set consists of multiple indexes, one for each document type. All the indexes are standard settings and include a filterable attribute used to determine the account owner (multi tenant). Almost All fields are text, the emails don't contain html.

Contact: - 100k records
{ "id": 67175, "tenant": 1, "email": "me@me.com", "first_name": "First", "last_name": "Last", "contact_number": null, "mobile_number": null, "address_1": "3 some drive, some place, some county", "address_2": null, "town": null, "postcode": "SN 1PR", "fullName": "First Last" }

Booking: - 56k records
{ "id": 1739, "from_date": "20\/09\/2018", "to_date": "20\/09\/2018", "name": "Charlie Stalker", "booking_type_id": 1, "booking_status_id": 14, "age": 0, "tenant_id2": 1, "tenant_id": 1 }

Invoices: - 235k records
{ "id": 1739, "created_at": "20\/09\/2018", "invoice_type": 14, "title": "late payment invoice", "tenant_id2": 1, "tenant_id": 1 }

Emails: - 416k records
{ "id": 1739, "booking_id": 1 "conversation_id": 1, "title": "late payment invoice", "message": "Typical email body text" // these vary in size considerably. "contact: "first last" "created_at": "20\/09\/2018", "tenant_id2": 1, "tenant_id": 1 }

The language of the dataset.
English (UK)

The settings of your index(es):
Default settings aside setting the tenant_id's as filterable attributes.

What is your use case:
SaaS application - CRM

How often do you need to add or/and update documents? Which type is it: - Both

Documents are triggered and changes are made by users actions, the data is then typically needed in proceeding processes.

How critical is it to have your data indexed as quickly as possible? - Critical

Often users are creating records such as a customer contact. They would then search for/lookup that contact to add them to other records such as orders etc therefore if the contact is unavailable due to indexation it impacts the user experience.

What is an acceptable duration between the time a document is sent to Meilisearch and the time it is searchable? - A few seconds

When users are managing a larger dataset/Crm type product the search becomes critical to navigation and linking various resources together, say you create a contact, then an order, if you search for the contact to link the order and its not found or leave the contact page and there is a delay which prevents you finding that contact (until indexed) the experience is jarring and reflects badly on the search as users perceive this as "i cant find things".

Misc

PHP SDK, via Laravel. The Laravel integration uses Laravel Scout which includes no support for batching and simply pushes an update each time an object is updated. This suits most use-cases so is understandable and perfect for our use case, we had previously used Algolia but as our dataset grew (like most) it became unviable.

Having battled with index performance for a long time we are eagerly awaiting the next release however even on a large server the performance could be quite slow (given updates i imagine are still only 1 index at a time) which would need to be addressed lest we end up with multiple meili instances with 1 index per instance.

Lastly, with complete ignorance to how your internal queue/batching is planned, it would be beneficial if the index delete commands are actionable immediately? vs processing old records then clearing the index. The use case being that if the queue falls behind with smaller updates (our was/is weeks behind) then we could delete the index out of hours, recreate it and batch update it rapidly. At this point if we issued a delete command it could be weeks before the index is deleted or we physically need to delete the index on disk.

5 replies

entrptaher Mar 3, 2022

Rebuilding the whole index is a fundamentally bad design. I tried to use a ramdisk for meilisearch storage with no noticeable difference since it then gets bottlenecked by a single cpu core for single document update.

I also think we may end up lots of meili instances with one index per process. If there is an api that auto routes data to specific index based on tenant, it may work out.

csnuknet Mar 7, 2022

Rebuilding the whole index is a fundamentally bad design. I tried to use a ramdisk for meilisearch storage with no noticeable difference since it then gets bottlenecked by a single cpu core for single document update.

Indeed, rebuilding the index shouldn’t be the go-to solution. Although if large amounts of data has changed or the index has fallen too far behind, it can be a far more preferable solution to pushing single updates/smaller batches.

In these instances, you would remove the index content and index criteria; this is important as Meili will no longer index incoming documents and instead just add them. Once all your documents are across, then add your index criteria. At this point, you will find Meili will, as advertised, use 50% of the available cores and as much ram as it needs (if the index is sizable enough) as it will process the entire index as a single update.

From experience, this has meant that our entire index has been processed in < 7 min vs 7min per single document.

I also think we may end up lots of meili instances with one index per process. If there is an api that auto routes data to specific index based on tenant, it may work out.

Since most of us will be proxying our requests to Meili via some authentication/app it’s thankfully not too much work to make requests to different indexes.

Presently, for example, our Laravel app takes the search request and forwards it to the appropriate Meili server(s) with the attached tenant keys. The results are then aggregated and returned to the user. Therefore, the search server(s) are not publicly available (bonus!).

However, even with this setup, it doesn’t completely solve the performance bottleneck. If you make frequent single document changes, each index will become overwhelmed mainly as the number of documents in the queue and time per update grows.

The planned auto batching will undoubtedly help solve part of the problem by ensuring the enqueued documents are processed as a batch, these larger batches can then also make use of more of the server's resources. This will prevent the index from becoming overwhelmed by single document updates. Even more effective if you were to keep one index per server.

The batches, however, will not solve the time taken per update, which will remain the biggest problem for most users, mainly if your update takes ten+ minutes to process the addition of a new document (irrespective of if that batch is one or 50k documents), it will negatively impact the user.

When this happens, you may again find yourself reaching for rebuilding the indexes, waiting eagerly for the next update or even looking for another solution (sadly)

gmourier Mar 28, 2022
Collaborator

Thanks for your feedback @entrptaher and @csnuknet! We are still working on it, we have made some progress on it for v0.27.0.

rjarmstrong Jun 23, 2022

Rebuilding the whole index is a fundamentally bad design

You can't make a generalised statement like this, there are many use cases where an index needs to be recreated quickly - it's probably better to look at the numbers of what people expect. From experience I have been able to index about 22k documents/second with Elasticsearch (+-2kb with analysis on several fields).

Whilst I think there are many great features with Meillisearch, the rate of indexing does not even come close to expectations - some basic testing gives me maybe 100 documents per second. Why not give people the option to have a memory only index, this would definitely satisfy our purposes if it dramatically improved indexing?

gmourier Jun 27, 2022
Collaborator

Hey @rjarmstrong 👋

I would like to know more about your use case.

Could you give us some answer regarding those questions, please?

What is your use case:

SaaS application
B2C application
E-commerce
Marketplace
Media
Site search
Internal tool

How often do you need to add or/and update documents? Which type is it?

Document addition
Document update
Both

What is an acceptable duration between the time a document is sent to Meilisearch and the time it is searchable?

A few seconds
A few minutes
A few hours

Most of all regarding the last answers, can you describe what you are trying to accomplish and how you are indexing the data in Meilisearch?

The more detail we have the better it is to help you.

Thanks a lot! 🙇‍♂️

xpader · 2022-03-01T11:14:01Z

xpader
Mar 1, 2022

I'm testing and running MeiliSearch at ecs s6.xlarge.4, 4CPU, 16G RAM.
Import 1.4 million cost about 18 hours, 17G disk space, 10G memory. (The source data used close to 2G of MySQL space).
I found that indexing is too expensive, and create documents/update documents performance degrades exponentially as the amount of data increases.

It looks like (just guess):
Every document create/update will rebuild all index.
When create/update documents need read all data to memory.

I have restart MeiliSearch, and update one document.
The meilisearch process is loading data to memory, memory increase to 10.5G now, and the task has been in processing status over 30 minutes now, it's still processing..

I can't believe that..

MeiliSerch Version: v0.25.2
I have setting index before import data.

Index setting:

{
    "displayedAttributes": [
        "*"
    ],
    "searchableAttributes": [
        "content"
    ],
    "filterableAttributes": [
        "id",
        "create_date",
        "elements",
        "last_time",
        "pub_time",
        "status",
        "tid",
        "top_status",
        "total_num"
    ],
    "sortableAttributes": [
        "id",
        "last_time",
        "pub_time",
        "top_status",
        "total_num"
    ],
    "rankingRules": [
        "words",
        "typo",
        "proximity",
        "attribute",
        "sort",
        "exactness"
    ],
    "stopWords": [],
    "synonyms": {},
    "distinctAttribute": null
}

Document example:

{
    "ci": 123456,
    "tid": 68,
    "content": "This is a text for example, length range is 1-2000",
    "elements": "1",
    "total_num": 465,
    "pub_time": 1441933120,
    "status": 0,
    "create_date": 1439136000,
    "last_time": 0,
    "top_status": 0
}

Total documents is 1364100.

And import 10000 every batch.

2 replies

Kerollmops Mar 7, 2022
Maintainer

Hey @xpader,

Thank you for your feedback. We are aware of these indexing issues and will release an experimental auto-batching feature. This feature will automatically batch the documents for you, it will process all of the updates in one big batch after having processed the previous update.

We will also release improvements in the processing of a small number of documents in v0.27.

Kerollmops Apr 25, 2022
Maintainer

Hey @xpader,

We have made some progress on the latest pre-release of Meilisearch v0.27rc2, if you combine this with the experimental auto-batching feature you should see great improvements in terms of indexation times.

Would you be willing to try again?

nsmosi · 2022-03-03T05:36:11Z

nsmosi
Mar 3, 2022

I have issue with update document it takes some time to get effect, for 627786 document update a single document takes about 1 minute to be done, I have around 15 million document in prod server, if does the case update a single document is going to take about 30 minutes !!!,

my meiliSearch version is:

| Environment:		"production"
| Commit SHA:		"unknown"
| Commit date:		"unknown"
| Package version:	"0.25.2"
|
| Thank you for using MeiliSearch!
|
| We collect anonymized analytics to improve our product and your experience. To learn more, including how to turn off analytics, visit our dedicated documentation page: https://docs.meilisearch.com/learn/what_is_meilisearch/telemetry.html
|
| Anonymous telemetry:	"Enabled"
| Instance UID:		"0c6aec20-9ca2-4dc1-4b53--dfc71e7bd121"
|
| A Master Key has been set. Requests to MeiliSearch won't be authorized unless you provide an authentication key.
|
| Documentation:		https://docs.meilisearch.com
| Source code:		https://github.com/meilisearch/meilisearch
| Contact:		https://docs.meilisearch.com/resources/contact.html
|
| [2022-03-02T07:57:38Z INFO  actix_server::builder] Starting 4 workers
| [2022-03-02T07:57:38Z INFO  actix_server::server] Actix runtime found. Starting in Actix runtime

I have 627786 documents:

{
    "numberOfDocuments": 627786,
    "isIndexing": false,
    "fieldDistribution": {
        "a": 627786,
        "created_at": 627786,
        "denomination_id": 627786,
        "id": 627786,
        "b": 627786,
        "c": 627786,
        "status": 627786,
        "d": 627786,
        "e": 627786
    }
}

as you can see the isIndexing is false which means meili is done with indexing, I have updated a document (partial update) like so:

[{
    "id": 1632899,
    "d": 999,
    "status": "sucess",
    "e": "2022-01-01 13:02:03"
}]

after sending this payload I have received this reponse:

{
    "uid": 200,
    "indexUid": "myindex",
    "status": "enqueued",
    "type": "documentPartial",
    "enqueuedAt": "2022-03-02T10:07:52.113686580Z"
}

when I checked the task status:

{
    "uid": 200,
    "indexUid": "myindex",
    "status": "succeeded",
    "type": "documentPartial",
    "details": {
        "receivedDocuments": 1,
        "indexedDocuments": 1
    },
    "duration": "PT50.446745199S",
    "enqueuedAt": "2022-03-02T10:07:52.113686580Z",
    "startedAt": "2022-03-02T10:07:52.116602947Z",
    "finishedAt": "2022-03-02T10:08:42.563348146Z"
}

as you can see it took almost one minute to done the update, this way I can depend on the meilisearch %100 because of update in delay, is there any environment variable which I can set and make the update document instantly ?

2 replies

Kerollmops Mar 7, 2022
Maintainer

Hey @nsmosi,

Thank you for your feedback. We always recommend sending your documents in the biggest batches you can. However, Meilisearch has made progress and will be able to auto-batch on its side when you activate the auto-batching experimental feature on the v0.26 release.

We also made huge progress for when updates contain a small number of documents, those speed-ups will be available in the v0.27 release.

gmourier Mar 28, 2022
Collaborator

Hi @nsmosi! 👋

Thanks for the feedback, what would be your expectations in terms of speed for a document update given the size of your database?

vprelovac · 2022-03-09T06:05:55Z

vprelovac
Mar 9, 2022

Indexing speed problems, it takes ~25 seconds to index 5000 documents, 6 core CPU.

[2022-03-09T06:00:53Z INFO meilisearch_lib::index::updates] document addition done: DocumentAdditionResult { indexed_documents: 5000, number_of_documents: 95000 }
[2022-03-09T06:01:17Z INFO meilisearch_lib::index::updates] document addition done: DocumentAdditionResult { indexed_documents: 5000, number_of_documents: 100000 }
[2022-03-09T06:01:44Z INFO meilisearch_lib::index::updates] document addition done: DocumentAdditionResult { indexed_documents: 5000, number_of_documents: 100000 }
[2022-03-09T06:02:09Z INFO meilisearch_lib::index::updates] document addition done: DocumentAdditionResult { indexed_documents: 5000, number_of_documents: 105000 }
[2022-03-09T06:02:36Z INFO meilisearch_lib::index::updates] document addition done: DocumentAdditionResult { indexed_documents: 5000, number_of_documents: 105000 }
[2022-03-09T06:03:02Z INFO meilisearch_lib::index::updates] document addition done: DocumentAdditionResult { indexed_documents: 5000, number_of_documents: 110000 }
[2022-03-09T06:03:29Z INFO meilisearch_lib::index::updates] document addition done: DocumentAdditionResult { indexed_documents: 5000, number_of_documents: 110000 }
[2022-03-09T06:03:56Z INFO meilisearch_lib::index::updates] document addition done: DocumentAdditionResult { indexed_documents: 5000, number_of_documents: 115000 }

6 replies

curquiza Apr 14, 2022
Maintainer Author

Hello everyone! :)
Sorry for the late answer!
We are sorry for the indexation issues we have. Could you please detail what you would mean by acceptable results? It would be definitely helpful for us.

Thanks for your report and your time!

cseblog Oct 8, 2022

It is taking me ~10 minutes to index 100K messages from the latest Mac M1 Pro. It is pretty sad to be honest.
{
"uid": 37,
"indexUid": "posts",
"status": "succeeded",
"type": "documentAddition",
"details": {
"receivedDocuments": 100000,
"indexedDocuments": 100000
},
"duration": "PT3765.107184S",
"enqueuedAt": "2022-10-07T15:16:37.05902Z",
"startedAt": "2022-10-08T02:48:56.836473Z",
"finishedAt": "2022-10-08T03:51:41.943657Z"
},

curquiza Oct 10, 2022
Maintainer Author

Hello @cseblog
Which Meilisearch do you use? What is the size of your machine? The size of your dataset of 100 000 documents (in Mb or Gb)?

You said 10min it's bad for you. What is your use case exactly? What would be the acceptable time? How often do you send 100 000 documents?

Thanks in advance for your help

cseblog Oct 15, 2022

hi @curquiza
Thanks for your reply. Below are my use case
Meilisearch version:
{
"commitSha": "cd2239eb2ded78cd9048628388887886ba88f8a4",
"commitDate": "2022-05-09T08:51:32Z",
"pkgVersion": "0.27.0"
}
My MacBook M1pro Max, (10CPU, 64GB Memory)
My dataset of 100k documents is around 160MB to 200MB. I have around 50 million documents to index into Meilisearch
My setup is all on my laptop: Meilisearch + backend service. Below is my Rust code (using Rust SDK) which I am using to inject the 100K documents
let data = post_read_file(xml_file); block_on(async move { // An index is where the documents are stored. let index = client.index(collection_name); let rows = data.rows; let chunks: Vec<_> = rows.chunks(100000).collect(); let mut total = 0; for ch in chunks.iter() { total += ch.len(); println!("======== ADDING.... {} total: {}", ch.len(), total); let _ = index .add_documents_in_batches(&ch, Some(100000), None) .await; sleep(Duration::from_secs(10)); } });
What should I do to get around less than 2 minutes or reduce the time of indexing of 100k?

curquiza Oct 18, 2022
Maintainer Author

Is it the first 100k documents you added, or do you already have 50M of documents in Meilisearch?

Also, can you try with v0.29.1? We made improvements regarding the indexing speed, as I announced here

hscspring · 2022-03-13T10:22:43Z

hscspring
Mar 13, 2022

update:

after adjust the searchable attrs, it becomes much quicker, hope this helps.

Hello guys,

I found the update in ms is really slow, is there a quicker way to do this? Shall Mongo be more suitable? or i should delete the doc first then add a new one?

this is my machine configuration:

i use the latest ms docker: getmeili/meilisearch:latest

there are almost 600,000 docs in ms, then i got 180 seconds to update 22 docs.

here is the updating log (query procssing status every 10 seconds):

# update code
index.update_documents_in_batches(data, 1000)

INFO:root:====================================================================================================
INFO:root:Begining Procssing comment
INFO:root:Total 22 items to update...
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'enqueued', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': None, 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Total 1 tasks, processing the 1 task.
INFO:root:{'uid': 2052, 'indexUid': 'users_cangdian', 'status': 'processing', 'type': 'documentPartial', 'details': {'receivedDocuments': 22, 'indexedDocuments': None}, 'duration': None, 'enqueuedAt': '2022-03-13T10:13:34.531720510Z', 'startedAt': '2022-03-13T10:13:34.536439205Z', 'finishedAt': None}
INFO:root:Successfully Procssed comment
INFO:root:====================================================================================================

2 replies

gmourier Mar 28, 2022
Collaborator

Hi @hscspring 👋

It is indeed long. How your database is composed? Many small fields? Very long text fields?

What would be acceptable times in your case?

Thanks!

hscspring Apr 23, 2022

sorry for the late reply.
many text fields, some may be long.

actually, many lists of structures, each contains several words, sentences, etc...

dkam · 2022-03-19T00:31:29Z

dkam
Mar 19, 2022

Wow - Indexing with 0.26.1 seems so much faster than 0.25. It's finished indexing ~ 8.6 million documents < 16 hours on a 48GB / 2CPU VPS. Previously I'd move the index over to a beefier VPS to do the indexing and it still took longer.

I've enabled autobatching. I load in ~ 50k records per batch when building the index.

Great work guys!

{"databaseSize"=>39974383616,
 "lastUpdate"=>"2022-03-19T00:02:24.969411905Z",
 "indexes"=>
  {"books"=>
    {"numberOfDocuments"=>8591382,
     "isIndexing"=>false,
     "fieldDistribution"=>
      {"author"=>8591382,
       "id"=>8591382,
       "series"=>8591382,
       "title"=>8591382,
       "work_no"=>8591382}}}}

1 reply

gmourier Mar 28, 2022
Collaborator

Thanks @dkam!

vprelovac · 2022-03-19T17:01:01Z

vprelovac
Mar 19, 2022

I can confirm that indexing is faster in last version with auto batching turned on but is still about 5x slower than typesense which basically does it in real-time (we use both in production).

0 replies

entrptaher · 2022-04-21T07:06:11Z

entrptaher
Apr 21, 2022

This is an update to my previous replies. This is a customer service app running on production.

It is adding 1-3 documents within 5 minutes interval. The queue is being processed like 2 days later (and the delay is getting bigger). Each update is taking around ~530 seconds (~8 minutes) now. I am pretty sure the documents are very very small.

The batching related plugins are not stable enough to get started. Will see what is needed to implement that.

2 replies

Kerollmops Apr 21, 2022
Maintainer

Thank you very much for your feedback.

We have done a lot of work on the indexing part of the engine and reduced the time it takes to insert new/unknown documents by the engine. The best performance you could get would be to use the experimental auto-batching feature with the v0.27 release (or even try a release candidate).

Are your documents already known by the engine? Are they replaced or are they just newly added to the database?

Thanks again!

entrptaher Apr 21, 2022

Unfortunately I was hit by this error #2338 (comment)

Auto batching may help with queue (but not with index speed), which we will test next working day.

I wouldn’t know about the words as its a production data with 200k entries each approximately 1-100mb.

entrptaher · 2022-05-08T08:47:50Z

entrptaher
May 8, 2022

This is a further update to my previous replies. I have tried with following settings.

Version: 0.27 rc3
Total document size: 200k+
Avg document size: 1-10mb
Filters: 10+

Tweaks:

Enabled auto batching
Set maximum batch in documents to 10000 (due to payload size)
Disabled Typo Tollarance

Results:
Each update still takes expotentially more time and uses only 1 core to some small % of CPU despite having access to 4 cores.

Log:

{
  "uid": 21,
  "indexUid": "conversations",
  "status": "succeeded",
  "type": "documentAddition",
  "details": {
    "receivedDocuments": 10000,
    "indexedDocuments": 10000
  },
  "duration": "PT227.816657037S",
  "enqueuedAt": "2022-05-08T08:07:56.479479785Z",
  "startedAt": "2022-05-08T08:33:30.724403842Z",
  "finishedAt": "2022-05-08T08:37:18.541060879Z",
  "batchUid": 21
}
{
  "uid": 20,
  "indexUid": "conversations",
  "status": "succeeded",
  "type": "documentAddition",
  "details": {
    "receivedDocuments": 10000,
    "indexedDocuments": 10000
  },
  "duration": "PT187.607740313S",
  "enqueuedAt": "2022-05-08T08:07:29.042847398Z",
  "startedAt": "2022-05-08T08:30:23.105929981Z",
  "finishedAt": "2022-05-08T08:33:30.713670294Z",
  "batchUid": 20
}
{
  "uid": 19,
  "indexUid": "conversations",
  "status": "succeeded",
  "type": "documentAddition",
  "details": {
    "receivedDocuments": 10000,
    "indexedDocuments": 10000
  },
  "duration": "PT177.205782023S",
  "enqueuedAt": "2022-05-08T08:07:01.641086559Z",
  "startedAt": "2022-05-08T08:27:25.889213684Z",
  "finishedAt": "2022-05-08T08:30:23.094995707Z",
  "batchUid": 19
}

4 replies

xpader May 8, 2022

Unless incremental indexing is implemented, these optimizations are of little significance.

Kerollmops May 9, 2022
Maintainer

Hey @entrptaher,

Thank you for you feedback, what I would like to know is: Are the inserted documents already known by the engine? In short: Are you replacing or inserting unknown documents?

The optimization we worked on can be improved by a lot when the documents are updates and we have planned to work on this!

Have a nice day!

entrptaher Sep 29, 2022

I just tried to update a settings, and I know it will reindex everything. However, it is using a single core all the time even though it's given a massive amount of cpu, ram and disk to it. Taking forever to do something that could be speed up massively. :(

The reason I cannot try a latest 0.29 rc version is because of the incompatibilities with the current library.

curquiza Oct 4, 2022
Maintainer Author

Hello @entrptaher sorry for the inconvenience

Could you try with v0.29.0, auto batching is enabled by default, no parameter to pass.
I see you are using --max-documents-per-batch: this parameter was removed in v0.29.0. Can I know why you needed it?

paulocoghi · 2022-07-27T11:47:04Z

paulocoghi
Jul 27, 2022

@vprelovac, ~~could you share more details? I'm doing some research and, on this year old article, MeiliSearch was almost 7x faster than TypeSense on indexing time.~~

Edit: see comments below. This article was measuring the indexing speed in the wrong way, and Meilisearch time is wrong.

3 replies

jasonbosco Aug 4, 2022

Note: I work on Typesense.

Someone just pointed me to this comment, so thought I’d share more context.

I was in touch with the author of the article when it came out just to understand more about their measurement methodology. It turns out they measured the API response time as the indexing time. In Typesense, indexing is synchronous, so the API response time is also the indexing time. Whereas in Meilisearch, indexing is async, so the API response time is just time it takes to place the indexing task into a queue and then the actual indexing could take much longer.

So the article is unfortunately measuring different values between search engines and the graph presented should really be labeled indexing API response times and not indexing times, given the nuance of how Meilisearch works.

paulocoghi Aug 5, 2022

Thanks for the extra info, @jasonbosco! This completely changes the perspective and this article needs an update with some urgency.

paulocoghi Aug 5, 2022

I edited my comment above and removed the image. Thanks again!

curquiza · 2022-08-30T13:59:18Z

curquiza
Aug 30, 2022
Maintainer Author

Hello everyone watching this issue!

We have just released v0.29.0rc1, which is a release candidate of v0.29.0 🔥
We have improved the indexing performance in this release. For example, we make the auto-batching feature available by default in this RC! Meaning Meilisearch will be way faster when you add 1 document by 1 document.

Binaries are attached to the release, or you can use the docker image:

docker run -it --rm \
    -p 7700:7700 \
    getmeili/meilisearch:v0.29.0rc1

Let us know about any bugs or feedback! 😄 It would be really helpful.

FYI, the official v0.29.0 release will be available on 3rd October.

0 replies

curquiza · 2022-10-06T12:39:36Z

curquiza
Oct 6, 2022
Maintainer Author

Hello everyone here!
A message to let you know we released v0.29.0 of Meilisearch containing a lot of changes to improve the indexing speed
I recommend anyone having issues with this upgrade to this latest release! 😄

0 replies

trim21 · 2023-01-07T09:24:08Z

trim21
Jan 7, 2023

Indexing is fast enough, but it crash after running for a while.

I have a instance on my vps with docker-compose

meilisearch:
  container_name: "chii-base-meilisearch"
  image: "getmeili/meilisearch:v0.30.5"
  command: meilisearch --env production
  restart: always
  environment:
    MEILI_ENABLE_METRICS_ROUTE: "true" # YES I Know it's not woking
    MEILI_MASTER_KEY: "..."
    MEILI_LOG_LEVEL: "WARN"

This happens from 0.28, 0.29, and now I'm using "getmeili/meilisearch:v0.30.5" but it still happen.

My use case is to search about 400k documents, and indexing speed it not important

data file is about 25G size before it crash(count by du command).

After removing old data and re-add all documents, data file take only 8.2G

$ sudo du . -h|
8.2G    ./data/meili_data
8.2G    ./data/meili_data/data.ms
8.2G    ./data/meili_data/data.ms/indexes
8.2G    ./data/meili_data/data.ms/indexes/b1f0dcb9-95b3-4362-b78a-4010ae240bb5

Technical information

Meilisearch version: happens in v0.28, v0.29 and v0.30.5

Additional context
Additional information that may be relevant to the issue.
[e.g. architecture, device, OS, browser]

x64, ubuntu 20.04, 4c 8g, swap off.

How often do you need to add or/and update documents? Which type is it:

Less than 50 adds per day. Updates 10k per day (about 1.3/s). Previous I send payload (update and adding) one by one, these days I send them in batch (1k), I'm using UpdateDocuments so they are mixed in payload.

And this happens whether I send payload to meilisearch in batch or not.

(data in queue waiting to flush to meilisearch)

and it crash after it running for a while (about weeks or months), without any useful logging, and sdk doesn't return any error.

Dataset information

my documents look like this:
(golang)

type subjectIndex struct {
	ID       uint32          `json:"id"`
	Summary  string          `json:"summary"`
	Tag      []string        `json:"tag,omitempty" filterable:"true"`
	Name     []string        `json:"name"`
	Date     int             `json:"date,omitempty" filterable:"true" sortable:"true"`
	Score    float64         `json:"score" filterable:"true" sortable:"true"`
	PageRank float64         `json:"page_rank" sortable:"true"`
	Heat     uint32          `json:"heat" sortable:"true"`
	Rank     uint32          `json:"rank" filterable:"true" sortable:"true"`
	Platform uint16          `json:"platform,omitempty"`
	Type     uint8           `json:"type" filterable:"true"`
	NSFW     bool            `json:"nsfw" filterable:"true"`
}

setting:

{
  "displayedAttributes": ["*"],
  "searchableAttributes": ["name", "summary", "tag", "type", "id"],
  "filterableAttributes": ["date", "nsfw", "rank", "score", "tag", "type"],
  "sortableAttributes": ["date", "heat", "page_rank", "rank", "score"],
  "rankingRules": [
    "exactness",
    "words",
    "typo",
    "proximity",
    "attribute",
    "sort",
    "id:asc",
    "rank:asc",
    "score:desc",
    "nsfw:asc"
  ],
  "stopWords": [],
  "synonyms": {},
  "distinctAttribute": null,
  "typoTolerance": {
    "enabled": true,
    "minWordSizeForTypos": {
      "oneTypo": 5,
      "twoTypos": 9
    },
    "disableOnWords": [],
    "disableOnAttributes": []
  },
  "faceting": {
    "maxValuesPerFacet": 100
  },
  "pagination": {
    "maxTotalHits": 1000
  }
}

stats:

{
  "numberOfDocuments": 413702,
  "isIndexing": false,
  "fieldDistribution": {
    "date": 345888,
    "heat": 413702,
    "id": 413702,
    "name": 413702,
    "nsfw": 413702,
    "page_rank": 413702,
    "platform": 285143,
    "rank": 413702,
    "score": 413702,
    "summary": 413702,
    "tag": 230165,
    "type": 413702
  }
}

I'm running meilisearch in docker, so dockerd restart it after it crash.

High cpu, memory, and high io read.

and not high memory usage, I don't think it's killed by OS.

I also have a prometheus exporter to export some data, hopefully it's useful, you can see meilisearch's "enqueued" and "processing" task uid is keep increasing but it didn't finish any task.

This happened in 9/26, 10/20, 12/7 and today, meilisearch also didn't give any useful logging in previous crashing.

I already rm my old data file, if you need them, I can only share them the next time it happens

11 replies

trim21 Feb 1, 2023

I downloaded data files to local and start it with my 16GB machine and looks like it's working normal

but the tasks can't finish, it taks almost 1h and still can't finish.

curl -q http://127.0.0.1:7700/tasks | jq '.results[0]'
 {
  "uid": 1595,
  "indexUid": "subjects",
  "status": "processing",
  "type": "documentAdditionOrUpdate",
  "canceledBy": null,
  "details": {
    "receivedDocuments": 2104,
    "indexedDocuments": null
  },
  "error": null,
  "duration": null,
  "enqueuedAt": "2023-01-25T23:41:23.171574969Z",
  "startedAt": "2023-02-01T12:37:31.569891361Z",
  "finishedAt": null
}

irevoire Feb 1, 2023
Collaborator

Thanks, we're going to inspect on our side!

Just to be sure, from this second chart you shared, if I understand correctly, meilisearch is taking 230mib of ram above it's 2GiB limit 😬

trim21 Feb 1, 2023

Thanks, we're going to inspect on our side!

Just to be sure, from this second chart you shared, if I understand correctly, meilisearch is taking 230mib of ram above it's 2GiB limit 😬

it's container_memory_working_set_bytes from cadvisor https://github.com/google/cadvisor/blob/a52ec5d60cf70b22f8b6d204780aec7a222cf6bb/docs/storage/prometheus.md#prometheus-container-metrics

irevoire Feb 1, 2023
Collaborator

but the tasks can't finish, it taks almost 1h and still can't finish.

Just wondering, how many documents do you have in your database currently? Sometimes it can take a long time.

Also, after chatting with the team, we realized that meilisearch limits its indexing-memory as the flag says but doesn't limit the other source of memory allocation, like the search or the memory used by LMDB.
I know this isn't ideal, and we should work on a global flag max-memory flag or something, but currently, the only thing you can try is to reduce your indexing memory until it stops crashing; sorry:grimacing:

Our default parameter is 70% of the total available memory, so in your case, you would go from 2GiB to 1.4GiB

trim21 Feb 1, 2023

Just wondering, how many documents do you have in your database currently? Sometimes it can take a long time.

400k

it normally take less than 10 min to finish a task on an even less performance machine.

Revertron · 2023-03-27T17:47:08Z

Revertron
Mar 27, 2023

Meilisearch version: 1.0.2

Runs on: VirtualBox VM of 10Gb RAM, 8 cores of Ryzen 3700X.

Documents structure is like this:

id: some short unique string
title: short string
author: usually short string
content: long text content of the book

The problem:
Every push of new books indexing kicks in only for the first batch/chunk on one CPU core, and only AFTER it is done it starts all the other threads and a bunch of tasks that are waiting.

		{
			"uid": 3009,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 2,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:02:15.725230302Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 3008,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:02:14.713958417Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 3007,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:02:11.66059322Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 3006,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:02:08.059216416Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 3005,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:02:05.987175016Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 3004,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:02:02.929278366Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 3003,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:01:59.5885585Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 3002,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:01:57.096295695Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 3001,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:01:53.571275804Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 3000,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:01:50.376094875Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2999,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:00:47.978805615Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2998,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:00:44.926331679Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2997,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:00:42.542222003Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2996,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:00:39.432092494Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2995,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:00:36.490543101Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2994,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:00:33.265590658Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2993,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:00:30.360675416Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2992,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:00:26.86237521Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2991,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:00:23.464482088Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2990,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T17:00:20.395922711Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2989,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T16:59:16.42078754Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2988,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T16:59:14.630777948Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2987,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T16:59:12.051917483Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2986,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T16:59:10.34088605Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2985,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T16:59:07.008501357Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2984,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T16:59:03.648047772Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2983,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T16:59:00.566112942Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2982,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T16:58:58.235852699Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2981,
			"indexUid": "books",
			"status": "enqueued",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T16:58:56.376806456Z",
			"startedAt": null,
			"finishedAt": null
		},
		{
			"uid": 2980,
			"indexUid": "books",
			"status": "processing",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 100,
				"indexedDocuments": null
			},
			"error": null,
			"duration": null,
			"enqueuedAt": "2023-03-27T16:58:54.434152636Z",
			"startedAt": "2023-03-27T16:58:54.446285123Z",
			"finishedAt": null
		},

What are those tasks waiting for?

I've tried small chunks, large chunks, manual chunking, auto-batching - the behavior is the same. Indexing works for first portion only, and then for all the rest.

2 replies

loiclec Apr 6, 2023

Hi @Revertron ,

Meilisearch starts processing tasks as soon as it receives them, which would explain why it indexes your dataset in (at least) two batches.

Immediately after the first chunk of documents is sent, Meilisearch starts indexing them.
While it is indexing the first chunk, it cannot start processing any new task. Therefore, new tasks sent to it will be placed in a queue.
Only after the first indexing batch is finished can it dequeue all the remaining tasks, batch them together, and process them.

That being said, if indexing performance is not satisfactory for your use case, we should look into it in more details. Could you share with us some numbers about the time it takes to index the dataset (first chunk + the rest, separately)?

Revertron Apr 6, 2023

Okay, here are the numbers.
This is the first task that runs right after sending the first chunk of documents:

		{
			"uid": 3854,
			"indexUid": "books",
			"status": "succeeded",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 30,
				"indexedDocuments": 30
			},
			"error": null,
			"duration": "PT390.783545600S",
			"enqueuedAt": "2023-04-06T13:33:26.9373002Z",
			"startedAt": "2023-04-06T13:33:26.9406481Z",
			"finishedAt": "2023-04-06T13:39:57.7241937Z"
		},

It lasts for about 6 minutes.
And after that 86 tasks start at the same time. They look like this:

		{
			"uid": 3855,
			"indexUid": "books",
			"status": "succeeded",
			"type": "documentAdditionOrUpdate",
			"canceledBy": null,
			"details": {
				"receivedDocuments": 30,
				"indexedDocuments": 30
			},
			"error": null,
			"duration": "PT4244.499432200S",
			"enqueuedAt": "2023-04-06T13:33:27.1008227Z",
			"startedAt": "2023-04-06T13:39:57.7266952Z",
			"finishedAt": "2023-04-06T14:50:42.2261274Z"
		},

Here we get 71 minutes for 86 tasks, much efficient. So the first chunk stalls other tasks :-/

Revertron · 2023-04-06T19:29:45Z

Revertron
Apr 6, 2023

I have a question. When I add some documents to an index, does Meilisearch reindex whole index from the start?
Because it seems like that. And monitoring tools show that the index is read over and over :(
So the time of adding new documents is increasing exponentially. And is done in single thread almost all the time.

17 replies

loiclec Apr 25, 2023

@Revertron I think there are two things that we can do to help, although I can't guarantee that all your problems will be resolved quickly.

We know that Meilisearch can struggle to index/search documents with a lot of text in them. If I understand correctly, you are dealing with documents that can contain almost one megabyte of text. These will always be much more difficult to index than shorter documents. But we know of a couple of optimisations that should nevertheless speed up your use case.
It is in fact possible to cancel tasks that are in the processing state. However, there are only a few moments during a task execution during which it can be canceled. What you are probably experiencing is that a task is "stuck" in its processing state at a point where it cannot be cancelled. We can/should make sure that tasks can be be successfully cancelled more often.

I also have another question about your dataset, could you share the number of words that are within the "book content" field?
Thank you for your feedback! :)

Revertron Apr 25, 2023

I've counted words in one big book, it has 184659 words. But there are some numbers too, like chapter numbers or notes numbers.

And I have another problem - if I try to index more than 2000 documents, whole process is looping over and over. The task works for about 2,5 hours and starts over. There is no error in the log:

[2023-04-20T09:11:43Z INFO  actix_web::middleware::logger] 192.168.32.140 "GET /indexes HTTP/1.1" 200 167 "-" "Meilisearch Rust (v0.22.0)" 0.002145
[2023-04-20T09:11:43Z DEBUG meilisearch::routes::indexes] returns: IndexStats { number_of_documents: 284284, is_indexing: true, field_distribution: {"author": 284284, "content": 284284, "genres": 284284, "lang": 284284, "path": 284284, "title": 284284} }
[2023-04-20T09:11:43Z INFO  actix_web::middleware::logger] 192.168.32.140 "GET /indexes/books/stats HTTP/1.1" 200 160 "-" "Meilisearch Rust (v0.22.0)" 0.000880
[2023-04-20T09:21:42Z INFO  actix_web::middleware::logger] 192.168.32.140 "GET /tasks?statuses=processing HTTP/1.1" 200 349 "-" "curl/7.74.0" 0.026279
[2023-04-20T09:21:43Z DEBUG meilisearch::routes::indexes] returns: PaginationView { results: [IndexView { uid: "books", created_at: 2023-04-04 18:59:21.4208213 +00:00:00, updated_at: 2023-04-18 23:21:08.8787868 +00:00:00, primary_key: Some("path") }], offset: 0, limit: 20, total: 1 }
[2023-04-20T09:21:43Z INFO  actix_web::middleware::logger] 192.168.32.140 "GET /indexes HTTP/1.1" 200 167 "-" "Meilisearch Rust (v0.22.0)" 0.004465
[2023-04-20T09:21:43Z DEBUG meilisearch::routes::indexes] returns: IndexStats { number_of_documents: 284284, is_indexing: true, field_distribution: {"author": 284284, "content": 284284, "genres": 284284, "lang": 284284, "path": 284284, "title": 284284} }
[2023-04-20T09:21:43Z INFO  actix_web::middleware::logger] 192.168.32.140 "GET /indexes/books/stats HTTP/1.1" 200 160 "-" "Meilisearch Rust (v0.22.0)" 0.001762
[2023-04-20T09:31:43Z DEBUG meilisearch::routes::indexes] returns: PaginationView { results: [IndexView { uid: "books", created_at: 2023-04-04 18:59:21.4208213 +00:00:00, updated_at: 2023-04-18 23:21:08.8787868 +00:00:00, primary_key: Some("path") }], offset: 0, limit: 20, total: 1 }
[2023-04-20T09:31:43Z INFO  actix_web::middleware::logger] 192.168.32.140 "GET /indexes HTTP/1.1" 200 167 "-" "Meilisearch Rust (v0.22.0)" 0.015688
[2023-04-20T09:31:43Z DEBUG meilisearch::routes::indexes] returns: IndexStats { number_of_documents: 284284, is_indexing: true, field_distribution: {"author": 284284, "content": 284284, "genres": 284284, "lang": 284284, "path": 284284, "title": 284284} }
[2023-04-20T09:31:43Z INFO  actix_web::middleware::logger] 192.168.32.140 "GET /indexes/books/stats HTTP/1.1" 200 160 "-" "Meilisearch Rust (v0.22.0)" 0.003200
[2023-04-20T09:41:43Z DEBUG meilisearch::routes::indexes] returns: PaginationView { results: [IndexView { uid: "books", created_at: 2023-04-04 18:59:21.4208213 +00:00:00, updated_at: 2023-04-18 23:21:08.8787868 +00:00:00, primary_key: Some("path") }], offset: 0, limit: 20, total: 1 }
[2023-04-20T09:41:43Z INFO  actix_web::middleware::logger] 192.168.32.140 "GET /indexes HTTP/1.1" 200 167 "-" "Meilisearch Rust (v0.22.0)" 0.000886
[2023-04-20T09:41:43Z DEBUG meilisearch::routes::indexes] returns: IndexStats { number_of_documents: 284284, is_indexing: true, field_distribution: {"author": 284284, "content": 284284, "genres": 284284, "lang": 284284, "path": 284284, "title": 284284} }
[2023-04-20T09:41:43Z INFO  actix_web::middleware::logger] 192.168.32.140 "GET /indexes/books/stats HTTP/1.1" 200 160 "-" "Meilisearch Rust (v0.22.0)" 0.000308
[2023-04-20T09:50:30Z DEBUG TimerFinished] WordPrefixPositionDocids::execute(), Elapsed=6220.5601475s
[2023-04-20T09:50:30Z DEBUG TimerFinished] IndexDocuments::execute_prefix_databases(), Elapsed=9582.5155799s
[2023-04-20T09:50:30Z DEBUG TimerFinished] IndexDocuments::execute_raw(), Elapsed=16657.9294701s
[2023-04-20T09:50:30Z DEBUG TimerFinished] IndexDocuments::execute(), Elapsed=16676.2840438s
[2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 0 }
[2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 1 }
[2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 2 }
[2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 3 }
[2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 4 }
[2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 5 }
[2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 6 }
[2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 7 }
[2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 8 }
[2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 9 }
[2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 10 }

So, I'm processing 2000 docs in one task and wait for it to complete, because if it completes and there are 2-3 more batches of 2k docs they start together, and those tasks don't finish ever.

loiclec May 1, 2023

Thanks for the additional information and bug report ☺️ I made two new issues in the Meilisearch repo about the problems you're experiencing: #3714 and #3715

I also wanted to make sure that you are aware of this limit of Meilisearch, which says that an attribute may not contain more than ~65K words. It is likely that your documents go over this limit. I think that we could technically lift this limit, but it will not be an easy or quick decision to make.

Revertron May 1, 2023

Good to hear, but:

The limit of words is cancelling even the idea of creating some search engine, like an analogue of YaCy, for example. Or a library with the search.
The line [2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 0 } in the log indicates of task restart. I wrote before that sometimes the tasks just keep restarting over and over. I think you'd better to create another issues for this.

And I have a question about current behavior of maximum-number-of-words-per-attribute limit. How does it behave when I submit docs that have more words?

loiclec May 11, 2023

The limit of words is cancelling even the idea of creating some search engine, like an analogue of YaCy, for example. Or a library with the search.

This is partly true. But note that the word limit is per attribute. Therefore, you could try splitting the book's content into multiple fields, where each field contains fewer word positions.

The line [2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 0 } in the log indicates of task restart. I wrote before that sometimes the tasks just keep restarting over and over. I think you'd better to create another issues for this.

It is a problem when a task fails and repeats itself. However, I am not yet fully convinced that this is what's happening here. Have you confirmed that it is the same indexing task which starts again with the line [2023-04-20T09:50:50Z DEBUG index_scheduler::batch] update: RemapDocumentAddition { documents_seen: 0 }? Based on the previous line, [2023-04-20T09:50:30Z DEBUG TimerFinished] IndexDocuments::execute(), Elapsed=16676.2840438s, it looks like the previous task successfully completed. To verify this, you could look into the task queue using the GET tasks route and see which task succeeded, failed, and which one is currently processing. Also, it would be good to know whether the Meilisearch instance ever crashes (and restarts) while a task is processing, which could partly explain why the task itself restarts.

If it did indeed restart, then my guess is that it did so because of the amount of changes that had to be committed to the database. In that case, then improving indexing speed and the size of the index for datasets with long documents would also make this failure less likely to happen. We also have a couple of issues opened related to the memory usage of Meilisearch. We have begun to tackle them, which may solve your problem.

And I have a question about current behavior of maximum-number-of-words-per-attribute limit. How does it behave when I submit docs that have more words?

All the words that are after the limit are ignored, unfortunately 😞

ItsANameToo · 2023-06-12T13:26:55Z

ItsANameToo
Jun 12, 2023

Meilisearch Version
v1.1.1 binary (also tried with an instance built from source)

Machine Details
We made use of two servers:

Bare Metal - AMD EPYC 7371 - 32 cores, 128GB RAM, NVME drive
Cloud VM - 4 core, 8GB RAM

Batch Size
Initial import is around 25m documents, afterwards we try to index new data every minute which results in batches of 10-50 documents

Dataset Information
The documents we index are small in size; only a few properties each and they contain a limited amount of characters as well.

An example of data we index is as follows:

Blocks

[{
    "id": "000000770074d3f869b95bf763bd577e7e28eb9623c0787e054b9a0b8f70a39e",
    "generator_public_key": "0218b77efb312810c9a549e2cc658330fcc07f554d465673e08fa304fa59e67a0a",
    "number_of_transactions": 0,
    "timestamp": 172766136
}]

Wallets

[{
    "address": "Aa17HVc9xFut5UmLmiYj4MkdUgWcoMZoUg",
    "username": null,
    "balance": 4986979785636,
    "timestamp": 95697040
}]

Index settings for the blocks index are as follows:

{
    "displayedAttributes": [
        "*"
    ],
    "searchableAttributes": [
        "id"
    ],
    "filterableAttributes": [],
    "sortableAttributes": [
        "timestamp"
    ],
    "rankingRules": [
        "typo",
        "words",
        "proximity",
        "attribute",
        "sort",
        "exactness"
    ],
    "stopWords": [],
    "synonyms": {},
    "distinctAttribute": null,
    "typoTolerance": {
        "enabled": true,
        "minWordSizeForTypos": {
            "oneTypo": 5,
            "twoTypos": 9
        },
        "disableOnWords": [
            "id"
        ],
        "disableOnAttributes": []
    },
    "faceting": {
        "maxValuesPerFacet": 100
    },
    "pagination": {
        "maxTotalHits": 1000
    }
}

Usecase

Site search
Document addition every minute for blocks, document addition and update for wallets
It's important to have it indexed as quickly as possible
Acceptible duration would be less than a minute so the index can be kept up to date with the actual data in the database

Issue Description
The main question we have is whether the performance we currently see is what we should expect, or if there are improvements to be made on our side.

What we noticed is that data import is alright in terms of speed; after a couple hours we have the data available and ready to go. The issue that we run into is when we try to add new documents to the existing index, mainly the one containing blocks. This index contains around 24m documents, and adding new ones to it takes around 8 minutes to complete. For the wallets index it's around 180k documents and adding/updating there takes around 10 seconds.

The question here is whether these times are expected, or if Meilisearch should be able to handle faster additions/updates with a change in configuration.

Another thing we noticed is that adding new documents only ever utilises a single CPU core/thread. As a result, both server instances need around the same time to add documents to the index, even though one has 32 cores while the other only has 4. Is this also an expected result with the way Meilisearch handles the indexes or can we somehow make it use more of the available resources to increase the speed?

Looking forward to a reply to better understand the limitations we may be running into.

6 replies

ItsANameToo Jun 12, 2023

Thanks for the quick reply @curquiza . Based on your response, I have some follow up questions

One issue with the meili instance as it currently stands is that it has 3 indexes in total. These are queued as tasks when they are being updated, but if the block index has an update it results in other tasks piling up during the time that one runs (for those 8 minutes). As a result, the wallets update that only takes 10 seconds can sometimes lag behind by 8 minutes due to the blocks index task. Would it be possible to run tasks that target separate indexes to run in parallel so we can avoid this long delay for the wallets index? or is that something that's not possible with a single meilisearch instance and we would be better off running separate instances per index and combining results for them? I didn't find much information on that in the Task documentation
Our data is indeed unique as each block will have a unique ID. In such a case, is meili re-indexing all the data when we add something to it or is it adding to an existing index? I don't know the ins and outs of how meilisearch creates the index and what it takes for new documents to be added to it, but due to the unique aspect of the data I'm wondering if there's some possible improvements there. Normally I'd expect that only adding to an index would be quick, especially if there's no overlap where an existing index needs adjusting to fit the new data for terms being reused in other documents, but that might be my lack of experience in this area

irevoire Jun 19, 2023
Collaborator

Hey @ItsANameToo,

About 1, as you guessed yes, meilisearch shove all your updates in a queue, and then batch as many updates as possible applied to one index and process all of that.
Currently we can’t run two indexing process at the same time but maybe we could work on that, I’ll send it to our PMs.
Running separate instances could work and seems like the most sensible option currently.

No it doesn’t reindex everything, but it probably needs to update a lot of databases.
Just a thought, the nature of your data is really strange, what do you need exactly?
Would a simple filter works for you?
If you need typo tolerance I guess we can’t do much for you. But if you’re able to use a filter then the performance should improve dramatically.
And if you need prefix-search you could try this prototype which introduce a new STARTS WITH filter: Introduce the filter syntax for CONTAINS, STARTS and ENDS WITH #3751

ItsANameToo Jun 21, 2023

@irevoire filtering should suffice for our case, typo tolerance is not very important. I'll try changing that to see if it helps with the indexing speed

irevoire Jun 21, 2023
Collaborator

Nice, keep me updated 👀
I expect some huge improvement

ItsANameToo Jun 23, 2023

@irevoire relying solely on filters it seems to have gone down from ~8 minutes to update the index to ~30 seconds now, which is much better indeed.

mikkpokk · 2023-08-23T14:50:17Z

mikkpokk
Aug 23, 2023

I have unknown anomaly using Meilisearch 1.3.1. After indexing, first search request is about 100 times slower. After that, search works super-fast again.

Added log output below

2023-08-23 17:38:43 [2023-08-23T14:38:43Z INFO  actix_web::middleware::logger] 172.17.0.1 "PUT /indexes/listings/documents?primaryKey=id HTTP/1.1" 202 138 "-" "Meilisearch PHP (v1.3.0)" 0.033350
2023-08-23 17:39:07 [2023-08-23T14:39:07Z INFO  index_scheduler::batch] document addition done: DocumentAdditionResult { indexed_documents: 48314, number_of_documents: 120000 }
2023-08-23 17:39:15 [2023-08-23T14:39:15Z INFO  index_scheduler] A batch of tasks was successfully completed.
2023-08-23 17:39:35 [2023-08-23T14:39:35Z INFO  index_scheduler::batch] document addition done: DocumentAdditionResult { indexed_documents: 62997, number_of_documents: 120000 }
2023-08-23 17:39:45 [2023-08-23T14:39:45Z INFO  index_scheduler] A batch of tasks was successfully completed.
2023-08-23 17:41:39 [2023-08-23T14:41:39Z INFO  actix_web::middleware::logger] 172.17.0.1 "POST /multi-search HTTP/1.1" 200 73989 "-" "Meilisearch PHP (v1.3.0)" 2.543925
2023-08-23 17:41:43 [2023-08-23T14:41:43Z INFO  actix_web::middleware::logger] 172.17.0.1 "POST /multi-search HTTP/1.1" 200 73956 "-" "Meilisearch PHP (v1.3.0)" 0.026002

4 replies

Kerollmops Aug 23, 2023
Maintainer

Hey @mikkpokk 👋 Could you give more information about the disk type you use i.e. SSD, NVMe, hard drive, EFS? The number of indexes you have in your instance?

mikkpokk Aug 23, 2023

Disk: SSD
Running through docker cointainer (4 dedicated CPUs & 4GB dedicated RAM)

Only have 1 large index with 120 000 documents.

Multi search is used to fetch multiple facets from same index with 1 request.

Kerollmops Aug 23, 2023
Maintainer

Thanks! It looks to me that the only reason it's slow is the speed of your SSD and, more specifically, the speed at which it can let the OS cache the disk pages in RAM. The faster your disk and the longer the program has been running, the hotter the OS page cache and the faster the queries.

I advise you to either use a faster SSD, keep Meilisearch hot for a long time, or make the page hot by yourself by doing a cat of your index file. You can read some fun experiments in this article section. It shows that forcing the OS to read big files will remove other pages from the OS page cache and make it slow to read your files.

# This should force the OS to make all pages hot
cat data.ms/indexes/your-single-index/data.mdb > /dev/null
# Now, you can measure your query and see if it's faster

mikkpokk Aug 23, 2023

Thank you very much for glazying fast response and super fast search engine!

I keep that info in my mind in case same thing might happen in production server. Currently I ran that issue in my own computer where SSD should be more than fine.

However, you gave me the right direction and I started investigating Docker's configuration. I found out that Docker used by default gRPC FUSE file sharing implementation instead of VirtioFS. VirtioFS is the fastest for MacOS and switching to VirtioFS solved the issue! 🥳

curquiza · 2024-03-13T16:00:19Z

curquiza
Mar 13, 2024
Maintainer Author

Closing this discussion. We made huge improvements with v1.6.0 and v1.7.0.

We recommend that anyone encountering indexing issues upgrade their Meilisearch instance to the latest version. If not enough, please open an issue directly in the Meilisearch repository with your indexing time and expectations: https://github.com/meilisearch/meilisearch/issues

0 replies

Indexation speed issues: call for feedback #2131

curquiza Feb 2, 2022 Maintainer

Current issues

Current solutions to fix the indexation

If you still have issues with indexation

Technical information

Dataset information

Your usecase

Misc

Replies: 19 comments · 69 replies

gmourier Feb 21, 2022 Collaborator

gmourier Mar 28, 2022 Collaborator

gmourier Jun 27, 2022 Collaborator

Kerollmops Mar 7, 2022 Maintainer

Kerollmops Apr 25, 2022 Maintainer

Kerollmops Mar 7, 2022 Maintainer

gmourier Mar 28, 2022 Collaborator

curquiza Apr 14, 2022 Maintainer Author

curquiza Oct 10, 2022 Maintainer Author

curquiza Oct 18, 2022 Maintainer Author

gmourier Mar 28, 2022 Collaborator

gmourier Mar 28, 2022 Collaborator

Kerollmops Apr 21, 2022 Maintainer

Kerollmops May 9, 2022 Maintainer

curquiza Oct 4, 2022 Maintainer Author

curquiza Aug 30, 2022 Maintainer Author

curquiza Oct 6, 2022 Maintainer Author

Technical information

How often do you need to add or/and update documents? Which type is it:

Dataset information

irevoire Feb 1, 2023 Collaborator

irevoire Feb 1, 2023 Collaborator

curquiza
Feb 2, 2022
Maintainer

Replies: 19 comments 69 replies

gmourier Feb 21, 2022
Collaborator

gmourier Mar 28, 2022
Collaborator

gmourier Jun 27, 2022
Collaborator

Kerollmops Mar 7, 2022
Maintainer

Kerollmops Apr 25, 2022
Maintainer

Kerollmops Mar 7, 2022
Maintainer

gmourier Mar 28, 2022
Collaborator

curquiza Apr 14, 2022
Maintainer Author

curquiza Oct 10, 2022
Maintainer Author

curquiza Oct 18, 2022
Maintainer Author

gmourier Mar 28, 2022
Collaborator

gmourier Mar 28, 2022
Collaborator

Kerollmops Apr 21, 2022
Maintainer

Kerollmops May 9, 2022
Maintainer

curquiza Oct 4, 2022
Maintainer Author

curquiza
Aug 30, 2022
Maintainer Author

curquiza
Oct 6, 2022
Maintainer Author

irevoire Feb 1, 2023
Collaborator

irevoire Feb 1, 2023
Collaborator