Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(node/vectordb): schema mismatch when using custom embedding function #1281

Open
wjones127 opened this issue May 8, 2024 · 4 comments
Open
Assignees
Labels
bug Something isn't working typescript Typescript / javascript

Comments

@wjones127
Copy link
Contributor

LanceDB version

v0.4.19

What happened?

When adding data, we get the error:

TypeError: Table and inner RecordBatch schemas must be equivalent.

This comes from the line:

https://github.com/apache/arrow/blob/6a28035c2b49b432dc63f5ee7524d76b4ed2d762/js/src/table.ts#L136-L137

The one difference is that in schema, vector is not nullable, while in batch.schema, vector is nullable.

Are there known steps to reproduce?

Here is the user provided repro: https://paste.mozilla.org/udbe1bNs

Original message: https://discord.com/channels/1030247538198061086/1197630540271067258/1237552085525074000

Copy of repro
import lancedb from 'vectordb'
import express from 'express'
import { pipeline } from '@xenova/transformers'
import { Schema, Field, FixedSizeList, Float64, Int32, Utf8 } from "apache-arrow";

const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const db = await lancedb.connect('data/sample-lancedb')

const embed_fun = {
    sourceColumn: 'text',
    embed: async function (batch) {
        let result = []
        for (let text of batch) {
            const res = await pipe(text, { pooling: 'mean', normalize: true })
            result.push(Array.from(res['data']))
        }
        return (result)
    }
}

const schema = new Schema([
    new Field("id", new Int32()),
    new Field("text", new Utf8()),
    new Field("type", new Utf8()),
    new Field("vector", new FixedSizeList(384, new Field("item", new Float64())))
]);

debugger;

const tables = await db.tableNames()
let table
if (!tables.includes("food_table")) {
    table = await db.createTable({ name: "food_table", schema, embed_fun })
} else {
    table = await db.openTable('food_table', embed_fun)
}

const app = express()
app.use(express.json())

app.get('/', async (req, res) => {
    const results = await table
        .search("a sweet fruit to eat")
        .metricType("cosine")
        .limit(2)
        .execute()
    res.json(results)
})

app.post('/', async (req, res) => {
    await table.add(req.body)
    res.send('OK')
})

const port = 3000
app.listen(port, () => {
    console.log(`Listening port on ${port}`)
})
curl --location 'http://localhost:3000' \
--header 'Content-Type: application/json' \
--data '[
    { "id": 2, "text": "Carrot", "type": "vegetable" },
    { "id": 3, "text": "Potato", "type": "vegetable" },
    { "id": 4, "text": "Apple",  "type": "fruit" },
    { "id": 5, "text": "Banana", "type": "fruit" }
]'

TypeError: Table and inner RecordBatch schemas must be equivalent.

@wjones127 wjones127 added bug Something isn't working typescript Typescript / javascript labels May 8, 2024
@universalmind303
Copy link
Contributor

This appears to be due to the fact that the user is specifying the "vector" column which is also the output of the embed function. The application logic is unable to handle this scenario.

@wjones127
Copy link
Contributor Author

Does that mean we should be validating this at the time of create_table()?

@universalmind303
Copy link
Contributor

i guess it depends on how we want to handle this. Do we want to treat this as a user error, or do we want to add logic to check for columns matching the embedding functions?

@wjones127
Copy link
Contributor Author

If you look at the Python examples, we have users providing the vector column in the Pydantic schema: https://lancedb.github.io/lancedb/embeddings/#openai-embedding-function

So it feels inconsistent that in the JS library we would tell them not to do that, right?

@universalmind303 universalmind303 self-assigned this May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working typescript Typescript / javascript
Projects
None yet
Development

No branches or pull requests

2 participants