Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on 3.0 data layout, new data model, indexes, and write protocol #24979

Open
pauldix opened this issue May 8, 2024 · 18 comments
Open
Labels

Comments

@pauldix
Copy link
Member

pauldix commented May 8, 2024

This is a bit of a large catch-all issue to capture the breadth of new/additional/modified functionality we'd like to bring to InfluxDB v3. InfluxDB 3 will feature a new write API and data model to be able to store additional data types. Further, it will give the user more explicit control over how data is organized on disk, with optional structures for specialized indexes for time series data.

Physical Layout and Blocks

How data is physically laid out (either in object storage or on local disk):

<database name>/<table name>/<block start time>/<parquet files>

The block matches up to a period of time based on the time column of the data (with a start and an end). Just like previous versions of InfluxDB, all tables will have a required time column with a timestamp (more on this later). The duration of a block of data initially lines up with WAL segments on the server. WAL segment durations can be in the range of 5m, 15m, 30m, 1h, 2h and are a global configuration for the WAL shared by all logical databases on the server, thus block start times will always align on those intervals. At a minimum, a single parquet file will be written for every segment period for every table that receives data. If more data comes in during the segment period than the database can buffer in memory, the largest tables will have files persisted prior to the segment finishing and the segment could have many files in it. If data for that block of time comes in as a historical backfill (i.e. the time of the data isn't around the wall clock time of the server, then new segments will be created for it, adding parquet files to a block).

InfluxDB Pro, the commercial upgrade, will feature a compactor that can combine the Parquet files of a single block and of multiple blocks together to form longer duration blocks (4h, 8h, 1 day, 3, days, 7 days, 1 month) with data clustered together based on the series key (more on this later). This will also combine block indexes together into longer periods. The purpose of the compactor is to reorganize the data as it ages out so that it is optimized for fast historical query performance. Leading edge performance should be roughly equivalent in both open source and Pro. I.E. querying the last N blocks of time for some reasonable value of N should be about as fast without compaction.

Table Schema and Types

Tables have columns and the following data types will be supported:

  • scalars: int64, uint64, float64, bool, microsecond time (more on this later)
  • string and bytes
  • array
  • map

Within an array, only scalars may be stored. A map is a string key to value where values can be scalars, string, bytes, array, or map, but for child maps, only arrays, scalars, string, or bytes may be beneath (limiting the nesting level). I believe these limits will make things easier to implement and reason about, but still give users a lot of flexibility in how they represent schemas. We'll have to test how many columns a table can support. We'll also have to test how wide an array or child struct can be and set limits on these things.

Table and column names must be alphanumeric or can contain ., -, _, or %. The % character is useful to represent encoding when translating from external systems that might have names that aren't allowed. Ideally, users won't ever use the % character in their names.

**Open Question: ** should arrays support structs inside them? If so, should those structs be limited to only be able to have specific types to limit nesting?

**Open Question: ** should we support deeper nesting? If so, how deep? Is there an advantage/disadvantage to setting explicit limits here?

**Open Question: ** do we want to represent measurement units? Is it enough to push for naming conventions?

**Open Question: ** should we have other data types like histograms, hyperloglog? Or just have logical types on top of the types we support?

Series Keys

Within a block, data for each table will be sorted by the series key then time. The series key is an ordered set of columns that represent the key for an individual time series. The series key and timestamp are meant to uniquely identify a set of values collected at that timestamp. The series key is intentionally hierarchical so that data is organized on disk clustered together based on that hierarchy. In the new line protocol, this will be represented as a single string as //... Some examples (each one would be in a different table):

trace_id/0x5b8aa5a2d2c872e8321cf37308d69df2/span_id/0x051581bf3cb55c13
host_name/ip-10-24-34-0.us-west-2.compute.internal
country/us/city/nyc/tower_id/234

InfluxDB 3.0 will continue to be schema on write. However, once the columns that represent the series key for a table are set, they series key columns are immutable. New data columns can be added and old data columns can be removed, but the series key cannot change. Every column specified as part of the series key MUST be non-null when writing data (an empty string is fine). The columns that represent the key should be thought out when the table is initially written to and not evolve. If they must evolve, then a new table is required to represent this evolution.

The series key plus the time represent the primary key for every table. None of these columns can have null values.

Timestamps

Importantly, times will be kept at microsecond precision in Parquet by default. If users require nanosecond precision timestamps, they can do so explicitly. This is an attempt to achieve broadest possible compatibility for third parties to read Parquet files produced by InfluxDB 3. We will want to explicitly check this data model and the files generated to see if they can be read by Databricks, Snowflake, BigQuery, Redshift, Trino, and others. Ideally, Parquet files created by InfluxDB 3 will be readable by as many third parties as possible so we should do some testing to ensure compatibility.

Series Indexes

Indexes are mostly applicable to Pro where compaction can run to create large blocks of data. These indexes make queries that run against a single series or a handful of series run much faster than they otherwise would, particularly for historical queries.

For any of the columns in the series key, the user will be able to specify that column/value should be indexed to point to the individual parquet files (or row groups when supported) that contain data for that column/value. Buffered data that has yet to be persisted to Parquet will also be indexed. The use case for these is to map column/value pairs that narrow down to a handful of individual series to narrow down the set of parquet files/row groups that must be queried. For example, in the above we might index trace_id, host_name, and tower_id.

These indexes would be kept per block and when the compactor runs to combine multiple blocks, it would rewrite the index for the larger block.

Last Value Caches

A very common use case is to request the last value, or last N values for a given series or many of them. Users expect these types of queries to return in single digit millisecond response times. The user should be able to specify that they would like the last value cached in RAM some set of data columns. Data will be cache by series key, but the user can specify what columns they want to lookup series by. For example with the country/us/city/nyc/tower_id/234 series key we know we'll be looking up the last values for individual series by tower_id and all of the series by city (which would return many series). The last value cache will make either of those lookups very fast.

In open source, this will be based only on what has been written since the server starts up or replayed from the unpersisted WAL segments (given delay in segment persistence, this will give at least segment duration/2 amount of look back for the cache on restart). In the Pro version, these caches can optionally be filled on startup from historical data, with some configured limit on look back time range.

To query this data, I imagine that we'll add a new function that would appear where the table name appears in the FROM clause to specify that we want to select from the last value cache to bypass all the other query machinery. I think that having it there in the query language will be useful because the results can then be used in a CTE to do further processing against the rest of the data. And it'll save the user from having to call a separate API to get at the last value cache.

Reference Data

In the IoT space, we get frequent requests to store reference data related to time series so that it can be joined at query time to either filter series out or to provide extra context about series. Now that we have a fully featured SQL engine, we could potentially store this reference data in addition to the time series data. We could potentially just do this by convention by writing to a table where all timestamps are always zero. But we should give some thought for how we might want to handle this situation. I believe the number of rows in this kind of table will generally be less than tens of millions.

Write Protocol

One of the lasting strengths of InfluxDB has been Line Protocol. It is easy to read, easy to construct and can be represented as plain text. I think we'll want to do the same thing with InfluxDB 3. While we will probably also want a binary protocol at some point, I think a text protocol will be beneficial to begin with.

Alternatively, we could opt for JSON and have a structure that enables writing this kind of schema and information. This would speed initial development up since we wouldn't have to create our own text based protocol.

I think that capturing what represents the series key in the write protocol is useful and user friendly. I'm not sure that we'll want to capture what columns we're creating series indexes or last values on. That information will have to be set by the user via an API (either in the query language or REST, we should pick one method). That does limit what we're able to do with schema on write as users will have to make additional API calls to get everything fully set up, but those indexes and caches are optimizations. The initial out of the box test it out experience would be able to run without calls to create databases, tables, indexes, and caches.

Here's an initial sketch of what the write protocol could look like:

<table name> <series key> <data block> <time>

Each section is separated by a space, table and column names must follow the rules specified earlier. The series key is the same as mentioned above. Values that appear in the series key that have either a space, a /, or a \ character must be escaped with a \. This is the structure of the data block:

<column name>=<value block>[,<column name>=<value block>]...

So the data block is a set of column key/value pairs separated by commas. Values will start with a character that specifies their type immediately followed by a value. Thus the first character:

  • i - int64
  • u - uint64
  • f - float64
  • t - RFC3339 date/time
  • e - epoch in microseconds
  • " - string (must terminate in "). If ", \ appears in string, must be escaped by \
  • b - bytes (immediately followed by base64 encoded binary
  • { - map (must terminate in }
  • [ - array (must terminate in ]

For arrays and maps, they must follow value type and nesting limitations mentioned at the beginning of this issue.

Lastly is the time, which can be either a microsecond epoch or an RFC3339 date/time.

If the user requires nanosecond times, they must be represented as RFC3339.

@pauldix pauldix added the v3 label May 8, 2024
@hiltontj
Copy link
Contributor

hiltontj commented May 9, 2024

So, the series key is similar to, but seems like a departure from tags (whose mention is absent above). I reckon this is intentional and we are trying to move away from the typical associations users may have with tags, and narrowing in on what tags were ultimately used for, i.e., identifying a time series.

One key distinction I see is that the ordering is decided by the user, who defines the hierarchy. Would the series key elements always need to be written in the same order?

For example, I am assuming that

trace_id/foo/span_id/bar

and

span_id/bar/trace_id/foo

are not equal in terms of the resulting order/organization of data, but as columnar data, i.e., with trace_id and span_id columns, they would be treated as equal. So, unless we have a means of rejecting the second version after receiving the first, then it is really important that the user writes the key elements in the correct order the first time. I think this is manageable, but does put a burden on the client library implementations to get this concept right.

@pauldix
Copy link
Member Author

pauldix commented May 9, 2024

Yes, functionally they're very similar to tags. The critical differences are that ordering matters with series key columns, that series key columns can't be added later, and values can't be null. In v1 and v2, tags were all indexed. This design gives us the ability to decouple the identifier from what gets indexed. Also, in InfluxQL, there was a limitation that you were only able to group by tags. This isn't the case with SQL and the v3 query engine, so we can lift that limitation.

Given the order has importance, I think the write endpoint should validate the same order in subsequent writes. We already validate the schema is consistent so this should fit in the same way. In v1, tag ordering technically didn't have significance, but there was a concept of a series key in the TSM engine itself which was the tagset followed by a field name. To ensure that rows that had the same tag key/value pairs had the same series key, the tagset would always be sorted lexicographically by the tag key. This led to us telling client library implementors to always ensure that tags were written in that order in line protocol they wrote so that we wouldn't have to do a resort on the server during ingest (it was a performance optimization that had very visible impact). For example, all writes from Telegraf do this.

@jacobmarble
Copy link
Member

How data is physically laid out (either in object storage or on local disk):

<database name>/<table name>/<block start time>/<parquet files>

Using database/table name in file paths limits future ability to rename databases and tables. Why not use IDs in paths, and map names in the catalog?

Within an array, only scalars may be stored.

Why not strings and bytes?

should arrays support structs inside them? If so, should those structs be limited to only be able to have specific types to limit nesting?

You haven't defined structs. A column type similar to map, but with explicit key and value types?

should we support deeper nesting? If so, how deep? Is there an advantage/disadvantage to setting explicit limits here?

I vote we start with zero nesting allowed, iterate from there.

do we want to represent measurement units? Is it enough to push for naming conventions?

should we have other data types like histograms, hyperloglog? Or just have logical types on top of the types we support?

If the user can define structs, then we don't really need to define other composite types, at least not right away.

@pauldix
Copy link
Member Author

pauldix commented May 9, 2024

@jacobmarble what's the difference between structs and maps?

@hiltontj
Copy link
Contributor

hiltontj commented May 9, 2024

@pauldix - I think that you used the terms map and struct interchangeably in the original issue, but it may be clearer to stick to one or the other. struct implies fixed structure with explicit key/value types, while map implies variable key/value pairs.

@jacobmarble
Copy link
Member

@jacobmarble what's the difference between structs and maps?

The same as the difference between struct and map in C, Go, Parquet, etc. A struct is a map with predefined keys, each with predefined value type.

Iceberg

The Iceberg spec makes this very easy:

A struct is a tuple of typed values. Each field in the tuple is named and has an integer id that is unique in the table schema. Each field can be either optional or required, meaning that values can (or cannot) be null. Fields may be any type. Fields may have an optional comment or doc string. Fields can have default values.

A list is a collection of values with some element type. The element field has an integer id that is unique in the table schema. Elements can be either optional or required. Element types may be any type.

A map is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types.

Parquet

The Parquet spec doesn't make this easy. The Parquet group (list of fields, mainly used to describe schemas) can be nested. Read carefully the Parquet spec regarding Lists and Maps -- they are defined with group.

<map-repetition> group <name> (MAP) {
  repeated group key_value {
    required <key-type> key;
    <value-repetition> <value-type> value;
  }
}

A Parquet schema which includes an int64 column, a map column, and a struct column looks something like this:

group schema {
  int64 column_a;
  group column_b_map {
    repeated group key_value {
      string key;
      string value;
    };
  };
  group column_c_struct {
    string name_first;
    string name_last;
    int64 account_id;
  };
};

@jacobmarble
Copy link
Member

jacobmarble commented May 10, 2024

Importantly, times will be kept at microsecond precision in Parquet by default. If users require nanosecond precision timestamps, they can do so explicitly.

How do you propose the user do this? Here's an idea:

Schema-on-write tables have some type limitations (eg microsecond timestamps, no structs, all numbers are floats, no byte slices, ...) and explicit schemas unlock the entire type system (CREATE TABLE t ('time' nanoseconds, 'name_first' string, ...).

One of the lasting strengths of InfluxDB has been Line Protocol.
...
Alternatively, we could opt for JSON and have a structure that enables writing this kind of schema and information. This would speed initial development up since we wouldn't have to create our own text based protocol.

By limiting schema-on-write as proposed above, the new line protocol can be plain-old-JSON. In particular:

@alamb
Copy link
Contributor

alamb commented May 10, 2024

High Level Thought 1: Explicitly list usecase

  1. It might help separate out the new usecases from the specific proposal / ideas of how to implement them. I think this would clarify the discussion and make it easier to evaluate "how does this proposal achieve the desired aims"

For example, I don't understand what is driving the change to series id (aka what problem is it solving )?

High Level Thought 2: Separate discussion of logical and physical representations

This proposal seems to mix logical data with an implied physical representation. For example, the new data types are logical changes, where series indexes and last value caches seem like physical representation.

I think it would help to explain the proposal if you separated the changes to the logical data model (e.g. map support) and then how they would be physically implemented (e.g. series index, last value cache, etc)

This would also make it easier to allow thought experiments about how the physical representation could evolve over time, without changing the logical model

@alamb
Copy link
Contributor

alamb commented May 10, 2024

Thoughts on DataType:

what's the difference between structs and maps?

I think the important distinction is if the keys of the map are dynamic (new key values can appear at any time) or fixed (all possible keys (fields) are known up front), as @jacobmarble says in #24979 (comment).

One benefit of a struct is that both Parquet and Arrow support projection and filter pushdown to structs efficiently. This is not the same as maps

Some downsides of a struct are that

  1. the fields must all be known up front and
  2. there are practical limits on the number of distinct fields (a few thousand at most) as each is effectively stored as a colum

@alamb
Copy link
Contributor

alamb commented May 10, 2024

Series Keys

The series key is intentionally hierarchical so that data is organized on disk clustered together based on that hierarchy.

It seems to me tags in line protocol already achieve clustering by (single) series.

I am guessing (see comment above!) that the heirarchy usecase is to control the the clustering of multiple series together.

For example, if you have data with city and state (where a state has multiple cities) and you want predicates on individual cities to be fast AND queries on individual state to be fast. To achieve this you want the data to be clustered by state AND city

For example

city=Andover,state=MA
city=Boston,state=MA
city=NYC,state=NY

To make predicates like WHERE state = 'NY' are fast you need the date stored so all NY cities are together (aka ORDER BY state, city)

Existing Line Protocol

I think you can do this with the existing data model and InfluxDB 3.0 with a user defined sort order. For example

ORDER BY state, city, time

The downside, as @pauldix has pointed out, is that the user would have to provide this ordering somehow which might be a UX challenge.

Proposed Heirarchy

I believe the proposal requires encodings the hierarchy explicitly in their data. So they would have to supply data like

/MA/Andover
/MA/Boston
/NY/NYC

To ensure the data in MA and NY are clustered together and predicates like WHERE state = 'NY' are fast

It is not clear to me this is any better UX (the user still has to define the clustering explicitly) and furthermore now they have to explicitly encode it in their data

@alamb
Copy link
Contributor

alamb commented May 10, 2024

Schema Evolution

The columns that represent the key should be thought out when the table is initially written to and not evolve. If they must evolve, then a new table is required to represent this evolution.

I think this would be a major usability challenge -- In the real world, it seems schemas do evolve over time by (by adding new columns). The fact that schema evolution support is present in all major table formats is evidence of how wide spread it is

@alamb
Copy link
Contributor

alamb commented May 10, 2024

Last Value Caches

In addition to last value caches, I recommend storing data sorted by ... TIME DESC so that the first row stored in a series in a parquet file is the most recent

@alamb
Copy link
Contributor

alamb commented May 10, 2024

Write Protocol

Here's an initial sketch of what the write protocol could look like:

Have you considered simply extending the V2 protocol to support additional data types?

For example using the same notation, perhaps we could extend line protocol to support maps like this (would be the same for lists):

measurement,tag1=foo,tag2=bar value={'key': 12}

@jacobmarble
Copy link
Member

One benefit of a struct is that both Parquet and Arrow support projection and filter pushdown to structs efficiently. This is not the same as maps

Assuming we use the Parquet Map LogicalType, another benefit of "structs" is that the member fields each have their own type, while the Map LogicalType has one type for all values.

@jacobmarble
Copy link
Member

Series Keys, Series Indexes, Last Value Caches

Performance of selective queries (queries with WHERE A=B predicates) would certainly benefit from structuring data around series. However, this does nothing for other categories of query, such as range filters (WHERE foo < 0.5), and is very difficult to evolve.

Would you consider a more general Secondary Index framework, with specific types for various query patterns?

  • Writing and storing data files does not change. Partitioning does not change. Primary keys do not change.
  • Secondary Indexes are files independent of data files.
  • A user adds and removes Secondary Indexes at any time in a table's lifecycle.
    • This is very different to partitioning, which is not as flexible.
  • A background process (compaction perhaps) generates and removes Secondary Index files.
  • Secondary Indexes are referenced at query time, per partition, only when available AND useful.
    • A given query that spans multiple partitions may find a helpful index in one partition but not another, until the background indexing process catches up. That's OK, the effect is that the query gets faster over time.
  • For some queries, some Secondary Indexes can answer the query without reading data files.

Specific column Secondary Indexes types might be:

  • Bloom filter
  • B-tree
  • Hash
  • Statistics / summaries - min/max, first/last, average

Secondary Indexes are created by the user as ordered composites of column/type. For example:

  • Column A / Hash, Column B / Hash, Column C / statistics
    • Index Column A with Hash
    • For each entry in ^^ Index Column B with Hash
    • For each entry in ^^ Index Column C statistics

@pauldix
Copy link
Member Author

pauldix commented May 14, 2024

@jacobmarble on the timestamp question for how the user specifies that it's nanosecond when they write the data, that's why I mention that they can only do that with an RFC3339 time. If it includes the nanoseconds, then they want that precision, otherwise it's assumed microseconds.

@alamb to your question on what the series key design solves, it's multiple things. First, we already have that concept in 1.x (which is the tagset in lexicographical tag order). This just formalizes it and adds ordering. Technically, users could achieve this today with weird tag naming a_..., b_..., etc. You're correct in that the goal is to let the user specify how data is clustered together. The other thing it makes clear is how the last value cache interacts the the idea of a series. This is a very common pattern that we've seen pop up. Users think of the individual series and they want to return data for the individual series.

For schema evolution, my thinking is that we still give users the flexibility to evolve their schema, but not when it comes to what identifies a row. That's the only part that's fixed. They can add new columns at will. I think the series key is something that gets set at initial development time and never gets changed. Or in cases where they want to change it, they end up wanting to backfill all previous data with whatever the new value is. In which case, they should just create a new table with the new primary key and run a historical backfill on it.

@alamb for separating out physical from logical, I mainly just wanted to do a brain dump here so people could understand the totality of things I'm thinking about with respect to the new model. The indexes and last value cache aren't about the data model explicitly, but their existence and how they're used by the user has to be considered in conjunction with the model.

@jacobmarble on the topic of secondary indexes, we're definitely going to have those. I don't think those are mutually exclusive to the idea of the last value cache. It's a specific optimization for a very common pattern in sensor data use cases that I think is worthy of its own explicit concept and user experience.

@alamb I'm not sure what the benefit of extending the old LP to support these new types would be. For backwards compatibility, we already have support for the v1 and v2 write endpoints. The v3 model has a more specific concept of series which means that any client library or v3LP generator should have an API that corresponds to the v3 model. v1LP data can be ingested into the v3 model with the caveat that tags can't be added on later.

So some changes I would propose:

  • Keep the series key concept and have its ordering in LP have significance
  • Primitives: bool, int64, uint64, float64, timestamp (microsecond), timestamp_ns, uuid, fixed length byte, bytes, string
  • Nested types: list (array), and map, allowing only primitives as values

I notice that Iceberg doesn't have unsigned ints. Uhhh?

More thoughts?

@alamb
Copy link
Contributor

alamb commented May 16, 2024

to your question on what the series key design solves, it's multiple things.

I still don't understand the problem (formalizing something we already do isn't really a problem 🤔 , and I don't understand how a single key vs a multi-part key is related to a last value cache) -- maybe face to face

For schema evolution, my thinking is that we still give users the flexibility to evolve their schema, but not when it comes to what identifies a row. That's the only part that's fixed. They can add new columns at will. I think the series key is something that gets set at initial development time and never gets changed.

I see -- so in 2.0 terminology the key difference is that you can't add new tag columns, but you could add new fields.

@alamb I'm not sure what the benefit of extending the old LP to support these new types would be.

One benefit might be be you could augment existing clients like telegraf incrementally to support additional structured data (rather than requiring they use an entirely new endpoint / data model)

I agree if you make the assumption that the data model will be incompatible (e.g. series_key instead of a tag) then extending existing Line Protocol makes much less sense

I notice that Iceberg doesn't have unsigned ints. Uhhh?

Java!

@pauldix
Copy link
Member Author

pauldix commented May 16, 2024

@alamb I think maybe you're thinking that series key is one thing/value? That is not what I'm thinking. Series key is a logical concept, where it is an ordered set of columns and their values. I was thinking that it would be limited to string columns, but we could add bytes, int64, bool, and uint64, but I'm not sure that gets us much. Anyway, it gets written in as line protocol with the (<column name>/<value>)+ format, but in the back end it just pulls those values into their columns.

I think that the advantage of having it in the line protocol with the ordering is that it has the user think about how data is organized and what they're doing. This translates directly to how data is stored on disk. Like picking a clustered index in a regular DB.

We do this already with TSM and tags, this just makes it explicit with ordering specified by the user, rather than lexicographically by the server. I expect that data will be partitioned automatically by the system based on this hierarchy. Partitioning falls out from the series key and whatever the target number of rows might be per partition. Well, that and some idea of blocks of time (referenced as chunks above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants