POC of Version Store with pluggable backend #586

pablojim · 2018-07-18T09:30:52Z

First implementation of Version Store with pluggable backends. POC in S3 with read, write, soft delete & snapshots using the existing VersionStore chunking and serialisation mechanisms. Append and hard deletes are not implemented.

This implementation stands alone and has no effect on existing functionality. Contains lots of code duplication form the existing functionality. Has limited error checking and cleanup functionality. This PR is mostly for discussion purposes at this point.

First implementation in S3

bmoscon · 2018-07-18T12:26:52Z

@pablojim awesome - I'll take a look this week!

pablojim · 2018-07-19T07:28:12Z

General implementation notes:

Uses forward pointers everywhere. Versions point to segments, Snapshots point to Versions
For Version documents native S3 versioning is used. Snapshotting is just asking S3 for the latest version key of all version docs.
VersionStore has knowledge of the backingstore while the serialisation classes remain stateless and are handed a backing store for every operation

Random thoughts/possibilities for improvements:

Add an abstract VersionStore base class
Implement a backward compatible version of the Mongo VersionStore using this abstraction
Allow passing of kwargs from all reads and writes to allow customisation for differing backends. e.g. read only certain columns from parquet.
Need to integrate the new VersionStore with Arctic and libraries - e.g. tie libraries to store type and some specific configuration
Make use of the S3 metadata functionality - especially when writing the segments write metadata about how it was serialised
Switch from BSON for the version document serialisation - maybe YAML? Or JSON if we add some date handling.
Can we achieve chunk sharing with parquet? So we get fast appends/modifications and lower storage usage. It seems possible but would require deep integration when writing the parquet files.
Multithread the S3 uploads & downloads?
Handling of different S3 profiles - e.g. multiple S3 endpoints
Add error checking and verification of S3 writes?
Add cleanup methods and hard deletes as per existing VersionStore
Think about fallbacks for parquet serialisation - dataframes in parquet then everything else in pickle?
is there any value in hybrid approaches - data on NFS and metadata in S3, Mongo, Oracle. Could use transparent urls for reading segment data e.g. s3:// or file:// Configuration would be complex.

pablojim · 2018-07-19T08:22:35Z

arctic/pluggable/_parquet_store.py

+        segment_keys = version['segment_keys']
+        assert len(segment_keys) == 1, "should only be one segment for parquet"
+        # TODO this is S3 functionality bleeding out of the backing store.
+        # Currently reading a Pandas dataframe from a parquet bytes array fails as it only takes a file path.


Fixed in https://github.com/dask/fastparquet/pull/330/files

jamesblackburn · 2018-08-29T10:06:03Z

arctic/pluggable/key_value_datastore.py

+        return sorted(dirs, key=mtime)
+
+    def delete_symbol(self, library_name, symbol):
+        """Soft deletes a symbol - no data is removed, snapshots still work.


Snapshots can be done with hard-links which would make deletions safe

jamesblackburn · 2018-08-29T10:08:47Z

Quite a nice bit of work - any comments on performance of the implementations?

pablojim · 2018-08-29T13:03:00Z

@jamesblackburn
From some early results for the parquet store - for reading some large objects there are dramatic improvements - 3 seconds vs 90 seconds. These are probably worst case scenarios for arctic. Write performance is not so dramatically affected. I need to test more though.

There would also be large improvements due to being able to load partial frames e.g. only loading selected columns and row groups. This may help cases such as #609.

Still some work and implementation decisions to do though.

dimosped · 2018-07-18T10:09:11Z

arctic/pluggable/key_value_datastore.py

+                         " Does it exist and is versioning enabled?".format(bucket_name))
+
+
+class S3KeyValueStore(object):


Would it make sense to have an abstract class KeyValueStore which has the methods for the KV api
and then have e.g. S3KeyValueStore as one concrete implementation maybe in a separate file ?

So if e.g. someone wants to implement a KV store on different backing storage, could simply extend the KeyValueStore class and not the S3-specific one ?

dimosped · 2018-07-18T10:15:19Z

arctic/pluggable/_kv_ndarray_store.py

+        version['segment_count'] = len(segment_keys)  # on appends this value is incorrect but is updated later on
+        version['append_size'] = 0
+        version['append_count'] = 0
+        version['segment_keys'] = segment_keys


aha, this is the forward-pointer implementation, where the version keeps all the segment keys, cool

dimosped · 2018-07-18T10:42:53Z

arctic/pluggable/generic_version_store.py

+        previous_version = self._backing_store.read_version(self.library_name, symbol)
+
+        handler = self._write_handler(version, symbol, data, **kwargs)
+        handler.write(self._backing_store, self.library_name, version, symbol, data, previous_version, **kwargs)


Would be nice to think about decoupling further and have:

the VersionStore top level api (looks great, as shown here)

a read/write handler being a composite of:

version metadata handling. Metadata to be attached on the version are returned in the form of a dict byt he handler. The handler is not aware of version/version documents, but only handler-specific metadata. The write handler write() expects metadata to be passed, and returns to the to top-level version new metadata, to be attached in the version document.

serialization handler. Can be numpy recarry serializer, Arrow serializer, anything.

segmentation policy. How to segment data, size of segments (upper size bound probably dictated by the backing store)

compression handler

As you already have in the code, a backing_store handler, which is responsible for writing individual segments + associated per-segment metadata at the underlying storage.

Having such a well-separated model, one may create custom implementation of individual handler for serialization/segmentation/compression/backing_store.

The logic of e.g. converting a numpy array to a rec array, segmenting, producing byte arrays and finally write chunks, is now very integrated and hard to add different/new implementations.

I thought about this approach. One worry is that it becomes very complicated with an explosion of possible interactions between the different parts which may or may not make sense e.g. mongodb metadata with parquet serialisation with 2Mb chunking on a S3 backend etc. etc.

Each library would then have to be configured with a particular config to work correctly. I think it might be an abstraction too far.

yschimke · 2019-01-23T17:08:37Z

@pablojim Shame to let this bitrot, can we discuss later this week with @willdealtry, @shashank88

shashank88 · 2019-01-24T15:56:51Z

@pablojim Shame to let this bitrot, can we discuss later this week with @willdealtry, @shashank88

Yeah, this seems pretty good, will go through it tonight. Have fixed the merge conflict. Will see if the tests are fine

yschimke · 2019-01-25T07:34:18Z

If we don’t think this is prod ready or a feature we want to support long term. Maybe we can segregate it is an example and make sure our API allows this sort of flexibility.

_{Sent with GitHawk}

pablojim · 2019-01-25T17:07:53Z

If we don’t think this is prod ready or a feature we want to support long term. Maybe we can segregate it is an example and make sure our API allows this sort of flexibility.

Apart from some dependencies it is completely isolated from the rest of Arctic. It's all in the "pluggable" package and duplicates some code from the main APIs.

One option would be to merge it but mark it in the code and documentation as Beta until it is deemed ready for wider use.

shashank88 · 2019-04-24T15:11:48Z

Will move to a contrib directory to unblock this PR, without committing this to be used as a production store.

POC of Version Store with pluggable backend

ce19dc1

First implementation in S3

pablojim requested review from jamesblackburn, bmoscon and dimosped July 18, 2018 09:30

add __init__.py

65ddd14

pablojim mentioned this pull request Jul 18, 2018

Port arctic to the S3 key-value store #354

Open

Proof of concept with Parquet Serialisation

768dc7c

pablojim commented Jul 19, 2018

View reviewed changes

pablojim added 3 commits August 23, 2018 17:46

Added POC for FileSystem backend

f528899

Additional tests and fix for pickle compatability

0d4683e

Handle missing symbols and more tests

c56794c

jamesblackburn reviewed Aug 29, 2018

View reviewed changes

jamesblackburn approved these changes Aug 29, 2018

View reviewed changes

bmoscon approved these changes Sep 5, 2018

View reviewed changes

dimosped approved these changes Sep 6, 2018

View reviewed changes

Merge branch 'master' into pluggable_version_store

6ae7b24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC of Version Store with pluggable backend #586

POC of Version Store with pluggable backend #586

pablojim commented Jul 18, 2018

bmoscon commented Jul 18, 2018

pablojim commented Jul 19, 2018

pablojim Jul 19, 2018

jamesblackburn Aug 29, 2018

jamesblackburn commented Aug 29, 2018

pablojim commented Aug 29, 2018

dimosped Jul 18, 2018

dimosped Jul 18, 2018

dimosped Jul 18, 2018

dimosped Jul 18, 2018

pablojim Sep 6, 2018

yschimke commented Jan 23, 2019

shashank88 commented Jan 24, 2019 •

edited

yschimke commented Jan 25, 2019

pablojim commented Jan 25, 2019

shashank88 commented Apr 24, 2019

		" Does it exist and is versioning enabled?".format(bucket_name))


		class S3KeyValueStore(object):

POC of Version Store with pluggable backend #586

Are you sure you want to change the base?

POC of Version Store with pluggable backend #586

Conversation

pablojim commented Jul 18, 2018

bmoscon commented Jul 18, 2018

pablojim commented Jul 19, 2018

pablojim Jul 19, 2018

Choose a reason for hiding this comment

jamesblackburn Aug 29, 2018

Choose a reason for hiding this comment

jamesblackburn commented Aug 29, 2018

pablojim commented Aug 29, 2018

dimosped Jul 18, 2018

Choose a reason for hiding this comment

dimosped Jul 18, 2018

Choose a reason for hiding this comment

dimosped Jul 18, 2018

Choose a reason for hiding this comment

dimosped Jul 18, 2018

Choose a reason for hiding this comment

pablojim Sep 6, 2018

Choose a reason for hiding this comment

yschimke commented Jan 23, 2019

shashank88 commented Jan 24, 2019 • edited

yschimke commented Jan 25, 2019

pablojim commented Jan 25, 2019

shashank88 commented Apr 24, 2019

shashank88 commented Jan 24, 2019 •

edited