v1.3.0 ingest performance poor compared to v1.2.2 #676

lslate · 2023-03-16T16:58:12Z

Hello. I recently updated from 1.2.2 to 1.3.0 (for rollup issue #457 fix) and now only seeing about 20% of the previous http.ingest_count and a doubling of http.ingest_time, which results is Kairos ingest falling behind rabbitmq.

For reference the data flow is rabbitmq -> python(pika) -> kairosdb -> scylladb 5.0.

I translated the previous kairosdb.properties to the new kairosdb.conf hocon format and I believe I have all the settings the same (threads, batch sizes, connections etc).

Java is 1.8.0 and I have tried 11 but no difference.
Starting with a fresh keyspace makes no difference.
Kairosdb 1.3.0 is the only change to the stack and when I change back to 1.2.2 it's all happy again.

I've trawled the commits and issues but nothing stands out. Any ideas?

brianhks · 2023-03-17T12:35:04Z

Well this is disappointing. Tell me a bit about your setup. How many clients, how many kairos nodes, how big is your scylla cluster? What are your ingest numbers before and after 1.3.0? You can email directly your config files and I'll take a look if anything sticks out as wrong.

lslate · 2023-03-21T10:43:27Z

For ref: 4 kairosdb nodes, 6 node scylla cluster. All 20(40) core, 64GB, SSD. 1.2.2 ingest approx 6.5mil/min 1.3.0 approx 2mil/min Also seeing a corresponding drop in CPU usage on the kairos nodes. Rollups are the primary source of queries, so ingest patterns are generally consistent. Kairosb.conf: kairosdb: { # To set the traffic type allowed for the server, change the kairosdb.server.type entry to INGEST, QUERY, or DELETE, # or a comma delimited list of two of them to enable different sets of API methods. # The default setting is ALL, which will enable all API methods. #server.type: "ALL" # Specify a map of custom tags to add to KairosDB's own internal metrics. Example tags might # include environment, data center or server role/type. This should resolve to a JSON object of key/value pairs # Example defining two custom tags: # metrics.custom_tags.environment: "AWSLAB" # metrics.custom_tags."data center": "US-EAST-1" # The default is to not have any custom tags defined. #metrics.custom_tags."server.type": "INGEST" # Properties that start with kairosdb.service are services that are started # when kairos starts up. You can disable services in your custom # kairosdb.conf file by setting the value to <disabled> ie kairosdb.service.telnet=<disabled> #service.telnet: "org.kairosdb.core.telnet.TelnetServerModule" #telnetserver: { # port: 4242 # address: "0.0.0.0" # max_command_size: 1024 } #=============================================================================== service.http = org.kairosdb.core.http.WebServletModule jetty: { # Set to 0 to turn off HTTP port port: 8080 address: "0.0.0.0" #timeout for idle sockets between requests. socket_idle_timeout: 120000 static_web_root: "webroot" # Show stack trace for debug show_stacktrace: false # To enable SSL uncomment the following lines and specify the path to the keyStore and its password and port #ssl: { #port: 443 #protocols: "TLSv1.1, TLSv1.2" #cipherSuites: "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_EMPTY_RENEGOTIATION_INFO_SCSV" #keystore.path= #keystore.password= #Truststore may need to be set if you are using a form of auth that uses another server/service for authentication #You may need to add a truststore with that servers certificate so KairosDB can connect to that server. #truststore.path= #} # To enable Jetty Request Logging uncomment the following lines. Such logging configuration is detailed here https://www.eclipse.org/jetty/documentation/current/configuring-jetty-request-logs.html # Days to retain the logs defaults to 30 if it is not set. # ignore_paths allows you set which paths you want jetty to not log if any. The below would cause jetty to not log calls made to the version and health endpoints. #request_logging: { #enabled: true #retain_days: 30 #ignore_paths: ["/api/v1/version", "/api/v1/health/*"] #} # KairosDB now uses the Jetty JAAS Framework to do Authentication. You can extend the Jetty JAAS framework to do many kinds of authentication. # Under the auth folder you will find 3 files that allow configuration of file based basic auth or LDAP auth # Samples of how to configure the various auth versions can be found here https://www.eclipse.org/jetty/documentation/current/jaas-support.html # basicAuth.conf and auth.props will configure the PropertyFileLoginModule and ldap-loginModule.conf will configure the ldap module. # the auth_module_name below should match the name outside the curly braces in the respective conf file. # The basics of how to configure two or more modules to work together in logic and/or combinations can be found here https://docs.oracle.com/javase/8/docs/api/javax/security/auth/login/Configuration.html and here https://github.com/rundeck/rundeck/wiki/Multiple-authentication-modules #auth_module_name: "basicAuth" # To enable thread pooling uncomment the following lines and specify the limits #threads.queue_size: 6000 #threads.min: 1000 #threads.max: 2500 #threads.keep_alive_ms: 10000 } #=============================================================================== # Each factory must be bound in a guice module. The definition here defines what # protocol data type the factory services. datapoints.factory: { # Default data point implementation for long - class must implement LongDataPointFactory long: "org.kairosdb.core.datapoints.LongDataPointFactoryImpl" # Default data point implementation for double - class must implement DoubleDataPointFactory double: "org.kairosdb.core.datapoints.DoubleDataPointFactoryImpl" string: "org.kairosdb.core.datapoints.StringDataPointFactory" } #=============================================================================== service.reporter = org.kairosdb.core.reporting.MetricReportingModule reporter: { # Uses Quartz Cron syntax - default is to run every minute schedule: "0 */1 * * * ?" # TTL to apply to all kairos reported metrics #ttl: 0 ttl: 7776000 } #=============================================================================== #Configure the datastore #service.datastore: "org.kairosdb.datastore.h2.H2Module" service.datastore: "org.kairosdb.datastore.cassandra.CassandraModule" datastore.concurrentQueryThreads: 5 datastore.h2.database_path: "build/h2db" datastore.cassandra: { #This lets you add additional parameters to the CQL create statement #For example if you want to change compression or compaction strategies #This only takes effect the first time Kairos starts and tries to create #the tables if they do not exist table_create_with: { data_points: "WITH COMPACT STORAGE" row_key_index: "" row_key_time_index: "" row_keys: "" tag_indexed_row_keys: "" string_index: "" service_index: "" } #For a single metric query this dictates the number of simultaneous cql queries #to run (ie one for each partition key of data). The larger the cluster the higher you may want #this number to be. simultaneous_cql_queries: 20 # query_reader_threads is the number of threads to use to read results from # each cql query. You may want to change this number depending on your environment query_reader_threads: 6 # When set, the query_limit will prevent any query reading more than the specified # number of data points. When the limit is reached an exception is thrown and an # error is returned to the client. Set this value to 0 to disable (default) #query_limit: 10000000 # When set, the query_time_limit_sec will try to prevent any query from taking # longer than the number of seconds specified. This time is measuered while # doing actual work against Cassandra. A query could be blocked on slower queries # at a higher level and actually take longer than the specified time. #query_time_limit_sec: 60 //Todo this is wrong #Size of the row key cache size. This can be monitored by querying #kairosdb.datastore.cassandra.write_batch_size.sum and filtering on the tag table = row_keys #Ideally the data written to the row_keys should stabilize to zero except #when data rolls to a new row #row_key_cache_size: 50000 row_key_cache_size: 3100000 string_cache_size: 50000 #the time to live in seconds for datapoints. After this period the data will be #deleted automatically. If not set the data will live forever. #TTLs are added to columns as they're inserted so setting this will not affect #existing data, only new data. #datapoint_ttl: 31536000 datapoint_ttl: 2592000 #When start_async is set to true a background thread is created to try and #connect to cassandra when starting up Kairos. This allows Kairos to start #even if Cassandra is not yet available. The background thread repeatedly #attempts to connect every 1sec until it is successful. #Setting start_async to false means kairos will fail to start if Cassandra #is not available. start_async: false # This identifies the cluster, metrics are written to. The write_cluster also # participates in any metric query. If you only have one C* cluster then # it must be specified as the write_cluster write_cluster: { # name of the cluster as it shows up in client specific metrics name: "Scylla Cluster" keyspace: "kairosdb" replication: "{'class': 'NetworkTopologyStrategy','dc1' : '2'}" #cql_host_list: ["localhost"] cql_host_list: ["node1", "node2", "node3", "node4", "node5", "node6"] # Set this if this kairosdb node connects to cassandra nodes in multiple datacenters. # Not setting this will select cassandra hosts using the RoundRobinPolicy, while setting this will use DCAwareRoundRobinPolicy. #local_dc_name: "<local dc name>" #Control the required consistency for cassandra operations. #Available settings are cassandra version dependent: #http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/dml/dml_config_consistency_c.html read_consistency_level: "ONE" #write_consistency_level: "QUORUM" write_consistency_level: "ONE" #protocol compression to use in the Cassandra client. Available values are #LZ4, SNAPPY, NONE. Defaults to LZ4 protocol_compression: "LZ4" #The number of times to retry a request to C* in case of a failure. request_retry_count: 1 connections_per_host: { local.core: 36 local.max: 36 remote.core: 36 remote.max: 36 } # If using cassandra 3.0 or latter consider increasing this value max_requests_per_connection: { local: 8 remote: 8 } #max_queue_size: 500 max_queue_size: 500000 #for cassandra authentication use the following #auth.[prop name]=[prop value] #example: #auth.user_name= #auth.password= # Set this property to true to enable SSL connections to your C* cluster. # Follow the instructions found here: http://docs.datastax.com/en/developer/java-driver/3.1/manual/ssl/ # to create a keystore and pass the values into Kairos using the -D switches use_ssl: false # Set this property to a list of metric names for which the tag-indexed row key lookup table should be used # to improve query speed when high-cardinality tags are involved. Setting this to a single entry of ['*'] # will enable it for all metrics. # Turning this on can significantly increase the number of writes to Cassandra. # Row keys in Cassandra should only be written once every 3 weeks for # every unique tag combination, if your cache is large enough. # Say you have a metric with 3 tags: inserting a value with tags not seen # before will generate 2 inserts, one for the data and one for the row index. # If you turn on the tag index lookup for that metric it will create 5 inserts, # same as before with one for for each tag. # The property can be set as a list of metric names where all tags get indexed # or it can be an object where you specify the tags to index for each metric # tag_indexed_row_key_lookup_metrics: { # metric1: [ tag1 ] # metric2: [ * ] # metric3: [] # } # Only tag1 for metric1 will be indexed. For metric2 and metric3 all tags will be indexed. tag_indexed_row_key_lookup_metrics: [] # The row_time_unit and row_width configurations are one time use and only read # from a write cluster configuration. These configurations are used when # creating the schema and how data is written to Cassandra. # row_time_unit - can either be set to SECONDS or MILLISECONDS and # determines the granularity of the data stored in Cassandra. If set to # SECONDS data sent to Kairos will still be sent according to the API used # and millisecond timestamps will be trucated to seconds. # The row_width parameter determines how wide the rows are in Cassandra, in # other words how long data is written to a row. The default is just about # 3 weeks. #row_time_unit: "MILLISECONDS" #row_width: 1814400000 } # Rename this to read_clusters in order for it to be used # This is for additional clusters of old data that you want to make available # for queries. The cql_host_list SHOULD NOT point to the same cluster as # the write_cluster above. # All properties found in the write_cluster section can be used here as well # As this property is a list you can specify 0 or more read clusters. # The idea behind read_clusters is so you can manage data growth, so instead # of adding more nodes to an older C* cluster you can create new clusters # on newer versions of C*. Older clusters can then be turned off or shrunk. read_clusters_not: [ { name: "read_cluster" keyspace: "kairosdb" replication: "{'class': 'SimpleStrategy','replication_factor' : 1}" //cql_host_list: ["kairos01", "kairos02", "kairos03"] cql_host_list: ["localhost"] #local_dc_name: "<local dc name> read_consistency_level: "ONE" write_consistency_level: "QUORUM" connections_per_host: { local.core: 5 local.max: 100 remote.core: 1 remote.max: 10 } max_requests_per_connection: { local: 128 remote: 128 } max_queue_size: 500 use_ssl: false # Start and end date are optional configuration parameters # The start and end date set bounds on the data in this cluster # queries that do not include this time range will not be sent # to this cluster. start_time: "2001-07-04T12:08-0700" end_time: "2001-07-04T12:08-0700" } ] } #=============================================================================== #Uncomment this line to require oauth connections to http server #service.oauth: "org.kairosdb.core.oauth.OAuthModule" #OAuth consumer keys and secrets in the form #oauth.consumer: { # [consumer key]: "[consumer secret]" #} #=============================================================================== # Determines if cache files are deleted after being used or not. # In some cases the cache file can be used again to speed up subsequent queries # that query the same metric but aggregate the results differently. query_cache.keep_cache_files: false # Cache file cleaning schedule. Uses Quartz Cron syntax - this only matters if # keep_cache_files is set to true query_cache.cache_file_cleaner_schedule: "0 0 12 ? * SUN *" #By default the query cache is located in kairos_cache under the system temp folder as #defined by java.io.tmpdir system property. To override set the following value #query_cache.cache_dir: "" query_cache.cache_dir: "/kairoscache" #=============================================================================== # Log long running queries, set this to true to record long running queries # into kairos as the following metrics. # kairosdb.log.query.remote_address - String data point that is the remote address # of the system making the query # kairosdb.log.query.json - String data point that is the query sent to Kairos log.queries: { enable: false # Time to live to apply to the above metrics. This helps limit the mount of space # used for storing the query information ttl: 86400 # Time in seconds. If the query request time is longer than this value the query # will be written to the above metrics greater_than: 60 } # When set to true the query stats are aggregated into min, max, avg, sum, count # Setting to true will also disable the above log feature. # Set this to true on Kairos nodes that receive large numbers of queries to save # from inserting data witch each query queries.aggregate_stats = false # If a tag filter value begins with this string the remaining is considered a # regex to match against those tag values. ei {"host": "regex:server1[0-2]"} # matches host tag values server10, server11 and server12 # set the value to an empty string to disable queries.regex_prefix = "regex:" #When set to true Kairos will insert the query into the response json as #original_query. This is useful for some processes that send queries asynchronously #and need a way to identify responses. queries.return_query_in_response = false #=============================================================================== # Health Checks service.health: "org.kairosdb.core.health.HealthCheckModule" #Response code to return from a call to /api/v1/health/check #Some load balancers want 200 instead of 204 health.healthyResponseCode: 204 #=============================================================================== #Ingest queue processor # The MemoryQueueProcessor keeps everything in memory before batching to # cassandra and blocks when the queue is full # The FileQueueProcessor uses a hybrid memory queue and a file backed queue # Data is placed in both memory and in the file queue before a client response # is sent. Data is read from file only when the lag is greater than what the # memory queue can hold queue_processor: { #class: "org.kairosdb.core.queue.MemoryQueueProcessor" class: "org.kairosdb.core.queue.FileQueueProcessor" # The number of data points to send to Cassandra # For the best performance you will want to set this to 10000 but first # you will need to change the following values in your cassandra.yaml # batch_size_warn_threshold_in_kb: 50 # batch_size_fail_threshold_in_kb: 70 # You may need to adjust the above numbers to fit your insert patterns. # The CQL batch has a hard limit of 65535 items in a batch, make sure to stay # under this as a single data point can generate more than one insert into Cassandra # You will want to multiply this number by the number of hosts in the Cassandra # cluster. A batch is pulled from the queue and then divided up depending on which # host the data is destined for. # If you set this value higher you may also get warnings in C* about Unlogged batches # covering x number of partitions. You can remove this warning by increaseing the value # of this property in cassandra.yaml # unlogged_batch_across_partitions_warn_threshold: 100 #batch_size: 10 #batch_size: 25 #batch_size: 50 #batch_size: 100 #batch_size: 200 #batch_size: 400 batch_size: 800 #batch_size: 1600 #batch_size: 3200 #batch_size: 6400 # If the queue doesn't have at least this many items to process the process thread # will pause for .5 seconds to wait for more before grabbing data from the queue. # This is an attempt to prevent chatty inserts which can cause extra load on # Cassandra min_batch_size: 100 # If the number of items in the process queue is less than {min_batch_size} the # queue processor thread will wait this many milliseconds before flushing the data # to C*. This is to prevent single chatty inserts. This only has effect when # data is trickling in to Kairos. min_batch_wait: 500 # The size (number of data points) of the memory queue # In the case of FileQueueProcessor: # Ingest data is written to the memory queue as well as to disk. If the system gets # behind the memory queue is overrun and data is read from disk until it can # catch up. # In the case of MemoryQueueProcessor it defines the size of the memory queue. #memory_queue_size: 100000 memory_queue_size: 2000000000 # The number of seconds before checkpointing the file backed queue. In the case of # a crash the file backed queue is read from the last checkpoint # Only applies to the FileQueueProcessor seconds_till_checkpoint: 90 # Path to the file backed queue # Only applies to the FileQueueProcessor #queue_path: "queue" queue_path: "/kairosqueue" # Page size of the file backed queue 50Mb # Only applies to the FileQueueProcessor page_size: 52428800 } #Number of threads allowed to insert data to the backend #CassandraDatastore is the only use of this executor #ingest_executor.thread_count = 10 ingest_executor.thread_count = 32 # The HostManager serivce keeps track of other kairos nodes in the cluster # (ie that are talking to the same cassandra cluster). It does this by # writing data to a serivce key and updating it periodically. The following # settings define how often it checks the key for other hosts and when to mark # them as inactive. This is primarily used for balancing rollup jobs host_service_manager: { check_delay_time_millseconds: 30000 inactive_time_seconds: 30 } # This filters and prevents specified metrics from being ingested. # This can be used to turn off kairos internal metrics or stop a flow of # metrics that have to many tags, etc. Uncomment the module and then # specify the filters to put in place. #service.filter: "org.kairosdb.filter.FilterModule" filter: { # this does exact match filtering list: [ ] # this does prefix match filtering prefix: [ ] # this filters using regex's regex: [ ] } # sets the priority of the filter plugin so it can remove events before # the datastore gets them. eventbus.filter.priority.org.kairosdb.filter.FilterPlugin: 25 #=============================================================================== # Roll-ups service.rollups=org.kairosdb.rollup.RollUpModule # How often the Roll-up Manager queries for new or updated roll-ups\ rollups: { server_assignment { check_update_delay_millseconds = 60000 } } #=============================================================================== #=============================================================================== #Demo and stress modules # The demo module will load one years of data into kairos. The code goes back # one year from the present and inserts data every minute. The data makes a # pretty little sign wave. #service.demo: "org.kairosdb.core.demo.DemoModule" demo: { metric_name: "demo_data" # number of rows = number of host tags to add to the data. Setting the number_of_rows # to 100 means 100 hosts reporting data every minutes, each host has a different tag. number_of_rows: 100 ttl: 0 } # This just inserts data as fast as it can for the duration specified. Good for # stress testing your backend. I have found that a single Kairos node will only # send about 500k/sec because of a limitation in the cassandra client. #service.blast: "org.kairosdb.core.blast.BlastModule" # The number_of_rows translates into a random number between 0 and number_of_rows # that is added as a tag to each data point. Trying to simulate a even distribution # data in the cluster. blast: { number_of_rows: 1000 duration_seconds: 30 metric_name: "blast_load" ttl: 600 } } ######################################################################### Kairosdb.properties: kairosdb.telnetserver.port=0 kairosdb.telnetserver.address=0.0.0.0 kairosdb.telnetserver.max_command_size=1024 # Properties that start with kairosdb.service are services that are started # when kairos starts up. You can disable services in your custom # kairosdb.properties file by setting the value to <disabled> ie kairosdb.service.telnet=<disabled> kairosdb.service.telnet=org.kairosdb.core.telnet.TelnetServerModule kairosdb.service.http=org.kairosdb.core.http.WebServletModule kairosdb.service.reporter=org.kairosdb.core.reporting.MetricReportingModule #=============================================================================== #Each factory must be bound in a guice module. The definition here defines what #protocol data type the factory services. #Default data point implementation for long - class must implement LongDataPointFactory kairosdb.datapoints.factory.long=org.kairosdb.core.datapoints.LongDataPointFactoryImpl #Default data point implementation for double - class must implement DoubleDataPointFactory kairosdb.datapoints.factory.double=org.kairosdb.core.datapoints.DoubleDataPointFactoryImpl kairosdb.datapoints.factory.string=org.kairosdb.core.datapoints.StringDataPointFactory #=============================================================================== # Uses Quartz Cron syntax - default is to run every minute kairosdb.reporter.schedule=0 */1 * * * ? # TTL to apply to all kairos reported metrics kairosdb.reporter.ttl=7776000 #=============================================================================== # Set to 0 to turn off HTTP port kairosdb.jetty.port=8080 kairosdb.jetty.address=0.0.0.0 kairosdb.jetty.static_web_root=webroot # To enable SSL uncomment the following lines and specify the path to the keyStore and its password and port #kairosdb.jetty.ssl.port=443 #kairosdb.jetty.ssl.protocols=TLSv1, TLSv1.1, TLSv1.2 #kairosdb.jetty.ssl.cipherSuites=TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA #kairosdb.jetty.ssl.keystore.path= #kairosdb.jetty.ssl.keystore.password= #To enable http basic authentication uncomment the following lines and specify #the user name and password for authentication. #kairosdb.jetty.basic_auth.user= #kairosdb.jetty.basic_auth.password= # To enable thread pooling uncomment the following lines and specify the limits #kairosdb.jetty.threads.queue_size=6000 #kairosdb.jetty.threads.min=1000 #kairosdb.jetty.threads.max=2500 #kairosdb.jetty.threads.keep_alive_ms=10000 # To set the traffic type allowed for the server, change the kairosdb.server.type entry to INGEST, QUERY, or DELETE, # or a comma delimited list of two of them to enable different sets of API methods. # The default setting is ALL, which will enable all API methods. #kairosdb.server.type=ALL #=============================================================================== #kairosdb.service.datastore=org.kairosdb.datastore.h2.H2Module kairosdb.datastore.concurrentQueryThreads=5 kairosdb.service.datastore=org.kairosdb.datastore.cassandra.CassandraModule #kairosdb.service.datastore=org.kairosdb.datastore.remote.RemoteModule #=============================================================================== #H2 properties kairosdb.datastore.h2.database_path=build/h2db #=============================================================================== #Cassandra properties #host list is in the form> 1.1.1.1:9042,1.1.1.2 #if the port is omitted it defaults to 9042 #kairosdb.datastore.cassandra.cql_host_list=localhost kairosdb.datastore.cassandra.cql_host_list=node1,node2,node3,node4,node5,node6 kairosdb.datastore.cassandra.keyspace=kairosdb #Sets the replication for the keyspace. This is only used the first time Kairos #starts up and needs to create the schema in Cassandra. Later changes #to this property have no effect. #kairosdb.datastore.cassandra.replication={'class': 'SimpleStrategy','replication_factor' : 1} kairosdb.datastore.cassandra.replication={'class': 'NetworkTopologyStrategy','dc1': '2'} #For a single metric query this dictates the number of simultaneous cql queries #to run (ie one for each partition key of data). The larger the cluster the higher you may want #this number to be. kairosdb.datastore.cassandra.simultaneous_cql_queries=20 # query_reader_threads is the number of threads to use to read results from # each cql query. You may want to change this number depending on your environment kairosdb.datastore.cassandra.query_reader_threads=6 # When set, the query_limit will prevent any query reading more than the specified # number of data points. When the limit is reached an exception is thrown and an # error is returned to the client. Set this value to 0 to disable (default) #kairosdb.datastore.cassandra.query_limit=10000000 #Size of the row key cache size. This can be monitored by querying #kairosdb.datastore.write_size and filtering on the tag buffer = row_key_index #Ideally the data written to the row_key_index should stabilize to zero except #when data rolls to a new row #kairosdb.datastore.cassandra.row_key_cache_size=50000 kairosdb.datastore.cassandra.row_key_cache_size=3100000 kairosdb.datastore.cassandra.string_cache_size=50000 #Control the required consistency for cassandra operations. #Available settings are cassandra version dependent: #http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/dml/dml_config_consistency_c.html kairosdb.datastore.cassandra.read_consistency_level=ONE #kairosdb.datastore.cassandra.write_consistency_level=QUORUM kairosdb.datastore.cassandra.write_consistency_level=ONE # Set this if this kairosdb node connects to cassandra nodes in multiple datacenters. # Not setting this will select cassandra hosts using the RoundRobinPolicy, while setting this will use DCAwareRoundRobinPolicy. #kairosdb.datastore.cassandra.local_datacenter= kairosdb.datastore.cassandra.connections_per_host.local.core=36 kairosdb.datastore.cassandra.connections_per_host.local.max=36 kairosdb.datastore.cassandra.connections_per_host.remote.core=36 kairosdb.datastore.cassandra.connections_per_host.remote.max=36 kairosdb.datastore.cassandra.max_requests_per_connection.local=8 kairosdb.datastore.cassandra.max_requests_per_connection.remote=8 #This is the number of retries the Cassandra Client RetryPolicy will used before #giving up on a cql query. A good rule is to set it to the number of replicas you have kairosdb.datastore.cassandra.request_retry_count=1 kairosdb.datastore.cassandra.max_queue_size=500 #for cassandra authentication use the following #kairosdb.datastore.cassandra.auth.[prop name]=[prop value] #example: #kairosdb.datastore.cassandra.auth.user_name= #kairosdb.datastore.cassandra.auth.password= # Set this property to true to enable SSL connections to your C* cluster. # Follow the instructions found here: http://docs.datastax.com/en/developer/java-driver/3.1/manual/ssl/ # to create a keystore and pass the values into Kairos using the -D switches kairosdb.datastore.cassandra.use_ssl=false #the time to live in seconds for datapoints. After this period the data will be #deleted automatically. If not set the data will live forever. #TTLs are added to columns as they're inserted so setting this will not affect #existing data, only new data. #kairosdb.datastore.cassandra.datapoint_ttl=31536000 kairosdb.datastore.cassandra.datapoint_ttl=2592000 # Set this property to true to align each datapoint ttl with its timestamp. # example: datapoint_ttl is set to 30 days; ingesting a datapoint with timestamp '25 days ago' # - Without this setting, the datapoint will be stored for 30 days from now on, so it can be queried for 30 + 25 days which may not be the intended behaviour using a 30 days ttl # - Setting this property to true will only store this datapoint for 5 days (30 - 25 days). # default: false # Additional note: consider setting force_default_datapoint_ttl as well for full control kairosdb.datastore.cassandra.align_datapoint_ttl_with_timestamp=true # Set this property to true to force the default datapoint_ttl for all ingested datapoints, effectively ignoring their ttl informations if they provide them. # This gives you full control over the timespan the datapoints are stored in K*. # default: false # Additional note: consider setting align_datapoint_ttl_with_timestamp as well for full control kairosdb.datastore.cassandra.force_default_datapoint_ttl=true # Tells kairos to try and create the keyspace and tables on startup. kairosdb.datastore.cassandra.create_schema=true # Milliseconds for java driver to wait for a connection from a C* node kairosdb.datastore.cassandra.connection_timeout=5000 # Milliseconds for java driver to wait for a response from a C* before giving up kairosdb.datastore.cassandra.read_timeout=12000 #=============================================================================== # Remote datastore properties # Load the RemoteListener modules instead of RemoteDatastore if you want to # fork the flow of data. This module allows you to continue writting to your # configured Datastore as well as send data on to a remote Kairos cluster # Sample use case is to run clusters in parallel before migrating to larger cluster # Cannot be used in conjunction with the RemoteModule #kairosdb.service.remote=org.kairosdb.datastore.remote.ListenerModule # Location to store data locally before it is sent off kairosdb.datastore.remote.data_dir=. kairosdb.datastore.remote.remote_url= # quartz cron schedule for sending data (currently set to 30 min) kairosdb.datastore.remote.schedule=0 */30 * * * ? # delay the sending of data for a random number of seconds. # this prevents all remote kairos nodes from sending data at the same time # the default of 600 means the data will be sent every half hour plus some some # delay up to 10 minutes. kairosdb.datastore.remote.random_delay=0 # Optional prefix filter for remote module. If present, only metrics that start with the # values in this comma-separated list are forwarded on. #kairosdb.datastore.remote.prefix_filter= #=============================================================================== #Uncomment this line to require oauth connections to http server #kairosdb.service.oauth=org.kairosdb.core.oauth.OAuthModule #OAuth consumer keys and secrets in the form #kairosdb.oauth.consumer.[consumer key]=[consumer secret] #=============================================================================== # Determines if cache files are deleted after being used or not. # In some cases the cache file can be used again to speed up subsequent queries # that query the same metric but aggregate the results differently. kairosdb.query_cache.keep_cache_files=false # Cache file cleaning schedule. Uses Quartz Cron syntax - this only matters if # keep_cache_files is set to true kairosdb.query_cache.cache_file_cleaner_schedule=0 0 12 ? * SUN * #By default the query cache is located in kairos_cache under the system temp folder as #defined by java.io.tmpdir system property. To override set the following value #kairosdb.query_cache.cache_dir= kairosdb.query_cache.cache_dir=/kairoscache #=============================================================================== # Log long running queries, set this to true to record long running queries # into kairos as the following metrics. # kairosdb.log.query.remote_address - String data point that is the remote address # of the system making the query # kairosdb.log.query.json - String data point that is the query sent to Kairos kairosdb.log.queries.enable=false # Time to live to apply to the above metrics. This helps limit the mount of space # used for storing the query information kairosdb.log.queries.ttl=86400 # Time in seconds. If the query request time is longer than this value the query # will be written to the above metrics kairosdb.log.queries.greater_than=60 # When set to true the query stats are aggregated into min, max, avg, sum, count # Setting to true will also disable the above log feature. # Set this to true on Kairos nodes that receive large numbers of queries to save # from inserting data witch each query kairosdb.queries.aggregate_stats=false #=============================================================================== # Health Checks kairosdb.service.health=org.kairosdb.core.health.HealthCheckModule #Response code to return from a call to /api/v1/health/check #Some load balancers want 200 instead of 204 kairosdb.health.healthyResponseCode=204 #=============================================================================== #Ingest queue processor #kairosdb.queue_processor.class=org.kairosdb.core.queue.MemoryQueueProcessor kairosdb.queue_processor.class=org.kairosdb.core.queue.FileQueueProcessor # The number of data points to send to Cassandra # For the best performance you will want to set this to 10000 but first # you will need to change the following values in your cassandra.yaml # batch_size_warn_threshold_in_kb: 50 # batch_size_fail_threshold_in_kb: 70 # You may need to adjust the above numbers to fit your insert patterns. # The CQL batch has a hard limit of 65535 items in a batch, make sure to stay # under this as a single data point can generate more than one insert into Cassandra # You will want to multiply this number by the number of hosts in the Cassandra # cluster. A batch is pulled from the queue and then divided up depending on which # host the data is destined for. #kairosdb.queue_processor.batch_size=200 #kairosdb.queue_processor.batch_size=400 kairosdb.queue_processor.batch_size=800 #kairosdb.queue_processor.batch_size=1600 #kairosdb.queue_processor.batch_size=3200 # If the queue doesn't have at least this many items to process the process thread # will pause for {min_batch_wait} milliseconds to wait for more before grabbing data from the queue. # This is an attempt to prevent chatty inserts which can cause extra load on # Cassandra kairosdb.queue_processor.min_batch_size=50 # If the number of items in the process queue is less than {min_batch_size} the # queue processor thread will wait this many milliseconds before flushing the data # to C*. This is to prevent single chatty inserts. This only has effect when # data is trickling in to Kairos. kairosdb.queue_processor.min_batch_wait=500 # The size (number of data points) of the memory queue # In the case of FileQueueProcessor: # Ingest data is written to the memory queue as well as to disk. If the system gets # behind the memory queue is overrun and data is read from disk until it can # catch up. # In the case of MemoryQueueProcessor it defines the size of the memory queue. #kairosdb.queue_processor.memory_queue_size=100000 kairosdb.queue_processor.memory_queue_size=2000000000 # The number of seconds before checkpointing the file backed queue. In the case of # a crash the file backed queue is read from the last checkpoint # Only applies to the FileQueueProcessor kairosdb.queue_processor.seconds_till_checkpoint=90 # Path to the file backed queue # Only applies to the FileQueueProcessor kairosdb.queue_processor.queue_path=/kairosqueue # Page size of the file backed queue 50Mb # Only applies to the FileQueueProcessor kairosdb.queue_processor.page_size=52428800 #Number of threads allowed to insert data to the backend #CassandraDatastore is the only use of this executor #kairosdb.ingest_executor.thread_count=10 kairosdb.ingest_executor.thread_count=32 #=============================================================================== # Roll-ups #kairosdb.service.rollups=org.kairosdb.rollup.RollUpModule # How often the Roll-up Manager queries for new or updated roll-ups kairosdb.rollups.server_assignment.check_update_delay_millseconds=60000 #=============================================================================== #=============================================================================== # Host Manager Service # How often the host service checks for active hosts kairosdb.host_service_manager.check_delay_time_millseconds=60000 # How long before a host is considered inactive kairosdb.host_service_manager.inactive_time_seconds=300 #=============================================================================== #=============================================================================== #Demo and stress modules # The demo module will load one years of data into kairos. The code goes back # one year from the present and inserts data every minute. The data makes a # pretty little sign wave. #kairosdb.service.demo=org.kairosdb.core.demo.DemoModule kairosdb.demo.metric_name=demo_data # number of rows = number of host tags to add to the data. Setting the number_of_rows # to 100 means 100 hosts reporting data every minutes, each host has a different tag. kairosdb.demo.number_of_rows=100 kairosdb.demo.ttl=0 # This just inserts data as fast as it can for the duration specified. Good for # stress testing your backend. I have found that a single Kairos node will only # send about 500k/sec because of a limitation in the cassandra client. #kairosdb.service.blast=org.kairosdb.core.blast.BlastModule # The number_of_rows translates into a random number between 0 and number_of_rows # that is added as a tag to each data point. Trying to simulate a even distribution # data in the cluster. kairosdb.blast.number_of_rows=1000 kairosdb.blast.duration_seconds=30 kairosdb.blast.metric_name=blast_load kairosdb.blast.ttl=600

brianhks · 2023-05-12T04:14:32Z

I've done some testing on a single kairos and single cassandra 4 node. I'm getting almost identical performance from kairos 1.2 and 1.3.
When using 1.3 I did have to make a change in cassandra.yaml for the batch warn and fail config
batch_size_warn_threshold: 75KiB
batch_size_fail_threshold: 150KiB
All the warnings did seem to effect the performance of the inserts.

I'm working on 1.4 release right now and I'm upgrading to the latest cassandra driver. I'll be testing it along with the other versions before I release to see if it makes any difference.

lslate · 2023-05-24T10:43:08Z

Thanks for that Brian. I'll have a play with those thresholds and get back to you.

brianhks · 2023-05-26T01:07:05Z

Also someone else made a comment that made me think this may be the issue. key caching may have changed. Have a look at the kairos cassandra metrics for writes to different tables: kairosdb.datastore.cassandra.write_batch_size.sum then group by table. In an ideal state you are only writting to the data_points table and everything else gets cached. If the cache is too small you will see a lot of writes to the other tables. It would be interesting to compare the two versions and see if there is a difference there.

lslate · 2023-05-31T09:08:41Z

I've played with the batch size thresholds but I see no improvement. Looking at kairosdb.datastore.cassandra.write_batch_size.sum grouped by table, the only writes are to data_points, the other tables writes are negligible or zero, so key caching doesn't seem to be a problem. I'll update Scylla.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.3.0 ingest performance poor compared to v1.2.2 #676

v1.3.0 ingest performance poor compared to v1.2.2 #676

lslate commented Mar 16, 2023

brianhks commented Mar 17, 2023

lslate commented Mar 21, 2023 via email •

edited

brianhks commented May 12, 2023

lslate commented May 24, 2023

brianhks commented May 26, 2023

lslate commented May 31, 2023

v1.3.0 ingest performance poor compared to v1.2.2 #676

v1.3.0 ingest performance poor compared to v1.2.2 #676

Comments

lslate commented Mar 16, 2023

brianhks commented Mar 17, 2023

lslate commented Mar 21, 2023 via email • edited

brianhks commented May 12, 2023

lslate commented May 24, 2023

brianhks commented May 26, 2023

lslate commented May 31, 2023

lslate commented Mar 21, 2023 via email •

edited