LLM As Database Administrator

Features • News • QuickStart • Customization • Cases • FAQ • Community • Contributors

【English | 中文】

🧗 D-Bot, a LLM-based DBA, can acquire database maintenance experience from textual sources, and provide reasonable, well-founded, in-time diagnosis and optimization advice for target databases.

Features

DBA Team (D-Bot)

Well-Founded Diagnosis: D-Bot can provide founded diagnosis by utilizing relevant database knowledge (with document2experience).
Practical Tool Utilization: D-Bot can utilize both monitoring and optimization tools to improve the maintenance capability (with tool learning and tree of thought).
In-depth Reasoning: Compared with vanilla LLMs, D-Bot will achieve competitive reasoning capability to analyze root causes (with multi-llm communications).

What's New

[2023/9/10] Add diagnosis logs and replay button 🔗 link
[2023/9/09] Add typical anomalies 🔗 link
[2023/9/05] A unique framework is available! Start diag+tool service with a single command, experiencing 5x speed up!! 🚀 link
[2023/8/25] Support vue-based website interface. More flexible and beautiful! 🔗 link
[2023/8/22] Support tool retrieval for 60+ APIs 🔗 link
[2023/8/16] Support multi-level optimization functions 🔗 link
[2023/8/10] Our vision papers are released (continuously update)
- LLM As DBA. [paper] [中文解读] [twitter] [slides]
- DB-GPT: Large Language Model Meets Database. [paper]

D-Bot is evolving with new features 👫👫
Don't forget to star ⭐ and watch 👀 to stay up to date :)

A demo of using D-Bot

db_diag.mp4

QuickStart

D-Bot

Folder Structure

.
├── multiagents
│   ├── agent_conf                        # Settings of each agent
│   ├── agents                            # Implementation of different agent types 
│   ├── environments                      # E.g., chat orders / chat update / terminal conditions
│   ├── knowledge                         # Diagnosis experience from documents
│   ├── llms                              # Supported models
│   ├── memory                            # The content and summary of chat history
│   ├── response_formalize_scripts        # Useless content removal of model response
│   ├── tools                             # External monitoring/optimization tools for models
│   └── utils                             # Other functions (e.g., database/json/yaml operations)

1. Prerequisites

PostgreSQL v12 or higher

Additionally, install extensions like pg_stat_statements (track slow queries), pg_hint_plan (optimize physical operators), and hypopg (create hypothetical Indexes).
Prometheus ~~and Grafana (tutorial)~~

Check prometheus.md for detailed installation guides.

Grafana is no longer a necessity with our vue-based frontend.

2. Package Installation

Step 1: Install python packages.

pip install -r requirements.txt

Step 2: Configure environment variables.

Export your OpenAI API key

# macos
export OPENAI_API_KEY="your_api_key_here"

# windows
set OPENAI_API_KEY="your_api_key_here"

Step 3: Add database/anomaly/prometheus settings into tool_config_example.yaml and rename into tool_config.yaml:

```bash
POSTGRESQL:
  host: 182.92.xxx.x
  port: 5432
  user: xxxx
  password: xxxxx
  dbname: postgres

BENCHSERVER:
  server_address: 8.131.xxx.xx
  username: root
  password: xxxxx
  remote_directory: /root/benchmark

PROMETHEUS:
  api_url: http://8.131.xxx.xx:9090/
  postgresql_exporter_instance: 172.27.xx.xx:9187
  node_exporter_instance: 172.27.xx.xx:9100
```

You can ignore the settings of BENCHSERVER, which is not used in this version.

If accessing openai service via vpn, execute this command:

# macos
export https_proxy=http://127.0.0.1:7890 http_proxy=http://127.0.0.1:7890 all_proxy=socks5://127.0.0.1:7890

Test your openai key

cd others
python openai_test.py

3. Diagnosis & Optimization

Website Interface

We also provide a local website demo for this environment. You can launch it with

# cd website
cd front_demo
rm -rf node_modules/
rm -r package-lock.json
# install dependencies for the first run (nodejs, ^16.13.1 is recommended)
npm install  --legacy-peer-deps
# back to root directory
cd ..
# launch the local server and open the website
sh run_demo.sh

Modify the "python app.py" command within run_demo.sh if multiple versions of Python are installed.

After successfully launching the local server, visit http://127.0.0.1:9228/ to trigger the diagnosis procedure.

Command-line Interface

python main.py

4. Supported Anomalies

Within the anomaly_trigger directory, we aim to offer scripts that could incur typical anomalies, e.g.,

Type	Root Cause	Description
Query Issues	INSERT_LARGE_DATA	Long execution time for large data insertions
	FETCH_LARGE_DATA	Long execution time for large data fetching
	MISSING_INDEXES	Missing indexes causing performance issues
	REDUNDANT_INDEX	Unnecessary and redundant indexes in tables
	VACUUM	Unused space caused by data modifications
	POOR_JOIN_PERFORMANCE	Poor performance of Join operators
	CORRELATED_SUBQUERY	Non-promotable subqueries in SQL
	LACK_STATISTIC_INFO	Outdated statistical info affecting execution plan
Concurrency Issues	FETCH_LARGE_DATA	Fetching of large data volumes
	LOCK_CONTENTION	Lock contention issues
	CPU_CONTENTION	Severe external CPU resource contention
	IO_CONTENTION	IO resource contention affecting SQL performance
	WORKLOAD_CONTENTION	Workload concentration affecting SQL execution
Configuration Issues	SMALL_MEMORY_ALLOCATION	Tool small allocated memory space
	IO_SATURATION	Reach the max I/O capacity or throughput

Customize Your KnowledgeBase And Tools

1. Knowledge Preparation

Extract knowledge from both code (./knowledge_json/knowledge_from_code) and documents (./knowledge_json/knowledge_from_document).
- Add code blocks into diagnosis_code.txt file -> Rerun the extract_knowledge.py script -> Check the update results and sync to root_causes_dbmind.jsonl.

2. Tool Preparation

Tool APIs (for optimization)

Module Functions

index_selection (equipped) heuristic algorithm

query_rewrite (equipped) 45 rules

physical_hint (equipped) 15 parameters

For functions within [query_rewrite, physical_hint], you can use api_test.py script to verify the effectiveness.

If the function actually works, append it to the api.py of corresponding module.

Tool Usage Algorithm (tree of thought)
```
cd tree_of_thought
python test_database.py
```
History messages may take up many tokens, and so carefully decide the turn number.

Robust Responses Mechanisms

Refined diagnostics

Remove redundant content in llm responses.

cd response_formalize_scripts
python combine_similar_answer.py # test

Before Cleaning

[Chief DBA]: The identified root causes of the anomaly are high CPU usage due to the high number of running processes and high load averages, slow queries, and high memory usage due to complex queries, insufficient memory allocation, and the possibility of disk spill.
Optimize the slow queries by analyzing their execution plans, identifying any missing indexes, and tuning the queries accordingly
Polisih the slow queries by checking their execution plans, identifying any missing indexes, and tuning the queries accordingly
Monitor and manage the number of running processes on the node, ensuring that it remains within acceptable limits
Investigate and address any system or infrastructure issues that may be contributing to the high load averages
Consider scaling up or adding more resources to the system if the current CPU usage is consistently high and impacting performance
Add the recommended indexes: ['movie_info#movie_id', 'movie_companies#movie_id', 'keyword#keyword', 'aka_name#person_id'] to reduce cost
Further investigate and optimize the slow queries for better memory utilization
Monitor and adjust the memory allocation parameters like 'work_mem' and 'shared_buffers' to optimize memory usage
Analyze the query plans and consider adjustments to parameters like 'work_mem' and 'shared_buffers' to help reduce or eliminate disk spilling

After Cleaning

[Chief DBA]: The identified root causes of the anomaly are high CPU usage due to the high number of running processes and high load averages, slow queries, and high memory usage due to complex queries, insufficient memory allocation, and the possibility of disk spill.
Polisih the slow queries by checking their execution plans, identifying any missing indexes, and tuning the queries accordingly.
Monitor and manage the number of running processes on the node, ensuring that it remains within acceptable limits.
Add the recommended indexes: ['movie_info#movie_id', 'movie_companies#movie_id', 'keyword#keyword', 'aka_name#person_id'] to reduce cost.

Cases

MISSING_INDEXES → INDEX_SELECTION

Note optimize_index_selection takes minutes to explore beneficial index combinations.

[Chief DBA]: The anomaly started at 1691897340 and ended at 1691897430.

[CPU Agent]: The anomaly is caused by high CPU usage due to the high number of running processes and the relatively high load averages, especially in the past 1 minute. In addition, slow queries also contribute to the high CPU usage.
1. Optimize the slow queries by analyzing their execution plans, identifying any missing indexes, and tuning the queries accordingly.
2. Monitor and manage the number of running processes on the node, ensuring that it remains within acceptable limits.
3. Investigate and address any system or infrastructure issues that may be contributing to the high load averages.
4. Consider scaling up or adding more resources to the system if the current CPU usage is consistently high and impacting performance.
5. Add the recommended indexes: ['movie_info#movie_id', 'movie_companies#movie_id', 'keyword#keyword', 'aka_name#person_id'] to reduce cost from 46856716.92000003 to 31393898.919999983.
----------------------------
(matched knowledge) The high CPU usage may be due to the high number of running processes and the relatively high load averages. Slow queries can also contribute to the high CPU usage. The missing_index function checks for the presence of a required index using a workload-index-recommend interface.

[Memory Agent]: The potential causes of high memory usage could be complex queries, insufficient memory allocation, and possibility of disk spill. The slow queries mentioned involve complex operations such as filtering, grouping, and joining multiple tables, which can consume a significant amount of memory, especially if they involve large data sets. The abnormal memory metric values suggest that the available memory might be insufficient to handle the workload. Adjusting parameters like 'work_mem' and 'shared_buffers' might help alleviate memory issues. The complexity and size of the slow queries could potentially lead to disk spill during execution, which can significantly impact performance.
1. Further investigate and optimize the slow queries mentioned above for better memory utilization.
2. Monitor and adjust the memory allocation parameters like 'work_mem' and 'shared_buffers' to optimize memory usage.
3. Analyze the query plans and consider adjustments to parameters like 'work_mem' and 'shared_buffers' to help reduce or eliminate disk spilling.
----------------------------
(matched knowledge) Disk spill is a function that checks whether there is a possibility of disk spill during the execution of SQL. If the plan parse information is not available, it checks whether the sort spill count or hash spill count exceeds the sort rate threshold. If the plan parse information is available, it calculates the total cost of the plan and checks whether the cost rate of the sort or hash operators exceeds the cost rate threshold. If abnormal operator details are found and the sort or hash spill count is greater than 0, it indicates that the SORT/HASH operation may spill to disk. The suggestion is to analyze whether the business needs to adjust parameters like work_mem and shared_buffers. If disk spill is detected, it is a root cause, otherwise it is not a root cause.

[Chief DBA]: The identified root causes of the anomaly are high CPU usage due to the high number of running processes and high load averages, slow queries, and high memory usage due to complex queries, insufficient memory allocation, and the possibility of disk spill.
----------------------------
(solution) 1. Monitor and manage the number of running processes on the node, ensuring that it remains within acceptable limits.
2. Add the recommended indexes: ['movie_info#movie_id', 'movie_companies#movie_id', 'keyword#keyword', 'aka_name#person_id'] to reduce cost.
3. Analyze the query plans and consider adjustments to parameters like 'work_mem' and 'shared_buffers' to help reduce or eliminate disk spilling.
4. Investigate and address any system or infrastructure issues that may be contributing to the high load averages.

INSUFFICIENT_MEMORY_ALLOCATION

[Chief DBA]: The anomaly started at 1691897340 and ended at 1691897430.

[CPU Agent]: Based on the CPU relevant metric values, the root cause of high CPU usage could be a high load value, too many active processes, blocked processes, high CPU scrape collector duration, and low entropy.
I suggest reviewing system logs and identifying any recent changes or updates. We should also analyze other relevant metrics such as memory usage, disk I/O, network traffic, and check for any abnormal behavior from specific processes or applications.
----------------------------
(matched knowledge) missing_index: This function checks for the presence of a required index using a workload-index-recommend interface. If the recommended index information is available, it indicates that a required index is missing and provides a suggestion for the recommended index. If the information is not available, it is not a root cause for the issue.

[Memory Agent]: The root cause of the high memory usage could be disk spill and insufficient memory allocation.
I suggest analyzing the actual resource usage during query execution, performing a thorough analysis of the query plans, and evaluating the system configuration settings related to memory allocation, such as 'work_mem' and 'shared_buffers.' More information about the sort spill count or hash spill count and the threshold values is needed to confirm if disk spill is the root cause. If the free memory is consistently low and close to the total memory capacity, consider increasing the memory allocation for the system.
----------------------------
(matched knowledge) disk_spill: This function checks whether there is a possibility of disk spill during the execution of SQL. If the plan parse information is not available, it checks whether the sort spill count or hash spill count exceeds the sort rate threshold. If the plan parse information is available, it calculates the total cost of the plan and checks whether the cost rate of the sort or hash operators exceeds the cost rate threshold. If abnormal operator details are found and the sort or hash spill count is greater than 0, it indicates that the SORT/HASH operation may spill.

[Chief DBA]: The identified root causes of the anomaly are high CPU usage due to high load value, too many active processes, blocked processes, high CPU scrape collector duration, and low entropy. The high memory usage could be due to disk spill and insufficient memory allocation.
----------------------------
(solution) To resolve the high CPU usage, we should review system logs and identify any recent changes or updates. We should also analyze other relevant metrics such as memory usage, disk I/O, network traffic, and check for any abnormal behavior from specific processes or applications.
To mitigate the high memory usage, we should analyze the actual resource usage during query execution, perform a thorough analysis of the query plans, and evaluate the system configuration settings related to memory allocation, such as 'work_mem' and 'shared_buffers.' More information about the sort spill count or hash spill count and the threshold values is needed to confirm if disk spill is the root cause. If the free memory is consistently low and close to the total memory capacity, consider increasing the memory allocation for the system.

POOR_JOIN_PERFORMANCE

case_poor_join5.mp4

FAQ

🤨 The '.sh' script command cannot be executed on windows system.

Switch the shell to *git bash* or use *git bash* to execute the '.sh' script.

🤨 "No module named 'xxx'" on windows system.

This error is caused by issues with the Python runtime environment path. You need to perform the following steps:

Step 1: Check Environment Variables.

You must configure the "Scripts" in the environment variables.

Step 2: Check IDE Settings.

For VS Code, download the Python extension for code. For PyCharm, specify the Python version for the current project.

Todo

~~Project cleaning~~
Support more anomalies
Add more communication mechanisms
Public-generated anomaly training data
Fine-tune localized private model
Integrate preparation components in the demo website.
Support other databases like MySQL
Collect more knowledge and store in vector db (./knowledge_vector_db)

Community

Relevant Projects

https://github.com/OpenBMB/AgentVerse

https://github.com/OpenBMB/BMTools

Citation

Feel free to cite us if you like this project.

@misc{zhou2023llm4diag,
      title={LLM As DBA}, 
      author={Xuanhe Zhou, Guoliang Li, Zhiyuan Liu},
      year={2023},
      eprint={2308.05481},
      archivePrefix={arXiv},
      primaryClass={cs.DB}
}

@misc{zhou2023dbgpt,
      title={DB-GPT: Large Language Model Meets Database}, 
      author={Xuanhe Zhou, Zhaoyan Sun, Guoliang Li},
      year={2023},
      archivePrefix={Data Science and Engineering},
}

Contributors

Other Collaborators: Wei Zhou, Kunyi Li.

We thank all the contributors to this project. Do not hesitate if you would like to get involved or contribute!

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.idea		.idea
anomaly_trigger		anomaly_trigger
api		api
config		config
front_demo		front_demo
imgs		imgs
knowledge_json		knowledge_json
knowledge_vector_db		knowledge_vector_db
logs		logs
materials		materials
models		models
monitoring_scripts		monitoring_scripts
multiagents		multiagents
others		others
prompt_template_scripts		prompt_template_scripts
prompt_templates		prompt_templates
response_formalize_scripts		response_formalize_scripts
tree_of_thought		tree_of_thought
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README-Chinese.md		README-Chinese.md
README.md		README.md
app.py		app.py
app_factory.py		app_factory.py
classpath.txt		classpath.txt
diag_time.txt		diag_time.txt
host_local_tools.py		host_local_tools.py
main.py		main.py
requirements.txt		requirements.txt
run_demo.sh		run_demo.sh
setup.py		setup.py

Module	Functions
index_selection (equipped)	heuristic algorithm
query_rewrite (equipped)	45 rules
physical_hint (equipped)	15 parameters

License

awesome-software/DB-GPT

Folders and files

Latest commit

History

Repository files navigation