Skip to content

awesome-software/DB-GPT

 
 

Repository files navigation

LLM As Database Administrator

FeaturesNewsQuickStartCustomizationCasesFAQCommunityContributors

【English | 中文

🧗 D-Bot, a LLM-based DBA, can acquire database maintenance experience from textual sources, and provide reasonable, well-founded, in-time diagnosis and optimization advice for target databases.



Features

DBA Team (D-Bot)

  • Well-Founded Diagnosis: D-Bot can provide founded diagnosis by utilizing relevant database knowledge (with document2experience).

  • Practical Tool Utilization: D-Bot can utilize both monitoring and optimization tools to improve the maintenance capability (with tool learning and tree of thought).

  • In-depth Reasoning: Compared with vanilla LLMs, D-Bot will achieve competitive reasoning capability to analyze root causes (with multi-llm communications).



What's New

  • [2023/9/10] Add diagnosis logs and replay button 🔗 link

  • [2023/9/09] Add typical anomalies 🔗 link

  • [2023/9/05] A unique framework is available! Start diag+tool service with a single command, experiencing 5x speed up!! 🚀 link

  • [2023/8/25] Support vue-based website interface. More flexible and beautiful! 🔗 link

  • [2023/8/22] Support tool retrieval for 60+ APIs 🔗 link

  • [2023/8/16] Support multi-level optimization functions 🔗 link

  • [2023/8/10] Our vision papers are released (continuously update)

D-Bot is evolving with new features 👫👫
Don't forget to star ⭐ and watch 👀 to stay up to date :)

A demo of using D-Bot

db_diag.mp4

QuickStart

D-Bot

Folder Structure

.
├── multiagents
│   ├── agent_conf                        # Settings of each agent
│   ├── agents                            # Implementation of different agent types 
│   ├── environments                      # E.g., chat orders / chat update / terminal conditions
│   ├── knowledge                         # Diagnosis experience from documents
│   ├── llms                              # Supported models
│   ├── memory                            # The content and summary of chat history
│   ├── response_formalize_scripts        # Useless content removal of model response
│   ├── tools                             # External monitoring/optimization tools for models
│   └── utils                             # Other functions (e.g., database/json/yaml operations)

1. Prerequisites

  • PostgreSQL v12 or higher

    Additionally, install extensions like pg_stat_statements (track slow queries), pg_hint_plan (optimize physical operators), and hypopg (create hypothetical Indexes).

  • Prometheus and Grafana (tutorial)

    Check prometheus.md for detailed installation guides.

    Grafana is no longer a necessity with our vue-based frontend.

2. Package Installation

Step 1: Install python packages.

pip install -r requirements.txt

Step 2: Configure environment variables.

  • Export your OpenAI API key
# macos
export OPENAI_API_KEY="your_api_key_here"
# windows
set OPENAI_API_KEY="your_api_key_here"

Step 3: Add database/anomaly/prometheus settings into tool_config_example.yaml and rename into tool_config.yaml:

```bash
POSTGRESQL:
  host: 182.92.xxx.x
  port: 5432
  user: xxxx
  password: xxxxx
  dbname: postgres

BENCHSERVER:
  server_address: 8.131.xxx.xx
  username: root
  password: xxxxx
  remote_directory: /root/benchmark

PROMETHEUS:
  api_url: http://8.131.xxx.xx:9090/
  postgresql_exporter_instance: 172.27.xx.xx:9187
  node_exporter_instance: 172.27.xx.xx:9100
```

You can ignore the settings of BENCHSERVER, which is not used in this version.

  • If accessing openai service via vpn, execute this command:
# macos
export https_proxy=http://127.0.0.1:7890 http_proxy=http://127.0.0.1:7890 all_proxy=socks5://127.0.0.1:7890
  • Test your openai key
cd others
python openai_test.py

3. Diagnosis & Optimization

Website Interface

We also provide a local website demo for this environment. You can launch it with

# cd website
cd front_demo
rm -rf node_modules/
rm -r package-lock.json
# install dependencies for the first run (nodejs, ^16.13.1 is recommended)
npm install  --legacy-peer-deps
# back to root directory
cd ..
# launch the local server and open the website
sh run_demo.sh

Modify the "python app.py" command within run_demo.sh if multiple versions of Python are installed.

After successfully launching the local server, visit http://127.0.0.1:9228/ to trigger the diagnosis procedure.

Command-line Interface
python main.py

4. Supported Anomalies

Within the anomaly_trigger directory, we aim to offer scripts that could incur typical anomalies, e.g.,

Type Root Cause Description
Query Issues INSERT_LARGE_DATA Long execution time for large data insertions
FETCH_LARGE_DATA Long execution time for large data fetching
MISSING_INDEXES Missing indexes causing performance issues
REDUNDANT_INDEX Unnecessary and redundant indexes in tables
VACUUM Unused space caused by data modifications
POOR_JOIN_PERFORMANCE Poor performance of Join operators
CORRELATED_SUBQUERY Non-promotable subqueries in SQL
LACK_STATISTIC_INFO Outdated statistical info affecting execution plan
Concurrency Issues FETCH_LARGE_DATA Fetching of large data volumes
LOCK_CONTENTION Lock contention issues
CPU_CONTENTION Severe external CPU resource contention
IO_CONTENTION IO resource contention affecting SQL performance
WORKLOAD_CONTENTION Workload concentration affecting SQL execution
Configuration Issues SMALL_MEMORY_ALLOCATION Tool small allocated memory space
IO_SATURATION Reach the max I/O capacity or throughput

Customize Your KnowledgeBase And Tools

1. Knowledge Preparation

  • Extract knowledge from both code (./knowledge_json/knowledge_from_code) and documents (./knowledge_json/knowledge_from_document).

2. Tool Preparation

  • Tool APIs (for optimization)

    Module Functions
    index_selection (equipped) heuristic algorithm
    query_rewrite (equipped) 45 rules
    physical_hint (equipped) 15 parameters

    For functions within [query_rewrite, physical_hint], you can use api_test.py script to verify the effectiveness.

    If the function actually works, append it to the api.py of corresponding module.

  • Tool Usage Algorithm (tree of thought)

    cd tree_of_thought
    python test_database.py

    History messages may take up many tokens, and so carefully decide the turn number.

Robust Responses Mechanisms

Refined diagnostics

Remove redundant content in llm responses.

cd response_formalize_scripts
python combine_similar_answer.py # test 
Before Cleaning
[Chief DBA]: The identified root causes of the anomaly are high CPU usage due to the high number of running processes and high load averages, slow queries, and high memory usage due to complex queries, insufficient memory allocation, and the possibility of disk spill.
Optimize the slow queries by analyzing their execution plans, identifying any missing indexes, and tuning the queries accordingly
Polisih the slow queries by checking their execution plans, identifying any missing indexes, and tuning the queries accordingly
Monitor and manage the number of running processes on the node, ensuring that it remains within acceptable limits
Investigate and address any system or infrastructure issues that may be contributing to the high load averages
Consider scaling up or adding more resources to the system if the current CPU usage is consistently high and impacting performance
Add the recommended indexes: ['movie_info#movie_id', 'movie_companies#movie_id', 'keyword#keyword', 'aka_name#person_id'] to reduce cost
Further investigate and optimize the slow queries for better memory utilization
Monitor and adjust the memory allocation parameters like 'work_mem' and 'shared_buffers' to optimize memory usage
Analyze the query plans and consider adjustments to parameters like 'work_mem' and 'shared_buffers' to help reduce or eliminate disk spilling
After Cleaning
[Chief DBA]: The identified root causes of the anomaly are high CPU usage due to the high number of running processes and high load averages, slow queries, and high memory usage due to complex queries, insufficient memory allocation, and the possibility of disk spill.
Polisih the slow queries by checking their execution plans, identifying any missing indexes, and tuning the queries accordingly.
Monitor and manage the number of running processes on the node, ensuring that it remains within acceptable limits.
Add the recommended indexes: ['movie_info#movie_id', 'movie_companies#movie_id', 'keyword#keyword', 'aka_name#person_id'] to reduce cost.

Cases

MISSING_INDEXES → INDEX_SELECTION

Note optimize_index_selection takes minutes to explore beneficial index combinations.

[Chief DBA]: The anomaly started at 1691897340 and ended at 1691897430.

[CPU Agent]: The anomaly is caused by high CPU usage due to the high number of running processes and the relatively high load averages, especially in the past 1 minute. In addition, slow queries also contribute to the high CPU usage.
1. Optimize the slow queries by analyzing their execution plans, identifying any missing indexes, and tuning the queries accordingly.
2. Monitor and manage the number of running processes on the node, ensuring that it remains within acceptable limits.
3. Investigate and address any system or infrastructure issues that may be contributing to the high load averages.
4. Consider scaling up or adding more resources to the system if the current CPU usage is consistently high and impacting performance.
5. Add the recommended indexes: ['movie_info#movie_id', 'movie_companies#movie_id', 'keyword#keyword', 'aka_name#person_id'] to reduce cost from 46856716.92000003 to 31393898.919999983.
----------------------------
(matched knowledge) The high CPU usage may be due to the high number of running processes and the relatively high load averages. Slow queries can also contribute to the high CPU usage. The missing_index function checks for the presence of a required index using a workload-index-recommend interface.

[Memory Agent]: The potential causes of high memory usage could be complex queries, insufficient memory allocation, and possibility of disk spill. The slow queries mentioned involve complex operations such as filtering, grouping, and joining multiple tables, which can consume a significant amount of memory, especially if they involve large data sets. The abnormal memory metric values suggest that the available memory might be insufficient to handle the workload. Adjusting parameters like 'work_mem' and 'shared_buffers' might help alleviate memory issues. The complexity and size of the slow queries could potentially lead to disk spill during execution, which can significantly impact performance.
1. Further investigate and optimize the slow queries mentioned above for better memory utilization.
2. Monitor and adjust the memory allocation parameters like 'work_mem' and 'shared_buffers' to optimize memory usage.
3. Analyze the query plans and consider adjustments to parameters like 'work_mem' and 'shared_buffers' to help reduce or eliminate disk spilling.
----------------------------
(matched knowledge) Disk spill is a function that checks whether there is a possibility of disk spill during the execution of SQL. If the plan parse information is not available, it checks whether the sort spill count or hash spill count exceeds the sort rate threshold. If the plan parse information is available, it calculates the total cost of the plan and checks whether the cost rate of the sort or hash operators exceeds the cost rate threshold. If abnormal operator details are found and the sort or hash spill count is greater than 0, it indicates that the SORT/HASH operation may spill to disk. The suggestion is to analyze whether the business needs to adjust parameters like work_mem and shared_buffers. If disk spill is detected, it is a root cause, otherwise it is not a root cause.

[Chief DBA]: The identified root causes of the anomaly are high CPU usage due to the high number of running processes and high load averages, slow queries, and high memory usage due to complex queries, insufficient memory allocation, and the possibility of disk spill.
----------------------------
(solution) 1. Monitor and manage the number of running processes on the node, ensuring that it remains within acceptable limits.
2. Add the recommended indexes: ['movie_info#movie_id', 'movie_companies#movie_id', 'keyword#keyword', 'aka_name#person_id'] to reduce cost.
3. Analyze the query plans and consider adjustments to parameters like 'work_mem' and 'shared_buffers' to help reduce or eliminate disk spilling.
4. Investigate and address any system or infrastructure issues that may be contributing to the high load averages.
INSUFFICIENT_MEMORY_ALLOCATION

[Chief DBA]: The anomaly started at 1691897340 and ended at 1691897430.

[CPU Agent]: Based on the CPU relevant metric values, the root cause of high CPU usage could be a high load value, too many active processes, blocked processes, high CPU scrape collector duration, and low entropy.
I suggest reviewing system logs and identifying any recent changes or updates. We should also analyze other relevant metrics such as memory usage, disk I/O, network traffic, and check for any abnormal behavior from specific processes or applications.
----------------------------
(matched knowledge) missing_index: This function checks for the presence of a required index using a workload-index-recommend interface. If the recommended index information is available, it indicates that a required index is missing and provides a suggestion for the recommended index. If the information is not available, it is not a root cause for the issue.

[Memory Agent]: The root cause of the high memory usage could be disk spill and insufficient memory allocation.
I suggest analyzing the actual resource usage during query execution, performing a thorough analysis of the query plans, and evaluating the system configuration settings related to memory allocation, such as 'work_mem' and 'shared_buffers.' More information about the sort spill count or hash spill count and the threshold values is needed to confirm if disk spill is the root cause. If the free memory is consistently low and close to the total memory capacity, consider increasing the memory allocation for the system.
----------------------------
(matched knowledge) disk_spill: This function checks whether there is a possibility of disk spill during the execution of SQL. If the plan parse information is not available, it checks whether the sort spill count or hash spill count exceeds the sort rate threshold. If the plan parse information is available, it calculates the total cost of the plan and checks whether the cost rate of the sort or hash operators exceeds the cost rate threshold. If abnormal operator details are found and the sort or hash spill count is greater than 0, it indicates that the SORT/HASH operation may spill.

[Chief DBA]: The identified root causes of the anomaly are high CPU usage due to high load value, too many active processes, blocked processes, high CPU scrape collector duration, and low entropy. The high memory usage could be due to disk spill and insufficient memory allocation.
----------------------------
(solution) To resolve the high CPU usage, we should review system logs and identify any recent changes or updates. We should also analyze other relevant metrics such as memory usage, disk I/O, network traffic, and check for any abnormal behavior from specific processes or applications.
To mitigate the high memory usage, we should analyze the actual resource usage during query execution, perform a thorough analysis of the query plans, and evaluate the system configuration settings related to memory allocation, such as 'work_mem' and 'shared_buffers.' More information about the sort spill count or hash spill count and the threshold values is needed to confirm if disk spill is the root cause. If the free memory is consistently low and close to the total memory capacity, consider increasing the memory allocation for the system.
POOR_JOIN_PERFORMANCE
case_poor_join5.mp4

FAQ

🤨 The '.sh' script command cannot be executed on windows system. Switch the shell to *git bash* or use *git bash* to execute the '.sh' script.
🤨 "No module named 'xxx'" on windows system. This error is caused by issues with the Python runtime environment path. You need to perform the following steps:

Step 1: Check Environment Variables.

You must configure the "Scripts" in the environment variables.

Step 2: Check IDE Settings.

For VS Code, download the Python extension for code. For PyCharm, specify the Python version for the current project.

Todo

  • Project cleaning
  • Support more anomalies
  • Add more communication mechanisms
  • Public-generated anomaly training data
  • Fine-tune localized private model
  • Integrate preparation components in the demo website.
  • Support other databases like MySQL
  • Collect more knowledge and store in vector db (./knowledge_vector_db)

Community

Relevant Projects

https://github.com/OpenBMB/AgentVerse

https://github.com/OpenBMB/BMTools

Citation

Feel free to cite us if you like this project.

@misc{zhou2023llm4diag,
      title={LLM As DBA}, 
      author={Xuanhe Zhou, Guoliang Li, Zhiyuan Liu},
      year={2023},
      eprint={2308.05481},
      archivePrefix={arXiv},
      primaryClass={cs.DB}
}
@misc{zhou2023dbgpt,
      title={DB-GPT: Large Language Model Meets Database}, 
      author={Xuanhe Zhou, Zhaoyan Sun, Guoliang Li},
      year={2023},
      archivePrefix={Data Science and Engineering},
}

Contributors

Other Collaborators: Wei Zhou, Kunyi Li.

We thank all the contributors to this project. Do not hesitate if you would like to get involved or contribute!

Releases

No releases published

Packages

No packages published

Languages

  • Python 76.2%
  • Vue 12.3%
  • JavaScript 8.5%
  • CSS 1.5%
  • SCSS 1.1%
  • Shell 0.3%
  • HTML 0.1%