Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reformat and improve RAG module and agents #184

Open
wants to merge 93 commits into
base: main
Choose a base branch
from

Conversation

ZiTao-Li
Copy link
Collaborator

@ZiTao-Li ZiTao-Li commented Apr 28, 2024

Description

Updates

Changes on code structure

  • migrate and reformat RAG/knowledge module(s) and RAG agent(s) from examples to a module in src
  • add llama-index as rag_requires in setup.py

Changes on the RAG agent module

  • be compatible with the new KnowledgeBank feature
  • the configurations for the RAG-related functionalities are relocated back to knowledge modules
  • the retrieve method merges the retrievers from the KnowledgeBank members

Changes on the RAG/knowledge module

  • Rename the RAG modules to Knowledge (e.g., LlamaIndexRAG -> LlamaIndexKnowledge)
  • store and persist processed embeddings/indices/documents
  • support loading multiple doc types and dirs for one index
  • support docs management in the obtained (persisted) index
  • add a refresh function to update the index when needed
  • enable agents to reset or add new retrievers

Improving utility of knowledge module

  • reformat easy-to-use knowledge module config: the new format only configure the KnowledgeBank
  • introduce KnowledgeBank:
    • KnowledgeBank provides an easier way to initialize a knowledge object, just call add_data_as_knowledge with knowledge_id (a string as the identifier for this knowledge object), emb_model_name (the name of the embedding model config) and data_dirs_and_types (a dictionary of data directories and the wanted file extensions). As shown in the rag_example.py
       knowledge_bank.add_data_as_knowledge(
          knowledge_id="agentscope_tutorial_rag",
          emb_model_name="qwen_emb_config",
          data_dirs_and_types={
              "../../docs/sphinx_doc/en/source/tutorial": [".md"],
          },
      )
      
    • Knowledge objects in KnowledgeBank can be shared and duplicated by multiple agents, which can avoid embedding duplicated documents.
    • RAG agents can load multiple Knowledge objects (based on the "knowledge_id" in knowledge_config.json) with associated retrievers to perform multi-source information retrieval. Just need to pass the agent into KnowledgeBank.equip function.

Toturial

Both English and Chinese tutorial are added as 209-rag.md .


Checklist

Please check the following items before code is ready to be reviewed.

  • Code has passed all tests
  • Docstrings have been added/updated in Google Style
  • Documentation has been updated
  • Code is ready for review

Copy link
Collaborator

@garyzhang99 garyzhang99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the inline comments.

docs/sphinx_doc/zh_CN/source/tutorial/209-rag.md Outdated Show resolved Hide resolved
src/agentscope/rag/llama_index_knowledge.py Outdated Show resolved Hide resolved
src/agentscope/rag/llama_index_knowledge.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@pan-x-c pan-x-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the inline comments.

The current version of rag is not compatible with distributed mode. We can add support in future PRs

src/agentscope/rag/knowledge.py Outdated Show resolved Hide resolved
src/agentscope/agents/rag_agents.py Outdated Show resolved Hide resolved
src/agentscope/agents/rag_agents.py Outdated Show resolved Hide resolved
src/agentscope/rag/knowledge_bank.py Outdated Show resolved Hide resolved
ZiTao-Li and others added 4 commits May 22, 2024 19:05
…s used in previous versions, but no longer needed.
# Conflicts:
#	examples/conversation_with_RAG_agents/README.md
Copy link
Collaborator

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see inline comments

* `KnowledgeBank.add_data_as_knowledge`: 创建Knowledge模块。一种简单的方式只需要提供knowledge_id、emb_model_name和data_dirs_and_types。
```python
knowledge_bank.add_data_as_knowledge(
knowledge_id="agentscope_tutorial_rag",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more like a knowledge name rather than ID here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ID (identity) could be either a name or number, as long as it is unique in the set-up.

examples/conversation_with_RAG_agents/README.md Outdated Show resolved Hide resolved
examples/conversation_with_RAG_agents/README.md Outdated Show resolved Hide resolved
docs/sphinx_doc/zh_CN/source/tutorial/209-rag.md Outdated Show resolved Hide resolved
```json
[
{
"knowledge_id": "{your_knowledge_id}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use a config to setup the rag module, do we consider to add a config file to explain what's the usage of each parameters? Just like this file in FederatedScope
https://github.com/alibaba/FederatedScope/blob/master/federatedscope/core/configs/config.py#L258

"description": "Code-Search-Assistant is an agent that can provide answer based on AgentScope code base. It can answer questions about specific modules in AgentScope.",
"sys_prompt": "You're a coding assistant of AgentScope. The answer starts with appreciation for the question, then provide details regarding the functionality and features of the modules mentioned in the question. The language should be in a professional and simple style. The answer is limited to be less than 100 words.",
"model_config_name": "qwen_config",
"rag_config": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this pre-processing outside the agent? For example (taking get as example):

knowledge_bank = KnowledgeBank(...)

knowledges = knowledge_bank.get(knowledge_ids=["kb1", "kb2"], similarity_top_k=5, log_retrieval=5, recent_n_mem=1)

AgentClass(name="assistant", knowledges=knowledges, ...)

or user can setup their own knowledge within the agent object's constructor by themselves.

There are two advantages:

  1. No need to know what parameters should be written in a rag config. All parameters are in the declaration of this get() function, which can be accessed easily.
  2. The agent is not required to have a rag config attribute.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support after update. Now, the knowledge can be obtained by get_knowledge function, and a list of knowledge can be assigned to a agent in initialization.

In this update, agents are changed to use knowledge.retrieve function directly (the retriever is removed). The retriever is build in the knowledge.retrieve function every time called, with the parameter provided.

while True:
# The workflow is the following:
# 1. user input a message,
# 2. if it mentions one of the agents, then the agent will be called
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we tell the user that the word mention referred to the @ operation here?

src/agentscope/rag/knowledge_bank.py Show resolved Hide resolved
Set the transformations as needed, or just use the default setting.

Args:
config (dict): a dictionary containing configurations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue, since the function won't expose to users, but it would be better if the config arg is more specified( e.g. the store_and_index field required?)

Copy link
Collaborator

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz see inline comments

```python
knowledge_bank.add_data_as_knowledge(
knowledge_id="agentscope_tutorial_rag",
emb_model_name="qwen_emb_config",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config_name or model_name here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is emb_model_name

name: str,
sys_prompt: str,
model_config_name: str,
memory_config: Optional[dict] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to remove memory_config since we never use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants