Can LLMs Replace Data Analysts? Getting Answers Using SQL

Part 2: Diving deeper into LLM agents

Mariya Mansurova

Towards Data Science

· ~31 min read · December 22, 2023 (Updated: December 22, 2023) · Free: Yes

In the previous article, we've started building an LLM-powered analyst. We decided to focus on descriptive analytics and reporting tasks since they are the most common for analysts. Most analysts start their careers with such tasks, and most companies start building the analytical function with reporting and BI tools.

Our first prototype can use ready-made tools to answer questions related to the defined metrics, like in the example below.

Illustration by author

The next step would be to teach our LLM-powered analyst to get any metrics. Analysts usually use SQL to get data. So, the most helpful skill for the LLM analyst would be interacting with SQL databases.

We've already discussed OpenAI functions and learned how LLMs can use tools to integrate with the world. In this article, I would like to focus on LLM agents and discuss them in more detail. We will learn how to build agents using LangChain and try different agent types.

Setting up a database

First, let's set up a database we will be interacting with. My choice is ClickHouse. ClickHouse is an open-source column-oriented SQL database management system for online analytical processing (OLAP). It's a good option for big data and analytical tasks.

If you want to learn more about ClickHouse, please check my article. However, you can use any database. You will need just to tweak the code for functions that get data.

Installing ClickHouse is just one line of code. The initial command executes the script provided by the ClickHouse team to download the proper binary for your platform. Then, you need to launch a server, and that's it.

curl https://clickhouse.com/ | sh # downloads appropriate binary file
./clickhouse server # starts clickhouse server

You can access ClickHouse via HTTP API. By default, it listens on the 8123 port.

CH_HOST = 'http://localhost:8123' # default address 

def get_clickhouse_data(query, host = CH_HOST, connection_timeout = 1500):
    r = requests.post(host, params = {'query': query}, 
      timeout = connection_timeout)
    
    return r.text

I usually check that r.status_code = 200 to ensure the request has been successfully completed and raise the error otherwise. However, we will pass the results of this function to LLM. So, getting any output DB returns is okay, regardless of whether it is an error or not. LLM will be able to handle it properly.

I've generated synthetic data for this example. If you would like to learn more about data simulation, you can find the code here. I used retention curves to model sessions for customers, considering the number of days since account creation and weekly seasonality. It could be a bit of an overcomplicated approach right now since we don't use data much. But I hope in future prototypes, we will be able to get some insights from this data using LLMs.

We need just a couple of tables representing a data model for a basic e-commerce product. We will work with the list of users (ecommerce.users) and their sessions (ecommerce.sessions).

Let's look at the ecommerce.sessions table.

Screenshot by author

And here, you can see what features we have for users.

Screenshot by author

Now, we have data to work with and are ready to move on and discuss LLM agents in more detail.

Agents overview

The core idea of the LLM agents is to use LLM as a reasoning engine to define the set of actions to take. In the classic approach, we hardcode a sequence of actions, but with agents, we give the model tools and tasks and let her decide how to achieve them.

One of the most foundational papers regarding LLM agents is "ReAct: Synergizing Reasoning and Acting in Language Models" by Shunyu Yao et al. The ReAct (Reasoning + Acting) approach suggests combining:

reasoning that helps to create the plan and update it in case of exceptions,
actions that allow the model to leverage external tools or gather data from external sources.

Such an approach shows much better performance on different tasks. One of the examples from the paper is below.

Example from the paper by Yao et al.

Actually, that's how human intelligence works: we combine inner voice reasoning with task-oriented actions. Suppose you need to cook a dinner. You will use reasoning to define a plan ("guests will be in 30 minutes, I have time only to cook pasta"), adjust it ("Ben has become a vegan, I should order something for him") or decide to delegate which is an equivalent of external tools ("there's no pasta left, I need to ask my partner to buy it"). At the same time, you will use actioning to use some tools (ask a partner for help or use a mixer) or get some information (to look up in the internet how many minutes you need to cook pasta to make it al dente). So, it's reasonable to use a similar approach with LLMs since it works for humans (who are no doubt AGI).

Now, there are quite a lot of different approaches for LLM agents since ReAct. They differ in prompts used to set the model's reasoning, how we define tools, output format, handling memory about the intermediate steps, etc.

Some of the most popular approaches are:

OpenAI functions,
AutoGPT,
BabyAGI,
Plan-and-execute agent.

We will use these approaches later on for our task and see how they work and what the differences are.

Building Agent from Scratch

Let's start to build an agent. We will do it from scratch to understand how everything works under the hood. Then, we will use LangChain's tools for faster prototyping if you don't need any customisation.

The core components of LLM agents are:

Prompt to guide the model's reasoning.
Tools that the model can use.
Memory — a mechanism to pass previous iterations to the model.

For the first version of the LLM agent, we will use OpenAI functions as a framework to build an agent.

Defining tools

Let's start with defining the tools for our robot. Let's think about what information our LLM-powered analyst might need to be able to answer questions:

List of tables — we can put it in the system prompt so that the model has some view on what data we have and doesn't need to execute a tool for it every time,
List of columns for the table so that the model can understand the data schema,
Top values for the column in the table so that the model can look up values for filters,
Results of SQL query execution to be able to get actual data.

To define tools in LangChain, we need to use @tool decorator for the function. We will use Pydantic to specify the arguments schema for each function so that the model knows what to pass to the function.

We've discussed tools and OpenAI functions in detail in the previous article. So don't hesitate to read it if you need to revise this topic.

The code below defines three tools: execute_sql, get_table_columns and get_table_column_distr.

from langchain.agents import tool
from pydantic import BaseModel, Field
from typing import Optional

class SQLQuery(BaseModel):
    query: str = Field(description="SQL query to execute")

@tool(args_schema = SQLQuery)
def execute_sql(query: str) -> str:
    """Returns the result of SQL query execution"""
    return get_clickhouse_data(query)

class SQLTable(BaseModel):
    database: str = Field(description="Database name")
    table: str = Field(description="Table name")

@tool(args_schema = SQLTable)
def get_table_columns(database: str, table: str) -> str:
    """Returns list of table column names and types in JSON"""
    
    q = '''
    select name, type
    from system.columns 
    where database = '{database}'
        and table = '{table}'
    format TabSeparatedWithNames
    '''.format(database = database, table = table)
    
    return str(get_clickhouse_df(q).to_dict('records'))

class SQLTableColumn(BaseModel):
    database: str = Field(description="Database name")
    table: str = Field(description="Table name")
    column: str = Field(description="Column name")
    n: Optional[int] = Field(description="Number of rows, default limit 10")

@tool(args_schema = SQLTableColumn)
def get_table_column_distr(database: str, table: str, column: str, n:int = 10) -> str:
    """Returns top n values for the column in JSON"""

    q = '''
    select {column}, count(1) as count
    from {database}.{table} 
    group by 1
    order by 2 desc 
    limit {n}
    format TabSeparatedWithNames
    '''.format(database = database, table = table, column = column, n = n)
    
    return str(list(get_clickhouse_df(q)[column].values))

It's worth noting that the code above uses Pydantic v1. In June 2023, Pydantic released v2, which is incompatible with v1. So, check your version if you see validation errors. You can find more details on the Pydantic compatibility in the documentation.

We will be working with OpenAI functions and need to convert our tools. Also, I saved our toolkit in a dictionary. It will be handy when executing tools to get observations.

from langchain.tools.render import format_tool_to_openai_function

# converting tools into OpenAI functions
sql_functions = list(map(format_tool_to_openai_function, 
    [execute_sql, get_table_columns, get_table_column_distr]))

# saving tools into a dictionary for the future
sql_tools = {
    'execute_sql': execute_sql,
    'get_table_columns': get_table_columns,
    'get_table_column_distr': get_table_column_distr
}

Defining a chain

We've created tools for the model. Now, we need to define the agent chain. We will use the latest GPT 4 Turbo, which was also fine-tuned to be used with the functions. Let's initialise a chat model.

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0.1, model = 'gpt-4-1106-preview')\
  .bind(functions = sql_functions)

The next step is to define a prompt consisting of a system message and a user question. We also need a MessagesPlaceholder to set up a place for the list of observations the model will be working with.

from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder

system_message = '''
You are working as a product analyst for the e-commerce company. 
Your work is very important, since your product team makes decisions based on the data you provide. So, you are extremely accurate with the numbers you provided. 
If you're not sure about the details of the request, you don't provide the answer and ask follow-up questions to have a clear understanding.
You are very helpful and try your best to answer the questions.

All the data is stored in SQL Database. Here is the list of tables (in the format <database>.<table>) with descriptions:
- ecommerce.users - information about the customers, one row - one customer
- ecommerce.sessions - information about the sessions customers made on our web site, one row - one session
'''

analyst_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_message),
        ("user", "{question}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

As we discussed, I've added the list of tables in the database to the prompt so that the model has at least some knowledge about our data.

We have all the building blocks and are ready to set up the agent chain. The input parameters are a user message and intermediate steps (previous messages, function calls and observations). We pass the input parameters to the prompt using format_to_openai_function_messages to convert them into the expected format. Then, we pass everything to the LLM and, in the end, use the output parser OpenAIFunctionsAgentOutputParser for convenience.


from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser

analyst_agent = (
    {
        "question": lambda x: x["question"],
        "agent_scratchpad": lambda x: format_to_openai_function_messages(x["intermediate_steps"]),
    }
    | analyst_prompt
    | llm
    | OpenAIFunctionsAgentOutputParser()
)

We've defined our primary agent chain. Let's try to invoke it. I've passed an empty list since we have no intermediate steps in the beginning.

analyst_agent.invoke({"question": "How many active customers from the United Kingdom do we have?", 
    "intermediate_steps": []})

# AgentActionMessageLog(
#    tool='execute_sql', 
#    tool_input={'query': "SELECT COUNT(DISTINCT user_id) AS active_customers_uk FROM ecommerce.sessions WHERE country = 'United Kingdom' AND active = TRUE"}, 
#    log='\nInvoking: `execute_sql` with `{\'query\': "SELECT COUNT(DISTINCT user_id) AS active_customers_uk FROM ecommerce.sessions WHERE country = \'United Kingdom\' AND active = TRUE"}`\n\n\n', 
#    message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{"query":"SELECT COUNT(DISTINCT user_id) AS active_customers_uk FROM ecommerce.sessions WHERE country = \'United Kingdom\' AND active = TRUE"}', 'name': 'execute_sql'}})]
# )

We got an AgentActionMessageLog object, which means the model wants to call execute_sql function. When the model is ready to return the final answer to the user, it returns the AgentFinish object.

If we look at the tool_input, we can see that the model wants to execute the following query.

SELECT COUNT(DISTINCT user_id) AS active_customers_uk 
FROM ecommerce.sessions 
WHERE country = 'United Kingdom' AND active = TRUE

The query looks pretty good but uses the wrong column name: active instead of is_active. It will be interesting to see whether LLM will be able to recover from this error and return the result.

We can do execution step by step manually, however it will be more convenient to automate it.

If the AgentActionMessageLog object is returned, we need to call a tool, add the observation to the agent_scratchpad, and invoke the chain one more time.
If we got theAgentFinish object, we can terminate execution since we have the final answer.

I will also add a break after ten iterations to avoid potential endless loops.

from langchain_core.agents import AgentFinish

# setting initial parameters
question = "How many active customers from the United Kingdom do we have?"
intermediate_steps = []
num_iters = 0

while True:
    # breaking if there were more than 10 iterations
    if num_iters >= 10:  
        break

    # invoking the agent chain
    output = analyst_agent.invoke(
        {
            "question": question,
            "intermediate_steps": intermediate_steps,
        }
    )
    num_iters += 1
    
    # returning the final result if we got the AgentFinish object
    if isinstance(output, AgentFinish):
        model_output = output.return_values["output"]
        break
    # calling tool and adding observation to the scratchpad otherwise
    else:
        print(f'Executing tool: {output.tool}, arguments: {output.tool_input}')
        observation = sql_tools[output.tool](output.tool_input)
        print(f'Observation: {observation}')
        print()
        intermediate_steps.append((output, observation))
        
print('Model output:', model_output)

I added some logging of the tools' usage to the output to see how the execution is going. Also, you can always use LangChain debug mode to see all the calls.

As a result of the execution, we got the following output.

Executing tool: execute_sql, arguments: {'query': "SELECT COUNT(*) AS active_customers_uk FROM ecommerce.users WHERE country = 'United Kingdom' AND active = TRUE"}
Observation: Code: 47. DB::Exception: Missing columns: 'active' 
while processing query: 'SELECT count() AS active_customers_uk 
FROM ecommerce.users WHERE (country = 'United Kingdom') AND (active = true)', 
required columns: 'country' 'active', maybe you meant: 'country'. 
(UNKNOWN_IDENTIFIER) (version 23.12.1.414 (official build))

Executing tool: get_table_columns, arguments: {'database': 'ecommerce', 'table': 'users'}
Observation: [{'name': 'user_id', 'type': 'UInt64'}, {'name': 'country', 'type': 'String'}, 
{'name': 'is_active', 'type': 'UInt8'}, {'name': 'age', 'type': 'UInt64'}]

Executing tool: execute_sql, arguments: {'query': "SELECT COUNT(*) AS active_customers_uk FROM ecommerce.users WHERE country = 'United Kingdom' AND is_active = 1"}
Observation: 111469

Model output: We have 111,469 active customers from the United Kingdom.

Note: there's no guarantee that the agent won't execute DML operations on your database. So, if you're using it in a production environment, ensure that LLM either doesn't have permission to change data or your tool implementation doesn't allow it.

So, the model tried to execute SQL but got an error that there was no column active. Then, it decided to see the table schema, corrected the query accordingly, and got the result.

It's a pretty decent performance. I behave the same way myself. I usually try recalling or guessing column names first and check the documentation only if the first attempt fails.

However, in most cases, we don't need to write the execution ourselves. We can use the LangChain AgentExecutor class for it. Check documentation to learn about all possible parameters for the class.

You need to write your own executor only if you want to customise something. For example, add some conditions to terminate the execution or logic to use tools.

You can find the same code using the AgentExecutor below.

from langchain.agents import AgentExecutor

analyst_agent_executor = AgentExecutor(
    agent=analyst_agent, 
    tools=[execute_sql, get_table_columns, get_table_column_distr], 
    verbose=True,
    max_iterations=10, # early stopping criteria
    early_stopping_method='generate', 
    # to ask model to generate the final answer after stopping
)

analyst_agent_executor.invoke(
  {"question": "How many active customers from the United Kingdom do we have?"}
)

As a result, we got an easy-to-trace output with the same result. You can note that LangChain's formatting for the agent's output is very convenient.

Image by author

We've built the LLM agent from scratch. So now, we understand how it works and know how to customise it. However, LangChain provides a high-level function initialize_agent that could do it within just one call. You can find all the details in the documentation.

from langchain.agents import AgentType, Tool, initialize_agent
from langchain.schema import SystemMessage

agent_kwargs = {
    "system_message": SystemMessage(content=system_message)
}

analyst_agent_openai = initialize_agent(
    llm = ChatOpenAI(temperature=0.1, model = 'gpt-4-1106-preview'),
    agent = AgentType.OPENAI_FUNCTIONS, 
    tools = [execute_sql, get_table_columns, get_table_column_distr], 
    agent_kwargs = agent_kwargs,
    verbose = True,
    max_iterations = 10,
    early_stopping_method = 'generate'
)

Note that we passed the ChatOpenAI model without functions bound to it. We've passed tools separately, so we don't need to link them to the model.

Different Agent Types

We've built an LLM agent based on OpenAI functions from scratch. However, there are quite a lot of other approaches. So let's try them out as well.

We will look at the ReAct approach (the initial one from the paper we discussed earlier) and several experimental approaches provided by LangChain: Plan-and-execute, BabyAGI and AutoGPT.

ReAct agent

Let's start with looking at ReAct agents. With the current implementation, we can easily change the agent type and try the ReAct approach described in the paper.

The most general ReAct implementation is Zero-shot ReAct. It won't work for us because it supports only tools with a single string in input. Our tools require multiple arguments, so we need to use Structured Input ReAct.

We can leverage the advantage of using a modular framework: we need to change just one parameter agent = AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION, and that's it.

system_message = system_message + '''\nYou have access to the following tools:'''

agent_kwargs = {
    "prefix": system_message
}

analyst_agent_react = initialize_agent(
    llm = ChatOpenAI(temperature=0.1, model = 'gpt-4-1106-preview'),
    agent = AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION, 
    tools = [execute_sql, get_table_columns, get_table_column_distr], 
    agent_kwargs = agent_kwargs,
    verbose = True,
    max_iterations = 10,
    early_stopping_method = 'generate'
)

You might wonder how to find the arguments you can specify for the agent. Unfortunately, it's not documented, so we need to dive into the source code to understand it. Let's discuss it step-by-step.

We can see that analyst_agent_react is an object of the AgentExecutor class.
This class has an agent field. In our case, it's an object of the StructuredChatAgent class. The class depends on the specified agent type.
Let's find a StructuredChatAgent class implementation and see how it works. In this case, LangChain creates a prompt consisting of prefix, tools' description, formatted instructions and suffix.
You can find the complete list of parameters you can pass as agent_kwargs in the code.
So, we can override the default PREFIX value from here and pass it as a prefix in agent_kwargs. Also, if you're interested, you can read through the default ReAct prompt here and think about how to tweak it for your task.

If you are interested, you can see the final prompt using the following call.

for message in analyst_agent_react.agent.llm_chain.prompt.messages:
    print(message.prompt.template)

Let's invoke our method and see the result.

analyst_agent_react.run("How many active customers from the United Kingdom do we have?")

We can notice that the model follows a slightly different framework for reasoning. The model starts iteration with writing down the thought (reasoning), then moves to action (function call) and observation (the result of function call). Then, iteration repeats. In the end, the model returns action = Final Answer.

> Entering new AgentExecutor chain...
Thought: To answer this question, I need to define what is meant by 
"active customers" and then query the database for users from 
the United Kingdom who meet this criteria. I will first need to know 
the structure of the `ecommerce.users` table to understand what columns 
are available that could help identify active customers and their location.

Action:
```
{
  "action": "get_table_columns",
  "action_input": {
    "database": "ecommerce",
    "table": "users"
  }
}
```


Observation: [{'name': 'user_id', 'type': 'UInt64'}, {'name': 'country', 'type': 'String'}, {'name': 'is_active', 'type': 'UInt8'}, {'name': 'age', 'type': 'UInt64'}]
Thought:The `ecommerce.users` table contains a column named `is_active` 
which likely indicates whether a customer is active or not, and a `country` 
column which can be used to filter users by their location. Since we 
are interested in active customers from the United Kingdom, we can use 
these two columns to query the database.

Action:
```
{
  "action": "execute_sql",
  "action_input": {
    "query": "SELECT COUNT(*) AS active_customers_uk FROM ecommerce.users WHERE country = 'United Kingdom' AND is_active = 1"
  }
}
```
Observation: 111469

Thought:I have the information needed to respond to the user's query.
Action:
```
{
  "action": "Final Answer",
  "action_input": "There are 111,469 active customers from the United Kingdom."
}
```

> Finished chain.
'There are 111,469 active customers from the United Kingdom.'

Even though the model followed a different path (starting with understanding the table schema and then executing SQL), it came to the same result.

Now, let's move on to experimental approaches. In LangChain, there are experimental agent types. They are not advised for production usage yet. However, it will be interesting to try using them and see how they work.

Plan-and-execute agent

The code below is based on the example from LangChain's cookbook.

This agent follows a "Plan-and-execute" approach in contrast to the "Action" agents we looked at previously. This approach was inspired by the BabyAGI framework and the paper "Plan-and-Solve Prompting".

The characteristic of such an approach is that the agent first tries to plan the next steps and then executes them.

There are two components in this approach:

Planner — a regular Large Language Model with the primary goal — just to reason and plan,
Executor — Action agent, an LLM empowered with the set of tools it can use to action.

The advantage of this approach is that you have a separation: one model focuses on planning (reasoning), while the other focuses on execution (action). It's more modular, and potentially, you could use smaller and cheaper models fine-tuned for your specific tasks. However, this approach also generates more LLM calls, so it's more expensive if we are using ChatGPT.

Let's initialise the planner and the executor.

from langchain_experimental.plan_and_execute import PlanAndExecute, load_agent_executor, load_chat_planner

model = ChatOpenAI(temperature=0.1, model = 'gpt-4-1106-preview')
planner = load_chat_planner(model)
executor = load_agent_executor(model, 
    tools = [execute_sql, get_table_columns, get_table_column_distr], 
    verbose=True)

There's currently no way to specify a custom prompt for the executor since you can't pass it to the function. However, we can hack the prompt and add our initial system message that gives some context about the task to the beginning of the default prompt.

Disclaimer: overriding objects' fields is a bad practice because we might bypass some prompt validations. We are doing it now only to experiment with this approach. Such a solution is not suitable for production.

executor.chain.agent.llm_chain.prompt.messages[0].prompt.template = system_message + '\n' + \
    executor.chain.agent.llm_chain.prompt.messages[0].prompt.template

Now, it's time to define an agent and execute the same query we were asking before.

analyst_agent_plan_and_execute = PlanAndExecute(
    planner=planner, 
    executor=executor
)
analyst_agent_plan_and_execute.run("How many active customers from the United Kingdom do we have?")

The call returned an error: RateLimitError: Error code: 429 — {'error': {'message': 'Request too large for gpt-4–1106-preview in organization on tokens_usage_based per min: Limit 150000, Requested 235832.', 'type': 'tokens_usage_based', 'param': None, 'code': 'rate_limit_exceeded'}} . It looks like the model tried to send too many tokens to OpenAI.

Let's try to understand what has happened by looking at the model's output (you can find it below):

First, the model decided to look at ecommerce.users and ecommerce.sessions columns to determine the criteria for "active" customers.
It realised that it needed to use is_active in ecommerce.users table. However, the model decided it should also use sessions' data to define the customer's activity.
Then, the model went down this rabbit hole trying to define criteria for recent activity in ecommerce.sessions. It looked at the distributions for action_date, session_duration and revenue.
Finally, it defined active customers as those who have had a session within the last 30 days, with a session duration and revenue above certain thresholds, neglecting that it could just use is_active. The model planned to look at the data to define these thresholds.
The final error was caused by an attempt to load all user_ids for customers since there are too many tokens for ChatGPT.

Even though there is quite a lot of text, I encourage you to read it to see how the model thinks through the problems.

> Entering new AgentExecutor chain...
Action:
```
{
  "action": "Final Answer",
  "action_input": "The data regarding customers is stored in a SQL Database, specifically in the table named 'ecommerce.users'."
}
```

> Finished chain.


> Entering new AgentExecutor chain...
Thought: To determine the criteria for an "active" customer, I need 
to understand what data is available in the 'ecommerce.users' and 
'ecommerce.sessions' tables. This will help me identify which columns 
could be used to define an "active" customer based on recent purchases, 
account activity, or other engagement metrics.

Action:
```
{
  "action": "get_table_columns",
  "action_input": {
    "database": "ecommerce",
    "table": "users"
  }
}
```


Observation: [{'name': 'user_id', 'type': 'UInt64'}, {'name': 'country', 'type': 'String'}, {'name': 'is_active', 'type': 'UInt8'}, {'name': 'age', 'type': 'UInt64'}]
Thought:The 'ecommerce.users' table contains columns for 'user_id', 'country',
 'is_active', and 'age'. The 'is_active' column seems to directly indicate 
whether a user is considered active or not. However, to understand 
the criteria behind this designation, we need to see if there are 
any engagement metrics in the 'ecommerce.sessions' table that could 
correlate with the 'is_active' status.

Action:
```
{
  "action": "get_table_columns",
  "action_input": {
    "database": "ecommerce",
    "table": "sessions"
  }
}
```

Observation: [{'name': 'user_id', 'type': 'UInt64'}, {'name': 'session_id', 'type': 'UInt64'}, {'name': 'action_date', 'type': 'Date'}, {'name': 'session_duration', 'type': 'UInt64'}, {'name': 'os', 'type': 'String'}, {'name': 'browser', 'type': 'String'}, {'name': 'is_fraud', 'type': 'UInt8'}, {'name': 'revenue', 'type': 'Float32'}]
Thought:The 'ecommerce.sessions' table includes columns such as 'user_id', 
'session_id', 'action_date', 'session_duration', 'os', 'browser', 'is_fraud', 
and 'revenue'. The 'action_date' could be used to determine recent activity, 
and 'session_duration' might indicate engagement level. 'Revenue' could 
be used to identify recent purchases. To define an "active" customer, 
we could consider users with recent 'action_date', longer 'session_duration', 
and positive 'revenue'. However, to set specific criteria, we might need 
to understand the distribution of these values, such as 
the most recent 'action_date' or average 'session_duration'. 
Let's first check the distribution of 'action_date' to see how recent 
the sessions are.

Action:
```
{
  "action": "get_table_column_distr",
  "action_input": {
    "database": "ecommerce",
    "table": "sessions",
    "column": "action_date",
    "n": 10
  }
}
```
Observation: ['2023-12-25', '2023-12-26', '2023-12-27', '2023-12-18', '2023-12-28', '2023-12-21', '2023-12-20', '2023-12-04', '2023-12-19', '2023-12-14']
Thought:The distribution of 'action_date' shows recent dates, with 
the latest being '2023-12-28'. This suggests that the data includes 
very recent sessions. To define an "active" customer, we could consider 
a timeframe such as the last 30 days. However, before finalizing 
this criterion, it would be beneficial to also understand 
the distribution of 'session_duration' and 'revenue' to see if there 
are common thresholds that could further define an "active" customer. 
Let's check the distribution of 'session_duration' next.

Action:
```
{
  "action": "get_table_column_distr",
  "action_input": {
    "database": "ecommerce",
    "table": "sessions",
    "column": "session_duration",
    "n": 10
  }
}
```
Observation: [772, 860, 887, 678, 405, 671, 670, 707, 982, 907]
Thought:The distribution of 'session_duration' shows a range of values, 
with durations varying from 405 to 982 seconds. These values could 
be used to set a threshold for engagement level, but without context, 
it's difficult to determine what would be considered a high or 
low session duration. To get a better understanding, we should look 
at the distribution of 'revenue' to see if there's a common revenue 
threshold that could be used in conjunction with 'action_date' and 
'session_duration' to define an "active" customer. Let's check 
the distribution of 'revenue' next.

Action:
```
{
  "action": "get_table_column_distr",
  "action_input": {
    "database": "ecommerce",
    "table": "sessions",
    "column": "revenue",
    "n": 10
  }
}
```

Observation: [0.0, 1556.9, 7803.2, 4936.5, 619.6, 9947.7, 646.2, 3251.1, 2860.9, 4376.5]
Thought:The distribution of 'revenue' shows a range of values including zero, 
which could indicate sessions without purchases, and values up to 9947.7,
 which suggests significant purchases. To define an "active" customer, 
we could consider those with non-zero revenue within the last 30 days 
and with a session duration above a certain threshold. However, to set 
a precise threshold for 'session_duration' and 'revenue', we would need 
to analyze the data further, possibly calculating averages or percentiles. 
For now, we can propose a preliminary definition of an "active" customer
 as one who has had a session within the last 30 days, with a session duration 
and revenue above certain thresholds to be determined.

Action:
```
{
  "action": "Final Answer",
  "action_input": "Based on the data available in the 'ecommerce.users' and 'ecommerce.sessions' tables, an 'active' customer could preliminarily be defined as one who has had a session within the last 30 days, with a session duration and revenue above certain thresholds. The 'is_active' column in the 'users' table may already reflect this or a similar definition, but further analysis would be required to set specific thresholds for 'session_duration' and 'revenue'. These thresholds could be determined by calculating averages or percentiles based on the data distribution."
}
```

> Finished chain.


> Entering new AgentExecutor chain...
Action:
```
{
  "action": "get_table_columns",
  "action_input": {
    "database": "ecommerce",
    "table": "users"
  }
}
```

Observation: [{'name': 'user_id', 'type': 'UInt64'}, {'name': 'country', 'type': 'String'}, {'name': 'is_active', 'type': 'UInt8'}, {'name': 'age', 'type': 'UInt64'}]
Thought:The 'ecommerce.users' table contains a 'country' column which 
can be used to filter the customer records based on the location 
being the United Kingdom. I will write and execute an SQL query 
to retrieve the user IDs of customers located in the United Kingdom.

Action:
```
{
  "action": "execute_sql",
  "action_input": {
    "query": "SELECT user_id FROM ecommerce.users WHERE country = 'United Kingdom'"
  }
}
```

Observation: 
1000001
1000011
1000021
1000029
1000044
... <many more lines...>

That's an excellent example of the situation when the agent overcomplicated the question and went into too much detail. Human analysts also make such mistakes from time to time. So, it's interesting to see similar patterns in LLM behaviour.

If we try to reflect on how we could potentially fix this issue, there are a couple of ways:

First, we could prevent the cases when we try to get too much data from the database, returning an error if there are more than 1K rows in the output of the execute_sql function.
The other thing I would think about is allowing LLM to ask follow-up questions and instruct it not to make assumptions.

Let's move on to the BabyAGI approach that inspired the current one.

BabyAGI agent with Tools

The code below is based on example from LangChain's cookbook.

Similar to the previous approach, our other experimental one, BabyAGI, tries to plan first and then execute.

This approach uses retrieval, so we need to set up a vector storage and embedding model. I use open-source and lightweight Chroma for storage and OpenAI embeddings.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embedding = OpenAIEmbeddings()
persist_directory = 'vector_store'

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

Retrieval allows the model to store all the results for a long term and extract and pass only the most relevant ones. If you want to learn more about retrieval, read my article on RAG (Retrieval Augmented Generation).

Firstly, we will create a TO-DO chain that we will use as a tool for our executor later.

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

todo_prompt_message = '''
You are a planner who is an expert at coming up with a todo list for 
a given objective. Come up with a todo list for this objective: {objective}
'''

todo_prompt = PromptTemplate.from_template(todo_prompt_message)
todo_chain = LLMChain(llm=OpenAI(temperature=0.1, 
    model = 'gpt-4-1106-preview'), prompt=todo_prompt)

Then, we will create an agent specifying tools and prompts.

from langchain.agents import AgentExecutor, Tool, ZeroShotAgent
from langchain.prompts import PromptTemplate

tools = [
    execute_sql, 
    get_table_columns, 
    get_table_column_distr,
    Tool(
        name="TODO",
        func=todo_chain.run,
        description="useful for when you need to come up with todo lists. Input: an objective to create a todo list for. Output: a todo list for that objective. Please be very clear what the objective is!",
    )
]


prefix = """
You are an AI who performs one task based on the following objective: {objective}. Take into account these previously completed tasks: {context}.

You are asked questions related to analytics for e-commerce product.
Your work is very important, since your product team makes decisions based on the data you provide. So, you are extremely accurate with the numbers you provided. 
If you're not sure about the details of the request, you don't provide the answer and ask follow-up questions to have a clear understanding.
You are very helpful and try your best to answer the questions.

All the data is stored in SQL Database. Here is the list of tables (in the format <database>.<table>) with descriptions:
- ecommerce.users - information about the customers, one row - one customer
- ecommerce.sessions - information about the sessions customers made on our web site, one row - one session
"""

suffix = """Question: {task}
{agent_scratchpad}"""

prompt = ZeroShotAgent.create_prompt(
    tools,
    prefix=prefix,
    suffix=suffix,
    input_variables=["objective", "task", "context", "agent_scratchpad"],
)

llm = OpenAI(temperature=0.1)
llm_chain = LLMChain(llm=llm, prompt=prompt)
tool_names = [tool.name for tool in tools]
analyst_agent_babyagi = ZeroShotAgent(llm_chain=llm_chain, allowed_tools=tool_names)
analyst_agent_babyagi_executor = AgentExecutor.from_agent_and_tools(
    agent=analyst_agent_babyagi, tools=tools, verbose=True
)

The last step is to define the BabyAGI executor and run it.

from langchain_experimental.autonomous_agents import BabyAGI
baby_agi = BabyAGI.from_llm(
    llm=llm,
    vectorstore=vectordb,
    task_execution_chain=analyst_agent_babyagi_executor,
    verbose=True,
    max_iterations=10
)
baby_agi("Find, how many active customers from the United Kingdom we have.")

Again, the model failed to return results because it wasn't able to follow the input schema for the tool.

Also, surprisingly, the model decided not to use the TO-DO function to create a to-do list but to jump into querying SQL. However, the first query wasn't correct. The model tried to recover and call theget_table_columns function to get column names, but it failed to follow the schema.

Let's look at the log.

*****TASK LIST*****

1: Make a todo list

*****NEXT TASK*****

1: Make a todo list


> Entering new AgentExecutor chain...
Thought: I need to find out how many active customers from the United Kingdom 
we have
Action: execute_sql
Action Input: SELECT COUNT(*) FROM ecommerce.users WHERE country = 'UK' AND active = 1
Observation: Code: 47. DB::Exception: Missing columns: 'active' while processing query: 
'SELECT count() FROM ecommerce.users WHERE (country = 'UK') AND (active = 1)', 
required columns: 'country' 'active', maybe you meant: 'country'. 
(UNKNOWN_IDENTIFIER) (version 23.12.1.414 (official build))

Thought: I need to get the columns of the ecommerce.users table
Action: get_table_columns
Action Input: ecommerce.users

So, we've seen another problem that is pretty common for agents not powered by OpenAI functions — they fail to follow the structure.

AutoGPT agent with Tools

The code below is based on example from LangChain's cookbook.

Let's look at another experimental approach — the implementation of AutoGPT using the LangChain framework.

Again, we need to set up a vector storage for intermediate steps.

embedding = OpenAIEmbeddings()
from langchain.vectorstores import Chroma
persist_directory = 'autogpt'

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In this case, again, we can't specify any prompt to the model. Let's try to use it without any specific guidance. But let's add the get_tables tool so the model can see all the available tables. I hope it will help the model with writing correct SQL queries.

@tool()
def get_tables() -> str:
    """Returns list of tables in the format <database>.<table>"""
    
    return ['ecommerce.users', 'ecommerce.sessions']

Let's create an AutoGPT agent. It's as easy as one function call. Then, let's execute it and see how it works.


from langchain_experimental.autonomous_agents import AutoGPT

analyst_agent_autogpt = AutoGPT.from_llm_and_tools(
    ai_name="Harry",
    ai_role="Assistant",
    tools= [execute_sql, get_table_columns, 
        get_table_column_distr, get_tables],
    llm=ChatOpenAI(temperature=0.1, model = 'gpt-4-1106-preview'),
    memory=vectordb.as_retriever(),
)

analyst_agent_autogpt.chain.verbose = True

analyst_agent_autogpt.run(["Find how many active customers from the United Kingdom we have."])

The model was able to come up with the right answer: "The number of active customers from the United Kingdom is 111,469."

Reading through the prompt is interesting since we used the default one. You can access it via analyst_agent_autogpt.chain.prompt.

System: You are Harry, Assistant
Your decisions must always be made independently without seeking user 
assistance.
Play to your strengths as an LLM and pursue simple strategies with 
no legal complications.
If you have completed all your tasks, make sure to use the "finish" command.

GOALS:

1. Find how many active customers from the United Kingdom we have.


Constraints:
1. ~4000 word limit for short term memory. Your short term memory is short, 
so immediately save important information to files.
2. If you are unsure how you previously did something or want to recall 
past events, thinking about similar events will help you remember.
3. No user assistance
4. Exclusively use the commands listed in double quotes e.g. "command name"

Commands:
1. execute_sql: execute_sql(query: str) -> str - Returns the result of SQL query execution, args json schema: {"query": {"title": "Query", "description": "SQL query to execute", "type": "string"}}
2. get_table_columns: get_table_columns(database: str, table: str) -> str - Returns list of table column names and types in JSON, args json schema: {"database": {"title": "Database", "description": "Database name", "type": "string"}, "table": {"title": "Table", "description": "Table name", "type": "string"}}
3. get_table_column_distr: get_table_column_distr(database: str, table: str, column: str, n: int = 10) -> str - Returns top n values for the column in JSON, args json schema: {"database": {"title": "Database", "description": "Database name", "type": "string"}, "table": {"title": "Table", "description": "Table name", "type": "string"}, "column": {"title": "Column", "description": "Column name", "type": "string"}, "n": {"title": "N", "description": "Number of rows, default limit 10", "type": "integer"}}
4. get_tables: get_tables() -> str - Returns list of tables in the format <database>.<table>, args json schema: {}
5. finish: use this to signal that you have finished all your objectives, args: "response": "final response to let people know you have finished your objectives"

Resources:
1. Internet access for searches and information gathering.
2. Long Term memory management.
3. GPT-3.5 powered Agents for delegation of simple tasks.
4. File output.

Performance Evaluation:
1. Continuously review and analyze your actions to ensure you are 
performing to the best of your abilities.
2. Constructively self-criticize your big-picture behavior constantly.
3. Reflect on past decisions and strategies to refine your approach.
4. Every command has a cost, so be smart and efficient. Aim to complete 
tasks in the least number of steps.

You should only respond in JSON format as described below 
Response Format: 
{
    "thoughts": {
        "text": "thought",
        "reasoning": "reasoning",
        "plan": "- short bulleted\n- list that conveys\n- long-term plan",
        "criticism": "constructive self-criticism",
        "speak": "thoughts summary to say to user"
    },
    "command": {
        "name": "command name",
        "args": {
            "arg name": "value"
        }
    }
} 
Ensure the response can be parsed by Python json.loads
System: The current time and date is Sat Dec 16 18:39:49 2023
System: This reminds you of these events from your past:
[]

So, we can see that this model has long-term memory via vector storage. Also, it has a comprehensive reasoning consisting of thought, reasons, plan, criticism and summary.

Let's look at the model's reasoning (I've filtered only the model's responses). So, it followed the same way as the previous successful agents.

{
    "thoughts": {
        "text": "To find the number of active customers from the United Kingdom, I need to identify the relevant database and table that contains customer information. Once I have that, I can look for columns that might indicate customer activity and country. The first step is to list all the available tables.",
        "reasoning": "Understanding the structure of the database is essential before I can execute a query to find the active customers. Listing tables is a logical first step to identify where customer data might be stored.",
        "plan": "- Use the get_tables command to list all available tables.\n- Identify the table that likely contains customer information.\n- Use get_table_columns to understand the structure of the customer table.\n- Execute an SQL query to count active UK customers.",
        "criticism": "I need to ensure that I'm efficient with the commands I use, aiming to gather the necessary information in as few steps as possible.",
        "speak": "I am now going to list all the available tables to find where customer information is stored."
    },
    "command": {
        "name": "get_tables",
        "args": {}
    }
}

{
    "thoughts": {
        "text": "The 'ecommerce.users' table is likely to contain information about the customers. To find the number of active customers from the United Kingdom, I need to understand the structure of this table, specifically looking for columns that might indicate whether a customer is active and their country.",
        "reasoning": "Knowing the structure of the 'users' table will help me determine which columns to use in an SQL query to count active UK customers.",
        "plan": "- Use the get_table_columns command to get the structure of the 'ecommerce.users' table.\n- Identify columns that indicate customer activity and country.\n- Formulate an SQL query to count active UK customers based on the identified columns.",
        "criticism": "I must continue to be efficient with my commands, ensuring that each step brings me closer to the goal without unnecessary actions.",
        "speak": "I will now retrieve the structure of the 'ecommerce.users' table to identify the relevant columns for active customers and their country."
    },
    "command": {
        "name": "get_table_columns",
        "args": {
            "database": "ecommerce",
            "table": "users"
        }
    }
}

{
    "thoughts": {
        "text": "With the structure of the 'ecommerce.users' table known, I can see that there are columns for 'country' and 'is_active', which are necessary to identify active customers from the United Kingdom.",
        "reasoning": "The 'country' column will allow me to filter customers by their country, and the 'is_active' column will allow me to filter customers by their activity status. With this information, I can construct an SQL query to count the number of active customers from the UK.",
        "plan": "- Execute an SQL query using the 'execute_sql' command to count the number of active customers from the United Kingdom.\n- The query will count rows where 'country' is 'United Kingdom' and 'is_active' is true (1).",
        "criticism": "I need to ensure that the SQL query is correctly formulated to retrieve the exact information needed without any errors.",
        "speak": "I am now ready to execute an SQL query to count the number of active customers from the United Kingdom."
    },
    "command": {
        "name": "execute_sql",
        "args": {
            "query": "SELECT COUNT(*) FROM ecommerce.users WHERE country = 'United Kingdom' AND is_active = 1"
        }
    }
}

{
    "thoughts": {
        "text": "I have successfully executed the SQL query and obtained the number of active customers from the United Kingdom, which is 111,469.",
        "reasoning": "The goal was to find the number of active customers from the UK, and the SQL query has provided this information.",
        "plan": "- Since the goal has been achieved, the next step is to use the 'finish' command to signal that I have completed all my objectives.",
        "criticism": "There is no need for further action or commands as the objective has been met.",
        "speak": "I have found the number of active customers from the United Kingdom and am ready to finish the task."
    },
    "command": {
        "name": "finish",
        "args": {
            "response": "The number of active customers from the United Kingdom is 111,469."
        }
    }
}

Overall, AutoGPT looks like a working approach. However, I still believe that the most reliable way to build agents right now is through OpenAI functions.

Do we need to build everything from scratch?

We've spent some wonderful learning time building the agent integrated with SQL Database. However, I must mention that LangChain has its own implementation of SQL agent — SQLDatabaseChain.

This approach uses SQL Alchemy to interact with the databases. So, we need to install the package clickhouse-sqlalchemy to connect to ClickHouse.

pip install clickhouse-sqlalchemy

We can set up a connection to the database and initialize a toolkit.

uri = 'clickhouse+native://localhost/ecommerce'
db = SQLDatabase.from_uri(uri)
toolkit = SQLDatabaseToolkit(db=db, llm=OpenAI(temperature=0))

A toolkit is a collection of useful tools related to some topic. You can find lots of examples in the documentation.

We can see the list of tools we have in the toolkit. There are tools to make an SQL query or get information related to the database.

toolkit.get_tools()

Then, we can quickly create and run an agent based on OpenAI functions.

agent_executor = create_sql_agent(
    llm=ChatOpenAI(temperature=0.1, model = 'gpt-4-1106-preview'),
    toolkit=toolkit,
    verbose=True,
    agent_type=AgentType.OPENAI_FUNCTIONS
)

agent_executor.run("How many active customers from the United Kingdom do we have?")

We got the correct answer without much hassle on our side.

> Entering new AgentExecutor chain...

Invoking: `sql_db_list_tables` with ``


sessions, users
Invoking: `sql_db_schema` with `users`

CREATE TABLE users (
 user_id UInt64, 
 country String, 
 is_active UInt8, 
 age UInt64
) ENGINE = Log

/*
3 rows from users table:
user_id country is_active age
1000001 United Kingdom 0 70
1000002 France 1 87
1000003 France 1 88
*/
Invoking: `sql_db_query` with `SELECT COUNT(*) FROM users WHERE country = 'United Kingdom' AND is_active = 1`


[(111469,)]We have 111,469 active customers from the United Kingdom.

> Finished chain.
'We have 111,469 active customers from the United Kingdom.'

We can use langchain.debug = True to see what prompt was used.

System: You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct clickhouse query 
to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain, 
always limit your query to at most 10 results.
You can order the results by a relevant column to return the most interesting 
examples in the database.
Never query for all the columns from a specific table, only ask for 
the relevant columns given the question.
You have access to tools for interacting with the database.
Only use the below tools. Only use the information returned 
by the below tools to construct your final answer.
You MUST double check your query before executing it. If you get 
an error while executing a query, rewrite the query and try again.

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) 
to the database.

If the question does not seem related to the database, just return 
"I don't know" as the answer.

Human: How many active customers from the United Kingdom do we have? 
AI: I should look at the tables in the database to see what I can query.  
Then I should query the schema of the most relevant tables.

So, we have a pretty convenient and working implementation of SQL analyst. If you don't need any custom changes, you can just use the LangChain implementation.

Also, you can tweak it a bit, for example, by passing a prompt to the create_sql_agent function (documentation).

Summary

Today, we've learned how to create different types of agents. We've implemented an LLM-powered agent that can work with SQL databases entirely from scratch. Then, we leveraged high-level LangChain tools to achieve the same result with a couple of function calls.

So, now our LLM-powered analyst can use data from your DB and answer questions. It's a significant improvement. We can add our SQL Database agent as a tool for our LLM-powered analyst. It will be our first skill.

The agent now can answer data-related questions and work on their own. However, the cornerstone of the analytics work is collaboration. So, in the following article, we will add memory and learn agents to ask follow-up questions. Stay tuned!

Thank you a lot for reading this article. I hope it was insightful to you. If you have any follow-up questions or comments, please leave them in the comments section.

#llm #data-science #artificial-intelligence #agents #editors-pick