Welcome to my blog post about using LLM, Chainlit & LangChain to converse about best options for lunch.
As I sat in a cowork located in the center of one of Poland’s most touristic cities, I was faced with a simple dilemma: where to find a quality lunch? I know the city center very well and so I know that most places count on influx of tourists. As I thought about something more authentic at reasonable price, my best option was to google Facebook pages of local venues. And I’m not really FB fan.
This quest sparked an idea for my pet project: a website that presents a map and chat functionality to effortlessly seek out appealing lunch options without the hassle of navigating Facebook.
Project In a Nutshell
My project journey took me through various stages of development. I had started with creating a website with Webflow and an interactive map via Mapbox to conceptualising a chat feature.
In this post, I’ll focus the backend processes of scraping, data extraction, and chat integration:
- Scrape Google for details on local restaurants
- Scrape the Facebook pages of these venues
- Employ a Language Model to decide if a post is about lunch menu
- Scrape and analyae daily posts from sites known to share lunch menus
- Extract the lunch menus and their prices with the aid of a Language Model
- Store meal data in a vector database
- Engage in a dialogue with the dataset through LLM, augmented with chat history and document retrieval capabilities
Technologies I Used
- LangChain – framework to generalise use of LLMs
- OpenAI – LLM for that project
- Chainlit – frontend for LLM application
- Crawlbase – data scraper
- SerpAPI – data scraper
- ChromaDB – vectorstore
- SQLite – SQL database
Project details
For detailed installation instruction, please refer to the README.md and code repo.
Here, I’ll concentrate on pivotal moments that defined this project — which, to be clear, is a proof of concept and nothing more.
Database schema
The schema primarily revolves around data scraped from Google and Facebook. For a few hundred records, SQLite serves as a sufficient database to house the scraped data.
Scrapers
I began with SerpAPI for Google scraping and have not altered the code since its inception. If I were to start over, I might opt for a unified platform for all my scrapes.
With Facebook, I initially used Apify, which can return as many posts as necessary. However, due to its cost, it wasn’t suitable for my pet project. Thus, I turned to Crawlbase, which lets me scrape only the most recent post after extracting the link from the main page. While it means double the work and lacks the neat JSON formatting provided by Apify, it’s cost-effective, which is paramount for a non-revenue-generating project.
LangChain, Prompts & OpenAI
That’s the crux of the solution.
There are three places I engage LLM in the process.
Firstly, I assess posts from venues that serve food but not specifically known for their lunch menus.
Tagging
Prompt: Think carefully, and then tag the text as instructed
Pydantic class:
class Tagging(BaseModel):
"""Tag the piece of text with particular info."""
is_lunch_menu: str = Field(description="text contains lunch menu, should be 'yes' or 'no'")
is_daily: Optional[str] = Field(description="is the lunch menu only for today, should be 'yes' or 'no'")
Next, for venues confirmed to offer lunch, I use an extractor to get precise details about meals and prices.
Extracting lunch details
Prompt: Extract the relevant information, if not explicitly provided do not guess.
Extract partial info. Return empty string if info not provided.
Pydantic class:
class Meal(BaseModel):
"""Information about meals mentioned"""
meal: Optional[str] = Field(description="Meal")
price: Optional[str] = Field(description="Price of meal with currency")
class Menu(BaseModel):
"""Information to extract"""
menu: List[Meal] = Field(description="List of info about meals")
From this, I obtain a list of meals and their prices for each venue I scrape daily. I’ve chosen to create a separate vector store for each day’s data, which allows me to preserve historical data for analysis or to purge it to save space.
To have this project succeeding, I’d need to teach LLM on polish cousin as the meals composition and naming surprise myself and I am native 🙂
Chatbot
As for chatting with the data, Chainlit provides the frontend. As their instruction on how to start is straightforward, I won’t skip this step and move to the backend of the chat functionality.
Data is stored in ChromaDB, with each meal described in one sentence, alongside price and venue link metadata. This structure was chosen upon discovering the SelfQueryRetriever. However, I’m looking forward to revisiting my original idea of storing data like prices and opening hours in the SQLite3 database, allowing the LLM to determine the source of information.
The code that’s responsible for chatbot:
from dotenv import load_dotenv
load_dotenv()
import yaml
from langchain.vectorstores import Chroma
from langchain.schema.runnable import (
RunnablePassthrough,
RunnableLambda,
RunnableMap,
RunnableConfig
)
from operator import itemgetter
import chainlit as cl
from utils import (
get_path,
load_file,
date_today,
metadata_info,
_embedding_function,
_llm,
_output_parser,
_memory,
_retriever,
_prompt_template,
_combine_documents,
)
with open('../config.yaml') as f:
cfg = yaml.load(f, Loader=yaml.FullLoader)
@cl.on_chat_start
async def main():
# initiate asynchronous connection to vectorstore
db_file = get_path('..', cfg['db']['dir'], cfg['db']['vector_db']) + date_today()
embedding = _embedding_function(model_name=cfg['embedding_model_name'])
docsearch = await cl.make_async(Chroma)(
persist_directory=db_file,
embedding_function=embedding,
)
# memory
memory = _memory(return_message=False, output_key="answer", input_key="question" )
llm = _llm(chat=False, temp=0)
chat_llm = _llm(chat=True, temp=0)
output_parser = _output_parser()
metadata_field_info, document_content_description = metadata_info()
# retriever
retriever = _retriever(llm, docsearch, document_content_description, metadata_field_info)
_template = load_file(get_path('..', cfg['prompts']['dir'], cfg['prompts']['condense']))
CONDENSE_QUESTION_PROMPT = _prompt_template(chat= False, template=_template)
template = load_file(get_path('..', cfg['prompts']['dir'], cfg['prompts']['answer']))
ANSWER_PROMPT = _prompt_template(chat=True, template=template)
# chain elements
standalone_question = {
"standalone_question": {
"question": lambda x: x["question"],
"chat_history": lambda x: x["chat_history"],
}
| CONDENSE_QUESTION_PROMPT
| chat_llm
| output_parser,
}
# This adds a "memory" key to the input object
loaded_memory = RunnablePassthrough.assign(
chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter("history"),
)
retrieved_documents = RunnableMap({
"docs": lambda x: retriever.get_relevant_documents(itemgetter("standalone_question")),
"question": lambda x: x["standalone_question"],
})
final_inputs = {
"context": lambda x: _combine_documents(x["docs"]),
"question": itemgetter("question"),
}
runnable = (
loaded_memory
| standalone_question
| retrieved_documents
| final_inputs
| ANSWER_PROMPT
| chat_llm
| output_parser
)
cl.user_session.set("runnable", runnable)
@cl.on_message
async def on_message(message: cl.Message):
runnable = cl.user_session.get("runnable")
msg = cl.Message(content="")
async for chunk in runnable.astream(
{"question": message.content},
config=RunnableConfig(callbacks=[cl.LangchainCallbackHandler()]),
):
await msg.stream_token(chunk)
await msg.send()
Those are functions that define me LLM, memory, embeddings, prompts, parser
def _embedding_function(model_name: str):
"""Initiate embedding function"""
return SentenceTransformerEmbeddings(model_name=model_name)
def _llm(chat:bool, temp:int):
"""Return handle to LLM or Chat LLM"""
return ChatOpenAI(temperature=temp) if chat else OpenAI(temperature=temp)
def _output_parser():
"""return handle for output parser"""
return StrOutputParser()
def _memory(return_message: bool, output_key: str, input_key: str ):
"""return memory handle"""
return ConversationBufferMemory(
return_messages=return_message,
output_key=output_key,
input_key=input_key,
)
def _retriever(llm: str, db: str, document_descr: str, metadata:List):
"""Return retriever handle"""
return SelfQueryRetriever.from_llm(
llm,
db,
document_descr,
metadata,
verbose=True
)
def _prompt_template(chat: bool, template:str):
"""return formated template"""
return ChatPromptTemplate.from_template(template) if chat else PromptTemplate.from_template(template)
That’s the information I pass to LLM to decide on metadata, it’s required by SelfQueryRetriever
def metadata_info():
"return metadata info and document content description"
metadata_field_info = [
AttributeInfo(
name="price",
description="price of the meal",
type="string",
),
AttributeInfo(
name="url",
description="facebook page of the restaurant",
type="string",
),
]
document_content_description = "lunch menu from local restaurants"
return metadata_field_info, document_content_description
And finally, a code I modified form LangChain website to process documents as a string but to include metadata which I need to answer questions about price and place that offers particular lunch.
def _combine_documents(documents):
formatted_documents = []
for doc in documents:
# Assuming 'doc' has 'page_content' and 'metadata' attributes
page_content = doc.page_content
price = doc.metadata['price']
url = doc.metadata['url']
formatted_doc = f"{page_content} at the price of {price} served by {url}"
formatted_documents.append(formatted_doc)
return "\n\n".join(formatted_documents)
Thoughts on the Solution
I believe the Retriever could function more effectively. If it weren’t for the metadata reliance, there would be more efficient retrieval methods available, which is why I’m considering moving the metadata back to the SQLite database. This change would grant me greater flexibility in terms of retrieval methods.
Moreover, while the LLM is generally adept at translating the original query, it sometimes errs, leading to the retrieval of irrelevant data. One challenge I’ve observed is that LLMs are primarily trained in English, yet many users will interact in Polish, especially since the menu data is in Polish. This is why I think the LLM needs to become more acquainted with the variations of meals and the ways in which users may request specific dishes.
Certainly, the retrieval method needs to be refined to yield better results; otherwise, even an LLM with a temperature setting of zero can’t work wonders.
So the project isn’t quite finished, but with OpenAI’s recent announcements, I might be able to overcome the current limitations of LangChain and the retriever, and build a robust product with less hassle.
The near future will reveal whether my optimism is well-founded. 🙂