I decided to try to build a chatbot that will answer questions based on the content of my blog posts.
In order to do that I used:
- Langchain, for components to load data, process text, embeddings, and conversational modeling
- OpenAI for embeddings and the chatmodel,
- Streamlit to publish the app.
Below is a step-by-step guide detailing the process, code, and explanations of my development journey. (PS: I used VSCode)
⚠️ Important note:
Langchain updated its library, I updated it in the article as of today (23/11/24) but I didn’t check for any other changes. The app is currently live & functional.
- Install packages
- Import necessary libraries
- (Optional: How to generate the CSV)
- Load the data
- Set OpenAI key
- Split content into chunks
- Embed the data into a database
- Create the conversation chain
- Handle the user input
- Main function
- Create the HTML template
- Run it locally
- Upload it on Github
- Push it on streamlit
Install packages
First, create a requirements.txt
file to specify the necessary packages. Name the file "requirements" and include the following:
langchain
openai
faiss-cpu
tiktoken
langchain-community
These packages are essential for building the chatbot. Streamlit will automatically detect and install these packages from the file. To run the program locally, you need to run the following commands in the command prompt (WIndows):
pip install langchain
pip install openai
pip install faiss-cpu
pip install tiktoken
pip install langchain-community
Import necessary libraries
Next you need to open your main file (streamlitapp.py) and install the necessary libraries. You need:
import streamlit as st
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from HTMLTemplate import css, user_template, bot_template
import os
Libraries explained:
import streamlit as st
:
Streamlit is an open-source Python library used to create web applications for data projects with minimal effort. It's especially popular among data scientists and engineers for building data dashboards, visualization tools, and interactive reports.
By importing it as st
, you can use the shorthand st
to access Streamlit's functions and methods.
from langchain.document_loaders import CSVLoader
:
LangChain is an open source framework in Python that is widely used to develop LLM applications.
CSVLoader is a document loader in LangChain to load documents from a CSV format.
📖 There are a lot of document loaders in Langchain, you can find them here.
from langchain.text_splitter import CharacterTextSplitter
:
CharacterTextSplitter is a document transformer, a text splitter in LangChain used to split text into chunks based on characters. This is useful for preparing data for embeddings.
📖 Here you can find other splitters as well that can be utilized with Langchain.
from langchain.embeddings import OpenAIEmbeddings
:
OpenAIEmbeddings is an embedding method in LangChain designed to generate embeddings using OpenAI. Embeddings transform textual data into numerical vectors that machines can understand.
📖Find more info about OpenAI embeddings here, it’s a very popular method.
from langchain.chat_models import ChatOpenAI
:
ChatOpenAI is a chat model utility in LangChain that utilizes OpenAI's models for generating conversational responses.
from langchain.vectorstores import FAISS
:
FAISS is a library developed by Meta AI for efficient similarity search and clustering of dense vectors. Within the context of LangChain, it is used to store and retrieve embeddings efficiently.
📖Check the vector stores that support Langchain here.
from langchain.memory import ConversationBufferMemory
:
ConversationBufferMemory in LangChain provides memory for conversations, ensuring the chatbot can reference previous parts of a conversation.
📖Here is more information about memory types in Langchain.
from langchain.chains import ConversationalRetrievalChain
:
ConversationalRetrievalChain in LangChain combines conversation and retrieval functionalities, probably allowing the model to search through data and generate conversational responses.
📖Chains is a fundamental concept in Langchain, you can find more information here.
import os
:
os is a built-in Python module that provides a way of using operating system-dependent functionality, such as reading or writing to the filesystem, managing paths, and accessing environment variables.
(Optional: How to generate the CSV)
The CSV I used, include all the text, titles and content of the blog posts I wrote so far in kgiamalis.co. In order to get them, I used the following libraries:
from langchain.document_loaders import AsyncChromiumLoader
from bs4 import BeautifulSoup
from langchain.document_transformers import BeautifulSoupTransformer
import csv
For those that are interested, I used the AsyncChromiumLoader on my blog’s sitemap to parse and load all the content from the URLs that include “blog” in their url.
Then I used beautiful BeautifulSoupTransformer, to extract h1 (titles), page_content (copy) and URL.
Load the data
- CSVLoader Initialization:
loaders = CSVLoader('personal_posts.csv', encoding='utf-8')
:- This line initializes an instance of the
CSVLoader
class from the LangChain library, which is designed to load data from CSV files. - The instance is created with two arguments:
'personal_posts.csv'
: This is the name of the CSV file that contains the data to be loaded. The file contains textual data from kgiamalis.co posts.encoding='utf-8'
: Specifies the character encoding used in the CSV file. UTF-8 is a widely used character encoding that can represent any character in the Unicode standard.- Loading the Data:
docs = loaders.load()
:- This line calls the
load
method of theloaders
instance, effectively reading the content of thepersonal_posts.csv
file. - The loaded data is stored in the
docs
variable.
#Load Data with LangChain CSVLoader
loaders=CSVLoader('personal_posts.csv', encoding='utf-8')
docs=loaders.load()
Set OpenAI key
I added the OpenAI key, which is necessary for OpenAI chat models to operate. Here you can see the way I added them within Streamlit to deploy my app.
#Set OpenAI API Key
openai_key = st.secrets["openai"]["openai_api_key"]
os.environ["OPENAI_API_KEY"] = st.secrets["openai"]["openai_api_key"]
Split content into chunks
The get_text_chunks
function prepares textual data for embedding by segmenting it into smaller, more manageable chunks, ensuring both efficiency and preservation of context in subsequent processing and analysis steps.
- Function Definition:
def get_text_chunks(docs):
: The function namedget_text_chunks
is defined. It takes a single argument,docs
, which we described above.- CharacterTextSplitter Initialization:
text_splitter = CharacterTextSplitter(...)
:- An instance of the
CharacterTextSplitter
class is created and assigned to thetext_splitter
variable. - The purpose of this class is to split textual data into smaller chunks based on certain criteria.
- Parameters:
separator="\n"
: The primary delimiter for splitting is a newline character. This suggests that the text is split wherever a newline character appears.chunk_size=1000
: Each chunk of text created will have approximately 1000 characters.chunk_overlap=200
: Chunks will have an overlap of 200 characters. This overlap can be useful in ensuring that no important context is lost at the boundaries of the chunks.- Splitting the Documents:
text_chunks = text_splitter.split_documents(docs)
: Thesplit_documents
method of thetext_splitter
object is called with thedocs
argument. This method will split the documents into smaller chunks based on the criteria defined during thetext_splitter
initialization. The resulting chunks are stored in thetext_chunks
variable.- Return Statement:
return text_chunks
: The function concludes by returning thetext_chunks
, which are the smaller segments of the originaldocs
.
#Prepare data for embedding
def get_text_chunks(docs):
text_splitter=CharacterTextSplitter(separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len)
text_chunks=text_splitter.split_documents(docs)
return text_chunks
Embed the data into a database
While there are several libraries available for storing vectors, such as Pinecone and ChromaDB, I chose FAISS for the following reasons:
- Efficiency: FAISS is optimized for memory usage and speed, making it highly efficient for similarity searches.
- Ease of Use: FAISS offers a straightforward API that makes it easy to store and retrieve vectors.
The get_vector_store
function takes chunks of text, transforms them into vector embeddings using OpenAI models, stores these embeddings efficiently with FAISS, and returns this stored structure. This process allows for efficient similarity searches on the embedded data.
- Function Definition:
def get_vector_store(text_chunks):
: This defines a function namedget_vector_store
that accepts an argumenttext_chunks
, which we described above.- Initializing OpenAI Embeddings:
embeddings = OpenAIEmbeddings()
: Here, an instance of theOpenAIEmbeddings
class from LangChain is created. This object,embeddings
, is designed to convert text into a vector representation using OpenAI's models. These vectors capture the semantics and context of the provided text.- Embedding the Text and Storing in FAISS:
vectorstore = FAISS.from_documents(text_chunks, embeddings)
:- This line does a couple of things:
- Embedding: The
text_chunks
are converted into vectors using theembeddings
object (which is based on OpenAI models). - Storing in FAISS: FAISS (a library developed for efficient similarity search and clustering of dense vectors) is used to store these embeddings. The method
from_documents
of the FAISS class takes in the text chunks and their corresponding embeddings, and then creates a 'vectorstore', which is an efficient structure to store and later retrieve these embeddings. - Return the Vector Store:
return vectorstore
: The function concludes by returning thevectorstore
object, which now contains the vector embeddings of the provided text chunks.
#Embed the data in FAISS
def get_vector_store(text_chunks):
embeddings=OpenAIEmbeddings()
vectorstore=FAISS.from_documents(text_chunks, embeddings)
return vectorstore
Create the conversation chain
This function showcases the interplay between embeddings (vectorstore), a chat model (llm), and memory management to create a conversational agent that's both informed by past interactions and capable of retrieving relevant information from a dataset.
- The Function:
get_conversation_chain(vectorstore)
- The function is designed to take a
vectorstore
as an argument, which contains embedded representations of our data. This store will be crucial for retrieving relevant responses during conversations. - Language Model Initialization:
llm=ChatOpenAI(temperature=0.0)
- Here, we instantiate a chat model using the
ChatOpenAI
class. Thetemperature
parameter, set to0.0
, controls the randomness of the model's output. A lower value like0.0
makes the output more deterministic, ensuring consistent and focused replies. - Memory Management:
memory=ConversationBufferMemory(memory_key='chat_history', return_messages=True)
- The conversation's context is vital. The
ConversationBufferMemory
class helps manage and store the chat history, allowing the model to reference previous interactions and provide contextually relevant responses. Thereturn_messages=True
argument ensures that past messages are returned, preserving the conversation's flow. - Creating the Conversation Chain:
conversation_chain=ConversationalRetrievalChain.from_llm(...)
- This line is the heart of the function. The
ConversationalRetrievalChain
class brings together the chat model, the embedded text data invectorstore
, and the conversation memory to form a cohesive conversation chain. This chain can take a user's query, search through the embedded data for relevant information, and craft a contextually appropriate response.
#Create a Conversation Chain
def get_conversation_chain(vectorstore):
llm=ChatOpenAI(temperature=0.0)
memory=ConversationBufferMemory(memory_key='chat_history', return_messages=True)
conversation_chain=ConversationalRetrievalChain.from_llm(llm=llm, retriever=vectorstore.as_retriever(), memory=memory)
return conversation_chain
Handle the user input
This function is responsible for managing the interaction between a user and a chatbot within a Streamlit application. It takes the user's input, processes it through an active chatbot conversation, stores the chat history, and then displays the ongoing conversation in a formatted manner. If the chatbot isn't active, it alerts the user to start the conversation.
- Function Definition:
def handle_user_input(user_question):
: The function is defined to take one argument,user_question
, which represents the query or message input by the user.- Checking for an Active Conversation:
if st.session_state.conversation:
: This line checks if there is an active conversation in the Streamlit session state. Ifst.session_state.conversation
exists, it indicates that the chatbot is active and ready to process user input.- Getting the Chatbot's Response:
response = st.session_state.conversation({'question': user_question})
: If there's an active conversation, the user's question is passed to the chatbot, and the response is stored in theresponse
variable.- Storing Chat History:
st.session_state.chat_history = response['chat_history']
: The chat history from the bot's response is stored in the session state. This allows the application to keep track of the ongoing conversation.- Displaying the Conversation:
- The loop
for i, message in enumerate(st.session_state.chat_history):
iterates through each message in the chat history. - The conditional
if i % 2 == 0:
checks if the message is from the user or the bot based on its position in the chat history. If the index (i
) is even, it's assumed to be a user message; otherwise, it's a bot message. - The messages are displayed using
st.write(...)
, and the content of the message replaces the{{MSG}}
placeholder in the respective templates (user_template
for user messages andbot_template
for bot messages). - Warning for No Active Conversation:
else: st.warning("Please press 'Start' before asking a question.")
: If there's no active conversation (i.e., the user hasn't initiated the chatbot by pressing a hypothetical 'Start' button), a warning is displayed prompting the user to start the conversation.
#Handle User Input
def handle_user_input(user_question):
if st.session_state.conversation:
response = st.session_state.conversation({'question': user_question})
st.session_state.chat_history = response['chat_history']
for i, message in enumerate(st.session_state.chat_history):
if i % 2 == 0:
st.write(user_template.replace("{{MSG}}", message.content), unsafe_allow_html=True)
else:
st.write(bot_template.replace("{{MSG}}", message.content), unsafe_allow_html=True)
else:
st.warning("Please press 'Start' before asking a question.")
Main function
The main
function serves as the backbone of a Streamlit-based chatbot application. It sets up the user interface, initializes session variables, handles user input, and manages the chatbot's backend processes. When a user interacts with the application, they can input their questions, start the chatbot's processing capabilities, and receive responses, all in an interactive and user-friendly environment.
- Function Definition:
def main():
: The function namedmain
is defined. This function serves as the primary execution point for the Streamlit application, setting up the user interface and handling interactions.- Page Configuration:
st.set_page_config(...)
: This sets the configuration for the Streamlit page.page_title="kgiamalis.co chatbot - press start button to initiate"
: Sets the title of the web page.page_icon=":chatbot:"
: Sets the favicon for the web page using an emoji.- Styling the Application:
st.write(css, unsafe_allow_html=True)
: Injects CSS to style the application. Theunsafe_allow_html=True
argument allows the inclusion of raw HTML (in this case, CSS).- Page Header:
st.header("kgiamalis.co chatbot 💬")
: Sets a header for the application with a chatbot emoji.- Initializing Session State:
- The next two conditionals check whether the
conversation
andchat_history
keys exist in the session state. If not, they are initialized toNone
. - This ensures that the app can keep track of the ongoing conversation and its history.
- User Input Handling:
user_question=st.text_input("Ask your question")
: A text input box is displayed where users can type in their questions.if user_question: handle_user_input(user_question)
: If the user provides input, thehandle_user_input
function is called to process the user's question and manage the chatbot's response.- Sidebar Configuration:
- The
with st.sidebar:
context creates a sidebar for the application. Inside this sidebar: - A title and a brief description of the chatbot are displayed.
- Sample questions are provided as guidance for users.
- A "Start" button is available to initiate the chatbot's processes.
- Starting the Chatbot:
if st.button("Start"):
: Checks if the "Start" button has been pressed.- Inside this conditional, several processes are initiated:
- Data is loaded.
- Text data is split into chunks using the
get_text_chunks
function. - A vector store is created using the
get_vector_store
function to hold the embeddings. - A conversation chain is established using the
get_conversation_chain
function, allowing the chatbot to retrieve relevant information and generate conversational responses. - Finally, a success message is displayed, signaling that the chatbot is ready.
- Main Execution Point:
if __name__ == '__main__': main()
: This is a common Python construct. If this script is run as the main program (and not imported as a module elsewhere), themain
function will be executed, setting up the Streamlit application.
#Main Function
def main():
st.set_page_config(page_title="kgiamalis.co chatbot - press start button to initiate", page_icon=":chatbot:")
st.write(css, unsafe_allow_html=True)
st.header("kgiamalis.co chatbot 💬")
if "conversation" not in st.session_state:
st.session_state.conversation=None
if "chat_history" not in st.session_state:
st.session_state.chat_history=None
user_question=st.text_input("Ask your question")
if user_question:
handle_user_input(user_question)
with st.sidebar:
st.title("LLM Chatapp using LangChain - Press Start to begin.")
st.markdown('''
This app is an LLM powered Chatbot that answer questions based on kgiamalis.co
Here are some questions that you can ask:
- What is curse of knowledge?
- What is a good user acquisition KPI?
''')
if st.button("Start"):
with st.spinner("Processing"):
# Load the Data
data=docs
#Split the Text into Chunks
text_chunks = get_text_chunks(docs)
print(len(text_chunks))
#Create a Vector Store
vectorstore=get_vector_store(text_chunks)
#Create a Conversation Chain
st.session_state.conversation=get_conversation_chain(vectorstore)
st.success("Completed")
if __name__ == '__main__':
main()
Create the HTML template
Use this code to create your html template. Save it in the same folder with your streamlit file but in a separate file like this: HTMLTemplate.py
- CSS: This part styles the chat messages. It looks like you've set up different styles for user and bot messages, which is great for making the interaction visually distinct.
- Bot Template: This HTML structure is for the bot's messages. It includes an avatar image and a message section.
- User Template: Similar to the bot template, this is for the user's messages. It also includes an avatar image and a message section.
css = '''
<style>
.chat-message {
padding: 1.5rem; border-radius: 0.5rem; margin-bottom: 1rem; display: flex
}
.chat-message.user {
background-color: #2f2b3e
}
.chat-message.bot {
background-color: #6d7b99
}
.chat-message .avatar {
width: 20%;
}
.chat-message .avatar img {
max-width: 78px;
max-height: 78px;
border-radius: 50%;
object-fit: cover;
}
.chat-message .message {
width: 80%;
padding: 0 1.5rem;
color: #fff;
}
'''
bot_template = '''
<div class="chat-message bot">
<div class="avatar">
<img src="https://i.ibb.co/jMf7sB0/idea.png">
</div>
<div class="message">{{MSG}}</div>
</div>
'''
user_template = '''
<div class="chat-message user">
<div class="avatar">
<img src="https://i.ibb.co/TcgRhzg/question-mark.png">
</div>
<div class="message">{{MSG}}</div>
</div>
'''
Run it locally
Before running the code locally, make sure you've set up your Python environment properly, and you've installed all the required libraries as per your requirements.txt
file.
Here's how you can do that:
Install Required Packages:
Run the following commands in the command prompt:
pip install langchain
pip install openai
pip install faiss-cpu
pip install tiktoken
Set Up Environment Variables:
You might have sensitive information like API keys. It's good to keep them in environment variables. Streamlit offers st.secrets
to manage secrets, but when running locally, you may use your system's environment variables.
setx OPENAI_API_KEY "your-openai-api-key-here"
Run Streamlit App:
Navigate to the directory where your Streamlit script is located (your_script.py
), and run:
streamlit run your_script.py
A new tab should automatically open in your web browser displaying the Streamlit app for you to interact with.
Upload it on Github
There is a way to do it via VS Code as well, but here is the simplest method:
- Navigate to GitHub website
- Create account if you don’t have
- Click on "New Repository"
- Fill in the repository name and description
- Choose to make it public or private
- Click "Create Repository"
- Upload all the files to the repository
Push it on streamlit
Again, there is a way to do it via VS Code, but here is the simplest method:
- Go to https://streamlit.io/
- Create account
- Click on “new app”
- Fill the required information
- Add your secret keys by clicking “Manage app → Menu → Settings → Secrets →
[openai]
openai_api_key = "add-your-api-key"
- Let streamlit run it.
- You’re live. You can check the logs, from the “manage app” on the bottom right in your screen.
Here is my repository:
Here is my chatbot:
Check relevant posts: