Part3: Implementing a RAG chatbot with Vector Search, BGE, langchain and Mistral 8x7B on Databricks
Part3: Deploying your chatbot with langchain and Mistral or Llama2
In the previous parts, we discussed about the following steps:
We’re now ready to deploy our chatbot Model as Part 3.
We’ll create a Langchain chatbot, taking our customer’s questions, crafting a prompt extended with the chunks from our Vector Search, and send the query to a Foundation Model (i.e.: Llama2 or Mistral). Once the model is built, we’ll register it to Unity Catalog and deploy it as a Model Serving Endpoint.
This model will perform the following steps:
Let’s get started!
1/ Create the Retriever to find similar content from our Index
First, we need to query our index in realtime to find chunks relevant to our customer question. Langchain uses retrievers to do that. This will automatically lookup for similar chunks and inject it to the prompt. Databricks makes it easy with a built-in DatabricksVectorSearch
retriever:
# Get the vector search index created in part2
vsc = VectorSearchClient()
vs_index = vsc.get_index("demo_endpoint", "index_name")
# Create the retriever from the index
retriever = DatabricksVectorSearch(vs_index, text_column="content").as_retriever()
2/ Define your llm: Llama 2, Mistral or others
This defines the LLM you’ll use to answer your prompt. Databricks provides multiple Foundation Models as a Service. Note that you could use 3 types of models to answer your prompt:
Databricks Foundation models, serverless (Llama2, Mistral 8x7B, MPT…)
Your own LLM, fine-tuned and served as an accelerated model on GPU
An external model such as Claude or Azure OpenAI.
Here, we’ll use Databricks LLama2 Foundation model. Databricks provides a built-in ChatDatabricks
that you can use as part of langchain:
from langchain.chat_models import ChatDatabricks
chat_model = ChatDatabricks(endpoint="databricks-llama-2-70b-chat", max_tokens = 200)
3/ Building your prompt and assembling the chain
This is where the dark science of prompt engineering starts. You’ll have to define a prompt matching your requirements and make sure your model behaves as expected. If it’s your first prompt, you can use the CO-STAR framework to get started (Context Objective Style Tone Audience Response).
Here is a prompt example:
TEMPLATE = """You are an assistant for Databricks users. You are answering python, coding, SQL, data engineering, spark, data science, DW and platform, API or infrastructure administration question related to Databricks. If the question is not related to one of these topics, kindly decline to answer. If you don't know the answer, just say that you don't know.
Use the following pieces of context to answer the question at the end: {context}
Question: {question}
Answer:"""
Once your prompt is defined, you can assemble all the previous steps to create your langchain chain:
prompt = PromptTemplate(template=TEMPLATE, input_variables=["context", "question"])
chain = RetrievalQA.from_chain_type(
llm=chat_model,
chain_type="stuff",
retriever=get_retriever(),
chain_type_kwargs={"prompt": prompt})
4/ Save your model to MLFlow and Unity Catalog
Now that the chain is ready, you can save it to Unity Catalog using the UI or in one line of code:
mlflow.langchain.log_model(chain, registered_model_name=”catalog.schema.dbdemos_chatbot_model”, …)
5/ Deploy your model as a REST Endpoint
With the model saved and governed by Unity Catalog, you can now deploy it as a REST endpoint. Your endpoint input will be the customer question, and the output will be the answer. Under the hood, the deployed model will call the chain that we created (retrieve the similar doc, build the prompt, and send the prompt to llama2).
Deploying a model is a 2-click operation: Select the model in the Unity Catalog Explorer and Click on “Serve this model”:
6/ Deploy your front-end chatbot with Databricks Apps
The last bit is to deploy the chatbot front-end. Depending on your requirements, this can be a front-end application that you host as part of your web/mobile app, or served as a Databricks Lakehouse App (more details on that soon!)
Want to try this yourself end-to-end, with code ready to run?
Extra: comparing llama2 and Mistral 8x7B with Databricks AI Playground
Databricks provides state-of-the-art LLMs, letting you experiment with different capabilities. Mistral 8x7B was made available as a Databricks Foundation Model just a few weeks after its release. Leveraging MoE technique (Mixture of Experts) Mistral provides excellent results with lower inference costs.
If you don’t know where to start, you can easily leverage Databricks AI Playground to try multiple models in parallel and pick the one you think is best:
That’s it, you’re now ready to build your GenAI application on the lakehouse, end2end.
You now have the basis to deploy your own GenAI applications:
How to ingest and prepare your knowledge database
Create your Vector Search Index for similarity searches
Deploy your Langchain bot crafting a complete prompt using RAG
Don’t miss out on our next newsletter - we’ll deep dive into the latest Databricks update!