Problem: Make a chatbot which answers complex questions based on a terabyte of mixed media documents.

Prototype: here it is! It reads some text files, processes (embeds?) them into vectors (caching to disk) and then passes the prompt and these vectors to OpenAI which answers the question.

Fail: Can’t pass shitloads (terabytes) of pass for each query to OpenAI

Possible solution: RAG (see everywhere for a description).

Alternatives

So, can we start with a ton of PDFs, XLSs and TXT files, and end up with a domain-specific chat bot? Let’s check some options:

Ready-made e.g. Co-pilot from Microsoft. You can upload documents. Probably not a terabyte. This vaguely worked and was quick, integrating into Teams for colleagues. Used Co-pilot studio. The problem is feeding it terabytes of data.
Some assembly required: various commercial no-code offerings, maybe AWS too. Hard to pick a winner or know if it’s worth committing. Unsure about these. Maybe
Assembly required: e.g. LlamaIndex or Lang-chain. Use their github libraries and docs, and optionally use enterprise services for hosting, help, etc.
DIY: write the whole thing yourself. In theory use local GPUs to process. Super slow and hard probably.

Winner: #3 Assembly Required

Current understanding is that these are libraries and they abstract the some of the implementation-specific details (e.g. format for vectors) and common tasks (e.g. “save to index disk”).

Langchain is a toolbox for AI information ‘chains’. It is apparently ‘multi purpose’.

LlamaIndex is more targeted at search, index, extracting. It has nice data readers (e.g. XLS, PDF)

They are be used together it seems. See the Langchain docs and third-party intro using Langchain.

LlamaIndex

I’ll start with LlamaIndex as it seems better suited to my problem. It’s a company with an open-source piece.

Choices choice choices

Database (review of options and nice diagram): heaps! Native vector dbases leaders include Pinecone (paid), Chroma (open), Milveus (open), Faiss (open, FacePalm) or use Redis/PostgresSQL adaptors.

Jump In

I’m a fan of trying it out. So I’m working through:

Easy intro with runthrough
Similiar vibe as above
Official Llama Docs – big picture options for big RAG
Tutorial about assessing performance

<in progress>

Questions

Embedding with OpenAI: don’t we do this locally. Isn’t it reading files and making n-dim vectors? Do we need to upload all our corpus files to OpenAI?
HuggingFace?