How do you use your GitHub stars?
I’d guess if you’ve been programming for a few years you’ve probably hit the star button at the top of a few of your favourite repos. I know some people I follow have done it thousands of times. Do you go back to them though? Do you review them for inspiration for your next project or go to them when you’re stuck on a partictular problem?
I’ve always assumed I would use them but I never have. I found myself doing some research recently into how to build software that uses LLMs, with the deliberate goal of building an as yet undefined side-project. I wanted to build something I hadn’t built before, something that was hopefully a little original, and maybe even useful! So yet again I was starring repos like LangChain and Chroma, swearing this time would be different.
As I was running through blog posts and diligently smashing the star buttons I realised that I had just hit on exactly what I wanted to try. I wanted to bring my GitHub stars right into my editor. I wanted to be able to have them next to me as I was working and get a sensible set of suggestions on what might be useful for my needs at that moment, and I had just been starring the exact repos that could make this happen!
Use a dataset of your personal stars to inform retrival augmented generation for a question and answer large language model deployed in a command line interface
I thought this would be useful for a few reasons:
A keyword search looks for the exact letters in a string, or potentially a partial match. As an example the query
"Data Science" would find things that exactly matched the charcters in the string
"Data Science" and maybe also
["Data", "Science", "DS"].
A semantic search looks for the conceptual similarity between things, so in this context
"Data Science" would find things that matched the vector embedding of
"Data Science" as well as maybe also the vector embeddings that are associated with
["Machine Learning", "Artificial Intelligence"]
And so I ran
poetry new starpilot:
Retrieval Augmented Generation (RAG) is a technique used by large language models to cope with some of the limitations inherent in what are also sometimes referred to as ‘Foundational’ models.
When a model like GPT3 is trained, it is fed large amounts of textual data written by humans. These get translated into ‘weights’ in a nueral net. To overly simplify, these weights tell the model what the next most likely text is that follows the text it has already been shown.
However, these models don’t know much about what has happened recently, what other programming resources really exist rather than what just sounds like it should exist, or where to exactly get a specific repo or webpage.
Retrieval augmented generation solves this by allowing you to feed the large language model with known real, up to date and relevant information.
A type of data base called a vectorstore is commonly used for this because they are deliberately optimised towards a similarity search use case. They achieve this in a few ways:
With this set of goals and new knowledge I got to work working out which puzzle pieces I needed and how to fit them together. This time I did go through my stars (and a few other things), though maybe this is for the last time!
I figured I could get started using 4 main open source repos. My first commit to my pyproject.toml used these projects:
typer is a pretty trendy framework for building CLI tools in python right now. It embraces typing, uses function decorators to magically turn your functions into CLI commands, and has relatively clear documention.
typer specifically because:
langchain is the most mature and well embraced large language model orchestration framework. Langchain itself doesn’t supply you with any specific llm or vector store or embedding approach. Instead it is deliberately ‘vendor agnostic’. It provides a common set of APIs and abstractions across a staggering number of vector data bases, large language models and embedding engines.
chroma is a vectorstore that has great support from Langchain. There are many others as well but Chroma won out at this stage because:
chroma as an ‘embedded’ data store, e.g. it runs locally on the users machine
chroma was the most often used vectorstore in the Langchain docs for RAG tasks
gpt4all provides a set of LLM models and embedding engines that are also well supported by Langchain.
gpt4all was appealing because:
Soon after this I realised that
pygithub would be an easy way to go to GitHub to get the information I needed and bring it back into
starpilot to load into the vectorstore. I had initially thought I might be able to use the GitHub Document Loader built into
langchain, though once I sat down to really work it out I realised that this doesn’t give access to a users stars, so I needed an alternative.
There were alternatives in all these choices. I think these are all totally viable parts to build effectively the same system:
I actually am using
click, sort of.
typer is built ontop of
click, but to be honest I didn’t really know that before I’d mostly decided.
click looks like a really great project, but it wasn’t as clear how to get started.
llama_index is probably a great project, but I only found it late in my thinking on this project. If I start a different project it’s suitable for any time soon I’m definately going to try it out as a comparison.
faiss in a tutorial on vectorstores before. It didn’t strike me as hugely intuitive to use or as simple to set up (it’s recommended installation path is via conda). I also don’t particularly like Facebook so I’m happy to use an alternative.
openai for a handful of tutorials and notebook experiments already and been very happy with it. However for a project like this I wasn’t really sure what the operational costs would be, and if they would be worth it for the benefit the tool provides. That combined with the requirement to have network connectivity while using the tool pushed me towards experimenting with alternatives. Luckily with
langchain I should be able to provide it as an optional backend in the future?
“actively developed”, “v0.1.0”, “untested” and “it runs on my machine” are good descriptions of the project right now.
I’ve spent a few evenings this month on it, and see myself at least spending a few more on it next month. The API is getting breaking changes almost everytime I open the project. It’s got 0 real tests. It should get some soon though. It requires a few manual installation steps that are documented in
README.md but haven’t yet even been attempted on another machine other than the one I’m on right now.
It also doesn’t yet achieve exactly what I want it to, but I see no reason yet that it can’t with some more development time.
starpilot read MyCoolUserName
This will connect to Github and read the starred repos of the user
MyCoolUserName. Then it will go to each of those repos and get the topics and descriptions (and optionally the readmes) and load these into
chroma which is persisted on the local hard drive.
starpilot shoot "insert topic here"
This will spin up the
chroma database and perform a semantic similarity search on the string given in the command, then return the documents that seem to be the most relevant.
starpilot fortuneteller "Insert a question here"
This will perform the exact same search as the
shoot command, but then spin up a large language model and pass the results into the large language model for processing. It then returns the documents it found as well as the response from the LLM
That’s where this project is at. I’ve learnt a tonne about the available tools and relevant techniques in this space already, which was really the main goal of starting to begin with!
That said the progress I’ve made so far only makes me more curious about what else can be done with this and what else can be solved towards the vision of “Making your GitHub stars more valuable in your daily coding”. Here’s some ideas that I’ve found exciting while getting my hands dirty that might show up in the future. These are along with the obvious things like any testing at all, a simpler way to set up the project on your machine, better error handling, a more sensible way to update the vectorstore than drop everything and rebuild each time, etc.
Does this sound like something intersting to you, maybe even something useful? Did this just spark inspiration in you for a new project? Does this actually already exist somewhere and I’m just being an idiot? Let me know :)