Julia Community 🟣

Siddhant Chaudhary
Siddhant Chaudhary

Posted on • Updated on • Originally published at codetalker7.github.io

ColBERT meets Julia: Fast, efficient information retrieval using late-interaction!

TL;DR

With my GSoC 2024 project coming to an end, I still feel excited about all the future directions this work could go in. For those who don't want to read everything: I implemented the ColBERT123 information retrieval system in pure Julia. That is, no dependencies other than the julia binary and possibly a GPU backend. While still a long way to go, the first version of the package has been released, and we plan to optimize it further and possibly make it faster than the Python implementation! The eventual goal is to have this insanely fast and accurate retrival method available for the Julia community as an easy to use and hackable package. For reference, here's a tweet by one of the authors of ColBERT highlighting some key optimizations that the architecture uses (it explains things really simply!):

A🧡on beating the hardware lottery for retrieval: the internals of the late interaction stack.

ColBERT introduced a quirky multi-vector retrieval architecture. It does wonders for quality.

But how can it search 100M docs in 0.1 sec on CPU? Or store 1 billion embeddings in 20GB? pic.twitter.com/Nc3MDFxrj6

β€” Omar Khattab (@lateinteraction) December 20, 2023

Why do we need this?

ColBERT has gained immense popularity in the neural IR/GenAI community. What does it do differently than conventional neural IR algorithms?

Suppose you have a large corpus of N documents (in plain text; throughout this blog post, we'll only work with textual data. However, architectures like ColPali4 exist, which implement the ColBERT retrieval logic for vision data). We assume that N is really large (say, N > 10M). We want to perform a top-k retrieval search on this corpus, i.e given a query Q, find out the top k documents closest to the query. There can be several meanings of closest, and coming up with a good one is a problem of it's own. For our discussion, we'll score vectors against each other via their dot product. For this to work, a nice assumption to have is that all our vectors are normalized (w.r.t the L2 norm). Now, conventional document retrievers
do one of the following:

  1. Implement a cross-encoder. In simple words, take an LM whose input is a query-doc pair, and run the LM N times to get scores for all the N documents. For a practical use-case, this is way too inefficient, specially if scoring a query-doc pair requires computing scores for all pairs of tokens in the query and document texts. Really inefficient!

  2. To alleviate the quadratic blowup of considering all possible query/doc pairs, some retrievers represent each query/document as a single vector. Now, computing scores for all documents becomes much easier, since we need to compute only one dot product per document! However, a huge limitation of this method is expressivity: how flexible/accurate will such a method be if we're asking complicated queries about the documents? Will such a compressed representation fully encode each document well?

ColBERT kind-of solves both of these problems. Broadly, it's late-interaction design works based on the following:

  1. It works on a token-level. So, no need to cram/encode documents/queries to single vectors!

  2. It encodes all documents before performing search for a query (this is called indexing a collection of documents); i.e, it separates the interaction between queries and documents and the encoding of documents. This allows for offline indexing of the documents too!

  3. ColBERT uses a search algorithm sublinear in the number of documents it needs to process to get the top-k. This makes search really efficient, since we don't really have to compute scores for unnecessary documents.

The main scoring function used by ColBERT is the so called maximum similarity (aka "MaxSim") operator

Image description

In simple words, for each query token, we find the closest document token and we sum the results across all query tokens. This operator is easy to implement, and parts of the ColBERT candidate generation method don't even need to compute the full "MaxSim" score between a query and a doc; it can also do well by computing only lower bounds to this score!23

At this point, I'd suggest the reader to just read the papers123 to dive into more details! In the rest of the blog I'll talk about the design of ColBERT.jl, the philosophy behind the package, and it's future directions.

ColBERT.jl

As of v0.1.0, ColBERT.jl implements the ColBERTv2.02 indexer and searcher. The package design is based on the following philosophy:

  1. Don't introduce types unless absolutely necessary.

  2. Let functions handle the core operations, and do small, important operations within a function.

The only three types in the package are the ColBERTConfig, the Indexer and the Searcher. The names are self explanatory: a ColBERTConfig is an object which you'll define only once to specify all possible configuration you want to build the index/run search with. The Indexer is a high-level wrapper which you'll want to use to build an index (with just a single call to index), and the Searcher is another high-level wrapper to any search operations you want to perform. Note that these are just convenience wrappers, and you can perform all the actual operations by individually calling each component! The ColBERTConfig has the right defaults for most use cases, enabling it to be used off-the-shelf. For instance, for most of your use cases, you can just define a ColBERTConfig like this:



config = ColBERTConfig(
    use_gpu = true,                                             # if you're using a GPU
    checkpoint = "/home/codetalker7/models/colbertv2.0",        # I usually store model weights in $HOME/models/
    collection = "some/path/to/file/containing/documents",     
    index_path = "path/where/you/want/the/index"
)


Enter fullscreen mode Exit fullscreen mode

To get a complete understanding of the basic interface of the package, I recommend the reader to go through the package README. It's a really small example and doesn't require any special hardware!

The package structure

The package structure is also straightforward to understand:

  1. src/indexing.jl contains the high-level Indexer type, its constructor, and the index function.
  2. src/indexing contains all indexing operations we'll need to perform, plus functions to work with binary data (needed to work with the codec, aka the functions used to compress/decompress matrices).
  3. src/searching.jl contains the high-level Searcher type, its constructor and the search function.
  4. src/searching.jl contains all the search operations we'll need.
  5. src/modelling contains all Transformers.jl related code (used to call the underying transformer, tokenization etc.).
  6. src/infra contains the config specification. Consult this directory if you want to understand the configuration parameters in more detail.
  7. src/loaders.jl and src/savers.jl implement some simple logic to load/save some stuff in a pre-specified format. This is nothing serious, just normal convenience functions.
  8. src/local_loading.jl contains logic to load ColBERT models from HuggingFace locally; currently, no local loading is supported for Transformers.jl.
  9. Finally, src/utils.jl contains functions which I didn't want to put in any other file. These are some really helpful and generic functions which anyone might need to use. One such function is kmeans_gpu_onehot!, our implementation of the $k$-means clustering algorithm using GPU acceleration (I couldn't find any nice packages in Julia which have GPU support for clustering).

Core operations are designed to be test-friendly and easily hackable. Writing test friendly code is highly needed for a package like this, and inevitably requires writing small functions which implement a core operation.

Indexing and search

The major components of the package are the Indexer and Searcher. But using both of these components is really simple, thanks to ColBERTConfig which specifies everything we'll need, and only two functions we need to focus on: index and search.

Once you've built a ColBERTConfig (again, I recommend going through the README and trying the small example), building the index only requires two lines:



indexer = Indexer(config)
@time index(indexer)


Enter fullscreen mode Exit fullscreen mode

Both are easy to understand; the first line just builds an Indexer using the config we just specified. It is during this step that we load the ColBERT model weights into an Indexer, tweak the tokenizer to our needs, and set up other objects which are needed during indexing.

Search is even simpler. Once you've built the index, just create a Searcher, and call the search method:



searcher = Searcher("/path/where/you/built/the/index");
query = "what was Cesar Milan's trick?"
@time pids, scores = search(searcher, query, 2)


Enter fullscreen mode Exit fullscreen mode

The third line searches the documents (the index we just built) for the mentioned query. Here, we're searching for the top-$2$ documents relevant to our query. And we get two things from the search: pids, a list of indices of the two-most relevant documents (in order), and their "MaxSim" scores against the query.

Going forward and other learnings

This was the first time I ported a library from one technology to another (in this case, a Python codebase to a Julia implementation). And it isn't easy; I learned some really valuable lessons from this experience. Among them, the most important ones were:

  1. Don't blindly replicate the design which you're trying to port. This is crucial. In my case, the way Julia works is really different from how a typical Python project would go. There's really no need to introduce any types to the package besides the ones I just mentioned above; everything else should be delegated to simple functions, which also makes the code easy to test and verify.

  2. Understanding why you'd want to port a library from one language to another. In our case, we believe that, if completely optimized to it's full capabilities, a pure Julia implementation of ColBERT would not only be faster than the Python implementation, but also easy to hack, test and build.

I thoroughly enjoyed learning about ColBERT and building this package simultaneously, but this is just the first step. There's a tremendous amount of work still left to be done for this package, including (but not limited to the following):

  1. More optimizations to the index/search operations (things like strided tensor views, vector pooling etc).
  2. Implementation of PLAID3 to make the search even more efficient.
  3. Indexing/searching with multiple GPUs. Also many functions can be parallelized.

I'd love the Julia community and other open source enthusiasts to check out the package, and potentially to build/optimize it further! If you liked this post or like the package or it's potential, it'll be great if you can give us a star!

Acknowledgements

I really want to thank my mentor for the project, Jan Siml. Jan is an incredibly helpful mentor, and he made my life really easy (to the reader: if you're planning to do a future GSoC project in GenAI and have a chance to work with Jan, please don't miss it!). I'd also like to thank the JuliaGPU org for giving me access to their GPUs, and to julialang.org for giving me an opportunity to work on this amazing project! Finally, I'd like to thank the incredible community of Julia users, particularly Peter Cheng for my never-ending questions with Transformers.jl, and to other users who helped me out with other queries!

References


  1. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (SIGIR'20) ↩

  2. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (NAACL'22). ↩

  3. PLAID: An Efficient Engine for Late Interaction Retrieval (CIKM'22). ↩

  4. ColPali: Efficient Document Retrieval with Vision Language Models. ↩

Top comments (0)