About DataCat

DataCat, started in 2017, is an embeddings-focused project, showcasing a scalable tech stack for custom AI solutions, specializing in vector retrieval for both classification and knowledge bases.

History

DataCat, initiated as a modest venture in 2017, was primarily designed to augment the indexing and classification capabilities of the Dark Sentinel crawler. This endeavor predates the advent of Large Language Models (LLMs), BERT, and even GPT-1, with its first iteration being developed using Tensor2Tensor. The core concept behind DataCat was grounded in transfer learning. This involved leveraging one of the terminal activation layers as a feature vector, and then subsequently training smaller, specialized models using this vector. This approach is now widely recognized and employed in the form of embeddings.

Over the years, DataCat has evolved from developing its own embedding models to providing comprehensive support for a range of models including GPT-1, GPT-2, BERT, the Universal Sentence Encoder, and more recently, GPT-3.5, GPT-4, and Gemini. We are very open about out tech stack, considering this platform primarily exists as a demonstration of its capabilities to clients.

Tech

Our architecture is built upon Node.js for frontend and performance nodes, and Python for worker nodes, with a design that allows for full horizontal scalability, including the database layer.

When a file is uploaded for model or knowledge base training, and its size exceeds 100MB (post vectorisation), it is partitioned and processed by multiple worker nodes. These nodes are capable of computing embeddings utilizing multiple strategies, with ada-002 and the Universal Sentence Encoder being the most commonly employed ones.

The selection of a particular model strategy to create an artifact depends on available resources and may include:

  1. Building a Hierarchical Navigable Small World (HNSW) index using C++ bindings.
  2. Constructing a brute force index optimized for high-speed inference in C++.
  3. Creating a brute force index (essentially a buffer) for JavaScript inference for smaller models using our proprietary library. While slower, this approach is highly cost-effective for managing numerous small, infrequently accessed public models.
  4. Developing an HNSW index in the database, particularly useful for scenarios expecting frequent index updates.
  5. Generating a model trained on the vectors, which saves only a minimal model representation.

For public usage, we limit file uploads to smaller sizes, so free requests are typically handled by our JavaScript matching library through full brute force K-Nearest Neighbors (KNN) approach. C++ native lib is 40x faster, and HNSW even more so, but for that we need to reserve instances for you, as this involves native code.

As you may have guessed, most operations on this site are retrieval-based, employing semantic search. The type of inference varies depending on the artifact, ranging from KNN and Approximate Nearest Neighbors (ANN) to distinct Machine Learning models.

Additionally, in order to optimize resource utilization, models not actively in use are not retained in memory. Instead, we efficiently fetch the embeddings and the model (typically within a hundred milliseconds) during an inference request, so we have a model in RAM by the time input embeddings are done. This decoupling of compute and storage also allows us to serve the same model from multiple nodes under heavy load.

For certain clients, we offer single-tenant nodes that constantly keep models in memory. This is particularly advantageous for large HNSW indexes and single lookup queries, where loading the index into memory can be significantly more time-consuming than the inference itself.

Interestingly, as a vector retrieval system, DataCat doesn't differentiate that much between ML Models and Knowledge Bases. Any model is instantly deployable as a knowledge base, and vice versa.

Our technology suite is robust enough to support custom Text Assistants and Retrieval-Augmented Generation (RAG). Classification endpoints facilitate intent detection, retrieval operations can source data from knowledge bases, and label endpoints (using GenAI) can seamlessly integrate these into a highly adaptable RAG system.

For those who are still here

If this piques your interest, we invite you to contact us via email. Besides maintaining this service, we possess extensive experience in deploying large-scale AI projects. Our team is equipped to assist with everything from data infrastructure architecture to cross-domain projects involving other media.