Skip to main content

🗣 Retrieval Vocabulary

  • Chunk: A piece of data or information, often a subset of a larger document, that will handled as a single unit by a retrieval system
  • Chunking: The act of splitting your long body of text into smaller parts. Similar to text splitting.
  • Cosine Similarity: A metric used to measure the similarity between two vectors (often representing text embeddings) in a multi-dimensional space, by calculating the cosine of the angle between them
  • Dimension Size: The number of features or axes in the space, which corresponds to the size of the embeddings used for representing documents or chunks
  • Document Loader: A component of a retrieval system that is responsible for importing documents into the system and preparing them for indexing and retrieval
  • Document Store (DocStore): A specialized database for storing, managing, and retrieving documents within a retrieval system
  • Document: A unit of data or information that can be text, image, audio, or video, which the system can retrieve and present in response to a query
  • Embedding: A mathematical representation of a document or chunk, often in a high-dimensional space, where each dimension represents a feature such as a word or phrase. Similar to vector
  • Full Stack Retrieval: The entire retrieval system that handles the everything from data ingestion, processing to query handling and information delivery
  • Index: An index is a data structure that allows for fast retrieval of documents or chunks within a large dataset. It maps key terms or features to their locations in a dataset
  • Knowledge Base: A structured database of facts, information, and rules that a retrieval system can draw upon to answer queries or perform tasks
  • Maximum Marginal Relevance (MMR): An algorithm used to provide a set of search results that are both relevant to the query and diverse, minimizing content overlap to offer a broader information range
  • Reranker: A model that improves the precision of document retrieval by reevaluating and scoring the relevance of a pre-selected set of documents to a specific query, aiming to refine the results for higher accuracy.
  • Retriever: In the context of retrieval systems, a retriever is a component that fetches relevant documents from a corpus or database based on a query, often using embeddings and similarity measures.
  • Sentiment: The emotional tone or meaning behind a series of words, used to understand the attitudes, opinions, and emotions expressed in a chunk of text
  • Text Splitting: Another way to say chunking. The act of splitting up your long body of text into smaller parts
  • Vector Store (VectorStore): A database or storage system where vectors are kept. It allows for efficient retrieval and comparison of vectors for operations like similarity searching
  • Vector: A mathematical representation of a document or chunk, often in a high-dimensional space, where each dimension represents a feature such as a word or phrase.