{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Getting Started with RedisVL\n", "`redisvl` is a versatile Python library with an integrated CLI, designed to enhance AI applications using Redis. This guide will walk you through the following steps:\n", "\n", "1. Defining an `IndexSchema`\n", "2. Preparing a sample dataset\n", "3. Creating a `SearchIndex` object\n", "4. Testing `rvl` CLI functionality\n", "5. Loading the sample data\n", "6. Building `VectorQuery` objects and executing searches\n", "7. Updating a `SearchIndex` object\n", "\n", "...and more!\n", "\n", "Prerequisites:\n", "- Ensure `redisvl` is installed in your Python environment.\n", "- Have a running instance of [Redis Stack](https://redis.io/docs/install/install-stack/) or [Redis Cloud](https://redis.com/try-free).\n", "\n", "_____" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Define an `IndexSchema`\n", "\n", "The `IndexSchema` maintains crucial **index configuration** and **field definitions** to\n", "enable search with Redis. For ease of use, the schema can be constructed from a\n", "python dictionary or yaml file.\n", "\n", "### Example Schema Creation\n", "Consider a dataset with user information, including `job`, `age`, `credit_score`,\n", "and a 3-dimensional `user_embedding` vector.\n", "\n", "You must also decide on a Redis index name and key prefix to use for this\n", "dataset. Below are example schema definitions in both YAML and Dict format.\n", "\n", "**YAML Definition:**\n", "\n", "```yaml\n", "version: '0.1.0'\n", "\n", "index:\n", " name: user_simple\n", " prefix: user_simple_docs\n", "\n", "fields:\n", " - name: user\n", " type: tag\n", " - name: credit_score\n", " type: tag\n", " - name: job\n", " type: text\n", " - name: age\n", " type: numeric\n", " - name: user_embedding\n", " type: vector\n", " attrs:\n", " algorithm: flat\n", " dims: 3\n", " distance_metric: cosine\n", " datatype: float32\n", "```\n", "> Store this in a local file, such as `schema.yaml`, for RedisVL usage." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Python Dictionary:**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "schema = {\n", " \"index\": {\n", " \"name\": \"user_simple\",\n", " \"prefix\": \"user_simple_docs\",\n", " },\n", " \"fields\": [\n", " {\"name\": \"user\", \"type\": \"tag\"},\n", " {\"name\": \"credit_score\", \"type\": \"tag\"},\n", " {\"name\": \"job\", \"type\": \"text\"},\n", " {\"name\": \"age\", \"type\": \"numeric\"},\n", " {\n", " \"name\": \"user_embedding\",\n", " \"type\": \"vector\",\n", " \"attrs\": {\n", " \"dims\": 3,\n", " \"distance_metric\": \"cosine\",\n", " \"algorithm\": \"flat\",\n", " \"datatype\": \"float32\"\n", " }\n", " }\n", " ]\n", "}" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Sample Dataset Preparation\n", "\n", "Below, create a mock dataset with `user`, `job`, `age`, `credit_score`, and\n", "`user_embedding` fields. The `user_embedding` vectors are synthetic examples\n", "for demonstration purposes.\n", "\n", "For more information on creating real-world embeddings, refer to this\n", "[article](https://mlops.community/vector-similarity-search-from-basics-to-production/)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "\n", "data = [\n", " {\n", " 'user': 'john',\n", " 'age': 1,\n", " 'job': 'engineer',\n", " 'credit_score': 'high',\n", " 'user_embedding': np.array([0.1, 0.1, 0.5], dtype=np.float32).tobytes()\n", " },\n", " {\n", " 'user': 'mary',\n", " 'age': 2,\n", " 'job': 'doctor',\n", " 'credit_score': 'low',\n", " 'user_embedding': np.array([0.1, 0.1, 0.5], dtype=np.float32).tobytes()\n", " },\n", " {\n", " 'user': 'joe',\n", " 'age': 3,\n", " 'job': 'dentist',\n", " 'credit_score': 'medium',\n", " 'user_embedding': np.array([0.9, 0.9, 0.1], dtype=np.float32).tobytes()\n", " }\n", "]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ ">As seen above, the sample `user_embedding` vectors are converted into bytes. Using the `NumPy`, this is fairly trivial." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Create a `SearchIndex`\n", "\n", "With the schema and sample dataset ready, instantiate a `SearchIndex`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from redisvl.index import SearchIndex\n", "\n", "index = SearchIndex.from_dict(schema)\n", "# or use .from_yaml('schema_file.yaml')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we also need to facilitate a Redis connection. There are a few ways to do this:\n", "\n", "- Create & manage your own client connection (recommended)\n", "- Provide a simple Redis URL and let RedisVL connect on your behalf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bring your own Redis connection instance\n", "\n", "This is ideal in scenarios where you have custom settings on the connection instance or if your application will share a connection pool:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from redis import Redis\n", "\n", "client = Redis.from_url(\"redis://localhost:6379\")\n", "\n", "index.set_client(client)\n", "# optionally provide an async Redis client object to enable async index operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let the index manage the connection instance\n", "\n", "This is ideal for simple cases:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index.connect(\"redis://localhost:6379\")\n", "# optionally use an async client by passing use_async=True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the underlying index\n", "\n", "Now that we are connected to Redis, we need to run the create command." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "index.create(overwrite=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">Note that at this point, the index has no entries. Data loading follows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspect with the `rvl` CLI\n", "Use the `rvl` CLI to inspect the created index and its fields:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[32m11:53:23\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m Indices:\n", "\u001b[32m11:53:23\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m 1. user_simple\n" ] } ], "source": [ "!rvl index listall" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "Index Information:\n", "╭──────────────┬────────────────┬──────────────────────┬─────────────────┬────────────╮\n", "│ Index Name │ Storage Type │ Prefixes │ Index Options │ Indexing │\n", "├──────────────┼────────────────┼──────────────────────┼─────────────────┼────────────┤\n", "│ user_simple │ HASH │ ['user_simple_docs'] │ [] │ 0 │\n", "╰──────────────┴────────────────┴──────────────────────┴─────────────────┴────────────╯\n", "Index Fields:\n", "╭────────────────┬────────────────┬─────────┬────────────────┬────────────────╮\n", "│ Name │ Attribute │ Type │ Field Option │ Option Value │\n", "├────────────────┼────────────────┼─────────┼────────────────┼────────────────┤\n", "│ user │ user │ TAG │ SEPARATOR │ , │\n", "│ credit_score │ credit_score │ TAG │ SEPARATOR │ , │\n", "│ job │ job │ TEXT │ WEIGHT │ 1 │\n", "│ age │ age │ NUMERIC │ │ │\n", "│ user_embedding │ user_embedding │ VECTOR │ │ │\n", "╰────────────────┴────────────────┴─────────┴────────────────┴────────────────╯\n" ] } ], "source": [ "!rvl index info -i user_simple" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data to `SearchIndex`\n", "\n", "Load the sample dataset to Redis:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['user_simple_docs:d424b73c516442f7919cc11ed3bb1882', 'user_simple_docs:6da16f88342048e79b3500bec5448805', 'user_simple_docs:ef5a590ef85e4d4888fd8ebe79ae1e8c']\n" ] } ], "source": [ "keys = index.load(data)\n", "\n", "print(keys)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">By default, `load` will create a unique Redis \"key\" as a combination of the index key `prefix` and a UUID. You can also customize the key by providing direct keys or pointing to a specified `id_field` on load." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upsert the index with new data\n", "Upsert data by using the `load` method again:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['user_simple_docs:9806a362604f4700b17513cc94fcf10d']\n" ] } ], "source": [ "# Add more data\n", "new_data = [{\n", " 'user': 'tyler',\n", " 'age': 9,\n", " 'job': 'engineer',\n", " 'credit_score': 'high',\n", " 'user_embedding': np.array([0.1, 0.3, 0.5], dtype=np.float32).tobytes()\n", "}]\n", "keys = index.load(new_data)\n", "\n", "print(keys)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Creating `VectorQuery` Objects\n", "\n", "Next we will create a vector query object for our newly populated index. This example will use a simple vector to demonstrate how vector similarity works. Vectors in production will likely be much larger than 3 floats and often require Machine Learning models (i.e. Huggingface sentence transformers) or an embeddings API (Cohere, OpenAI). `redisvl` provides a set of [Vectorizers](https://www.redisvl.com/user_guide/vectorizers_04.html#openai) to assist in vector creation." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from redisvl.query import VectorQuery\n", "from jupyterutils import result_print\n", "\n", "query = VectorQuery(\n", " vector=[0.1, 0.1, 0.5],\n", " vector_field_name=\"user_embedding\",\n", " return_fields=[\"user\", \"age\", \"job\", \"credit_score\", \"vector_distance\"],\n", " num_results=3\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Executing queries\n", "With our `VectorQuery` object defined above, we can execute the query over the `SearchIndex` using the `query` method." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
vector_distanceuseragejobcredit_score
0john1engineerhigh
0mary2doctorlow
0.0566299557686tyler9engineerhigh
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "results = index.query(query)\n", "result_print(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using an Asynchronous Redis Client\n", "\n", "The `AsyncSearchIndex` class along with an async Redis python client allows for queries, index creation, and data loading to be done asynchronously. This is the\n", "recommended route for working with `redisvl` in production-like settings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from redisvl.index import AsyncSearchIndex\n", "from redis.asyncio import Redis\n", "\n", "client = Redis.from_url(\"redis://localhost:6379\")\n", "\n", "index = AsyncSearchIndex.from_dict(schema)\n", "index.set_client(client)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
vector_distanceuseragejobcredit_score
0john1engineerhigh
0mary2doctorlow
0.0566299557686tyler9engineerhigh
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# execute the vector query async\n", "results = await index.query(query)\n", "result_print(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Updating a schema\n", "In some scenarios, it makes sense to update the index schema. With Redis and `redisvl`, this is easy because Redis can keep the underlying data in place while you change or make updates to the index configuration." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So for our scenario, let's imagine we want to reindex this data in 2 ways:\n", "- by using a `Tag` type for `job` field instead of `Text`\n", "- by using an `hnsw` vector index for the `user_embedding` field instead of a `flat` vector index" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Modify this schema to have what we want\n", "\n", "index.schema.remove_field(\"job\")\n", "index.schema.remove_field(\"user_embedding\")\n", "index.schema.add_fields([\n", " {\"name\": \"job\", \"type\": \"tag\"},\n", " {\n", " \"name\": \"user_embedding\",\n", " \"type\": \"vector\",\n", " \"attrs\": {\n", " \"dims\": 3,\n", " \"distance_metric\": \"cosine\",\n", " \"algorithm\": \"hnsw\",\n", " \"datatype\": \"float32\"\n", " }\n", " }\n", "])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11:53:25 redisvl.index.index INFO Index already exists, overwriting.\n" ] } ], "source": [ "# Run the index update but keep underlying data in place\n", "await index.create(overwrite=True, drop=False)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
vector_distanceuseragejobcredit_score
0john1engineerhigh
0mary2doctorlow
0.0566299557686tyler9engineerhigh
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Execute the vector query async\n", "results = await index.query(query)\n", "result_print(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check Index Stats\n", "Use the `rvl` CLI to check the stats for the index:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Statistics:\n", "╭─────────────────────────────┬─────────────╮\n", "│ Stat Key │ Value │\n", "├─────────────────────────────┼─────────────┤\n", "│ num_docs │ 4 │\n", "│ num_terms │ 0 │\n", "│ max_doc_id │ 4 │\n", "│ num_records │ 20 │\n", "│ percent_indexed │ 1 │\n", "│ hash_indexing_failures │ 0 │\n", "│ number_of_uses │ 2 │\n", "│ bytes_per_record_avg │ 1 │\n", "│ doc_table_size_mb │ 0.00044632 │\n", "│ inverted_sz_mb │ 1.90735e-05 │\n", "│ key_table_size_mb │ 0.000138283 │\n", "│ offset_bits_per_record_avg │ nan │\n", "│ offset_vectors_sz_mb │ 0 │\n", "│ offsets_per_term_avg │ 0 │\n", "│ records_per_doc_avg │ 5 │\n", "│ sortable_values_size_mb │ 0 │\n", "│ total_indexing_time │ 1.796 │\n", "│ total_inverted_index_blocks │ 11 │\n", "│ vector_index_sz_mb │ 0.235603 │\n", "╰─────────────────────────────┴─────────────╯\n" ] } ], "source": [ "!rvl stats -i user_simple" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleanup" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# clean up the index\n", "await index.delete()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 ('redisvl2')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "9b1e6e9c2967143209c2f955cb869d1d3234f92dc4787f49f155f3abbdfb1316" } } }, "nbformat": 4, "nbformat_minor": 2 }