{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Vectorizers\n", "\n", "In this notebook, we will show how to use RedisVL to create embeddings using the built-in text embedding vectorizers. Today RedisVL supports:\n", "1. OpenAI\n", "2. HuggingFace\n", "3. Vertex AI\n", "4. Cohere\n", "\n", "Before running this notebook, be sure to\n", "1. Have installed ``redisvl`` and have that environment active for this notebook.\n", "2. Have a running Redis Stack instance with RediSearch > 2.4 active.\n", "\n", "For example, you can run Redis Stack locally with Docker:\n", "\n", "```bash\n", "docker run -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest\n", "```\n", "\n", "This will run Redis on port 6379 and RedisInsight at http://localhost:8001." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# import necessary modules\n", "import os" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Text Embeddings\n", "\n", "This example will show how to create an embedding from 3 simple sentences with a number of different text vectorizers in RedisVL.\n", "\n", "- \"That is a happy dog\"\n", "- \"That is a happy person\"\n", "- \"Today is a nice day\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OpenAI\n", "\n", "The ``OpenAITextVectorizer`` makes it simple to use RedisVL with the embeddings models at OpenAI. For this you will need to install ``openai``. \n", "\n", "```bash\n", "pip install openai\n", "```\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import getpass\n", "\n", "# setup the API Key\n", "api_key = os.environ.get(\"OPENAI_API_KEY\") or getpass.getpass(\"Enter your OpenAI API key: \")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Vector dimensions: 1536\n" ] }, { "data": { "text/plain": [ "[-0.001025049015879631,\n", " -0.0030993607360869646,\n", " 0.0024536605924367905,\n", " -0.004484387580305338,\n", " -0.010331203229725361,\n", " 0.012700922787189484,\n", " -0.005368996877223253,\n", " -0.0029411641880869865,\n", " -0.0070833307690918446,\n", " -0.03386051580309868]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from redisvl.utils.vectorize import OpenAITextVectorizer\n", "\n", "# create a vectorizer\n", "oai = OpenAITextVectorizer(\n", " model=\"text-embedding-ada-002\",\n", " api_config={\"api_key\": api_key},\n", ")\n", "\n", "test = oai.embed(\"This is a test sentence.\")\n", "print(\"Vector dimensions: \", len(test))\n", "test[:10]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[-0.01747742109000683,\n", " -5.228330701356754e-05,\n", " 0.0013870716793462634,\n", " -0.025637786835432053,\n", " -0.01985435001552105,\n", " 0.016117358580231667,\n", " -0.0037306349258869886,\n", " 0.0008945261361077428,\n", " 0.006577865686267614,\n", " -0.025091219693422318]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create many embeddings at once\n", "sentences = [\n", " \"That is a happy dog\",\n", " \"That is a happy person\",\n", " \"Today is a sunny day\"\n", "]\n", "\n", "embeddings = oai.embed_many(sentences)\n", "embeddings[0][:10]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of Embeddings: 3\n" ] } ], "source": [ "# openai also supports asyncronous requests, which we can use to speed up the vectorization process.\n", "embeddings = await oai.aembed_many(sentences)\n", "print(\"Number of Embeddings:\", len(embeddings))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Azure OpenAI\n", "\n", "The ``AzureOpenAITextVectorizer`` is a variation of the OpenAI vectorizer that calls OpenAI models within Azure. If you've already installed ``openai``, then you're ready to use Azure OpenAI.\n", "\n", "The only practical difference between OpenAI and Azure OpenAI is the variables required to call the API." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# additionally to the API Key, setup the API endpoint and version\n", "api_key = os.environ.get(\"AZURE_OPENAI_API_KEY\") or getpass.getpass(\"Enter your AzureOpenAI API key: \")\n", "api_version = os.environ.get(\"OPENAI_API_VERSION\") or getpass.getpass(\"Enter your AzureOpenAI API version: \")\n", "azure_endpoint = os.environ.get(\"AZURE_OPENAI_ENDPOINT\") or getpass.getpass(\"Enter your AzureOpenAI API endpoint: \")\n", "deployment_name = os.environ.get(\"AZURE_OPENAI_DEPLOYMENT_NAME\", \"text-embedding-ada-002\")\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Vector dimensions: 1536\n" ] }, { "data": { "text/plain": [ "[-0.0010088568087667227,\n", " -0.003142790636047721,\n", " 0.0024922797456383705,\n", " -0.004522906616330147,\n", " -0.010369433090090752,\n", " 0.012739036232233047,\n", " -0.005365503951907158,\n", " -0.0029668458737432957,\n", " -0.007141091860830784,\n", " -0.03383301943540573]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from redisvl.utils.vectorize import AzureOpenAITextVectorizer\n", "\n", "# create a vectorizer\n", "az_oai = AzureOpenAITextVectorizer(\n", " model=deployment_name, # Must be your CUSTOM deployment name\n", " api_config={\n", " \"api_key\": api_key,\n", " \"api_version\": api_version,\n", " \"azure_endpoint\": azure_endpoint\n", " },\n", ")\n", "\n", "test = az_oai.embed(\"This is a test sentence.\")\n", "print(\"Vector dimensions: \", len(test))\n", "test[:10]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[-0.017460526898503304,\n", " -6.895032856846228e-05,\n", " 0.0013909287517890334,\n", " -0.025688467547297478,\n", " -0.019813183695077896,\n", " 0.016087085008621216,\n", " -0.003729278687387705,\n", " 0.0009211922879330814,\n", " 0.006606514099985361,\n", " -0.025128915905952454]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Just like OpenAI, AzureOpenAI supports batching embeddings and asynchronous requests.\n", "sentences = [\n", " \"That is a happy dog\",\n", " \"That is a happy person\",\n", " \"Today is a sunny day\"\n", "]\n", "\n", "embeddings = await az_oai.aembed_many(sentences)\n", "embeddings[0][:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Huggingface\n", "\n", "[Huggingface](https://huggingface.co/models) is a popular NLP platform that has a number of pre-trained models you can use off the shelf. RedisVL supports using Huggingface \"Sentence Transformers\" to create embeddings from text. To use Huggingface, you will need to install the ``sentence-transformers`` library.\n", "\n", "```bash\n", "pip install sentence-transformers\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/tyler.hutcherson/RedisVentures/redisvl/.venv/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()\n", " return self.fget.__get__(instance, owner)()\n" ] }, { "data": { "text/plain": [ "[0.00037810884532518685,\n", " -0.05080341175198555,\n", " -0.03514723479747772,\n", " -0.02325104922056198,\n", " -0.044158220291137695,\n", " 0.020487844944000244,\n", " 0.0014617963461205363,\n", " 0.031261757016181946,\n", " 0.05605152249336243,\n", " 0.018815357238054276]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n", "from redisvl.utils.vectorize import HFTextVectorizer\n", "\n", "\n", "# create a vectorizer\n", "# choose your model from the huggingface website\n", "hf = HFTextVectorizer(model=\"sentence-transformers/all-mpnet-base-v2\")\n", "\n", "# embed a sentence\n", "test = hf.embed(\"This is a test sentence.\")\n", "test[:10]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# You can also create many embeddings at once\n", "embeddings = hf.embed_many(sentences, as_buffer=True)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### VertexAI\n", "\n", "[VertexAI](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) is GCP's fully-featured AI platform including a number of pretrained LLMs. RedisVL supports using VertexAI to create embeddings from these models. To use VertexAI, you will first need to install the ``google-cloud-aiplatform`` library.\n", "\n", "```bash\n", "pip install google-cloud-aiplatform>=1.26\n", "```\n", "\n", "1. Then you need to gain access to a [Google Cloud Project](https://cloud.google.com/gcp?hl=en) and provide [access to credentials](https://cloud.google.com/docs/authentication/application-default-credentials). This is accomplished by setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable pointing to the path of a JSON key file downloaded from your service account on GCP.\n", "2. Lastly, you need to find your [project ID](https://support.google.com/googleapi/answer/7014113?hl=en) and [geographic region for VertexAI](https://cloud.google.com/vertex-ai/docs/general/locations).\n", "\n", "\n", "**Make sure the following env vars are set:**\n", "\n", "```\n", "GOOGLE_APPLICATION_CREDENTIALS=\n", "GCP_PROJECT_ID=\n", "GCP_LOCATION=\n", "```" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.04373306408524513,\n", " -0.05040992051362991,\n", " -0.011946038343012333,\n", " -0.043528858572244644,\n", " 0.021510830149054527,\n", " 0.028604144230484962,\n", " 0.014770914800465107,\n", " -0.01610461436212063,\n", " -0.0036560404114425182,\n", " 0.013746795244514942]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from redisvl.utils.vectorize import VertexAITextVectorizer\n", "\n", "\n", "# create a vectorizer\n", "vtx = VertexAITextVectorizer(api_config={\n", " \"project_id\": os.environ.get(\"GCP_PROJECT_ID\") or getpass.getpass(\"Enter your GCP Project ID: \"),\n", " \"location\": os.environ.get(\"GCP_LOCATION\") or getpass.getpass(\"Enter your GCP Location: \"),\n", " \"google_application_credentials\": os.environ.get(\"GOOGLE_APPLICATION_CREDENTIALS\") or getpass.getpass(\"Enter your Google App Credentials path: \")\n", "})\n", "\n", "# embed a sentence\n", "test = vtx.embed(\"This is a test sentence.\")\n", "test[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cohere\n", "\n", "[Cohere](https://dashboard.cohere.ai/) allows you to implement language AI into your product. The `CohereTextVectorizer` makes it simple to use RedisVL with the embeddings models at Cohere. For this you will need to install `cohere`.\n", "\n", "```bash\n", "pip install cohere\n", "```" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import getpass\n", "# setup the API Key\n", "api_key = os.environ.get(\"COHERE_API_KEY\") or getpass.getpass(\"Enter your Cohere API key: \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Special attention needs to be paid to the `input_type` parameter for each `embed` call. For example, for embedding \n", "queries, you should set `input_type='search_query'`; for embedding documents, set `input_type='search_document'`. See\n", "more information [here](https://docs.cohere.com/reference/embed)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Vector dimensions: 1024\n", "[-0.010856628, -0.019683838, -0.0062179565, 0.003545761, -0.047943115, 0.0009365082, -0.005924225, 0.016174316, -0.03289795, 0.049194336]\n", "Vector dimensions: 1024\n", "[-0.009712219, -0.016036987, 2.8073788e-05, -0.022491455, -0.041259766, 0.002281189, -0.033294678, -0.00057029724, -0.026260376, 0.0579834]\n" ] } ], "source": [ "from redisvl.utils.vectorize import CohereTextVectorizer\n", "\n", "# create a vectorizer\n", "co = CohereTextVectorizer(\n", " model=\"embed-english-v3.0\",\n", " api_config={\"api_key\": api_key},\n", ")\n", "\n", "# embed a search query\n", "test = co.embed(\"This is a test sentence.\", input_type='search_query')\n", "print(\"Vector dimensions: \", len(test))\n", "print(test[:10])\n", "\n", "# embed a document\n", "test = co.embed(\"This is a test sentence.\", input_type='search_document')\n", "print(\"Vector dimensions: \", len(test))\n", "print(test[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Learn more about using RedisVL and Cohere together through [this dedicated user guide](https://docs.cohere.com/docs/redis-and-cohere)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Search with Provider Embeddings\n", "\n", "Now that we've created our embeddings, we can use them to search for similar sentences. We will use the same 3 sentences from above and search for similar sentences.\n", "\n", "First, we need to create the schema for our index.\n", "\n", "Here's what the schema for the example looks like in yaml for the HuggingFace vectorizer:\n", "\n", "```yaml\n", "version: '0.1.0'\n", "\n", "index:\n", " name: vectorizers\n", " prefix: doc\n", " storage_type: hash\n", "\n", "fields:\n", " - name: sentence\n", " type: text\n", " - name: embedding\n", " type: vector\n", " attrs:\n", " dims: 768\n", " algorithm: flat\n", " distance_metric: cosine\n", "```" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from redisvl.index import SearchIndex\n", "\n", "# construct a search index from the schema\n", "index = SearchIndex.from_yaml(\"./schema.yaml\")\n", "\n", "# connect to local redis instance\n", "index.connect(\"redis://localhost:6379\")\n", "\n", "# create the index (no data yet)\n", "index.create(overwrite=True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[32m20:22:42\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m Indices:\n", "\u001b[32m20:22:42\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m 1. vectorizers\n" ] } ], "source": [ "# use the CLI to see the created index\n", "!rvl index listall" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['doc:17c401b679ce43cb82f3ab2280ad02f2',\n", " 'doc:3fc0502bec434b17a3f06e20824b2e59',\n", " 'doc:199f17b0e5d24dcaa1fd4fb41558150c']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# load expects an iterable of dictionaries where\n", "# the vector is stored as a bytes buffer\n", "\n", "data = [{\"text\": t,\n", " \"embedding\": v}\n", " for t, v in zip(sentences, embeddings)]\n", "\n", "index.load(data)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "That is a happy dog 0.160862326622\n", "That is a happy person 0.273598492146\n", "Today is a sunny day 0.744559407234\n" ] } ], "source": [ "from redisvl.query import VectorQuery\n", "\n", "# use the HuggingFace vectorizer again to create a query embedding\n", "query_embedding = hf.embed(\"That is a happy cat\")\n", "\n", "query = VectorQuery(\n", " vector=query_embedding,\n", " vector_field_name=\"embedding\",\n", " return_fields=[\"text\"],\n", " num_results=3\n", ")\n", "\n", "results = index.query(query)\n", "for doc in results:\n", " print(doc[\"text\"], doc[\"vector_distance\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cleanup\n", "index.delete()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 ('redisvl2')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "9b1e6e9c2967143209c2f955cb869d1d3234f92dc4787f49f155f3abbdfb1316" } } }, "nbformat": 4, "nbformat_minor": 2 }