{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Hash vs JSON Storage\n",
"\n",
"\n",
"Out of the box, Redis provides a [variety of data structures](https://redis.com/redis-enterprise/data-structures/) that can adapt to your domain specific applications and use cases.\n",
"In this notebook, we will demonstrate how to use RedisVL with both [Hash](https://redis.io/docs/data-types/hashes/) and [JSON](https://redis.io/docs/data-types/json/) data.\n",
"\n",
"\n",
"Before running this notebook, be sure to\n",
"1. Have installed ``redisvl`` and have that environment active for this notebook.\n",
"2. Have a running Redis Stack or Redis Enterprise instance with RediSearch > 2.4 activated.\n",
"\n",
"For example, you can run [Redis Stack](https://redis.io/docs/install/install-stack/) locally with Docker:\n",
"\n",
"```bash\n",
"docker run -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest\n",
"```\n",
"\n",
"Or create a [FREE Redis Cloud](https://redis.com/try-free)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# import necessary modules\n",
"import pickle\n",
"\n",
"from redisvl.redis.utils import buffer_to_array\n",
"from redisvl.index import SearchIndex\n",
"\n",
"\n",
"# load in the example data and printing utils\n",
"data = pickle.load(open(\"hybrid_example_data.pkl\", \"rb\"))"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
user | age | job | credit_score | office_location | user_embedding |
---|
john | 18 | engineer | high | -122.4194,37.7749 | b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?' |
derrick | 14 | doctor | low | -122.4194,37.7749 | b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?' |
nancy | 94 | doctor | high | -122.4194,37.7749 | b'333?\\xcd\\xcc\\xcc=\\x00\\x00\\x00?' |
tyler | 100 | engineer | high | -122.0839,37.3861 | b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc>\\x00\\x00\\x00?' |
tim | 12 | dermatologist | high | -122.0839,37.3861 | b'\\xcd\\xcc\\xcc>\\xcd\\xcc\\xcc>\\x00\\x00\\x00?' |
taimur | 15 | CEO | low | -122.0839,37.3861 | b'\\x9a\\x99\\x19?\\xcd\\xcc\\xcc=\\x00\\x00\\x00?' |
joe | 35 | dentist | medium | -122.0839,37.3861 | b'fff?fff?\\xcd\\xcc\\xcc=' |
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from jupyterutils import result_print, table_print\n",
"\n",
"table_print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hash or JSON -- how to choose?\n",
"Both storage options offer a variety of features and tradeoffs. Below we will work through a dummy dataset to learn when and how to use both."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Working with Hashes\n",
"Hashes in Redis are simple collections of field-value pairs. Think of it like a mutable single-level dictionary contains multiple \"rows\":\n",
"\n",
"\n",
"```python\n",
"{\n",
" \"model\": \"Deimos\",\n",
" \"brand\": \"Ergonom\",\n",
" \"type\": \"Enduro bikes\",\n",
" \"price\": 4972,\n",
"}\n",
"```\n",
"\n",
"Hashes are best suited for use cases with the following characteristics:\n",
"- Performance (speed) and storage space (memory consumption) are top concerns\n",
"- Data can be easily normalized and modeled as a single-level dict\n",
"\n",
"> Hashes are typically the default recommendation."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# define the hash index schema\n",
"hash_schema = {\n",
" \"index\": {\n",
" \"name\": \"user-hash\",\n",
" \"prefix\": \"user-hash-docs\",\n",
" \"storage_type\": \"hash\", # default setting -- HASH\n",
" },\n",
" \"fields\": [\n",
" {\"name\": \"user\", \"type\": \"tag\"},\n",
" {\"name\": \"credit_score\", \"type\": \"tag\"},\n",
" {\"name\": \"job\", \"type\": \"text\"},\n",
" {\"name\": \"age\", \"type\": \"numeric\"},\n",
" {\"name\": \"office_location\", \"type\": \"geo\"},\n",
" {\n",
" \"name\": \"user_embedding\",\n",
" \"type\": \"vector\",\n",
" \"attrs\": {\n",
" \"dims\": 3,\n",
" \"distance_metric\": \"cosine\",\n",
" \"algorithm\": \"flat\",\n",
" \"datatype\": \"float32\"\n",
" }\n",
"\n",
" }\n",
" ],\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# construct a search index from the hash schema\n",
"hindex = SearchIndex.from_dict(hash_schema)\n",
"\n",
"# connect to local redis instance\n",
"hindex.connect(\"redis://localhost:6379\")\n",
"\n",
"# create the index (no data yet)\n",
"hindex.create(overwrite=True)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# show the underlying storage type\n",
"hindex.storage_type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Vectors as byte strings\n",
"One nuance when working with Hashes in Redis, is that all vectorized data must be passed as a byte string (for efficient storage, indexing, and processing). An example of that can be seen below:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'user': 'john',\n",
" 'age': 18,\n",
" 'job': 'engineer',\n",
" 'credit_score': 'high',\n",
" 'office_location': '-122.4194,37.7749',\n",
" 'user_embedding': b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# show a single entry from the data that will be loaded\n",
"data[0]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# load hash data\n",
"keys = hindex.load(data)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Statistics:\n",
"╭─────────────────────────────┬─────────────╮\n",
"│ Stat Key │ Value │\n",
"├─────────────────────────────┼─────────────┤\n",
"│ num_docs │ 7 │\n",
"│ num_terms │ 6 │\n",
"│ max_doc_id │ 7 │\n",
"│ num_records │ 44 │\n",
"│ percent_indexed │ 1 │\n",
"│ hash_indexing_failures │ 0 │\n",
"│ number_of_uses │ 1 │\n",
"│ bytes_per_record_avg │ 3.40909 │\n",
"│ doc_table_size_mb │ 0.000767708 │\n",
"│ inverted_sz_mb │ 0.000143051 │\n",
"│ key_table_size_mb │ 0.000248909 │\n",
"│ offset_bits_per_record_avg │ 8 │\n",
"│ offset_vectors_sz_mb │ 8.58307e-06 │\n",
"│ offsets_per_term_avg │ 0.204545 │\n",
"│ records_per_doc_avg │ 6.28571 │\n",
"│ sortable_values_size_mb │ 0 │\n",
"│ total_indexing_time │ 1.053 │\n",
"│ total_inverted_index_blocks │ 18 │\n",
"│ vector_index_sz_mb │ 0.0202332 │\n",
"╰─────────────────────────────┴─────────────╯\n"
]
}
],
"source": [
"!rvl stats -i user-hash"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Performing Queries\n",
"Once our index is created and data is loaded into the right format, we can run queries against the index with RedisVL:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"vector_distance | user | credit_score | age | job | office_location |
---|
0 | john | high | 18 | engineer | -122.4194,37.7749 |
0.109129190445 | tyler | high | 100 | engineer | -122.0839,37.3861 |
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from redisvl.query import VectorQuery\n",
"from redisvl.query.filter import Tag, Text, Num\n",
"\n",
"t = (Tag(\"credit_score\") == \"high\") & (Text(\"job\") % \"enginee*\") & (Num(\"age\") > 17)\n",
"\n",
"v = VectorQuery([0.1, 0.1, 0.5],\n",
" \"user_embedding\",\n",
" return_fields=[\"user\", \"credit_score\", \"age\", \"job\", \"office_location\"],\n",
" filter_expression=t)\n",
"\n",
"\n",
"results = hindex.query(v)\n",
"result_print(results)\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# clean up\n",
"hindex.delete()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Working with JSON\n",
"Redis also supports native **JSON** objects. These can be multi-level (nested) objects, with full JSONPath support for updating/retrieving sub elements:\n",
"\n",
"```python\n",
"{\n",
" \"name\": \"bike\",\n",
" \"metadata\": {\n",
" \"model\": \"Deimos\",\n",
" \"brand\": \"Ergonom\",\n",
" \"type\": \"Enduro bikes\",\n",
" \"price\": 4972,\n",
" }\n",
"}\n",
"```\n",
"\n",
"JSON is best suited for use cases with the following characteristics:\n",
"- Ease of use and data model flexibility are top concerns\n",
"- Application data is already native JSON\n",
"- Replacing another document storage/db solution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Full JSON Path support\n",
"Because Redis enables full JSON path support, when creating an index schema, elements need to be indexed and selected by their path with the desired `name` AND `path` that points to where the data is located within the objects.\n",
"\n",
"> By default, RedisVL will assume the path as `$.{name}` if not provided in JSON fields schema."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# define the json index schema\n",
"json_schema = {\n",
" \"index\": {\n",
" \"name\": \"user-json\",\n",
" \"prefix\": \"user-json-docs\",\n",
" \"storage_type\": \"json\", # JSON storage type\n",
" },\n",
" \"fields\": [\n",
" {\"name\": \"user\", \"type\": \"tag\"},\n",
" {\"name\": \"credit_score\", \"type\": \"tag\"},\n",
" {\"name\": \"job\", \"type\": \"text\"},\n",
" {\"name\": \"age\", \"type\": \"numeric\"},\n",
" {\"name\": \"office_location\", \"type\": \"geo\"},\n",
" {\n",
" \"name\": \"user_embedding\",\n",
" \"type\": \"vector\",\n",
" \"attrs\": {\n",
" \"dims\": 3,\n",
" \"distance_metric\": \"cosine\",\n",
" \"algorithm\": \"flat\",\n",
" \"datatype\": \"float32\"\n",
" }\n",
"\n",
" }\n",
" ],\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# construct a search index from the json schema\n",
"jindex = SearchIndex.from_dict(json_schema)\n",
"\n",
"# connect to local redis instance\n",
"jindex.connect(\"redis://localhost:6379\")\n",
"\n",
"# create the index (no data yet)\n",
"jindex.create(overwrite=True)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[32m11:54:18\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m Indices:\n",
"\u001b[32m11:54:18\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m 1. user-json\n"
]
}
],
"source": [
"# note the multiple indices in the same database\n",
"!rvl index listall"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Vectors as float arrays\n",
"Vectorized data stored in JSON must be stored as a pure array (python list) of floats. We will modify our sample data to account for this below:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"json_data = data.copy()\n",
"\n",
"for d in json_data:\n",
" d['user_embedding'] = buffer_to_array(d['user_embedding'], dtype=np.float32)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'user': 'john',\n",
" 'age': 18,\n",
" 'job': 'engineer',\n",
" 'credit_score': 'high',\n",
" 'office_location': '-122.4194,37.7749',\n",
" 'user_embedding': [0.10000000149011612, 0.10000000149011612, 0.5]}"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# inspect a single JSON record\n",
"json_data[0]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"keys = jindex.load(json_data)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"vector_distance | user | credit_score | age | job | office_location |
---|
0 | john | high | 18 | engineer | -122.4194,37.7749 |
0.109129190445 | tyler | high | 100 | engineer | -122.0839,37.3861 |
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# we can now run the exact same query as above\n",
"result_print(jindex.query(v))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cleanup"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"jindex.delete()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.13 ('redisvl2')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "9b1e6e9c2967143209c2f955cb869d1d3234f92dc4787f49f155f3abbdfb1316"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}