Data Infrastructure has evolved in the last 15 years from Hadoop’s batch system, to streaming systems like Spark and Kafka and now to realtime systems like Rockset and Clickhouse. Automatic decision making based on massive data sets demands a data infrastructure system that is Real-Time. These decisions are made by either hand crafted rules or served by machine learned models that operate on large datasets and return results in milliseconds.
We dive into the design and architecture of one such realtime data processing platform named Rockset. Rockset is a Real-Time indexing database that powers fast SQL over semi-structured data such as JSON, Parquet, or XML without requiring any schematization. All data loaded into Rockset are automatically indexed and a fully featured SQL engine powers fast queries over semi-structured data without requiring any database tuning. Rockset uses open source RocksDB as it storage engine, In this talk, we discuss some of the key design aspects of Rockset such as:
* Smart Schema: Smart Schemas can take any semistructured dataset with deeply nested objects and arrays and automatically turn it into a SQL table. This becomes especially important to serve Machine Learning Models in production when the models frequently create new columns or change schema of existing columns. We show how this feature reduces the need for data cleaning or data preparation before data can be used to generate insights or serve models in production.
* Converged indexing: A novel storage format (unlike Parquet or ORC), that is built for millisecond latency on massive data sets. This format builds multiple indices including an inverted index, a column index, a row indes, a range index, a time index etc with minimal overhead. This allows model serving to operate on large, fast changing datasets because a query automatically picks the best index to use, thereby making it faster than brute-force scan based systems.
* The Aggregator Leaf Tailer architecture: A novel systems architecture that implements a three-way disaggregation among storage, query compute and ingest compute.
We describe how Rockset uses SIMD instructions in a vectorized engine to improve query performance and draw a similarity to how machine-learning training infrastructure can leverage a similar approach. We explain how Rockset manages the on-disk format of data with automatic splitting of RocksDB based column based clusters for better compressions and faster decoding, a technique that can be used by general purpose machine learning training infrastructure as well.