how much faster are different databases than other methods?
What would be the best database or architecture of database + data engine for the following requirements:
-
I will receive IOT sensor data that are almost always ordered in terms of timestamp of the signal itself. The exception being if there's network delay between different packets, so they will have to be inserted between the last 1-3 signals most likely.
-
If no changes occur, latest row will be updated instead of doing a new insert.
-
A signal is received from each IOT sensor every 1-5 seconds with a total of ~100,000 devices.
-
Each IOT sensor signal is between 200 B - 2000 B, with an average size of around 500B.
-
Each IOT sensor returns a well-defined set of fields. There are around 200 possible types of fields returned by the IOT sensor, but although it can send all 200 fields, each IOT sensor usually sends a set of 20-40 fields out of those 200 depending on user configuration. So, in terms of storage, most columns for each row will be NULL.
-
A smaller subset of around 10 fields will always be provided by all IOT sensors
-
Availability matters more than consistency since most data will be consumed by the client for real-time monitoring and not through the database. Historical reports queries from the database will mostly be well into the past so eventual consistency is enough.
-
Old data (let's say 3+ months) is deleted.
-
I need partition tolerance with distributed writes on all nodes.
Conclusions until now:
-
Partition tolerance needed
-
Availability > consistency
-
NULL fields should take no space
-
Row oriented storage on disk, so that write speed is high
Based on these, I am considering Cassandra/ScyllaDB with partitions based on device ID and day like 01_01_2024.
However, I have some concerns with regards to query patterns:
Users can retrieve a list of complete signals over a time range, like a 1-week period, for a specific device.
-
Parallel queries across the different partitions solves this.
-
This query would be the most used by users to get a history playback of sensor data.
Basic aggregation/conditionals over a specific device like a displaying graph of a certain field with 1-hour averages as 1 data point over the span of 1 month or more.
-
Could use parallel queries also but it's less performant since we have to go over so many more partitions and blocks in disk unnecessarily as we only need 1 field at a time for each query, so something like a column-oriented database solves this... but not Cassandra/ScyllaDB.
Basic aggregation/conditionals over all or many devices over a specific field like points where a custom limit was exceeded over a time range of 1 month.
-
Benefits from a column-oriented database as well.
What is the best database(s)/architecture for this or maybe it is all solvable with Cassandra/ScyllaDB?