Cassandra Data Model
Data model in Cassandra is totally different from normally we see in RDBMS. Let’s see how Cassandra stores its data.
Cluster
Cassandra database is distributed over several machines that are operated together. The outermost container is known as the Cluster which contains different nodes. Every node contains a replica, and in case of a failure, the replica takes charge. Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to them.
Keyspace
Keyspace is the outermost container for data in Cassandra. Following are the basic attributes of Keyspace in Cassandra:
- Replication factor: It specifies the number of machine in the cluster that will receive copies of the same data.
- Replica placement Strategy: It is a strategy which species how to place replicas in the ring. There are three types of strategies such as:
1) Simple strategy (rack-aware strategy)
2) old network topology strategy (rack-aware strategy)
3) network topology strategy (datacenter-shared strategy)
- Column families: column families are placed under keyspace. A keyspace is a container for a list of one or more column families while a column family is a container of a collection of rows. Each row contains ordered columns. Column families represent the structure of your data. Each keyspace has at least one and often many column families.
In Cassandra, a well data model is very important because a bad data model can degrade performance, especially when you try to implement the RDBMS concepts on Cassandra.
Cassandra data Models Rules
Data Modeling Goals
You should have following goals while modeling data in Cassandra:
- Spread Data Evenly Around the Cluster: To spread equal amount of data on each node of Cassandra cluster, you have to choose integers as a primary key. Data is spread to different nodes based on partition keys that are the first part of the primary key.
- Minimize number of partitions read while querying data: Partition is used to bind a group of records with the same partition key. When the read query is issued, it collects data from different nodes from different partitions.
In the case of many partitions, all these partitions need to be visited for collecting the query data. It does not mean that partitions should not be created. If your data is very large, you can’t keep that huge amount of data on the single partition. The single partition will be slowed down. So you must have a balanced number of partitions.