This guide provides instructions for setting up and running a Cassandra database using Docker. It covers the creation of the necessary Docker environment, starting the application, and configuring the database. Additionally, it explains the advantages of using Cassandra over other databases and provides insights into Cassandra's architecture and data modeling. ### Environment Configuration
Ensure you have an .env
file in your project directory to manage environment variables. Below is an example of what the .env
file might look like: dotenv Copy code # .env CASSANDRA_VERSION=4.0.7 CASSANDRA_PORT=9042
- Run Cassandra Container:
docker run -p 9042:9042 --rm --name cassandra -d cassandra:4.0.7
- Access the Cassandra Container:
docker exec -it cassandra bash
- Configure Cassandra Inside the Container:
cqlsh -u cassandra -p cassandra
- Create Keyspace:
CREATE KEYSPACE spring_cassandra WITH replication = {'class' : 'SimpleStrategy', 'replication_factor' : 1};
-
Linearly Scalable: Achieves horizontal scalability for higher throughput.
-
Multi-Datacenter Deployment: Facilitates distributed data and replication.
-
High Availability: Ensured by clustering architecture. If a node fails, data retrieval is still possible without a master-slave setup.”
-
Data Modeling: Based on expected query patterns, emphasizing performance and denormalization of data.
-
Cluster Components:
- Datacenter: Location of a cluster deployment.
- Cluster: Collection of nodes.
- Node: Physical or virtual machine.
-
Request Handling:
- Application requests are load-balanced between nodes using a hash function.
- The coordinator node determines the location of the desired node via the snitch protocol.
- Data is replicated to other nodes based on the replication factor.
-
Inter-Node Communication:
- Nodes share state information using the gossip protocol.
- In case of cluster or datacenter failure, Cassandra handles data recovery automatically.
-
Cassandra:
- High availability through clustering.
- No master-slave architecture, avoiding master election delays.
- Sharding is database-managed, reducing application-level complexity.
-
Relational Databases:
- Client-aware sharding can become cumbersome with scale.
- Sharding management often falls on the application team.
- Query-Based Design: Data models are created based on application workflow and query requirements.
- Denormalization: Data is often duplicated across tables to optimize performance.
- Tunable Consistency: Adjust the number of nodes data must be written to for a write operation to be considered successful.
- No Joins: Data is denormalized, influencing table naming conventions to reflect different access patterns.
- AP System: Guarantees availability and partition tolerance.
- Eventual Consistency: Achieved through data replication across nodes.
- No Transactional Support: Provides row-level transaction abilities with all-or-nothing writes.
-
Node Management:
- Nodes have a unified and consistent view of the cluster.
- New nodes are added easily and rebalanced automatically.
-
Failure Detection and Recovery:
- Detected using the gossip protocol.
- Coordinator nodes ensure data availability during node failures.
- Replication Model: Ensures durability through data replication.
- Load Balancing: Operations are distributed among nodes to maximize throughput.
- Eventual Consistency: Balances high availability with consistent data replication.
- Lightweight Transactions: Achieved with the Paxos consensus protocol.
- Batched Writes: Data protection mechanism before committing writes to all nodes.
- Secondary Indexes: Enable querying of previously unqueryable tables, maintaining consistency with local replicas.
https://developers.google.com/identity/openid-connect/openid-connect#java