What is a Distributed Database
- A distributed database is a database that is stored across multiple physical locations (servers or cloud).
- It appears to users as one single database, even though it's spread across sites.
Types of Distributed Databases
- Homogeneous
- All sites use the same DBMS, schema, and OS.
- Easier to manage.
- Heterogeneous
- Sites may use different DBMSs, schemas, and OSs.
- Requires translation layers and is harder to maintain.
Methods of Distribution
- Fragmentation
- Database is split into parts, and each site manages a fragment.
- Useful when local data is accessed more frequently.
- Replication
- Copies of the data (or whole database) are stored at multiple sites.
- Improves reliability and availability, but increases complexity.
The Need for Data Consistency
- Data Consistency ensures that all users see the same data, regardless of which part of the database they access.
- In a distributed system, data consistency is challenging due to:
- Network Latency: Delays in data synchronization across locations.
- Concurrent Access: Multiple users updating the same data simultaneously.
Without data consistency, a user booking a hotel room in one location might find it unavailable in another, leading to errors and customer dissatisfaction.
The Role of ACID in Distributed Databases
- ACID stands for Atomicity, Consistency, Isolation, and Durability.
- These properties ensure reliable transaction processing, even in distributed environments.
Atomicity
- Atomicity guarantees that a transaction is all-or-nothing.
- In a distributed database, this means:
- If a transaction updates data in multiple locations, either all updates succeed, or none do.
In a hotel booking system, if a payment is processed but the room reservation fails, atomicity ensures the payment is rolled back.
Consistency
- Consistency ensures that a transaction brings the database from one valid state to another.
- In distributed systems, this requires:
- Synchronizing updates across all locations.
After a room is booked, all database copies should reflect the updated availability.
Isolation
- Isolation ensures that concurrent transactions do not interfere with each other.
- Techniques like locking and timestamping are used to maintain isolation.
Two users booking the last available room simultaneously should not both succeed.
Durability
- Durability guarantees that once a transaction is committed, it remains so, even in the event of a system failure.
- This is achieved through replication and backup strategies.
If a power outage occurs after a booking is confirmed, the reservation should still be recorded when the system restarts.
Key Features of Distributed Databases
Concurrency Control
- Concurrency Control ensures that multiple transactions can occur simultaneously without causing data inconsistencies.
- Two main approaches:
- Pessimistic Concurrency Control (PCC): Locks resources to prevent conflicts.
- Optimistic Concurrency Control (OCC): Assumes conflicts are rare and checks for them at commit time.
Use PCC in high-conflict environments and OCC when conflicts are unlikely.
Data Consistency Models
- Strong Consistency: Updates are immediately visible across all locations.
- Eventual Consistency: Updates propagate over time, allowing temporary inconsistencies.
- Causal Consistency: Preserves the order of causally related updates.
Strong consistency is ideal for financial transactions, while eventual consistency suits social media updates.
Data Partitioning
- Data Partitioning divides the database into smaller, manageable sections.
- Common partitioning methods:
- Range Partitioning: Based on value ranges (e.g., dates).
- Hash Partitioning: Uses a hash function to distribute data.
- List Partitioning: Based on specific values (e.g., regions).