What is a Distributed Database
- A distributed database is a database that is stored across multiple physical locations (servers or cloud).
- It appears to users as one single database, even though it's spread across sites.
Types of Distributed Databases
- Homogeneous
- All sites use the same DBMS, schema, and OS.
- Easier to manage.
- Heterogeneous
- Sites may use different DBMSs, schemas, and OSs.
- Requires translation layers and is harder to maintain.
Methods of Distribution
- Fragmentation
- Database is split into parts, and each site manages a fragment.
- Useful when local data is accessed more frequently.
- Replication
- Copies of the data (or whole database) are stored at multiple sites.
- Improves reliability and availability, but increases complexity.
The Need for Data Consistency
- Data Consistency ensures that all users see the same data, regardless of which part of the database they access.
- In a distributed system, data consistency is challenging due to:
- Network Latency: Delays in data synchronization across locations.
- Concurrent Access: Multiple users updating the same data simultaneously.
Without data consistency, a user booking a hotel room in one location might find it unavailable in another, leading to errors and customer dissatisfaction.
The Role of ACID in Distributed Databases
- ACID stands for Atomicity, Consistency, Isolation, and Durability.
- These properties ensure reliable transaction processing, even in distributed environments.
Atomicity
- Atomicity guarantees that a transaction is all-or-nothing.
- In a distributed database, this means:
- If a transaction updates data in multiple locations, either all updates succeed, or none do.
In a hotel booking system, if a payment is processed but the room reservation fails, atomicity ensures the payment is rolled back.
Consistency
- Consistency ensures that a transaction brings the database from one valid state to another.
- In distributed systems, this requires:
- Synchronizing updates across all locations.
After a room is booked, all database copies should reflect the updated availability.
Isolation
- Isolation ensures that concurrent transactions do not interfere with each other.
- Techniques like locking and timestamping are used to maintain isolation.
Two users booking the last available room simultaneously should not both succeed.
Durability
- Durability guarantees that once a transaction is committed, it remains so, even in the event of a system failure.
- This is achieved through replication and backup strategies.
If a power outage occurs after a booking is confirmed, the reservation should still be recorded when the system restarts.
Key Features of Distributed Databases
Concurrency Control
- Concurrency Control ensures that multiple transactions can occur simultaneously without causing data inconsistencies.
- Two main approaches:
- Pessimistic Concurrency Control (PCC): Locks resources to prevent conflicts.
- Optimistic Concurrency Control (OCC): Assumes conflicts are rare and checks for them at commit time.
Use PCC in high-conflict environments and OCC when conflicts are unlikely.
Data Consistency Models
- Strong Consistency: Updates are immediately visible across all locations.
- Eventual Consistency: Updates propagate over time, allowing temporary inconsistencies.
- Causal Consistency: Preserves the order of causally related updates.
Strong consistency is ideal for financial transactions, while eventual consistency suits social media updates.
Data Partitioning
- Data Partitioning divides the database into smaller, manageable sections.
- Common partitioning methods:
- Range Partitioning: Based on value ranges (e.g., dates).
- Hash Partitioning: Uses a hash function to distribute data.
- List Partitioning: Based on specific values (e.g., regions).
- Round-Robin Partitioning: Evenly distributes data across partitions.
Partitioning improves performance by reducing the amount of data each query must process.
Data Security
- Data Security is critical in distributed systems due to multiple access points.
- Key security measures:
- Authentication: Verifying user identities.
- Data Encryption: Protecting data in transit and at rest.
- Validated Input: Ensuring data meets predefined rules.
Always use encryption to protect sensitive data, especially when transmitting across networks.
Distribution Transparency
- Distribution Transparency makes the distributed system appear as a single database to users.
- Types of transparency:
- Access Transparency: Users interact with data the same way, regardless of location.
- Location Transparency: Data can be accessed without knowing its physical location.
- Replication Transparency: Users are unaware of data copies across sites.
- Failure Transparency: The system continues to function despite component failures.
- Concurrency Transparency: Multiple users can access data safely.
- Think of distribution transparency like using a cloud storage service.
- You can access your files from any device without knowing where they are physically stored.
Fault Tolerance
- Fault Tolerance ensures the system remains operational despite failures.
- Achieved through:
- Redundancy: Storing multiple copies of data.
- Automatic Failover: Switching to backup systems when a failure occurs.
Fault tolerance is essential for critical systems like banking and healthcare, where downtime is unacceptable.
Global Query Processing
- Global Query Processing optimizes queries across the entire distributed system.
- Involves:
- Local Optimization: Processing queries within individual partitions.
- Global Optimization: Combining results from all partitions.
Design queries to minimize data transfer between locations, reducing latency and improving performance.
Replication
- Replication involves storing copies of data at multiple sites.
- Types of replication:
- Full Replication: Entire database is copied to every site.
- Partial Replication: Only critical data is replicated.
- No Replication: Each site stores unique data.
- Replication Models:
- Master–Slave: One master processes all changes; others follow.
- Multi-Master: All nodes can accept updates and sync with others.
- Peer-to-Peer: All nodes can act as both master and slave.
- Single-Source: One authoritative source replicates to others.
Full replication ensures high availability but increases complexity in maintaining consistency.
Scalability
- Scalability allows the system to grow by adding more nodes or partitions.
- Distributed databases are inherently scalable, making them ideal for large, dynamic environments.
- When designing a distributed database, prioritize features based on your specific use case.
- For example, a financial system may prioritize strong consistency, while a social media platform might focus on scalability and eventual consistency.
Advantages of Distributed Databases
- Faster performance through local access
- Easier to scale by adding sites
- Improved fault tolerance and reliability
- Easier data sharing and local autonomy
Disadvantages of Distributed Databases
- More complex to manage and design
- Requires strong consistency and concurrency control
- Harder to secure due to multiple access points
- Translation issues with heterogeneous systems
Real-World Applications
- Hotel chains with local hotel data and central reporting
- Military and air traffic control systems
- Corporate systems with global departments
- Pharmaceutical companies tracking treatments globally
- Explain the difference between fragmentation and replication in distributed databases.
(Hint: Think about how data is divided vs duplicated across sites.) - What are the advantages and disadvantages of using full replication across all nodes in a distributed system?
- Why is the two-phase commit protocol important when maintaining ACID properties in distributed transactions?
- Describe how transparency (e.g. access, location, failure) improves the user experience in a distributed database system.