Key Features of the Web Graph
Note- The web graph is a directed graph, meaning the edges have a direction.
- If page A links to page B, there is an edge from A to B, but not necessarily from B to A.
Bowtie Structure
The bowtie structure is a model that describes the overall shape of the web graph. It divides the web into several distinct regions:
- Strongly Connected Core (SCC): A large central component where every page can be reached from any other page via a series of hyperlinks.
- IN Component: Pages that can reach the SCC but cannot be reached from it.
- OUT Component: Pages that can be reached from the SCC but cannot reach it.
- Tendrils: Pages that are connected to the IN or OUT components but not to the SCC.
- Disconnected Components: Pages that are completely isolated from the main structure.
Consider a university website:
- The homepage is part of the SCC because it links to and is linked from many other pages.
- An old event page might be in the OUT component if it is linked from the homepage but doesn't link back.
- A personal blog that links to the university but isn't linked from it would be in the IN component.
Strongly Connected Core (SCC)
The SCC is the heart of the web graph. It has several important properties:
- Mutual Reachability: Any page in the SCC can be reached from any other page within the SCC.
- High Connectivity: The SCC contains a large portion of the web's most important and frequently visited pages.
- Stability: The SCC is relatively stable over time, even as the web grows and changes.