Web Crawling
- The first step in accessing information is web crawling.
- Web crawlers (also known as spiders or bots) are automated programs that:
- Start from a set of seed URLs.
- Visit these pages.
- Follow the hyperlinks to discover new pages.
This process continues recursively, building a comprehensive map of the web.
Indexing
- As crawlers visit pages, they collect and store information about the content and structure of each page.
- This data is then indexed , creating a searchable database that allows search engines to quickly retrieve relevant results for user queries.
- When a crawler visits a webpage, it’s like a librarian receiving a new book.
- The librarian reads the book’s contents (title, author, keywords, chapters).
- This information is then entered into a catalogue.
- When a reader asks for “books on climate change,” the librarian doesn’t flip through every book on the shelve, they quickly check the catalogue (the index) and retrieve the relevant titles.
- In the same way, search engines don’t scan the entire web each time a query is made.
- Instead, they rely on the index, a structured database of terms and their locations, to find results in milliseconds.
PageRank Algorithm
- One of the most influential algorithms developed for search engines is PageRank, created by Google's founders, Larry Page and Sergey Brin.
- PageRank uses the web graph to evaluate the importance of web pages based on their link structure.
How PageRank Works
- Basic Idea:
- A page is considered important if it is linked to by other important pages.