RevisionDojo

Web Crawling

The first step in accessing information is web crawling.
Web crawlers (also known as spiders or bots) are automated programs that:
1. Start from a set of seed URLs.
2. Visit these pages.
3. Follow the hyperlinks to discover new pages.

Tip

This process continues recursively, building a comprehensive map of the web.

Indexing

As crawlers visit pages, they collect and store information about the content and structure of each page.
This data is then indexed , creating a searchable database that allows search engines to quickly retrieve relevant results for user queries.

Analogy

When a crawler visits a webpage, it’s like a librarian receiving a new book.
- The librarian reads the book’s contents (title, author, keywords, chapters).
- This information is then entered into a catalogue.
- When a reader asks for “books on climate change,” the librarian doesn’t flip through every book on the shelve, they quickly check the catalogue (the index) and retrieve the relevant titles.
In the same way, search engines don’t scan the entire web each time a query is made.
Instead, they rely on the index, a structured database of terms and their locations, to find results in milliseconds.

PageRank Algorithm

One of the most influential algorithms developed for search engines is PageRank, created by Google's founders, Larry Page and Sergey Brin.
PageRank uses the web graph to evaluate the importance of web pages based on their link structure.

How PageRank Works

Basic Idea:
1. A page is considered important if it is linked to by other important pages.

Unlock the rest of this chapter with a Free account

Nice try, unfortunately this paywall isn't as easy to bypass as you think. Want to help devleop the site? Join the team at https://revisiondojo.com/join-us. exercitation voluptate cillum ullamco excepteur sint officia do tempor Lorem irure minim Lorem elit id voluptate reprehenderit voluptate laboris in nostrud qui non Lorem nostrud laborum culpa sit occaecat reprehenderit

Definition

Paywall

(on a website) an arrangement whereby access is restricted to users who have paid to subscribe to the site.

anim nostrud sit dolore minim proident quis fugiat velit et eiusmod nulla quis nulla mollit dolor sunt culpa aliqua

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Note

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam quis nostrud exercitation.

Excepteur sint occaecat cupidatat non proident

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.

Tip

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum.

C.5.5 Page Rank Algorithm Notes

Web Crawling

Indexing

PageRank Algorithm

How PageRank Works

Unlock the rest of this chapter with a Free account

anim nostrud sit dolore minim proident quis fugiat velit et eiusmod nulla quis nulla mollit dolor sunt culpa aliqua

Duis aute irure dolor in reprehenderit

Excepteur sint occaecat cupidatat non proident

Introduction to Web Crawling

1. System fundamentals2 subtopics

2. Computer organization1 subtopic

3. Networks1 subtopic

4. Computational thinking, problem-solving and programming3 subtopics

5. Abstract data structures (HL)1 subtopic

6. Resource management (HL)1 subtopic

7. Control (HL)1 subtopic

A. Databases4 subtopics

B. Modelling and simulation4 subtopics

C. Web science6 subtopics

Object-oriented programming (OOP)4 subtopics

C.5.5 Page Rank Algorithm Notes

1. System fundamentals2 subtopics

2. Computer organization1 subtopic

3. Networks1 subtopic

4. Computational thinking, problem-solving and programming3 subtopics

5. Abstract data structures (HL)1 subtopic

6. Resource management (HL)1 subtopic

7. Control (HL)1 subtopic

A. Databases4 subtopics

B. Modelling and simulation4 subtopics

C. Web science6 subtopics

Object-oriented programming (OOP)4 subtopics

Web Crawling

Indexing

PageRank Algorithm

How PageRank Works

Unlock the rest of this chapter with a Free account

anim nostrud sit dolore minim proident quis fugiat velit et eiusmod nulla quis nulla mollit dolor sunt culpa aliqua

Duis aute irure dolor in reprehenderit

Excepteur sint occaecat cupidatat non proident

Introduction to Web Crawling