Preface#
After becoming a worker, I don't have as much time to browse the "Juejin" platform. Most of the time, I can only lie in bed and browse the hot list on weekends. However, during this time, we often miss the historical records of the hot list (the content of the hot list is always changing). This situation may cause us to miss some excellent articles on "Juejin". So, is there any way to record the historical hot list of "Juejin" with just one command? This is where we need to introduce Puppeteer.
Brief Introduction#
Puppeteer is a Node.js library developed and maintained by Google. It provides a high-level API for automating web page operations through controlling a Headless Chrome (a Chrome browser without a user interface). It can be used to perform various tasks, including taking screenshots, generating PDFs, crawling data, automating form filling and interaction on web pages, etc.
Here, let's focus on Headless Chrome, which is a headless browser (the code later will involve this concept). A headless browser refers to a web browser without a visible user interface. It can run in the background and perform web page operations and browsing behaviors without a graphical user interface. Compared to traditional web browsers, which usually provide a user interface that allows users to interact with the browser, such as entering URLs for link navigation or clicking submit buttons during login or registration. However, a headless browser can automate these operations in the background without displaying pop-up windows repeatedly. This allows developers to automate various operations by writing scripts, such as:
- Test automation in web applications
- Taking web page screenshots
- Running automated tests on JavaScript libraries
- Collecting website data
- Automating web page interactions
Next, I will demonstrate how to quickly get started with Puppeteer and obtain information about the hot list articles on "Juejin" with just one command.
Configure Puppeteer#
The configuration here is actually just following the quick start guide in the official documentation. I will briefly go through it here.
Here, we can choose to install Puppeteer
directly:
npm i puppeteer
One thing to note here is that we can configure Puppeteer by creating a puppeteer.config.cjs
file:
const {join} = require('path');
/**
* @type {import("puppeteer").Configuration}
*/
module.exports = {
// Changes the cache location for Puppeteer.
cacheDirectory: join(__dirname, '.cache', 'puppeteer'),
};
After creating this configuration file, if we run the installation command, we can see that a .cache
folder is added. If we open it, we will find that it contains many binary files. This involves an optimization solution to improve startup speed. The .cache
folder will automatically download the Chrome browser binary file suitable for our current operating system when we use Puppeteer
for the first time. This avoids the need for Puppeteer to re-download the required files during subsequent startups, thereby improving startup speed.
Getting Started with Puppeteer#
Now, let's create a test.js
file and enter the following content. I will explain each line:
// Import the Puppeteer library so that we can use its functions. Using ESM syntax is also possible here.
// import puppeteer from 'puppeteer';
const puppeteer = require('puppeteer');
(async () => {
// Launch a Chrome browser instance. By default, Puppeteer launches in headless mode.
const browser = await puppeteer.launch();//const browser = await puppeteer.launch({headless: true});
// Create a new page object.
const page = await browser.newPage();
// Navigate the code to the specified URL, simulating the operation of entering a URL to navigate.
await page.goto('https://example.com');
// Take a screenshot of the current page. Note: To ensure consistent content display on different devices, the default browser window size is 800x600.
await page.screenshot({path: 'example.png'});
// Close Chrome.
await browser.close();
})();
After running the node .\test.js
command, if you get the following image, it means you have succeeded:
Great! Now you have started using Puppeteer and mastered the most basic operations. Next, we will implement the requirement mentioned above.
Implementing the Functionality#
First, we need to know where the article titles and links are located on the "Juejin" hot list. Specifically, not just understanding their positions, but letting Puppeteer
know their positions. Here, we open the console in the hot list section and use the selector syntax: Page.$$()
. This method runs the document.querySelectorAll
method in the browser. Enter: $$('a')
, and you will get a bunch of <a>
tags, but obviously this is not what we expected.
Now, we need to narrow down the scope and place the mouse on the hot list, then right-click and select "Inspect".
Now, we can quickly locate this part of the content in the console. At this time, we modify the selector content to $$('.hot-list>a')
, and now we get the link content. Now, we want to get the title, which follows the same principle. We can modify it slightly: $$('.article-title').map(x=>x.innerText)
, and we can get the titles of the "Juejin" hot list.
Pitfalls#
If we run the code directly at this time, there is a high probability that we will get []
. This involves an important point, web page loading delay.
Here, we cancel the default headless browser mode and modify the code to:
const browser = await puppeteer.launch({ headless: false })
Now, when we run the code again, we will find that the web page ends our operation before it is fully loaded. At this time, we need to set waitUntil
or delay the loading of the script. Modify the code as follows:
await page.goto("https://juejin.cn/hot/articles", {
waitUntil: "domcontentloaded",
});
await page.waitForTimeout(2000);
However, if we check the documentation, we will find that it suggests that page.waitForTimeout
is deprecated and recommends using Frame.waitForSelector
. It waits for an element matching the given selector to appear in the frame before running the code. Compared to delaying the execution of the code directly, it is more efficient. For now, let's leave it like this, and I will provide the complete code later.
Completing and Optimizing the Functionality#
When we can successfully detect the content, what we need is to save it to the local file. At this time, we introduce the file system module of Node.js to write the detected file content:
import puppeteer from "puppeteer";
import fs from "fs";
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://juejin.cn/hot/articles', {
waitUntil: "domcontentloaded"
});
await page.waitForTimeout(2000);
let hotList = await page.$$eval(".article-title[data-v-cfcb8fcc]", (title) => {
return title.map((x) => x.innerText);
});
console.log(hotList);
// Save the article titles to a text file
fs.writeFile('titles.txt', hotList.join('\n'), (err) => {
if (err) throw err;
console.log('The article titles have been saved to the titles.txt file');
});
await browser.close();
})();
Now, we have obtained all the titles of the articles. However, having only the titles is not enough. If we still need to manually input the titles when we want to read some articles on the weekend, it will be very troublesome. At this time, we need to save both the titles and links of the articles. We can use closest("a").href
to get the links:
const articleList = await page.$$eval(
".article-title[data-v-cfcb8fcc] a",
(articles) => {
return articles.map((article) => ({
title: article.innerText,
link: article.href,
}));
}
);
console.log(articleList);
// Save the article titles and links to a text file
const formattedData = articleList.map(
(article) => `${article.title} - ${article.link}`
);
fs.writeFile("articles.txt", formattedData.join("\n"), (err) => {
if (err) throw err;
console.log("The article titles and links have been saved to the articles.txt file");
});
Great! But now we find that when we run this script again the next day, it overwrites the file from the previous day. This is not acceptable. We need to classify the hot list articles according to different days, so that we can have different files for different days. Here, we add the previously mentioned functionality of waiting for an element matching the given selector to appear in the frame. The final code is as follows:
import puppeteer from "puppeteer";
import fs from "fs";
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto("https://juejin.cn/hot/articles", {
waitUntil: "domcontentloaded",
});
// Process the folder name
const currentDate = new Date().toLocaleDateString();
const fileName = `${currentDate.replace(/\//g, "-")}.txt`;
await page.waitForSelector(".article-title[data-v-cfcb8fcc]");
const articleList = await page.$$eval(
".article-title[data-v-cfcb8fcc]",
(articles) => {
return articles.map((article) => ({
title: article.innerText,
link: article.closest("a").href,
}));
}
);
console.log(articleList);
const formattedData = articleList.map(
(article) => `${article.title} - ${article.link}`
);
fs.writeFile(fileName, formattedData.join("\n"), (err) => {
if (err) throw err;
console.log(`The article titles and links have been saved to the file: ${fileName}`);
});
await browser.close();
})();
After running the code, you will get the following content:
Summary#
As a Node.js library developed and maintained by the Google team, Puppeteer greatly facilitates various automation operations. Just imagine, in the future, you only need to run a simple node command to store the current hot list articles and information. Isn't it great? 🐱
However, web crawling is just one of the many features of Puppeteer. As the official documentation says, it can also be used for automating form submissions, UI testing, capturing site timelines, crawling SPAs to achieve pre-rendering effects (I will write an article about front-end first screen optimization in the future).