Building a LinkedIn Comment Scraper

In this article, I'll walk you through our process of creating a Chrome extension to scrape comments from LinkedIn posts. This project emerged from a need to analyze engagement on LinkedIn content more effectively, and while the journey had its challenges, we ultimately developed a functional solution that extracts meaningful data.

The Initial Challenge

LinkedIn's dynamic interface doesn't make it easy to extract comments at scale. Whether you're conducting social media analysis, gathering feedback on company announcements, or researching professional discourse, manually copying comments is impractical for posts with dozens or hundreds of responses.

Our goal was to build a browser extension that could:

Load all comments on a LinkedIn post
Extract the comment text along with author information
Save the data in a structured format for analysis

Here is the code repository

Setting Up the Chrome Extension

We started by creating a basic Chrome extension structure with these files:

manifest.json - Configuration file for the extension
popup.html - The user interface for our extension
popup.js - The script that handles user interactions and communicates with the content script

Our manifest.json defined the necessary permissions:

{
  "manifest_version": 3,
  "name": "LinkedIn Comments Scraper",
  "version": "1.0",
  "description": "Extract comments from LinkedIn posts",
  "action": {
    "default_popup": "popup.html",
    "default_icon": {
      "16": "icons/icon16.png",
      "48": "icons/icon48.png",
      "128": "icons/icon128.png"
    }
  },
  "permissions": [
    "activeTab",
    "scripting",
    "downloads"
  ],
  "host_permissions": [
    "https://*.linkedin.com/*"
  ]
}

The popup interface was kept simple - a button to trigger the scraping and a checkbox to enable auto-loading of all comments:

<button id="scrapeButton">Scrape Comments</button>
<div class="option">
  <label>
    <input type="checkbox" id="autoLoadComments" checked>
    Auto-load all comments
  </label>
</div>

First Roadblock: Injecting the Content Script

Our first challenge came when we tried to execute the script on the LinkedIn page. We initially used a background script approach, but ran into issues with Manifest V3 limitations. After several attempts, we simplified our approach to directly inject the script from the popup:

const results = await chrome.scripting.executeScript({
  target: { tabId: tab.id },
  func: scrapeLinkedInPost,
  args: [autoLoadComments]
});

This approach worked better, though we still encountered syntax errors and had to ensure our script was well-formatted.

Building the Comment Scraper Logic

The heart of our extension was the scrapeLinkedInPost function. This function had several key components:

1. Auto-scrolling the Page

LinkedIn loads comments dynamically as you scroll, so we implemented an auto-scroll function:

async function autoScroll() {
  return new Promise((resolve) => {
    const maxScrolls = 20;
    let scrollCount = 0;
    let lastHeight = document.body.scrollHeight;
    
    const timer = setInterval(() => {
      window.scrollBy(0, 800);
      scrollCount++;
      
      // Check if we've reached the bottom
      setTimeout(() => {
        const newHeight = document.body.scrollHeight;
        if (newHeight === lastHeight && scrollCount > 3) {
          clearInterval(timer);
          resolve();
        }
        lastHeight = newHeight;
      }, 300);
      
      if (scrollCount >= maxScrolls) {
        clearInterval(timer);
        resolve();
      }
    }, 600);
  });
}

2. Finding "Load More Comments" Buttons

We needed to click "Load More Comments" buttons to expand the comment section fully:

// Helper function to find buttons by text content
function findButtonsByText(text) {
  const allButtons = document.querySelectorAll('button');
  return Array.from(allButtons).filter(button => 
    button.textContent && 
    button.textContent.toLowerCase().includes(text.toLowerCase())
  );
}

// Using the function to find comment loading buttons
const textButtons1 = findButtonsByText("Load more comments");
const textButtons2 = findButtonsByText("Show more comments");

3. Finding Comment Elements

LinkedIn's DOM structure is complex and can change, so we used multiple selectors to identify comments:

const commentSelectors = [
  '.comments-comment-item',
  '[data-test-id^="comments-comment-"]',
  '.scaffold-finite-scroll__content > div',
  '.comments-comments-list > div'
];

for (const selector of commentSelectors) {
  try {
    const elements = document.querySelectorAll(selector);
    if (elements.length > 0) {
      allCommentElements = [...allCommentElements, ...Array.from(elements)];
    }
  } catch (e) {
    console.log('Error with comment selector:', selector, e);
  }
}

Major Challenges and Solutions

Challenge 1: Invalid CSS Selectors

We initially used jQuery-style :contains() selectors, which aren't supported in standard DOM APIs:

// This doesn't work in standard JavaScript
'button:contains("Load more comments")'

Solution: We created a custom function to find elements by their text content:

function findButtonsByText(text) {
  const allButtons = document.querySelectorAll('button');
  return Array.from(allButtons).filter(button => 
    button.textContent && 
    button.textContent.toLowerCase().includes(text.toLowerCase())
  );
}

Challenge 2: Duplicate Comments

Our initial implementation picked up profile elements as comments and created duplicate entries:

Solution: We improved our comment processing logic:

Added filtering to exclude profile sections
Created a tracking system using Map to prevent duplicates
Extracted profile titles into a separate field

// Create a unique key for this comment to avoid duplicates
const commentStart = comment.text.substring(0, 30);
const commentKey = `${comment.author}:${commentStart}`;

// Check if we've seen this comment before
if (!processedCommentKeys.has(commentKey)) {
  processedCommentKeys.set(commentKey, true);
  processedComments.push(comment);
}

Challenge 3: Download Mechanism

We encountered issues with URL.createObjectURL in the extension context:

Solution: We used a data URI approach instead:

chrome.downloads.download({
  url: 'data:application/json;charset=utf-8,' + encodeURIComponent(jsonString),
  filename: `linkedin_post_${Date.now()}.json`,
  saveAs: true
});

Final Implementation and Testing

After resolving these challenges, we had a working extension that successfully scraped LinkedIn post comments. Our process for each scrape was:

User clicks the "Scrape Comments" button in the extension popup
The script is injected into the current LinkedIn post page
The page is scrolled and "Load more comments" buttons are clicked
Comment elements are identified and processed
A JSON file is generated and downloaded with the post data and comments

The JSON included:

Post author and content
Post metadata (timestamp, URL, like count)
Comments with author name, profile URL, text, and timestamp

Lessons Learned

Throughout this project, we learned several important lessons:

Browser Extensions Have Limitations: Manifest V3 introduces constraints on how scripts can be executed and communicate.
DOM Traversal Requires Robustness: LinkedIn's DOM structure can vary, so using multiple selector approaches provides resilience.
Duplicate Detection is Critical: When scraping content, implementing proper deduplication logic is essential.
Error Handling Matters: Building in extensive error handling and logging helped identify and fix issues quickly.
Testing in Real Scenarios: What works in a development environment may fail in the real LinkedIn interface, making thorough testing crucial.

Conclusion

While our LinkedIn comment scraper isn't perfect, it successfully extracts valuable data that would be tedious to collect manually. With each iteration, we improved its reliability and accuracy. The extension now provides a solid foundation for analyzing LinkedIn engagement, though like any web scraping tool, it may require updates as LinkedIn's interface evolves.

This project demonstrates that with persistence and problem-solving, it's possible to build effective tools for extracting and analyzing social media data, even from complex platforms like LinkedIn.

Building a LinkedIn Comment Scraper: Our Journey from Idea to Implementation