How to scrape Spotify data using Python

23.04.2025

Comments: 0

Content of the article:

Requirements and installation
Spotify Scraping Process: Step-by-Step Guide

Step 1. Setting up the function to fetch HTML content
Step 2. Parsing the HTML content
Step 3. Extracting track information
Step 4. Saving data to a CSV file

Complete code example with proxy support

Spotify playlist data can be useful for collecting information on songs, artists, and other track details. Spotify does not provide detailed access to some data without an API key, but by leveraging Playwright for Python, we can access dynamic content and extract playlist data directly from Spotify's website. This article illustrates how to implement Playwright for scraping information such as track names, band names, links, and lengths of tracks from Spotify playlists.

Requirements and installation

To start, install Playwright and lxml to handle the dynamic page content and HTML parsing:


pip install playwright
pip install lxml

Then, to enable Playwright browsers, use the following command to download the required browser binaries:


playwright install

With these set up, we’re ready to scrape Spotify.

Spotify Scraping Process: Step-by-Step Guide

Spotify’s content is dynamically loaded, so using requests or other simple HTTP libraries won’t capture all the information rendered by JavaScript. Playwright is a browser automation library that lets us interact with dynamic websites as if we’re browsing them ourselves, which means we can wait for JavaScript to load before scraping.

Playwright also supports proxy configuration, enabling us to use different IPs if needed to avoid rate-limiting or geo-restrictions on Spotify.

Below, you'll find a detailed, step-by-step guide to scraping, complete with code examples for a clearer, more visual understanding of the process.

Step 1. Setting up the function to fetch HTML content

The fetch_html_content function initializes the Playwright environment, launches a browser, and navigates to the Spotify playlist URL. Here, we set headless=False so that the browser interface remains visible (useful for debugging); for automated tasks, setting it to True will speed up execution.

The wait_until='networkidle' option waits for network activity to stabilize before capturing the page content, ensuring all elements load correctly.


from playwright.async_api import async_playwright

async def fetch_html_content(playlist_url):
    async with async_playwright() as p:
        # Launch browser with proxy setup if needed
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()

        # Navigate to the URL
        await page.goto(playlist_url, wait_until='networkidle')

        # Allow time for network activity to settle
        await page.wait_for_timeout(3000)

        # Capture page content
        page_content = await page.content()

        # Close the browser
        await browser.close()

        return page_content

To use a proxy with IP address authentication in Playwright, configure the launch function as follows:


browser = await p.chromium.launch(headless=True, proxy={"server": "http://your-proxy-server:port"})

This will route requests through the specified proxy server, masking your original IP and helping avoid potential scraping restrictions.

Step 2. Parsing the HTML content

With lxml's fromstring function, we create a parser for the fetched HTML content. This allows us to locate and extract specific data using XPath expressions.


from lxml.html import fromstring

page_content = await fetch_html_content('https link')
parser = fromstring(page_content)

Step 3. Extracting track information

With XPath selectors, we collect the following details for each track in the playlist:

Track Names;
Track URLs;
Artist Names;
Artist URLs;
Track Durations.


track_names = parser.xpath('//div[@data-testid="tracklist-row"]//a[@data-testid="internal-track-link"]/div/text()')
track_urls = parser.xpath('//div[@data-testid="tracklist-row"]//a[@data-testid="internal-track-link"]/@href')
track_urls_with_domain = [f"https://open.spotify.com/{url}" for url in track_urls]
artist_names = parser.xpath('//div[@data-testid="tracklist-row"]//div[@data-encore-id="text"]/a/text()')
artist_urls = parser.xpath('//div[@data-testid="tracklist-row"]//div[@data-encore-id="text"]/a/@href')
artist_urls_with_domain = [f"https://open.spotify.com/{url}" for url in artist_urls]
track_durations = parser.xpath('//div[@data-testid="tracklist-row"]//div[@class="PAqIqZXvse_3h6sDVxU0"]/div/text()')

The URL lists are completed with the Spotify domain to create fully qualified links.

Step 4. Saving data to a CSV file

After gathering data, we write it into a CSV file. Each row in the file contains the track name, track URL, artist name, artist URL, and track duration.


import csv

rows = zip(track_names, track_urls_with_domain, artist_names, artist_urls_with_domain, track_durations)
csv_filename = "spotify_playlist.csv"

with open(csv_filename, mode='w', newline='') as file:
    writer = csv.writer(file)
    # Write header
    writer.writerow(["track_names", "track_urls", "artist_names", "artist_urls", "track_durations"])
    # Write rows
    writer.writerows(rows)

print(f"Data successfully written to {csv_filename}")

This creates a well-structured CSV file that is easy to analyze and use in further applications.

Complete code example with proxy support

Here is the full code, combining all steps for a streamlined Spotify scraping process:


from playwright.async_api import async_playwright
from lxml.html import fromstring
import csv

async def fetch_html_content(playlist_url):
    async with async_playwright() as p:
        # Launch browser, with proxy option if needed
        browser = await p.chromium.launch(headless=False, proxy={"server": "http://your-proxy-server:port", "username": "username", "password": "password"})
        page = await browser.new_page()

        # Navigate to the URL
        await page.goto(playlist_url, wait_until='networkidle')
        
        # Wait for any network activity to settle
        await page.wait_for_timeout(3000)
        
        # Capture page content
        page_content = await page.content()

        # Close the browser
        await browser.close()
        
        return page_content

page_content = asyncio.run(fetch_html_content('https link'))
parser = fromstring(page_content)


# Extract details
track_names = parser.xpath('//div[@data-testid="tracklist-row"]//a[@data-testid="internal-track-link"]/div/text()')
track_urls = parser.xpath('//div[@data-testid="tracklist-row"]//a[@data-testid="internal-track-link"]/@href')
track_urls_with_domain = [f"https://open.spotify.com/{url}" for url in track_urls]
artist_names = parser.xpath('//div[@data-testid="tracklist-row"]//div[@data-encore-id="text"]/a/text()')
artist_urls = parser.xpath('//div[@data-testid="tracklist-row"]//div[@data-encore-id="text"]/a/@href')
artist_urls_with_domain = [f"https://open.spotify.com/{url}" for url in artist_urls]
track_durations = parser.xpath('//div[@data-testid="tracklist-row"]//div[@class="PAqIqZXvse_3h6sDVxU0"]/div/text()')

# Write to CSV
rows = zip(track_names, track_urls_with_domain, artist_names, artist_urls_with_domain, track_durations)
csv_filename = "spotify_playlist.csv"

with open(csv_filename, mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["track_names", "track_urls", "artist_names", "artist_urls", "track_durations"])
    writer.writerows(rows)

print(f"Data successfully written to {csv_filename}")

Collecting Spotify playlist data using Python with Playwright allows access to dynamic content for extracting and analyzing track information. By configuring Playwright with proxies, we can handle rate-limiting, creating a reliable way to collect playlist data. This setup opens possibilities for detailed analysis and can be easily adapted for other Spotify content types.