Screaming Stone Design - Blog - A CD Ripper in Python

A CD Ripper in Python

Technologies: REST API, Python, IDLE, Linux

I have already written about how to rip a CD to MP3 with Linux. Now I am going to write about ripping a CD with Python.

The shell script I created before was simple and easy to use. I actually made it even easier by converting it to an executable binary with SHC, which meant I never had to open a terminal window at all - I merely had to double-click an icon and come back when the job was finished. The only problem I had with that system was I had to rename all of the MP3 files manually. I wanted to create something which would do all of that for me automatically. I was learning Python at the time and I decided it would make an excellent project to help me do that.

To do this I needed to accomplish several things-

Rip the CD to MP3 files or to some other intermediary files
- If ripping to intermediary files then a way to convert those to MP3
Gather the names of the artist, the album and all of the tracks
Rename all of the tracks
Add ID3 tags (extra bonus points for this one)

Initially, to acquire all of the track names and other album data, I decided to query the database of gnudb.org (a replacement for freedb.org) and to do that I would need to "calculate" a disc ID for the CD I wanted to rip. It is important to understand that compact discs do not actually have an ID number or any identifying data stored anywhere on them. Services such as gnudb.org (freedb.org) or Gracenote (cddb) have come up with their own method of calculating an ID for compact discs. It turned out to be very easy to generate this ID by using a small utility called CD-DISCID which Python was able to call with its built-in subprocess command and whose output it was able to retrieve from the standard output (the terminal window):

def run_cddisc_id():
    discid_data = subprocess.Popen('cd-discid',
                                   shell=True,
                                   stdout=subprocess.PIPE,
                                   stderr=subprocess.PIPE
                                   )
    return discid_data


discid_data = run_cddisc_id()

This returned a line with a series of numbers which listed the frame offsets of the tracks on the CD and also a valid gnudb disc ID:

CD-DISCID output with valid gnudb.org disc id

Unfortunately it also transpired that disc IDs were not actually unique and there were quite a lot of ID collisions. For each disc ID queried, data for as many as 15 different discs might be returned. As well as this the information was returned as plain text. After experimenting with this I decided it was not the correct route to follow so I looked around for another online music database.

I eventually settled on the MusicBrainz database as it seemed to have a larger amount of information, disc IDs were more likely to be unique and data was returned in JSON format instead of plain text. In order to query the MusicBrainz API I would need a valid MusicBrainz disc ID and the best way to generate or "calculate" one was to install the LIBDISCID python module.

However, due to the setup of my laptop - the particular combination of the processor, the OS and the Python version - I was unable to do so. After a couple of hours of persistent struggling I decided to code my own routine to calculate a MusicBrainz disc ID and I am glad I decided to do this as it turned out to be very easy to do.

I achieved this by again using CD-DISCID to return the frame offsets of the tracks on the CD and then fed them into my own calculate_disc_id function:

def calculate_disc_id(toc):

    shash = hashlib.sha1()

    toc_length = len(toc)

    first_track_number = 1
    last_track_number  = toc_length - 1
    lead_out_offset    = toc[toc_length - 1]

    first_track_hex     = b'%02X' % first_track_number
    last_track_hex      = b'%02X' % last_track_number
    lead_out_offset_hex = b'%08X' % int(lead_out_offset)

    disc_id = shash.update(first_track_hex)
    disc_id = shash.update(last_track_hex)
    disc_id = shash.update(lead_out_offset_hex)

    for i in range(0,99):
        if i < last_track_number:
            this_offset = int(toc[i])
        else:
            this_offset = 0
        this_offset_hex = b'%08X' % this_offset
        disc_id = shash.update(this_offset_hex)

    disc_id_b64 = codecs.encode(codecs.decode(shash.hexdigest(), 'hex'), 'base64').decode()

    fixed_string_1 = disc_id_b64.replace('+', '.')
    fixed_string_2 = fixed_string_1.replace('/', '_')
    fixed_string_3 = fixed_string_2.replace('=', '-')
    disc_id_b64    = fixed_string_3[:-1]

    return disc_id_b64


def run_cddisc_id():

    discid_data = subprocess.Popen('cd-discid --musicbrainz',
                                   shell=True,
                                   stdout=subprocess.PIPE,
                                   stderr=subprocess.STDOUT
                                   )
    return discid_data


def rip_disc_advanced():

    discid_data = run_cddisc_id()

    for line in discid_data.stdout.readlines():

        shell_output = str(line)

        if 'No medium found' in shell_output:
            display_popup('error', 'Error', 'No CD in drive')
        else:
            shell_output_no_newline = shell_output[:-3]
            disc_frames             = shell_output_no_newline.split(' ')
            number_of_tracks        = disc_frames.pop(0)
            disc_id                 = calculate_disc_id(disc_frames)
            disc_data               = get_disc_info(disc_id)
            album_info              = parse_disc_info(disc_data)
            rip_disc_with_names(album_info)
            clean_up()
            os.chdir(current_directory)

    return None

The last function (in this case called rip_disc_advanced) calls the run_cddisc_id function to get a line containing the frame offsets and then splits it into a list and this list (minus the first number returned which is just the number of tracks) is fed into the calculate_disc_id function which generates the MusicBrainz ID.

CD-DISCID output with frame offsets

The MusicBrainz ID is then, in turn, fed into the get_disc_info function which contacts the MusicBrainz API and retrieves all the required data about the disc:

def get_disc_info(disc_id):

    url = 'https://musicbrainz.org/ws/2/discid/' + disc_id + '?inc=artists+recordings&fmt=json'
    headers = {'User-Agent': 'Python Ripper/0.3 (http://www.screamingstonedesign.com)'}
    request = Request(url, headers=headers)

    with urlopen(request) as page:
        disc_info_bytes = page.read()
        disc_info = json.loads(disc_info_bytes.decode('UTF-8'))

    return disc_info

When this data is retrieved it is then fed into the parse_disc_info parsing function which extracts artist, album and track names from the JSON data and stores them in dictionaries for easy retrieval:

def parse_disc_info(disc_data):

    disc_dict = dict(disc_data)

    # cycle through the list discs (media) to find the correct disc by matching the disc-id
    discs_in_box = len(disc_dict['releases'][0]['media'])
    for disc_number in range(0, discs_in_box):
        disc_versions_released = disc_dict['releases'][0]['media'][disc_number]['discs']
        for disc_version in range(0, len(disc_versions_released)):
            if (disc_versions_released[disc_version]['id'] == disc_dict['id']):
                this_disc_number = disc_number
                break
    artist           = disc_dict['releases'][0]['artist-credit'][0]['name']
    album_name       = disc_dict['releases'][0]['title']
    
    if 'date' in disc_dict['releases'][0]:
        year             = disc_dict['releases'][0]['date'][:4]
    else:
        year = ''
    
    number_of_tracks = int(disc_dict['releases'][0]['media'][this_disc_number]['track-count'])

    track_list = []

    for track in range(0, number_of_tracks):
        track_list.append(disc_dict['releases'][0]['media'][this_disc_number]['tracks'][track]['title'])

    album_info = [artist, album_name, year, this_disc_number, discs_in_box, track_list]

    return album_info

At this point I had everything I needed to rename the folder and the tracks when I ripped them. All this information was then fed into the rip_disc_with_names ripping function which in turn called the rip_cd_to_wav function before then running the convert_wav_to_mp3 function on each WAV file which was created:

def rip_disc_with_names(album_info):

    folder_name = album_info[0] + ' - ' + album_info[1]
    folder_path = os.path.join(current_directory, folder_name)

    if not os.path.isdir(folder_path):
        os.makedirs(folder_name)
    os.chdir(folder_name)

    if album_info[4] > 1:
        subfolder_name = 'disc %d of %d' % (album_info[3] + 1, album_info[4])
        subfolder_path = os.path.join(folder_path, subfolder_name)
        if not os.path.isdir(subfolder_path):
            os.makedirs(subfolder_name)
        os.chdir(subfolder_name)

    rip_cd_to_wav()

    track_number = 0
    for track in album_info[5]:
        track_number += 1
        convert_wav_to_mp3(album_info[0],
                             album_info[1],
                             album_info[2],
                             track_number,
                             len(album_info[5]),
                             track)

    message = "{} successfully ripped to highest quality, joint stereo, \
               variable bit length encoded MP3".format(folder_name)
    display_popup('success', 'Success', message)

    return None

It turned out that LAME, which I used to create MP3 files from the WAV files, could also add ID3 tags to the MP3 files as it encoded them. This is very useful as ID3 tags can contain all the weird special characters which can't be included in file names. Each MP3 file is created individually with the convert_wav_to_mp3 function:

def convert_wav_to_mp3(artist,
                       album_name,
                       year,
                       track_number,
                       number_of_tracks,
                       track_name):

    wav_filename = 'track%02d.cdda.wav' % track_number
    mp3_name = '%02d - %s.mp3' % (track_number, track_name)
    track_x_of_xx = '%d/%d' % (track_number, number_of_tracks)
    command_string = 'lame -mj -V0 --tt "%s" --tn "%s" --ta "%s" --tl "%s" \
                      --ty "%s" --id3v2-only %s "%s"' % (track_name,
                                                         track_x_of_xx,
                                                         artist,
                                                         album_name,
                                                         year,
                                                         wav_filename,
                                                         mp3_name)
    wav_to_mp3 = subprocess.Popen(command_string,
                                  shell=True,
                                  close_fds=True
                                  #stdout=subprocess.PIPE,
                                  #stderr=subprocess.STDOUT,
                                  )
    wav_to_mp3.wait()
    os.remove(wav_filename)

    return None

Once the disc has been ripped the program needs to clean up the WAV files which were created by CDPARANOIA and there is a function to do this and the only files which are left in the folder are the MP3 files.

There are several other functions in this program which account for what to do in the event of no internet connection being available or if there isn't any disc in the drive.

In the former situation the program calls the rip_disc_basic function which entirely skips disc ID calculation and contacting the MusicBrainz API and instead asks the user for the name of the album and uses that to name the folder to which it rips the disc.

In the latter situation the program needs to be able to display a message that the CD drive is empty.

To get all this to work in a way which is simple and easy to use I created a GUI with a single button:

The Python Ripper single button image

I used TKinter for this job as this was my first time creating any kind of program with a GUI and even though it is extremely basic it is also very easy to use. So at the top of my script I need to call TKinter:

import tkinter as tk
import tkinter.messagebox as popup
from tkinter.simpledialog import askstring

Because I wanted to convert this into a binary executable I discovered it is not possible (or at least extremely difficult) to include external binary files so I converted my image to a base 64 string and assigned that to the variable image_for_button which I saved in a file called image_for_button.py which I then imported into my main program:

from image_for_button import image_for_button

I then needed to set up a few basic variables to call later on, including making a reference to an image called button_image by using TKinter's PhotoImage command to convert the base 64 string back into an image:

main_window = tk.Tk()
main_window.title('Python Ripper v0.5')

window_left = int((main_window.winfo_screenwidth() - 210) / 2)
window_top = int((main_window.winfo_screenheight() - 90) / 2)
# geometry = width x height + x offset + y offset
main_window.geometry(f"210x90+{window_left}+{window_top}")
main_window.resizable(0, 0)
main_window.config(bg='#d9d9d9')

button_image = tk.PhotoImage(data=image_for_button)

At the very end of my program I created the button which, when clicked, calls the main function which begins the whole ripping process and the main_loop command which starts the TKinter GUI:

ripper_button = tk.Button(main_window, image=button_image, command=main)

ripper_button.pack(padx=10,pady=10)

main_window.mainloop()

And this is the final result, a program with just one single big button:

The Python Ripper GUI running on Ubuntu

If you want you can see the entire code of this program on Github.

Please note this was created for my own personal use and as an educational exercise. Note also that there are 2 bugs which I know about and which I will get around to fixing. Eventually.