UTF-8 release

- bump version - update readme - add utf-8 convert option - make sure files are parsed as utf-8 - pull out playlist name extractor - first update outline & tests with m3u8 extension - add overall test - validate file extension - validate empty playlist file
radujica · Jan 22, 2021 · 8821833 · 8821833
2 parents 96ea90a + e70b73c
commit 8821833
Show file tree

Hide file tree

Showing 17 changed files with 251 additions and 32 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -16,4 +16,12 @@ Check out the Makefile for the common commands.
 
 Publishing to pypi is handled through github releases and action, 
 though version in setup.py needs to be manually bumped.
+
+# Current ideas
+
+- Use more data from the playlist or even the song tags themselves to more accurately find songs, e.g. song duration,
+year, album artist
+- Try removing the "the" from song names; `The Wolven Storm` is not the same as `Wolven Storm`
+- Compute some similarity metrics after finding matches on Spotify; taking the example above, atm can't know which 
+one to search first but after seeing the artist from each attempt, one could deduce the correct one
 
diff --git a/README.md b/README.md
@@ -1,13 +1,16 @@
-# Convert local playlist to Spotify playlist
+# Convert local playlists to Spotify playlists
 
 ![Build, Test, Lint](https://github.com/radujica/tospotify/workflows/Build,%20Test,%20Lint/badge.svg)
 [![PyPI version](https://badge.fury.io/py/tospotify.svg)](https://badge.fury.io/py/tospotify)
 
-Currently works for m3u files; m3u8 support to come!
+Supports m3u and m3u8 files in the [Extended format](https://en.wikipedia.org/wiki/M3U) encoded as UTF-8.
+
+Take a look [below](#help) for more details and debugging tips.
 
 ## Usage
 
-    usage: tospotify [-h] [-v] [--public] [--playlist-id PLAYLIST_ID]
+    usage: tospotify [-h] [--verbose] [--public] [--convert]
+                     [--playlist-id PLAYLIST_ID]
                      spotify_username playlist_path
 
     Create/update a Spotify playlist from a local m3u playlist
@@ -20,8 +23,9 @@ Currently works for m3u files; m3u8 support to come!
 
     optional arguments:
       -h, --help            show this help message and exit
-      -v, --verbose         print all the steps when searching for songs
+      --verbose             print all the steps when searching for songs
       --public              playlist is public, otherwise private
+      --convert             convert from locale default to utf-8
       --playlist-id PLAYLIST_ID
                             do not create a new playlist, instead update the
                             existing playlist with this id
@@ -56,4 +60,64 @@ Currently works for m3u files; m3u8 support to come!
 
 ### Windows
 Same as linux but use `set` instead of `export`
-
+
+
+## Help
+
+### Encoding
+
+Seeing unexpected characters in the log messages is a sign of faulty encoding.
+
+This tool uses [m3u8](https://github.com/globocom/m3u8) library to parse the files, which relies on utf-8.
+
+Encoding can be checked by opening the playlist in a text editor, such as Notepad++.
+If the playlist is in a different encoding, 
+try using the `--convert` argument which will attempt to convert it to utf-8.
+
+Alternatively, could try importing the playlist into your music player and using its export function 
+to export as utf-8, if it exists. AIMP, for example, can do this.
+
+### Songs missing
+
+This might happen when the file is not actually in the [extended m3u](https://en.wikipedia.org/wiki/M3U) format. 
+This format looks like
+
+    #EXTM3U
+    #EXTINF:277,Faun - Sieben Raben
+    /Music/Selection/Faun - Sieben Raben.mp3
+
+and is populated from file tags. For this example, the mp3 file contains the tags 
+artist = Faun and title = Sieben Raben which then populate the `#EXTINF` line.
+
+If your playlist only contain paths, try importing it in a (different) music player and exporting again.
+AIMP, for example, exports in the expected format.
+
+### What does tospotify actually do internally?
+
+It tries various cleaning steps and search queries in an attempt to find the correct songs on Spotify.
+
+The [extended m3u](https://en.wikipedia.org/wiki/M3U) format is important. As mentioned above, the ground truth
+is actually the artist and title tags stored in the songs themselves which are then reflected in the playlist.
+Looking at the example above, the format is essentially `artist - title`; this implicitly means that dashes `-`
+in the artist or title cannot be interpreted properly at the moment. Sorry, AC-DC :(
+
+The tool then uses rules to compute various queries. Take for example the song 
+`Every Breath You Take` by `Sting and the Police`. This can be stored in many ways. The artist could be
+`Sting and the Police`, `Sting;The Police`, `Sting & the Police`, `The Police`, etc.
+Then the title could be `Every Breath You Take` but also `Every Breath You Take feat. Sting` and other
+variations. Many of these are not found exactly as such on Spotify.
+
+There can also be live versions, e.g. with title `Every Breath You Take [live]`,
+covers by other artists, separate recordings of the song,
+and the list goes on. This song was actually recorded both by The Police with Sting
+and solo by Sting; both versions are available on Spotify!
+
+Bit more complex than it initially seems :)
+
+So this is what tospotify does; it will try to find the correct song through various rules derived from the data in the 
+playlist.
+
+
+## Contributing
+
+Take a look at the [CONTRIBUTING](CONTRIBUTING.md) file for more details. Pull requests are welcome!
diff --git a/requirements.txt b/requirements.txt
@@ -1,2 +1,2 @@
-m3u8==0.5.4
+m3u8>=0.7.1
 spotipy==2.11.1
diff --git a/setup.py b/setup.py
@@ -8,7 +8,7 @@ def readme():
 
 setup(
     name='tospotify',
-    version='0.2',
+    version='0.3',
     description='Create/update a Spotify playlist from a local m3u playlist',
     url='https://github.com/radujica/tospotify',
     author='Radu Jica',
@@ -29,7 +29,7 @@ def readme():
     packages=['tospotify', 'tospotify.types'],
     include_package_data=True,
     zip_safe=True,
-    install_requires=['spotipy', 'm3u8'],
+    install_requires=['spotipy', 'm3u8>=0.7.1'],
     entry_points={
         'console_scripts': ['tospotify=tospotify.run:main'],
     }

diff --git a/test/data/cp1252_playlist.m3u b/test/data/cp1252_playlist.m3u
@@ -0,0 +1,3 @@
+#EXTM3U
+#EXTINF:290,Eiv�r - Tr�llabundin
+..\..\Music\Selection\Nordic\Eiv�r - Tr�llabundin.mp3
diff --git a/test/data/empty_playlist.m3u b/test/data/empty_playlist.m3u
diff --git a/test/data/path_playlist.m3u b/test/data/path_playlist.m3u
@@ -0,0 +1 @@
+/Music/Selection/Faun - Sieben Raben.mp3
diff --git a/test/data/utf8_playlist.m3u8 b/test/data/utf8_playlist.m3u8
@@ -0,0 +1,5 @@
+#EXTM3U
+#EXTINF:200,Marcin Przybyłowicz - The Wolven Storm (Priscilla's Song)
+/Music/Marcin Przybyłowicz - The Wolven Storm (Priscilla's Song).mp3
+#EXTINF:290,Eivør - Trøllabundin
+/Music/Eivør - Trøllabundin.mp3
diff --git a/test/data/valid_playlist.m3u b/test/data/valid_playlist.m3u
@@ -0,0 +1,3 @@
+#EXTM3U
+#EXTINF:277,Faun - Sieben Raben
+/Music/Selection/Faun - Sieben Raben.mp3
diff --git a/test/test_integration.py b/test/test_integration.py
@@ -0,0 +1,32 @@
+import os
+from unittest.mock import patch
+
+import pytest
+
+from tospotify.run import main
+
+
+class MockArgs:
+    def __init__(self, playlist_path, playlist_id=None):
+        self.verbose = False
+        self.convert = False
+        self.spotify_username = 'test_username'
+        self.public = True
+        self.playlist_path = playlist_path
+        self.playlist_id = playlist_id
+
+
+# the point here is to run through the whole flow
+@patch('tospotify.run._parse_args')
+@patch('tospotify.run.prompt_for_user_token', lambda x, y: 'token')
+@patch('tospotify.search._find_track', lambda x, y, z: 'uri')
+@patch('tospotify.search.add_tracks', lambda x, y, z: None)
+@pytest.mark.parametrize('playlist', [
+    os.path.join('test', 'data', 'valid_playlist.m3u'),
+    os.path.join('test', 'data', 'empty_playlist.m3u'),
+    os.path.join('test', 'data', 'empty_playlist.m3u'),
+    os.path.join('test', 'data', 'utf8_playlist.m3u8')
+])
+def test_integration(mock_function, playlist):
+    mock_function.return_value = MockArgs(playlist_path=playlist, playlist_id=1)
+    main()
diff --git a/test/test_parser.py b/test/test_parser.py
@@ -0,0 +1,5 @@
+from tospotify.parser import parse_songs
+
+
+def test_parse_songs():
+    pass
diff --git a/test/test_processing.py b/test/test_processing.py
@@ -6,9 +6,8 @@
 @pytest.mark.parametrize('name,expected', [
     ('The Police', 'the police'),
     (' The  Police   ', 'the police'),
-    ('St-ing; The, Police', 'sting; the, police'),
-    ('ßtïngé', 'tng'),
-    ('é', ''),
+    ('St-ing; {The}, Police', 'st-ing; the, police'),
+    ('ßtïngé', 'ßtïngé'),
     ('Every Breath You Take (feat. Sting)', 'every breath you take (feat sting)'),
     ('Every Breath You Take [Acoustic]', 'every breath you take [acoustic]')
 ])

diff --git a/test/test_run.py b/test/test_run.py
@@ -1,9 +1,10 @@
+import argparse
 import os
 from unittest.mock import patch
 
 import pytest
 
-from tospotify.run import _parse_path
+from tospotify.run import _parse_path, _m3u_file
 
 
 @patch('os.getcwd', lambda: '/test/path')
@@ -13,3 +14,23 @@
 ])
 def test__parse_path(path, expected):
     assert _parse_path(path) == os.sep.join(expected)
+
+
+@pytest.mark.parametrize('playlist_path', [
+    'path/to/file.m3u',
+    'file.m3u',
+    'path/to/file.m3u8',
+    'file.m3u8'
+])
+def test__m3u_extension(playlist_path):
+    _m3u_file(playlist_path)
+
+
+@pytest.mark.parametrize('playlist_path', [
+    'path/file.mp3',
+    '.m3u',
+    'file'
+])
+def test__m3u_extension_invalid(playlist_path):
+    with pytest.raises(argparse.ArgumentTypeError):
+        _m3u_file(playlist_path)
diff --git a/tospotify/parser.py b/tospotify/parser.py
@@ -0,0 +1,39 @@
+import logging
+
+import m3u8
+
+
+def convert_utf8(playlist_path: str) -> str:
+    """ Convert file to utf-8 after parsing with locale.getpreferredencoding
+
+    :param playlist_path: absolute path of the playlist
+    :return: Path of converted file
+    """
+    logging.warning('Converting file to utf-8')
+    path_without_extension = playlist_path.rsplit('.', 1)[0]
+    output_path = path_without_extension + '_utf8.m3u8'
+
+    # here it uses the locale.getpreferredencoding, which could be cp1252 for Windows
+    with open(playlist_path, mode='r') as input_:
+        with open(output_path, encoding='utf-8', mode='w') as output_:
+            for line in input_.readlines():
+                output_.write(line.encode('utf-8').decode('utf-8'))
+
+    return output_path
+
+
+def parse_songs(playlist_path: str) -> m3u8.SegmentList:
+    """ Parse and return the songs found in the file
+
+    :param playlist_path: absolute path of the playlist
+    :type playlist_path: str
+    :return:
+    """
+    # m3u8 uses open(..., encoding='utf-8') which will through exception when the file cannot be parsed as utf-8
+    playlist = m3u8.load(playlist_path)
+    segments = playlist.segments
+
+    if len(segments) <= 0:
+        logging.error('Could not find any songs in the file!')
+
+    return segments
diff --git a/tospotify/processing.py b/tospotify/processing.py
@@ -28,16 +28,16 @@ def clean_title(title: str) -> str:
 def clean_name(name: str) -> str:
     """ Clean either artist or title:
 
-    - keep only ascii and some relevant characters: \\,&()[]
-    - note that single quotes are also removed since Spotify seems to handle those well
+    - removes these characters: .{}
+    - removes extra spaces
     - strip and lowercase
 
     :param name: artist or song title to clean
     :type name: str
     :return: cleaned name
     :rtype: str
     """
-    cleaned_name = re.sub(r'[^a-zA-Z0-9\s,;&()\[\]]', '', name)
+    cleaned_name = re.sub(r'[.{\}]', '', name)
     cleaned_name = re.sub(r'\s+', ' ', cleaned_name)
     cleaned_name = cleaned_name.strip()
     cleaned_name = cleaned_name.lower()
@@ -49,12 +49,16 @@ def clean_name(name: str) -> str:
 
 
 def process_song_name(song_name: str) -> Tuple[str, str]:
-    """ Splits m3u line of artist - title and cleans using clean_name
+    """ Splits m3u line of artist - title and cleans using clean_name.
+    Note that the Extended M3U8 format requires this format, i.e. artist - title based on file tags.
+
+    !Since '-' delimits the artist from the song, this character should not be seen inside the artist or title!
 
     :param song_name:
     :type song_name: str
     :return: tuple of artist and title
     :rtype: (str, str)
+    :raises ProcessingException: if a song does not obey the Extended M3U8 formatting of "artist - title"
     """
     song_split = song_name.split('-')
 

diff --git a/tospotify/run.py b/tospotify/run.py
@@ -1,22 +1,42 @@
 import argparse
 import logging
 import os
+from typing import Optional
 
 from spotipy import Spotify
 from spotipy.util import prompt_for_user_token
 
 from .search import create_spotify_playlist, update_spotify_playlist
 
 
+def _m3u_file(path: str) -> Optional[str]:
+    if not isinstance(path, str):
+        raise argparse.ArgumentTypeError('Path must be a string. Encountered type={}'.format(str(type(path))))
+
+    splits = path.rsplit('.', 1)
+    if len(splits) == 1:
+        raise argparse.ArgumentTypeError('Could not determine file extension')
+
+    filename, extension = splits[0], splits[1]
+    if len(filename) == 0:
+        raise argparse.ArgumentTypeError('Filename without extension cannot be empty')
+
+    if extension in {'m3u', 'm3u8'}:
+        return path
+
+    raise argparse.ArgumentTypeError('Only m3u files are supported. Encountered={}'.format(extension))
+
+
 def _parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(description='Create/update a Spotify playlist from a local m3u playlist')
     parser.add_argument('spotify_username',
                         help='Spotify username where playlist should be updated. '
                              'Your email address should work just fine, or could find your user id '
                              'through e.g. the developer console', type=str)
-    parser.add_argument('playlist_path', help='full path to the playlist', type=str)
+    parser.add_argument('playlist_path', help='full path to the playlist', type=_m3u_file)
     parser.add_argument('--verbose', help='print all the steps when searching for songs', action='store_true')
     parser.add_argument('--public', help='playlist is public, otherwise private', action='store_true')
+    parser.add_argument('--convert', help='convert from locale default to utf-8', action='store_true')
     parser.add_argument('--playlist-id', help='do not create a new playlist, '
                                               'instead update the existing playlist with this id', type=str)
     parsed_args = parser.parse_args()
@@ -31,6 +51,13 @@ def _parse_path(path: str) -> str:
     return os.path.join(os.getcwd(), *path.split(os.sep))
 
 
+def _extract_playlist_name(playlist_path: str) -> str:
+    _, filename = os.path.split(playlist_path)
+    playlist_name = str(filename.split('.')[0])
+
+    return playlist_name
+
+
 def main() -> None:
     """ Main entry point to the script """
     args = _parse_args()
@@ -50,12 +77,11 @@ def main() -> None:
     spot = Spotify(auth=token)
 
     if args.playlist_id is None:
-        _, filename = os.path.split(playlist_path)
-        playlist_name = str(filename.split('.')[0])
+        playlist_name = _extract_playlist_name(playlist_path)
         playlist_id = create_spotify_playlist(spot, playlist_name)
         logging.info('Created playlist with name={} at id={}'.format(playlist_name, playlist_id))
     else:
         playlist_id = args.playlist_id
         logging.info('Updating existing playlist with id={}'.format(playlist_id))
 
-    update_spotify_playlist(spot, playlist_path, playlist_id)
+    update_spotify_playlist(spot, playlist_path, playlist_id, args.convert)