Goodbye os.path: 15 Pathlib Tricks to Quickly Master The File System in Python | by Bex T. | Apr, 2023

Goodbye os.path: 15 Pathlib Tricks to Quickly Master The File System in Python | by Bex T. | Apr, 2023

[ad_1]

A robot pal. — Via Midjourney

Pathlib may be my favorite library (after Sklearn, obviously). And given there are over 130 thousand libraries, that’s saying something. Pathlib helps me turn code like this written in os.path:

import os

dir_path = "/home/user/documents"

# Find all text files inside a directory
files = [os.path.join(dir_path, f) for f in os.listdir(dir_path) \
if os.path.isfile(os.path.join(dir_path, f)) and f.endswith(".txt")]

into this:

from pathlib import Path

# Find all text files inside a directory
files = list(dir_path.glob("*.txt"))

Pathlib came out in Python 3.4 as a replacement for the nightmare that was os.path. It also marked an important milestone for Python language on the whole: they finally turned every single thing into an object (even nothing).

The biggest drawback of os.path was treating system paths as strings, which led to unreadable, messy code and a steep learning curve.

By representing paths as fully-fledged objects, Pathlib solves all these issues and introduces elegance, consistency, and a breath of fresh air into path handling.

And this long-overdue article of mine will outline some of the best functions/features and tricks of pathlib to perform tasks that would have been truly horrible experiences in os.path.

Learning these features of Pathlib will make everything related to paths and files easier for you as a data professional, especially during data processing workflows where you have to move around thousands of images, CSVs, or audio files.

Let’s get started!

Working with paths

1. Creating paths

Almost all features of pathlib is accessible through its Path class, which you can use to create paths to files and directories.

There are a few ways you can create paths with Path. First, there are class methods like cwd and home for the current working and the home user directories:

from pathlib import Path

Path.cwd()

PosixPath('/home/bexgboost/articles/2023/4_april/1_pathlib')
Path.home()
PosixPath('/home/bexgboost')

You can also create paths from string paths:

p = Path("documents")

p

PosixPath('documents')

Joining paths is a breeze in Pathlib with the forward slash operator:

data_dir = Path(".") / "data"
csv_file = data_dir / "file.csv"

print(data_dir)
print(csv_file)

data
data/file.csv

Please, don’t let anyone ever catch you using os.path.join after this.

To check whether a path, you can use the boolean function exists:

data_dir.exists()
True
csv_file.exists()
True

Sometimes, the entire Path object won’t be visible, and you have to check whether it is a directory or a file. So, you can use is_dir or is_file functions to do it:

data_dir.is_dir()
True
csv_file.is_file()
True

Most paths you work with will be relative to your current directory. But, there are cases where you have to provide the exact location of a file or a directory to make it accessible from any Python script. This is when you use absolute paths:

csv_file.absolute()
PosixPath('/home/bexgboost/articles/2023/4_april/1_pathlib/data/file.csv')

Lastly, if you have the misfortune of working with libraries that still require string paths, you can call str(path):

str(Path.home())
'/home/bexgboost'

Most libraries in the data stack have long supported Path objects, including sklearn, pandas, matplotlib, seaborn, etc.

2. Path attributes

Path objects have many useful attributes. Let’s see some examples using this path object that points to an image file.

image_file = Path("images/midjourney.png").absolute()

image_file

PosixPath('/home/bexgboost/articles/2023/4_april/1_pathlib/images/midjourney.png')

Let’s start with the parent. It returns a path object that is one level up the current working directory.

image_file.parent
PosixPath('/home/bexgboost/articles/2023/4_april/1_pathlib/images')

Sometimes, you may want only the file name instead of the whole path. There is an attribute for that:

image_file.name
'midjourney.png'

which returns only the file name with the extension.

There is also stem for the file name without the suffix:

image_file.stem
'midjourney'

Or the suffix itself with the dot for the file extension:

image_file.suffix
'.png'

If you want to divide a path into its components, you can use parts instead of str.split('/'):

image_file.parts
('/',
'home',
'bexgboost',
'articles',
'2023',
'4_april',
'1_pathlib',
'images',
'midjourney.png')

If you want those components to be Path objects in themselves, you can use parents attribute, which creates a generator:

for i in image_file.parents:
print(i)
/home/bexgboost/articles/2023/4_april/1_pathlib/images
/home/bexgboost/articles/2023/4_april/1_pathlib
/home/bexgboost/articles/2023/4_april
/home/bexgboost/articles/2023
/home/bexgboost/articles
/home/bexgboost
/home
/

Working with files

bexgboost_classified_files._8k._sharp_quality._ed73fcdc-67e6-4b3c-ace4-3092b268cc42.png
Classified files. — Midjourney

To create files and write to them, you don’t have to use open function anymore. Just create a Path object and write_text or write_btyes to them:

markdown = data_dir / "file.md"

# Create (override) and write text
markdown.write_text("# This is a test markdown")

Or, if you already have a file, you can read_text or read_bytes:

markdown.read_text()
'# This is a test markdown'
len(image_file.read_bytes())
1962148

However, note that write_text or write_bytes overrides existing contents of a file.

# Write new text to existing file
markdown.write_text("## This is a new line")
# The file is overridden
markdown.read_text()
'## This is a new line'

To append new information to existing files, you should use open method of Path objects in a (append) mode:

# Append text
with markdown.open(mode="a") as file:
file.write("\n### This is the second line")

markdown.read_text()

'## This is a new line\n### This is the second line'

It is also common to rename files. rename method accepts the destination path for the renamed file.

To create the destination path in the current directory, i. e. rename the file, you can use with_stem on the existing path, which replaces the stem of the original file:

renamed_md = markdown.with_stem("new_markdown")

markdown.rename(renamed_md)

PosixPath('data/new_markdown.md')

Above, file.md is turned into new_markdown.md.

Let’s see the file size through stat().st_size:

# Display file size
renamed_md.stat().st_size
49 # in bytes

or the last time the file was modified, which was a few seconds ago:

from datetime import datetime

modified_timestamp = renamed_md.stat().st_mtime

datetime.fromtimestamp(modified_timestamp)

datetime.datetime(2023, 4, 3, 13, 32, 45, 542693)

st_mtime returns a timestamp, which is the count of seconds since January 1, 1970. To make it readable, you can use use the fromtimestamp function of datatime.

To remove unwanted files, you can unlink them:

renamed_md.unlink(missing_ok=True)

Setting missing_ok to True won’t raise any alarms if the file doesn’t exist.

Working with directories

image.png
Folders in an office. — Midjourney

There are a few neat tricks to work with directories in Pathlib. First, let’s see how to create directories recursively.

new_dir = (
Path.cwd()
/ "new_dir"
/ "child_dir"
/ "grandchild_dir"
)

new_dir.exists()

False

The new_dir doesn’t exist, so let’s create it with all its children:

new_dir.mkdir(parents=True, exist_ok=True)

By default, mkdir creates the last child of the given path. If the intermediate parents don’t exist, you have to set parents to True.

To remove empty directories, you can use rmdir. If the given path object is nested, only the last child directory is deleted:

# Removes the last child directory
new_dir.rmdir()

To list the contents of a directory like ls on the terminal, you can use iterdir. Again, the result will be a generator object, yielding directory contents as separate path objects one at a time:

for p in Path.home().iterdir():
print(p)
/home/bexgboost/.python_history
/home/bexgboost/word_counter.py
/home/bexgboost/.azure
/home/bexgboost/.npm
/home/bexgboost/.nv
/home/bexgboost/.julia
...

To capture all files with a specific extension or a name pattern, you can use the glob function with a regular expression.

For example, below, we will find all text files inside my home directory with glob("*.txt"):

home = Path.home()
text_files = list(home.glob("*.txt"))

len(text_files)

3 # Only three

To search for text files recursively, meaning inside all child directories as well, you can use recursive glob with rglob:

all_text_files = [p for p in home.rglob("*.txt")]

len(all_text_files)

5116 # Now much more

Learn about regular expressions here.

You can also use rglob('*') to list directory contents recursively. It is like the supercharged version of iterdir().

One of the use cases of this is counting the number of file formats that appear within a directory.

To do this, we import the Counter class from collections and provide all file suffixes to it within the articles folder of home:

from collections import Counter

file_counts = Counter(
path.suffix for path in (home / "articles").rglob("*")
)

file_counts

Counter({'.py': 12,
'': 1293,
'.md': 1,
'.txt': 7,
'.ipynb': 222,
'.png': 90,
'.mp4': 39})

Operating system differences

Sorry, but we have to talk about this nightmare of an issue.

Up until now, we have been dealing with PosixPath objects, which are the default for UNIX-like systems:

type(Path.home())
pathlib.PosixPath

If you were on Windows, you would get a WindowsPath object:

from pathlib import WindowsPath

# User raw strings that start with r to write windows paths
path = WindowsPath(r"C:\users")
path

NotImplementedError: cannot instantiate 'WindowsPath' on your system

Instantiating another system’s path raises an error like the above.

But what if you were forced to work with paths from another system, like code written by coworkers who use Windows?

As a solution, pathlib offers pure path objects like PureWindowsPath or PurePosixPath:

from pathlib import PurePosixPath, PureWindowsPath

path = PureWindowsPath(r"C:\users")
path

PureWindowsPath('C:/users')

These are primitive path objects. You’ve access to some path methods and attributes, but essentially, the path object remains a string:

path / "bexgboost"
PureWindowsPath('C:/users/bexgboost')
path.parent
PureWindowsPath('C:/')
path.stem
'users'
path.rename(r"C:\losers") # Unsupported
AttributeError: 'PureWindowsPath' object has no attribute 'rename'

Conclusion

If you have noticed, I lied in the title of the article. Instead of 15, I believe the count of new tricks and functions was 30ish.

I didn’t want to scare you off.

But I hope I’ve convinced you enough to ditch os.path and start using pathlib for much easier and more readable path operations.

Forge a new path, if you will :)

bexgboost_Paths_and_pathlib._Extreme_quality._76f2bbe4-7c8d-45a6-abf4-ccc8d9e32144.png
Path. — Midjourney

If you enjoyed this article and, let’s face it, its bizarre writing style, consider supporting me by signing up to become a Medium member. Membership costs 4.99$ a month and gives you unlimited access to all my stories and hundreds of thousands of articles written by more experienced folk. If you sign up through this link, I will earn a small commission with no extra cost to your pocket.

[ad_2]
Source link

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *