An introduction to Python for R users

This introduction to Python assumes you know R, which is used as an analogy to explain Python for data analysis.

python
r
Author

Marc Dotson

Published

13 Jun 2025

In classes and workshops over many years, I’ve taught data analytics using R. But in my new position, I teach data analytics using Python. This introduction to Python is for R users—primarily me, though I hope it proves useful to others as well.

There are incredible resources in this space, and I’ve drawn liberally from a number of them. As an overall introduction to Python, Python for Data Analysis is a go-to resource and the spiritual equivalent of R for Data Science. I also really appreciate the work in Python and R for the Modern Data Scientist, especially for the authors’ clear espousing that this isn’t an either/or situation—you can (and arguably should) use Python and R as complements.

I am especially indebted to Emily Riederer’s blog series beginning with Python Rgonomics and subscribe to her philosophy of using tools in Python that are genuinely “Pythonic” while being consistent with the workflow and ergonomics of the best R has to offer. I am also grateful to extra help from posit::conf workshop instructors and colleagues in my new position at Utah State University.

Different mindsets

When you start working with Python, it’s essential that you approach it with the right mindset. R is a specialized language developed by statisticians for data analysis. Python is a big tent, a general programming language developed by computer scientists for many things, with only a small portion of it dedicated to data analysis.

To summarize some key differences:

Python R
General language Specialized language
Developed by computer scientists Developed by statisticians
Object-oriented programming Functional programming
Obsessed with efficiency Lazy about efficiency
Obsessed with namespacing Lazy about namespacing
Small, medium, and large data Small and medium data
Machine learning and deep learning Data wrangling and visualization
Spacing is part of the syntax Spacing is for convenience
Indices start with 0 Indices start with 1
No single authority Dominated by Posit
Jupyter Notebooks R Markdown and Quarto
Inconsistent (i.e., no tidyverse) Consistency in the tidyverse
“Pythonistas” and “Pythonic” code “R Users” and “Tidy” code

While these are broad strokes and not 100% accurate in every case, they help provide some high-level context for how the two languages deviate in their approaches to common problems.

Functions, methods, and attributes

With the right mindset, it’s easier to understand some of the things that Python is obsessed with that R simply isn’t. The most important difference is that Python is an object-oriented programming language while R is all about functional programming. While everything in R is a function (for a typical user), using Python requires frequently keeping track of the difference between functions, methods, and attributes.

Functions in Python and R are equivalent, although functions in Python are typically namespaced with the library name or alias like library.function(). Note that we can, but often don’t, similarly namespace R functions with package::function(). Methods are object-specific functions. In other words, methods are functions nested within object types and are namespaced with an object name of the given type as in object.method(). While it’s possible to import a specific function such that we can call it without referencing it’s library name or alias, we can never call a method without reference to an object name of the necessary type. In other words, we may see a function() like in R but we will always see a .method() with an object. One more set of definitions—just like packages in Python are typically referred to as libraries, function (and method) arguments are referred to as parameters.

Attributes are object-specific features and are, like methods, namespaced with an object name of the given type as in object.attribute, but without any parentheses. For example, the dimensions of a NumPy array could be referenced with array.size while the equivalent in base R would be a function like dim(array).

Get started

The first hurdle as we work to apply our new mindset is simply getting Python and our project environment installed. This is a big departure from what we’re used to in R, where there is one way to install R and we usually ignore our project environment, let alone make it reproducible. Remember, Python is a big tent with lots of uses and, unsurprisingly, lots of ways to do everything I’m covering. However, from the perspective of someone coming from R and with a focus on data analytics, I recommend the following.

While I started using pyenv and venv for managing Python versions and libraries, respectively, there’s a new(er) kid on the block that has been receiving lots of attention: uv, a single unified tool for managing Python project environments. (As a bellweather, {reticulate} and Positron now use uv.) Get started by installing uv via the command line.

The Command Line

Using the command line (i.e., terminal or shell) isn’t as common for R users who are comfortable with a dedicated console for running code. Be patient, take your time, and follow any instructions from a trusted source closely. A few things to help:

  • The command line is the programming interface into your OS itself. You don’t have to know everything about it to follow instructions.
  • Instructions can be different based on the type of command line. If you’re on a Mac that’s running macOS Catalina 10.15.7 or later, the terminal is Zsh. If you’re using Linux, the shell is Bash (and you probably already know that). And if you’re using Windows you’re working with PowerShell.

Install Python

Unlike your experience with R, Python comes pre-installed on some operating systems. This version should not be used by anyone except the OS itself. For this and other reasons, you’ll need the ability to maintain multiple versions of Python on the same computer. Once you have uv installed, it’s easy to install and manage Python versions.

  • To install the latest stable release of Python, on the command line, run uv python install. To see which versions of Python you have installed, run uv python find; none of these will be the off-limits OS version.
  • You can also install specific versions of Python, such as uv python install 3.13.4 to install Python 3.13.4. To view Python versions that are available to install, run uv python list.
Positron IDE

As you well know, an integrated development environment (IDE), outside of an open source language, is arguably your most important tool as a data analyst. There are many options, but I recommend Positron, a next-generation data science IDE. Built by Posit on VS Code’s open source core, Positron combines the multilingual extensibility of VS Code with essential data tools common to language-specific IDEs.

If RStudio is too specific and VS Code is too general, you may find that Positron is just right and becomes your only IDE for both Python and R. And unless you’re comfortable navigating between directories using the command line, Positron’s built-in terminal will be tied to the working directory you have opened in the IDE.

Initialize a project environment

A project environment is composed of the language(s) and libraries (including the dependencies) used for a given project. What makes a project environment reproducible is keeping track of which version of the language(s) and the libraries we’re using for a given project so that it can be easily reproduced on another machine by you (including future you) or someone else.

  • After navigating to a project working directory, run uv init to initialize a project environment. This creates a pyproject.toml file with metadata about the project and a hidden .python-version file that specifies the default version of Python for the project. (It also creates main.py and README.md files that you can use or delete.)
  • With the project environment initialized, you can install libraries. For example, to install the Polars library, run uv add polars. This installs Polars, and any dependencies, and creates both a uv.lock file that keeps track of the versions of the libraries you’ve installed and a hidden .venv reproducible (or virtual, hence the “v” in venv) environment folder that serves as the project library.

Just like with R, all Python libraries are installed in a single, global library on your computer known as the system library. The fact that we have a project library highlights an important feature of making project environments reproducible: Each project will have its own project library and thus be isolated. If two projects use different versions of the same package, they won’t conflict with each other because they’ll each have their own project library. (Well, not exactly. Python employs a global cache to avoid having to install the same version of a given library more than once. The project library will reference the global cache.) Whenever you install new libraries, the uv.lock file is automatically updated. And if you’re starting with an existing project, run uv run for the libraries included in uv.lock to be automatically installed.

It might seem like a lot just to get started, but it’s something we should be doing in R as well. Along with a project’s code, all someone would need is the pyproject.toml, .python-version, and uv.lock files to reproduce your code, including the project environment. Well, assuming they’re also using uv to manage their project environments. If they’re using another tool to install libraries instead (yes, there are many ways to install libraries in Python), they will likely need a requirements.txt file or a pylock.toml file to reproduce the project environment, which you can create with uv export --format requirements.txt or uv export -o pylock.toml, respectively.

Data wrangling

Whatever the language, the most common task we have for any data analysis is data wrangling (i.e., cleaning, munging, etc.). The NumPy library is more or less equivalent to what we see in base R, introducing arrays and efficient computation across arrays—but, perhaps surprisingly, not data frames. Data frames (or DataFrames as they are referred to in Python) came later with pandas (short for “panel data”). Still the most popular library for data wrangling in Python, pandas is built to supplement NumPy, with all of the syntax baggage. However, growing in popularity is Polars (an anagram of the query language it uses, OLAP, and the language it’s built in, Rust or rs). Polars is something of an answer to pandas problems. Actually, free from any Python code at all (yes, you can use it in R), it offers a glimpse at what the polyglot future might look like.

My take is that Polars provides a more self-consistent data wrangling experience than pandas, reversing the trend that many experience when they come to Polars for the speed and stay for the syntax. To illustrate the tidyverse spiritual connections, if not the deeper roots in SQL syntax, there are a number of great side-by-side comparisons of Polars and {dplyr}. I’ll illustrate a few common tasks in both Polars and {dplyr} and point out the differences to be mindful of.

Import data

Libraries and modules (a kind of sub-library) are imported with commonly accepted aliases in order to shorten the namespace reference. For example, the Polars alias convention is pl. We’re also importing the os library, which needs no alias, to write out a relative file path that will work for any user or operating system that has the same relative directory structure; a “Pythonic” version of {here}.

Note the difference between the pl.read_csv() function and the .shape and .columns attributes.

import polars as pl
import os

customer_data = pl.read_csv(os.path.join('data', 'customer_data.csv'))
customer_data.shape
(10531, 13)
customer_data.columns
['customer_id', 'birth_year', 'gender', 'income', 'credit', 'married', 'college_degree', 'region', 'state', 'star_rating', 'review_time', 'review_title', 'review_text']
library(tidyverse)

customer_data <- read_csv(here::here("posts", "python-intro", "data", "customer_data.csv"))
glimpse(customer_data)
Rows: 10,531
Columns: 13
$ customer_id    <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1…
$ birth_year     <dbl> 1971, 1970, 1988, 1984, 1987, 1994, 1968, 1994, 1958, 1…
$ gender         <chr> "Female", "Female", "Male", "Other", "Male", "Male", "M…
$ income         <dbl> 73000, 31000, 35000, 64000, 58000, 164000, 39000, 69000…
$ credit         <dbl> 742.0827, 749.3514, 542.2399, 573.9358, 644.2439, 553.6…
$ married        <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "No"…
$ college_degree <chr> "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "No"…
$ region         <chr> "South", "West", "South", "Midwest", "West", "Midwest",…
$ state          <chr> "DC", "WA", "AR", "MN", "HI", "MN", "MN", "KY", "NM", "…
$ star_rating    <dbl> 4, NA, NA, NA, 5, 2, NA, 5, NA, 5, NA, 5, NA, NA, NA, N…
$ review_time    <chr> "06 11, 2015", NA, NA, NA, "03 25, 2008", "06 7, 2013",…
$ review_title   <chr> "Four Stars", NA, NA, NA, "Great Product!!", "Not at al…
$ review_text    <chr> "everything's fine", NA, NA, NA, "I looked all over the…

Filter observations

Polars DataFrames have methods that are similar to {dplyr} (e.g., .filter() and filter()). DataFrames are composed of columns called Series (equivalent to R’s vectors). Note that unlike pandas DataFrames, Polars DataFrames don’t have a row index.

In pandas, we would need to reference column names with data['column_name'] (like base R’s data$column_name or data["column_name"]), but Polars allows for pl.col('column_name'). Yes, we use quotation marks for every column name. The pl.col() expression offers a helper-function-like consistency.

customer_data.filter(pl.col('gender') == 'Female', pl.col('income') > 70000)
shape: (3_970, 13)
customer_id birth_year gender income credit married college_degree region state star_rating review_time review_title review_text
i64 i64 str i64 f64 str str str str i64 str str str
1001 1971 "Female" 73000 742.082667 "No" "No" "South" "DC" 4 "06 11, 2015" "Four Stars" "everything's fine"
1010 1994 "Female" 77000 605.157869 "Yes" "No" "Northeast" "VT" 5 "10 18, 2015" "nice strong enough" "Yeah"
1012 1953 "Female" 126000 672.59663 "Yes" "No" "South" "DC" 5 "01 6, 2015" "Five Stars" "easy setup  works great"
1013 1974 "Female" 197000 680.211781 "Yes" "Yes" "West" "UT" null null null null
1022 1979 "Female" 155000 804.896188 "No" "Yes" "West" "UT" 5 "04 22, 2017" "Yak Attack is great!" "Very simple to use. Nice acces…
11515 1969 "Female" 203000 645.425436 "Yes" "Yes" "Northeast" "MA" null null null null
11519 1977 "Female" 149000 676.361119 "Yes" "No" "West" "UT" null null null null
11524 1994 "Female" 114000 644.463202 "Yes" "No" "South" "AL" null null null null
11526 1987 "Female" 109000 611.095352 "No" "No" "South" "MS" 4 "04 22, 2016" "Overall - A good deal" "Not the highest quality in the…
11529 1966 "Female" 112000 820.539578 "Yes" "Yes" "Northeast" "NJ" null null null null
filter(customer_data, gender == "Female", income > 70000)
# A tibble: 3,970 × 13
   customer_id birth_year gender income credit married college_degree region   
         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
 1        1001       1971 Female  73000   742. No      No             South    
 2        1010       1994 Female  77000   605. Yes     No             Northeast
 3        1012       1953 Female 126000   673. Yes     No             South    
 4        1013       1974 Female 197000   680. Yes     Yes            West     
 5        1022       1979 Female 155000   805. No      Yes            West     
 6        1023       1995 Female 137000   539. No      Yes            Northeast
 7        1024       1974 Female 285000   685. Yes     Yes            Midwest  
 8        1028       1980 Female  87000   715. No      No             West     
 9        1030       1969 Female 163000   636. Yes     Yes            West     
10        1036       1978 Female 227000   614. Yes     Yes            Northeast
# ℹ 3,960 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>

Slice observations

Note that Python is zero-indexed. This is probably the most problematic (and very computer science-based) difference and why it’s nice to avoid indexing if you can!

Remember that function (and method) arguments are called parameters. Some parameters are positional that have to be, as the name suggests, specified in an exact position. Others are keyword or named parameters, which is commonplace in R. The positional parameters for Polars’ .slice() method are the start index and the slice length.

customer_data.slice(0, 5)
shape: (5, 13)
customer_id birth_year gender income credit married college_degree region state star_rating review_time review_title review_text
i64 i64 str i64 f64 str str str str i64 str str str
1001 1971 "Female" 73000 742.082667 "No" "No" "South" "DC" 4 "06 11, 2015" "Four Stars" "everything's fine"
1002 1970 "Female" 31000 749.351433 "Yes" "No" "West" "WA" null null null null
1003 1988 "Male" 35000 542.239859 "No" "No" "South" "AR" null null null null
1004 1984 "Other" 64000 573.935783 "Yes" "Yes" "Midwest" "MN" null null null null
1005 1987 "Male" 58000 644.243856 "No" "Yes" "West" "HI" 5 "03 25, 2008" "Great Product!!" "I looked all over the Internet…
slice(customer_data, 1:5)
# A tibble: 5 × 13
  customer_id birth_year gender income credit married college_degree region 
        <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>  
1        1001       1971 Female  73000   742. No      No             South  
2        1002       1970 Female  31000   749. Yes     No             West   
3        1003       1988 Male    35000   542. No      No             South  
4        1004       1984 Other   64000   574. Yes     Yes            Midwest
5        1005       1987 Male    58000   644. No      Yes            West   
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>

Sort observations

It can be strange at first, but namespacing is critical. Remember that a function is preceded by the library name or alias (e.g., pl.col()), unless you’ve imported the specific function (e.g., from polars import col), while a method is preceded by an object name of a certain type (e.g., customer_data.sort()). Since object types are tied to libraries, the chain back to its corresponding library is always present, explicitly for functions and implicitly for methods.

Note that its True and False, not TRUE and FALSE or true and false.

customer_data.sort(pl.col('birth_year'), descending = True)
shape: (10_531, 13)
customer_id birth_year gender income credit married college_degree region state star_rating review_time review_title review_text
i64 i64 str i64 f64 str str str str i64 str str str
1026 1999 "Male" 66000 643.030421 "No" "No" "South" "AL" null null null null
1049 1999 "Other" 88000 630.037632 "No" "Yes" "West" "WA" null null null null
1092 1999 "Other" 77000 664.000947 "No" "Yes" "Midwest" "MO" null null null null
1107 1999 "Female" 97000 579.437133 "No" "Yes" "West" "CO" null null null null
1113 1999 "Male" 190000 660.594015 "Yes" "Yes" "Northeast" "PA" 5 "07 10, 2014" "Five Stars" "this is awesome"
8277 1947 "Male" 92000 839.039929 "Yes" "No" "West" "OR" null null null null
8525 1947 "Female" 170000 723.312232 "Yes" "Yes" "West" "CO" 3 "09 4, 2016" "Knuckles are not hard and they… "Knuckles  are not hard and the…
9437 1947 "Female" 95000 850.0 "No" "Yes" "West" "ID" null null null null
10069 1947 "Other" 84000 742.581417 "Yes" "Yes" "West" "WY" null null null null
9585 1939 "Female" 182000 850.0 "Yes" "No" "Northeast" "PA" null null null null
arrange(customer_data, desc(birth_year))
# A tibble: 10,531 × 13
   customer_id birth_year gender income credit married college_degree region   
         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
 1        1026       1999 Male    66000   643. No      No             South    
 2        1049       1999 Other   88000   630. No      Yes            West     
 3        1092       1999 Other   77000   664. No      Yes            Midwest  
 4        1107       1999 Female  97000   579. No      Yes            West     
 5        1113       1999 Male   190000   661. Yes     Yes            Northeast
 6        1126       1999 Female 121000   639. No      Yes            Midwest  
 7        1132       1999 Male    53000   669. No      No             West     
 8        1139       1999 Female 293000   659. Yes     Yes            Northeast
 9        1143       1999 Female  74000   587. No      Yes            West     
10        1147       1999 Other  109000   429. Yes     No             West     
# ℹ 10,521 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>

Select variables

Using single square brackets [ ] creates a list. This is similar to creating a vector in R with c(). A list is a fundamental Python object type and can be turned into a Series.

customer_data.select(pl.col(['region', 'review_text']))
shape: (10_531, 2)
region review_text
str str
"South" "everything's fine"
"West" null
"South" null
"Midwest" null
"West" "I looked all over the Internet…
"Northeast" null
"Northeast" null
"Northeast" null
"Midwest" null
"West" "Husband really appreciates the…
select(customer_data, region, review_text)
# A tibble: 10,531 × 2
   region    review_text                                                        
   <chr>     <chr>                                                              
 1 South     everything's fine                                                  
 2 West      <NA>                                                               
 3 South     <NA>                                                               
 4 Midwest   <NA>                                                               
 5 West      I looked all over the Internet to find a non-plastic water bottle.…
 6 Midwest   I ordered these sweat pants for my 12-year old daughter to wear fo…
 7 Midwest   <NA>                                                               
 8 South     Super comfortable mini pack. Bought it for hiking..large enough to…
 9 West      <NA>                                                               
10 Northeast Yeah                                                               
# ℹ 10,521 more rows

Create new variables

Polars is actually a query language, like SQL. So it’s not surprising to see methods with names that more closely mirror queries, like the .with_columns() method.

customer_data.with_columns(income = pl.col('income') / 1000)
shape: (10_531, 13)
customer_id birth_year gender income credit married college_degree region state star_rating review_time review_title review_text
i64 i64 str f64 f64 str str str str i64 str str str
1001 1971 "Female" 73.0 742.082667 "No" "No" "South" "DC" 4 "06 11, 2015" "Four Stars" "everything's fine"
1002 1970 "Female" 31.0 749.351433 "Yes" "No" "West" "WA" null null null null
1003 1988 "Male" 35.0 542.239859 "No" "No" "South" "AR" null null null null
1004 1984 "Other" 64.0 573.935783 "Yes" "Yes" "Midwest" "MN" null null null null
1005 1987 "Male" 58.0 644.243856 "No" "Yes" "West" "HI" 5 "03 25, 2008" "Great Product!!" "I looked all over the Internet…
11527 1982 "Other" 169.0 603.366567 "Yes" "Yes" "Northeast" "NY" null null null null
11528 1986 "Female" 24.0 647.671383 "Yes" "Yes" "Northeast" "RI" null null null null
11529 1966 "Female" 112.0 820.539578 "Yes" "Yes" "Northeast" "NJ" null null null null
11530 1971 "Male" 234.0 790.440173 "Yes" "Yes" "Midwest" "OH" null null null null
11531 1990 "Male" 160.0 641.77857 "No" "Yes" "West" "MT" 5 "08 21, 2014" "These are definitely one of th… "Husband really appreciates the…
mutate(customer_data, income = income / 1000)
# A tibble: 10,531 × 13
   customer_id birth_year gender income credit married college_degree region   
         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
 1        1001       1971 Female     73   742. No      No             South    
 2        1002       1970 Female     31   749. Yes     No             West     
 3        1003       1988 Male       35   542. No      No             South    
 4        1004       1984 Other      64   574. Yes     Yes            Midwest  
 5        1005       1987 Male       58   644. No      Yes            West     
 6        1006       1994 Male      164   554. Yes     Yes            Midwest  
 7        1007       1968 Male       39   608. No      No             Midwest  
 8        1008       1994 Male       69   710. No      No             South    
 9        1009       1958 Male      233   702. No      No             West     
10        1010       1994 Female     77   605. Yes     No             Northeast
# ℹ 10,521 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>

Join data frames

Note that missing values are identified as NaN or null. Series types include str, bytes (binary), i64 (num), bool, and int.

store_transactions = pl.read_csv(os.path.join('data', 'store_transactions.csv'))

customer_data.join(store_transactions, on='customer_id', how='left')
shape: (10_531, 181)
customer_id birth_year gender income credit married college_degree region state star_rating review_time review_title review_text jan_2005 feb_2005 mar_2005 apr_2005 may_2005 jun_2005 jul_2005 aug_2005 sep_2005 oct_2005 nov_2005 dec_2005 jan_2006 feb_2006 mar_2006 apr_2006 may_2006 jun_2006 jul_2006 aug_2006 sep_2006 oct_2006 nov_2006 dec_2006 dec_2015 jan_2016 feb_2016 mar_2016 apr_2016 may_2016 jun_2016 jul_2016 aug_2016 sep_2016 oct_2016 nov_2016 dec_2016 jan_2017 feb_2017 mar_2017 apr_2017 may_2017 jun_2017 jul_2017 aug_2017 sep_2017 oct_2017 nov_2017 dec_2017 jan_2018 feb_2018 mar_2018 apr_2018 may_2018 jun_2018 jul_2018 aug_2018 sep_2018 oct_2018 nov_2018 dec_2018
i64 i64 str i64 f64 str str str str i64 str str str i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64
1001 1971 "Female" 73000 742.082667 "No" "No" "South" "DC" 4 "06 11, 2015" "Four Stars" "everything's fine" 0 0 0 4 0 4 0 0 0 4 0 3 0 2 0 0 0 3 0 0 0 0 0 5 0 0 0 0 3 0 0 4 0 0 3 5 2 0 0 0 0 0 0 3 0 2 0 0 0 1 0 0 0 0 4 0 0 0 0 0 0
1002 1970 "Female" 31000 749.351433 "Yes" "No" "West" "WA" null null null null 0 0 4 0 0 0 0 0 0 0 2 0 3 0 0 0 0 0 0 0 0 2 0 0 0 0 3 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 5
1003 1988 "Male" 35000 542.239859 "No" "No" "South" "AR" null null null null 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 4 0 4 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 5 4 4 0 0 0 0 0 0 0 0 0 4 0
1004 1984 "Other" 64000 573.935783 "Yes" "Yes" "Midwest" "MN" null null null null 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 4 0 1 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 5 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0
1005 1987 "Male" 58000 644.243856 "No" "Yes" "West" "HI" 5 "03 25, 2008" "Great Product!!" "I looked all over the Internet… 0 0 0 0 0 4 0 5 0 3 4 0 0 0 0 0 0 0 0 0 5 0 1 0 0 0 0 0 0 0 0 0 0 0 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 3 2
11527 1982 "Other" 169000 603.366567 "Yes" "Yes" "Northeast" "NY" null null null null 4 0 4 0 0 0 0 3 0 5 0 5 3 0 0 0 0 0 0 0 0 2 4 0 5 0 0 0 0 0 2 0 0 3 3 0 0 0 0 5 0 2 3 0 0 0 0 3 4 0 3 0 0 0 0 0 0 0 2 0 0
11528 1986 "Female" 24000 647.671383 "Yes" "Yes" "Northeast" "RI" null null null null 0 0 0 0 0 0 0 0 0 0 3 0 2 0 5 0 2 0 4 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 4 3 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 5 0 4 5
11529 1966 "Female" 112000 820.539578 "Yes" "Yes" "Northeast" "NJ" null null null null 0 0 0 0 0 0 0 4 1 0 3 5 5 0 0 0 0 0 3 0 0 0 0 0 0 0 5 3 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0
11530 1971 "Male" 234000 790.440173 "Yes" "Yes" "Midwest" "OH" null null null null 3 0 3 0 0 0 0 3 0 0 3 0 0 0 0 4 0 0 0 0 0 0 0 0 3 4 0 0 0 3 5 0 0 0 0 4 1 0 0 4 0 0 0 0 0 0 0 3 2 2 0 0 0 0 2 0 0 2 0 0 3
11531 1990 "Male" 160000 641.77857 "No" "Yes" "West" "MT" 5 "08 21, 2014" "These are definitely one of th… "Husband really appreciates the… 0 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 0 5 0 0 0 0 4 3 4 0 2 0 3 0 3 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 3
store_transactions <- read_csv(here::here("posts", "python-intro", "data", "store_transactions.csv"))

left_join(customer_data, store_transactions, join_by(customer_id))
# A tibble: 10,531 × 181
   customer_id birth_year gender income credit married college_degree region   
         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
 1        1001       1971 Female  73000   742. No      No             South    
 2        1002       1970 Female  31000   749. Yes     No             West     
 3        1003       1988 Male    35000   542. No      No             South    
 4        1004       1984 Other   64000   574. Yes     Yes            Midwest  
 5        1005       1987 Male    58000   644. No      Yes            West     
 6        1006       1994 Male   164000   554. Yes     Yes            Midwest  
 7        1007       1968 Male    39000   608. No      No             Midwest  
 8        1008       1994 Male    69000   710. No      No             South    
 9        1009       1958 Male   233000   702. No      No             West     
10        1010       1994 Female  77000   605. Yes     No             Northeast
# ℹ 10,521 more rows
# ℹ 173 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>, jan_2005 <dbl>, feb_2005 <dbl>,
#   mar_2005 <dbl>, apr_2005 <dbl>, may_2005 <dbl>, jun_2005 <dbl>,
#   jul_2005 <dbl>, aug_2005 <dbl>, sep_2005 <dbl>, oct_2005 <dbl>,
#   nov_2005 <dbl>, dec_2005 <dbl>, jan_2006 <dbl>, feb_2006 <dbl>,
#   mar_2006 <dbl>, apr_2006 <dbl>, may_2006 <dbl>, jun_2006 <dbl>, …

Consecutive lines of code

While possible with Python code generally, Polars embraces writing consecutive lines of code using method chaining, which is clearly akin to piping in R. Note that each line starts with . (rather than ending with |>) and the entire chain needs to be surrounded with ( ).

(customer_data
  .join(store_transactions, on='customer_id', how='left')
  .filter(pl.col('region') == 'West', pl.col('feb_2005') == pl.col('feb_2005').max())
  .with_columns(age = 2024 - pl.col('birth_year'))
  .select(pl.col(['age', 'feb_2005']))
  .sort(pl.col('age'), descending=True)
  .slice(0, 1)
)
shape: (1, 2)
age feb_2005
i64 i64
67 5
customer_data |> 
  left_join(store_transactions, join_by(customer_id)) |> 
  filter(region == "West", feb_2005 == max(feb_2005)) |> 
  mutate(age = 2024 - birth_year) |> 
  select(age, feb_2005) |> 
  arrange(desc(age)) |> 
  slice(1)
# A tibble: 1 × 2
    age feb_2005
  <dbl>    <dbl>
1    67        5

Summarize discrete data

The .agg() method stands for aggregate, which is exactly what summarize() does in R.

(customer_data
  .group_by(pl.col(['region', 'college_degree']))
  .agg(n = pl.len())
)
shape: (8, 3)
region college_degree n
str str u32
"West" "Yes" 4106
"South" "No" 891
"South" "Yes" 220
"Northeast" "No" 640
"West" "No" 989
"Midwest" "Yes" 872
"Midwest" "No" 229
"Northeast" "Yes" 2584
customer_data |> 
  count(region, college_degree)
# A tibble: 8 × 3
  region    college_degree     n
  <chr>     <chr>          <int>
1 Midwest   No               229
2 Midwest   Yes              872
3 Northeast No               640
4 Northeast Yes             2584
5 South     No               891
6 South     Yes              220
7 West      No               989
8 West      Yes             4106

Summarize continuous data

This is a good example of where object-oriented programming requires a different mindset. You might think that there is a general mean() function like in R, but there isn’t and you’d have to load a specific library and reference its namespace to use such a function. Instead, .mean() here is a method for Polars Series and DataFrames.

(customer_data
  .select(pl.col(['income', 'credit']))
  .mean()
)
shape: (1, 2)
income credit
f64 f64
138622.637926 667.160128
customer_data |>
  summarize(
    avg_income = mean(income),
    avg_credit = mean(credit)
  )
# A tibble: 1 × 2
  avg_income avg_credit
       <dbl>      <dbl>
1    138623.       667.

Summarize discrete and continuous data

Combining .agg() with .group_by() is similarly powerful as in {dplyr}.

(customer_data
  .group_by(pl.col(['gender', 'region']))
  .agg(
    n = pl.len(), 
    avg_income = pl.col('income').mean(), 
    avg_credit = pl.col('credit').mean()
  )
  .sort(pl.col('avg_income'), descending=True)
)
shape: (12, 5)
gender region n avg_income avg_credit
str str u32 f64 f64
"Other" "Midwest" 124 154637.096774 663.269593
"Male" "Midwest" 420 152466.666667 666.03375
"Other" "Northeast" 337 150563.79822 665.445377
"Male" "Northeast" 1285 150498.054475 665.087794
"Male" "West" 2079 149452.621453 666.591336
"Female" "West" 2497 133818.582299 668.158516
"Female" "Northeast" 1602 133332.709114 669.02095
"Other" "South" 118 119067.79661 659.819318
"Male" "South" 430 117988.372093 669.075539
"Female" "South" 563 105888.099467 663.69759
customer_data |>
  group_by(gender, region) |>
  summarize(
    n = n(),
    avg_income = mean(income),
    avg_credit = mean(credit)
  ) |> 
  arrange(desc(avg_income))
`summarise()` has grouped output by 'gender'. You can override using the
`.groups` argument.
# A tibble: 12 × 5
# Groups:   gender [3]
   gender region        n avg_income avg_credit
   <chr>  <chr>     <int>      <dbl>      <dbl>
 1 Other  Midwest     124    154637.       663.
 2 Male   Midwest     420    152467.       666.
 3 Other  Northeast   337    150564.       665.
 4 Male   Northeast  1285    150498.       665.
 5 Male   West       2079    149453.       667.
 6 Other  West        519    144420.       667.
 7 Female Midwest     557    134083.       671.
 8 Female West       2497    133819.       668.
 9 Female Northeast  1602    133333.       669.
10 Other  South       118    119068.       660.
11 Male   South       430    117988.       669.
12 Female South       563    105888.       664.

Lazy evaluation

Remember how Polars is incredibly fast? By tagging a data frame with .lazy(), we are asking Polars to not evaluate the code until triggered and to optimize the code for us in the underlying query engine. Before the code is triggered with something like .collect(), you can even see the underlying optimized query using .explain().

This is exactly what happens when you use {dplyr} to connect to and communicate with a database using SQL code (except for the underlying query optimization).

df = (customer_data
  .group_by(pl.col(['gender', 'region']))
  .agg(
    n = pl.len(), 
    avg_income = pl.col('income').mean(), 
    avg_credit = pl.col('credit').mean()
  )
  .sort(pl.col('avg_income'), descending=True)
).lazy()

df.explain()
'DF ["gender", "region", "n", "avg_income", ...]; PROJECT */5 COLUMNS'
df.collect()
shape: (12, 5)
gender region n avg_income avg_credit
str str u32 f64 f64
"Other" "Midwest" 124 154637.096774 663.269593
"Male" "Midwest" 420 152466.666667 666.03375
"Other" "Northeast" 337 150563.79822 665.445377
"Male" "Northeast" 1285 150498.054475 665.087794
"Male" "West" 2079 149452.621453 666.591336
"Female" "West" 2497 133818.582299 668.158516
"Female" "Northeast" 1602 133332.709114 669.02095
"Other" "South" 118 119067.79661 659.819318
"Male" "South" 430 117988.372093 669.075539
"Female" "South" 563 105888.099467 663.69759
data_db <- customer_data |>
  group_by(gender, region) |>
  summarize(
    n = n(),
    avg_income = mean(income),
    avg_credit = mean(credit)
  ) |> 
  arrange(desc(avg_income))

data_db |> show_query()
<SQL>
SELECT
    gender,
    region,
    COUNT(*) AS n,
    AVG(income) AS avg_income,
    AVG(credit) AS avg_credit
FROM
    customer_data
GROUP BY
    gender,
    region
ORDER BY
    avg_income DESC;
data_db |> collect()
`summarise()` has grouped output by 'gender'. You can override using the
`.groups` argument.
# A tibble: 12 × 5
# Groups:   gender [3]
   gender region        n avg_income avg_credit
   <chr>  <chr>     <int>      <dbl>      <dbl>
 1 Other  Midwest     124    154637.       663.
 2 Male   Midwest     420    152467.       666.
 3 Other  Northeast   337    150564.       665.
 4 Male   Northeast  1285    150498.       665.
 5 Male   West       2079    149453.       667.
 6 Other  West        519    144420.       667.
 7 Female Midwest     557    134083.       671.
 8 Female West       2497    133819.       668.
 9 Female Northeast  1602    133333.       669.
10 Other  South       118    119068.       660.
11 Male   South       430    117988.       669.
12 Female South       563    105888.       664.

Visualization

There’s no way around it—visualizing data is where {ggplot2} simply reigns supreme, so much so that Posit has been investing in plotnine, a {ggplot2} port for Python. Maybe Posit will eventually facilitate a polyglot grammar of graphics, however in the spirit of this post, let’s consider a genuinely “Pythonic” tool.

If NumPy is base R, matplotlib is plotting in base R. If that analogy holds, matplotlib is an acquired taste, so much so that seaborn was developed to supplement matplotlib, much like pandas was developed to supplement NumPy. While we don’t have a Polars-like replacement (come on Hadley, a polygplot {ggvis} is written in the stars!) we do have seaborn.objects, a still-in-development module deliberately built with the consistency of the grammar of graphics in mind that also attempts to eliminate the need to invoke matplotlib for fine-tuning.

As a reminder, the grammar of graphics is a philosophical approach to visualizing data created by Leland Wilkinson that inspired the creation of {ggplot2} and seaborn.objects. It’s about composing a visualization a layer at a time, specifically:

  1. Data to visualize
  2. Mapping graphical elements to data
  3. A specific graphic representing the data and mappings
  4. Additional fine-tuning via facets, labels, scales, etc.

Having this principled approach to guide the development and consistency of a plotting approach is what distinguishes {ggplot2} and seaborn.objects. (I have often made the argument that SQL does something similar in providing a kind of grammar of data manipulation.) Let’s illustrate some common visualizations using seaborn.objects and {ggplot2} and gain some intuition for how they are related and divergent.

Column plots

Once again, we see the use of an alias to shorten the namespace reference. Here, the alias convention for the seaborn.objects module is so. We also see one of the limitations of method chaining (and thus object-oriented programming): Methods are specific to objects defined by libraries. Thus we can’t directly method chain data wrangled using Polars objects to be visualized by seaborn.objects like we can pipe data from {dplyr} to {ggplot2} in R.

However, the plot itself starts with a familiar-looking so.Plot() function (seaborn-objects’ version of ggplot()) which instantiates a Plot object and specifies (1) our data and (2) the mapping between that data and graphical elements. Then with a consistency that isn’t present with {ggplot2} (I’m looking at you |> vs. +), there are a set of methods applicable to the Plot object, starting with .add(), that can be method chained. Finally the (3) specific graphic is created with another familiar-looking so.Bar(), which is a specific example of an object called a Mark (seaborn-objects’ version of geom_*()).

import seaborn.objects as so

region_count = (customer_data
  .group_by(pl.col('region'))
  .agg(n = pl.len())
)

(so.Plot(region_count, x = 'region', y = 'n')
  .add(so.Bar())
)

customer_data |> 
  count(region) |> 
  ggplot(aes(x = region, y = n)) +
  geom_col()

In certain instances we can have the necessary data wrangling done as part of the visualization. For example, in {ggplot2} we can call geom_bar() instead of geom_col() to produce the same plot while in seaborn.objects we still use the same Mark so.Bar() but add another object called a Stat—in this instance so.Hist() to produce the sum for us.

(so.Plot(customer_data, x = 'region')
  .add(so.Bar(), so.Hist())
)

customer_data |> 
  ggplot(aes(x = region)) +
  geom_bar()

Another Stat object is so.Agg(). We can join this with an object type called a Move to further customize our visualization—for example, using the Move object so.Dodge() to create a dodged column plot.

region_count = (customer_data
  .group_by(pl.col(['region', 'college_degree']))
  .agg(n = pl.len())
)

(so.Plot(region_count, x = 'region', y = 'n', color = 'college_degree')
  .add(so.Bar(), so.Agg(), so.Dodge())
)

customer_data |> 
  count(region, college_degree) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "dodge")

Histograms

Accepting so.Hist() as a Stat and not a Mark, like it is in {ggplot2}, may seem awkward for the R user. However, what results in many specific geometries in {ggplot2} is reduced by the composibility of Mark, Stat, and Move objects in seaborn.objects. For example, a histogram also uses the so.Hist() Stat to bin data but uses the so.Bars() instead of so.Bar() to produce an actual histogram.

(so.Plot(customer_data, x = 'income', color = 'region')
  .add(so.Bars(), so.Hist())
)

customer_data |> 
  ggplot(aes(x = income)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Scatterplots

Again we see the one-to-one correspondence of Mark objects and geometries, so.Dot() and geom_point().

(so.Plot(customer_data, x = 'income', y = 'credit')
  .add(so.Dot())
)

customer_data |> 
  ggplot(aes(x = income, y = credit)) +
  geom_point()

We see again that a specialized geometry in {ggplot2} is composed of some combination of Mark, Stat, and Move objects (so.Jitter() is another Move object like so.Dodge()) in seaborn.objects, where there are parameters that can be further modified within each function call.

(so.Plot(customer_data, x = 'star_rating', y = 'income')
  .add(so.Dot(pointsize = 10, alpha = 0.5), so.Jitter(0.75))
)

customer_data |> 
  ggplot(aes(x = star_rating, y = income)) +
  geom_jitter(size = 3, alpha = 0.5)
Warning: Removed 7372 rows containing missing values or values outside the scale range
(`geom_point()`).

Line plots

It’s comforting to know that working with dates is just as problematic in Python as it is in R.

(so.Plot(customer_data, x = 'review_time', y = 'star_rating')
  .add(so.Line())
)

customer_data |> 
  ggplot(aes(x = review_time, y = star_rating)) +
  geom_line()
Warning: Removed 7372 rows containing missing values or values outside the scale range
(`geom_line()`).

Even if we can’t method chain between different object classes (including those from different libraries), we still need to rely on the back-and-forth between data wrangling and visualizing data.

rating_data = (customer_data
  .drop_nulls(pl.col('star_rating'))
  .select(pl.col(['review_time', 'star_rating']))
  .with_columns(
    pl.col('review_time').str.to_date(format='%m %d, %Y').alias('review_time')
  )
  .with_columns(
    pl.col('review_time').dt.year().alias('review_year')
  )
  .group_by('review_year')
  .agg(pl.mean('star_rating').alias('avg_star_rating'))
)

(so.Plot(rating_data, x = 'review_year', y = 'avg_star_rating')
  .add(so.Line())
)

customer_data |> 
  drop_na(star_rating) |> 
  select(review_time, star_rating) |> 
  mutate(
    review_time = mdy(review_time),
    review_year = year(review_time)
  ) |> 
  group_by(review_year) |> 
  summarize(avg_star_rating = mean(star_rating)) |>
  ggplot(aes(x = review_year, y = avg_star_rating)) +
  geom_line()

Density plots

Instead of a specific geom_density(), we compose it by combining so.Area() and so.Hist().

(so.Plot(customer_data, x = 'income', color = 'gender')
  .add(so.Area(), so.Hist())
)

customer_data |> 
  ggplot(aes(x = income, fill = gender)) +
  geom_density(alpha = 0.5)

Facets

Finally, facets, like .add(), are a method for the Plot object.

region_count = (customer_data
  .group_by(pl.col(['region', 'college_degree', 'gender']))
  .agg(n = pl.len())
)

(so.Plot(region_count, x = 'region', y = 'n', color = 'college_degree')
  .facet('gender')
  .add(so.Bar(), so.Agg(), so.Dodge())
)

customer_data |> 
  count(region, college_degree, gender) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "dodge") +
  facet_wrap(~ gender)

Modeling

If the adage is true that Python is “the second best language for everything,” it’s machine learning that is arguably where it should be first. If new statistical models first appear in R, then it can be said that the latest and greatest in machine learning and deep learning is incubated in Python and Python-adjacent libraries. At a high level, this should make sense. If R is tied closely to statistics, the big tent that is Python should naturally lend itself to the learning algorithms developed in computer science.

The most popular modeling library in Python is scikit-learn (referred to as sklearn in code). This machine learning library is built on NumPy, matplotlib, and SciPy—a library for scientific computing. The name scikit-learn comes from its origin as a “scikit” or “SciPy Toolkit,” a collection of extensions built for SciPy to provide specialized functions and methods.

When it comes to modeling, R is equally diverse in its approaches. However, I am partial to the consistency of the {tidymodels} ecosystem of packages, which clearly draw inspiration from scikit-learn. Let’s do a simple comparison of the two.

Prepare data

We’ll start by specifying some parameter values and simulating data. Here we see the obsession with namespacing and efficiency on full display. In R, I load namespace-free access to all of {tidymodels} functions with library(tidymodels). In Python, sklearn is so large and has so many different modules that it is convention to load namespace-free access to specific functions and methods by importing them one at a time as in from sklearn.linear_model import LinearRegression.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import root_mean_squared_error

# Set seed, variable, and parameter values.
np.random.seed(42)
nobs = 500
beta0 = -5
beta1 = 5
beta2 = 2
beta3 = 0

# Simulate data.
sim_data = pl.DataFrame({
    'x1': np.round(np.random.uniform(0, 20, nobs)),
    'x2': np.random.choice(['level01', 'level02', 'level03'], nobs, p = [0.7, 0.2, 0.1]),
    'y': beta0 + beta1 * np.random.uniform(0, 20, nobs) + beta2 * (np.random.choice(['level01', 'level02', 'level03'], nobs, p=[0.7, 0.2, 0.1]) == 'level02') + beta3 * (np.random.choice(['level01', 'level02', 'level03'], nobs, p=[0.7, 0.2, 0.1]) == 'level03') + np.random.normal(0, 3, nobs)
})

sim_data
shape: (500, 3)
x1 x2 y
f64 str f64
7.0 "level01" 15.07666
19.0 "level01" 51.125741
15.0 "level01" 83.961397
12.0 "level02" 68.491231
3.0 "level01" 75.0641
7.0 "level01" 62.539493
12.0 "level03" 58.44809
2.0 "level01" 42.765411
19.0 "level03" 33.206444
20.0 "level01" 83.561412
library(tidymodels)

# Set seed, variable, and parameter values.
set.seed(42)
nobs <- 500
beta0 <- -5
beta1 <- 5
beta2 <- 2
beta3 <- 0

# Simulate data.
sim_data <- tibble(
  x1 = round(runif(nobs, min = 0, max = 20)),
  x2 = rbinom(nobs, size = 2, prob = c(0.7, 0.3)) |> 
    as.factor() |> fct_recode("level01" = "0", "level02" = "1", "level03" = "2"),
  y = beta0 + beta1 * x1 + beta2 * ifelse(x2 == "level02", 1, 0) + beta3 * ifelse(x2 == "level03", 1, 0) + rnorm(nobs, mean = 0, sd = 3)
)

sim_data
# A tibble: 500 × 3
      x1 x2          y
   <dbl> <fct>   <dbl>
 1    18 level03 88.1 
 2    19 level01 92.7 
 3     6 level02 27.0 
 4    17 level02 82.4 
 5    13 level03 57.8 
 6    10 level02 46.4 
 7    15 level02 68.9 
 8     3 level01  7.10
 9    13 level02 58.3 
10    14 level01 67.5 
# ℹ 490 more rows

Here we can see that scikit-learn requires pandas DataFrames. We also see an interesting reversal where splitting the data into training and testing data produces separate datasets in Python while R creates a single object that contains both.

For feature engineering, recipe steps in {tidymodels} mirror specific transformers in scikit-learn.

# Training and testing split.
X = sim_data.drop('y').to_pandas()
y = sim_data['y'].to_pandas()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Feature engineering.
categorical_features = ['x2']
categorical_transformer = Pipeline(steps=[
  ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
  transformers=[
      ('cat', categorical_transformer, categorical_features)
  ]
)
# Training and testing split.
sim_split <- initial_split(sim_data, prop = 0.90)

# Feature engineering.
sim_recipe <- training(sim_split) |>
  recipe(y ~ .) |> 
  step_dummy(all_nominal_predictors())

Specify and fit a model

Here we can see some equivalence between scikit-learn’s Pipelines and {tidymodels}’ workflows where both feature engineering and model fitting are composed and executed at once.

# Model specification.
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Fit the model.
model.fit(X_train, y_train)
# Model specification.
sim_lm <- linear_reg() |> 
  set_engine("lm")

# Compose a workflow.
sim_wf_lm <- workflow() |> 
  add_recipe(sim_recipe) |> 
  add_model(sim_lm)

# Fit the model.
sim_lm_fit <- fit(sim_wf_lm, data = training(sim_split))

Evaluate model fit

It’s in evaluating model fit that differences are most apparent. While both scikit-learn and {tidymodels} can compute predictive fit values, scikit-learn doesn’t have a built-in way to access or visualize parameter estimates, especially interval estimates. At some level this shouldn’t be surprising given the different mindsets, but it is still a bit jarring that the most popular modeling package in Python can do prediction but not (statistical) inference. There are other libraries, of course. The statsmodels library is {stats}-like, including formula notation. The Bambi library is Python’s version of {brms} for Bayesian modeling.

# Compute RMSE.
y_pred = model.predict(X_test)
root_mean_squared_error(y_test, y_pred)
30.556378031804933
# Visualize slope parameter estimates.
tidy(sim_lm_fit, conf.int = TRUE) |> 
  ggplot(aes(x = term)) + 
  geom_point(aes(y = estimate)) + 
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .1) +
  geom_hline(yintercept = 0, color = "red")

# Compute RMSE.
sim_lm_fit |> 
  predict(new_data = testing(sim_split)) |>
  bind_cols(testing(sim_split)) |>
  rmse(truth = y, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        3.16

Communication

When it comes to communication, the elephant in the room is Jupyter notebooks. Even though Jupyter was designed to be polyglot (Julia, Python, and R), it really is the domain of Python. As an R user, Jupyter is weird. It appears that Jupyter is what happens when you don’t have a great IDE to work with—IDE functionality gets absorbed into the document type itself (e.g., an embedded kernel selector). But just because you can use Jupyter notebooks doesn’t mean you must.

It shouldn’t be surprising that I recommend Quarto for communication. It’s plain text (so it plays well with version control), supports Python and R natively with code cells that behave like actual scripts, and is designed as a means to whatever document type you need—PDFs, slides, websites, dashboards, etc. That said, if you really love or are required to work with Jupyter notebooks, Quarto can convert any .ipynb into a .qmd via the command line with quarto convert notebook.ipynb as well as render any .ipynb into whatever document type you need using quarto render notebook.ipynb --to format.

Final thoughts

There’s honestly a lot to love about both Python and R. Don’t be afraid to use the best of both interchangeably. I’ve found it’s easiest to switch between the different mindsets by keeping some syntax differences. For example, in R I use “double quotes” and in Python I use ‘single quotes.’

If I could change one thing about Python it would be for the community to embrace the fact that it was named after Monty Python, not a snake. Like R has Peanuts references for each release, here’s hoping we eventually see a stable Python release code-named “It’s Just a Flesh Wound.”