An introduction to Python for R users – Occasional Divergences

In classes and workshops over many years, I’ve taught data analytics using R. But in my new position, I teach data analytics using Python. This introduction to Python is for R users—primarily me, though I hope it proves useful to others as well.

There are incredible resources in this space, and I’ve drawn liberally from a number of them. As an overall introduction to Python, Python for Data Analysis is a go-to resource and the spiritual equivalent of R for Data Science. I also really appreciate the work in Python and R for the Modern Data Scientist, especially for the authors’ clear espousing that this isn’t an either/or situation—you can (and arguably should) use Python and R as complements.

I am especially indebted to Emily Riederer’s blog series beginning with Python Rgonomics and subscribe to her philosophy of using tools in Python that are genuinely “Pythonic” while being consistent with the workflow and ergonomics of the best R has to offer. I am also grateful to extra help from posit::conf workshop instructors and colleagues in my new position at Utah State University.

Different mindsets

When you start working with Python, it’s essential that you approach it with the right mindset. R is a specialized language developed by statisticians for data analysis. Python is a big tent, a general programming language developed by computer scientists for many things, with only a small portion of it dedicated to data analysis.

To summarize some key differences:

Python	R
General language	Specialized language
Developed by computer scientists	Developed by statisticians
Object-oriented programming	Functional programming
Obsessed with efficiency	Lazy about efficiency
Obsessed with namespacing	Lazy about namespacing
Small, medium, and large data	Small and medium data
Machine learning and deep learning	Data wrangling and visualization
Spacing is part of the syntax	Spacing is for convenience
Indices start with 0	Indices start with 1
No single authority	Dominated by Posit
Jupyter Notebooks	R Markdown and Quarto
Inconsistent (i.e., no tidyverse)	Consistency in the tidyverse
“Pythonistas” and “Pythonic” code	“R Users” and “Tidy” code

While these are broad strokes and not 100% accurate in every case, they help provide some high-level context for how the two languages deviate in their approaches to common problems.

Functions, methods, and attributes

With the right mindset, it’s easier to understand some of the things that Python is obsessed with that R simply isn’t. The most important difference is that Python is an object-oriented programming language while R is all about functional programming. While everything in R is a function (for a typical user), using Python requires frequently keeping track of the difference between functions, methods, and attributes.

Functions in Python and R are equivalent, although functions in Python are typically namespaced with the library name or alias like library.function(). Note that we can, but often don’t, similarly namespace R functions with package::function(). Methods are object-specific functions. In other words, methods are functions nested within object types and are namespaced with an object name of the given type as in object.method(). While it’s possible to import a specific function such that we can call it without referencing it’s library name or alias, we can never call a method without reference to an object name of the necessary type. In other words, we may see a function() like in R but we will always see a .method() with an object. One more set of definitions—just like packages in Python are typically referred to as libraries, function (and method) arguments are referred to as parameters.

Attributes are object-specific features and are, like methods, namespaced with an object name of the given type as in object.attribute, but without any parentheses. For example, the dimensions of a NumPy array could be referenced with array.size while the equivalent in base R would be a function like dim(array).

Get started

The first hurdle as we work to apply our new mindset is simply getting Python and our project environment installed. This is a big departure from what we’re used to in R, where there is one way to install R and we usually ignore our project environment, let alone make it reproducible. Remember, Python is a big tent with lots of uses and, unsurprisingly, lots of ways to do everything I’m covering. However, from the perspective of someone coming from R and with a focus on data analytics, I recommend the following.

While I started using pyenv and venv for managing Python versions and libraries, respectively, there’s a new(er) kid on the block that has been receiving lots of attention: uv, a single unified tool for managing Python project environments. (As a bellweather, {reticulate} and Positron now use uv.) Get started by installing uv via the command line.

The Command Line

Using the command line (i.e., terminal or shell) isn’t as common for R users who are comfortable with a dedicated console for running code. Be patient, take your time, and follow any instructions from a trusted source closely. A few things to help:

The command line is the programming interface into your OS itself. You don’t have to know everything about it to follow instructions.
Instructions can be different based on the type of command line. If you’re on a Mac that’s running macOS Catalina 10.15.7 or later, the terminal is Zsh. If you’re using Linux, the shell is Bash (and you probably already know that). And if you’re using Windows you’re working with PowerShell.

Install Python

Unlike your experience with R, Python comes pre-installed on some operating systems. This version should not be used by anyone except the OS itself. For this and other reasons, you’ll need the ability to maintain multiple versions of Python on the same computer. Once you have uv installed, it’s easy to install and manage Python versions.

To install the latest stable release of Python, on the command line, run uv python install. To see which versions of Python you have installed, run uv python find; none of these will be the off-limits OS version.
You can also install specific versions of Python, such as uv python install 3.13.4 to install Python 3.13.4. To view Python versions that are available to install, run uv python list.

Positron IDE

As you well know, an integrated development environment (IDE), outside of an open source language, is arguably your most important tool as a data analyst. There are many options, but I recommend Positron, a next-generation data science IDE. Built by Posit on VS Code’s open source core, Positron combines the multilingual extensibility of VS Code with essential data tools common to language-specific IDEs.

If RStudio is too specific and VS Code is too general, you may find that Positron is just right and becomes your only IDE for both Python and R. And unless you’re comfortable navigating between directories using the command line, Positron’s built-in terminal will be tied to the working directory you have opened in the IDE.

Initialize a project environment

A project environment is composed of the language(s) and libraries (including the dependencies) used for a given project. What makes a project environment reproducible is keeping track of which version of the language(s) and the libraries we’re using for a given project so that it can be easily reproduced on another machine by you (including future you) or someone else.

After navigating to a project working directory, run uv init to initialize a project environment. This creates a pyproject.toml file with metadata about the project and a hidden .python-version file that specifies the default version of Python for the project. (It also creates main.py and README.md files that you can use or delete.)
With the project environment initialized, you can install libraries. For example, to install the Polars library, run uv add polars. This installs Polars, and any dependencies, and creates both a uv.lock file that keeps track of the versions of the libraries you’ve installed and a hidden .venv reproducible (or virtual, hence the “v” in venv) environment folder that serves as the project library.

Just like with R, all Python libraries are installed in a single, global library on your computer known as the system library. The fact that we have a project library highlights an important feature of making project environments reproducible: Each project will have its own project library and thus be isolated. If two projects use different versions of the same package, they won’t conflict with each other because they’ll each have their own project library. (Well, not exactly. Python employs a global cache to avoid having to install the same version of a given library more than once. The project library will reference the global cache.) Whenever you install new libraries, the uv.lock file is automatically updated. And if you’re starting with an existing project, run uv run for the libraries included in uv.lock to be automatically installed.

It might seem like a lot just to get started, but it’s something we should be doing in R as well. Along with a project’s code, all someone would need is the pyproject.toml, .python-version, and uv.lock files to reproduce your code, including the project environment. Well, assuming they’re also using uv to manage their project environments. If they’re using another tool to install libraries instead (yes, there are many ways to install libraries in Python), they will likely need a requirements.txt file or a pylock.toml file to reproduce the project environment, which you can create with uv export --format requirements.txt or uv export -o pylock.toml, respectively.

Data wrangling

Whatever the language, the most common task we have for any data analysis is data wrangling (i.e., cleaning, munging, etc.). The NumPy library is more or less equivalent to what we see in base R, introducing arrays and efficient computation across arrays—but, perhaps surprisingly, not data frames. Data frames (or DataFrames as they are referred to in Python) came later with pandas (short for “panel data”). Still the most popular library for data wrangling in Python, pandas is built to supplement NumPy, with all of the syntax baggage. However, growing in popularity is Polars (an anagram of the query language it uses, OLAP, and the language it’s built in, Rust or rs). Polars is something of an answer to pandas problems. Actually, free from any Python code at all (yes, you can use it in R), it offers a glimpse at what the polyglot future might look like.

My take is that Polars provides a more self-consistent data wrangling experience than pandas, reversing the trend that many experience when they come to Polars for the speed and stay for the syntax. To illustrate the tidyverse spiritual connections, if not the deeper roots in SQL syntax, there are a number of great side-by-side comparisons of Polars and {dplyr}. I’ll illustrate a few common tasks in both Polars and {dplyr} and point out the differences to be mindful of.

Import data

Libraries and modules (a kind of sub-library) are imported with commonly accepted aliases in order to shorten the namespace reference. For example, the Polars alias convention is pl. We’re also importing the os library, which needs no alias, to write out a relative file path that will work for any user or operating system that has the same relative directory structure; a “Pythonic” version of {here}.

Note the difference between the pl.read_csv() function and the .shape and .columns attributes.

Python
R

import polars as pl
import os

customer_data = pl.read_csv(os.path.join('data', 'customer_data.csv'))
customer_data.shape

(10531, 13)

customer_data.columns

['customer_id', 'birth_year', 'gender', 'income', 'credit', 'married', 'college_degree', 'region', 'state', 'star_rating', 'review_time', 'review_title', 'review_text']

library(tidyverse)

customer_data <- read_csv(here::here("posts", "python-intro", "data", "customer_data.csv"))
glimpse(customer_data)

Rows: 10,531
Columns: 13
$ customer_id    <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1…
$ birth_year     <dbl> 1971, 1970, 1988, 1984, 1987, 1994, 1968, 1994, 1958, 1…
$ gender         <chr> "Female", "Female", "Male", "Other", "Male", "Male", "M…
$ income         <dbl> 73000, 31000, 35000, 64000, 58000, 164000, 39000, 69000…
$ credit         <dbl> 742.0827, 749.3514, 542.2399, 573.9358, 644.2439, 553.6…
$ married        <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "No"…
$ college_degree <chr> "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "No"…
$ region         <chr> "South", "West", "South", "Midwest", "West", "Midwest",…
$ state          <chr> "DC", "WA", "AR", "MN", "HI", "MN", "MN", "KY", "NM", "…
$ star_rating    <dbl> 4, NA, NA, NA, 5, 2, NA, 5, NA, 5, NA, 5, NA, NA, NA, N…
$ review_time    <chr> "06 11, 2015", NA, NA, NA, "03 25, 2008", "06 7, 2013",…
$ review_title   <chr> "Four Stars", NA, NA, NA, "Great Product!!", "Not at al…
$ review_text    <chr> "everything's fine", NA, NA, NA, "I looked all over the…

Filter observations

Polars DataFrames have methods that are similar to {dplyr} (e.g., .filter() and filter()). DataFrames are composed of columns called Series (equivalent to R’s vectors). Note that unlike pandas DataFrames, Polars DataFrames don’t have a row index.

In pandas, we would need to reference column names with data['column_name'] (like base R’s data$column_name or data["column_name"]), but Polars allows for pl.col('column_name'). Yes, we use quotation marks for every column name. The pl.col() expression offers a helper-function-like consistency.

Python
R

customer_data.filter(pl.col('gender') == 'Female', pl.col('income') > 70000)

shape: (3_970, 13)

customer_id	birth_year	gender	income	credit	married	college_degree	region	state	star_rating	review_time	review_title	review_text
i64	i64	str	i64	f64	str	str	str	str	i64	str	str	str
1001	1971	"Female"	73000	742.082667	"No"	"No"	"South"	"DC"	4	"06 11, 2015"	"Four Stars"	"everything's fine"
1010	1994	"Female"	77000	605.157869	"Yes"	"No"	"Northeast"	"VT"	5	"10 18, 2015"	"nice strong enough"	"Yeah"
1012	1953	"Female"	126000	672.59663	"Yes"	"No"	"South"	"DC"	5	"01 6, 2015"	"Five Stars"	"easy setup works great"
1013	1974	"Female"	197000	680.211781	"Yes"	"Yes"	"West"	"UT"	null	null	null	null
1022	1979	"Female"	155000	804.896188	"No"	"Yes"	"West"	"UT"	5	"04 22, 2017"	"Yak Attack is great!"	"Very simple to use. Nice acces…
…	…	…	…	…	…	…	…	…	…	…	…	…
11515	1969	"Female"	203000	645.425436	"Yes"	"Yes"	"Northeast"	"MA"	null	null	null	null
11519	1977	"Female"	149000	676.361119	"Yes"	"No"	"West"	"UT"	null	null	null	null
11524	1994	"Female"	114000	644.463202	"Yes"	"No"	"South"	"AL"	null	null	null	null
11526	1987	"Female"	109000	611.095352	"No"	"No"	"South"	"MS"	4	"04 22, 2016"	"Overall - A good deal"	"Not the highest quality in the…
11529	1966	"Female"	112000	820.539578	"Yes"	"Yes"	"Northeast"	"NJ"	null	null	null	null

filter(customer_data, gender == "Female", income > 70000)

# A tibble: 3,970 × 13
   customer_id birth_year gender income credit married college_degree region   
         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
 1        1001       1971 Female  73000   742. No      No             South    
 2        1010       1994 Female  77000   605. Yes     No             Northeast
 3        1012       1953 Female 126000   673. Yes     No             South    
 4        1013       1974 Female 197000   680. Yes     Yes            West     
 5        1022       1979 Female 155000   805. No      Yes            West     
 6        1023       1995 Female 137000   539. No      Yes            Northeast
 7        1024       1974 Female 285000   685. Yes     Yes            Midwest  
 8        1028       1980 Female  87000   715. No      No             West     
 9        1030       1969 Female 163000   636. Yes     Yes            West     
10        1036       1978 Female 227000   614. Yes     Yes            Northeast
# ℹ 3,960 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>

Slice observations

Note that Python is zero-indexed. This is probably the most problematic (and very computer science-based) difference and why it’s nice to avoid indexing if you can!

Remember that function (and method) arguments are called parameters. Some parameters are positional that have to be, as the name suggests, specified in an exact position. Others are keyword or named parameters, which is commonplace in R. The positional parameters for Polars’ .slice() method are the start index and the slice length.

Python
R

customer_data.slice(0, 5)

shape: (5, 13)

customer_id	birth_year	gender	income	credit	married	college_degree	region	state	star_rating	review_time	review_title	review_text
i64	i64	str	i64	f64	str	str	str	str	i64	str	str	str
1001	1971	"Female"	73000	742.082667	"No"	"No"	"South"	"DC"	4	"06 11, 2015"	"Four Stars"	"everything's fine"
1002	1970	"Female"	31000	749.351433	"Yes"	"No"	"West"	"WA"	null	null	null	null
1003	1988	"Male"	35000	542.239859	"No"	"No"	"South"	"AR"	null	null	null	null
1004	1984	"Other"	64000	573.935783	"Yes"	"Yes"	"Midwest"	"MN"	null	null	null	null
1005	1987	"Male"	58000	644.243856	"No"	"Yes"	"West"	"HI"	5	"03 25, 2008"	"Great Product!!"	"I looked all over the Internet…

slice(customer_data, 1:5)

# A tibble: 5 × 13
  customer_id birth_year gender income credit married college_degree region 
        <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>  
1        1001       1971 Female  73000   742. No      No             South  
2        1002       1970 Female  31000   749. Yes     No             West   
3        1003       1988 Male    35000   542. No      No             South  
4        1004       1984 Other   64000   574. Yes     Yes            Midwest
5        1005       1987 Male    58000   644. No      Yes            West   
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>

Sort observations

It can be strange at first, but namespacing is critical. Remember that a function is preceded by the library name or alias (e.g., pl.col()), unless you’ve imported the specific function (e.g., from polars import col), while a method is preceded by an object name of a certain type (e.g., customer_data.sort()). Since object types are tied to libraries, the chain back to its corresponding library is always present, explicitly for functions and implicitly for methods.

Note that its True and False, not TRUE and FALSE or true and false.

Python
R

customer_data.sort(pl.col('birth_year'), descending = True)

shape: (10_531, 13)

customer_id	birth_year	gender	income	credit	married	college_degree	region	state	star_rating	review_time	review_title	review_text
i64	i64	str	i64	f64	str	str	str	str	i64	str	str	str
1026	1999	"Male"	66000	643.030421	"No"	"No"	"South"	"AL"	null	null	null	null
1049	1999	"Other"	88000	630.037632	"No"	"Yes"	"West"	"WA"	null	null	null	null
1092	1999	"Other"	77000	664.000947	"No"	"Yes"	"Midwest"	"MO"	null	null	null	null
1107	1999	"Female"	97000	579.437133	"No"	"Yes"	"West"	"CO"	null	null	null	null
1113	1999	"Male"	190000	660.594015	"Yes"	"Yes"	"Northeast"	"PA"	5	"07 10, 2014"	"Five Stars"	"this is awesome"
…	…	…	…	…	…	…	…	…	…	…	…	…
8277	1947	"Male"	92000	839.039929	"Yes"	"No"	"West"	"OR"	null	null	null	null
8525	1947	"Female"	170000	723.312232	"Yes"	"Yes"	"West"	"CO"	3	"09 4, 2016"	"Knuckles are not hard and they…	"Knuckles are not hard and the…
9437	1947	"Female"	95000	850.0	"No"	"Yes"	"West"	"ID"	null	null	null	null
10069	1947	"Other"	84000	742.581417	"Yes"	"Yes"	"West"	"WY"	null	null	null	null
9585	1939	"Female"	182000	850.0	"Yes"	"No"	"Northeast"	"PA"	null	null	null	null

arrange(customer_data, desc(birth_year))

# A tibble: 10,531 × 13
   customer_id birth_year gender income credit married college_degree region   
         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
 1        1026       1999 Male    66000   643. No      No             South    
 2        1049       1999 Other   88000   630. No      Yes            West     
 3        1092       1999 Other   77000   664. No      Yes            Midwest  
 4        1107       1999 Female  97000   579. No      Yes            West     
 5        1113       1999 Male   190000   661. Yes     Yes            Northeast
 6        1126       1999 Female 121000   639. No      Yes            Midwest  
 7        1132       1999 Male    53000   669. No      No             West     
 8        1139       1999 Female 293000   659. Yes     Yes            Northeast
 9        1143       1999 Female  74000   587. No      Yes            West     
10        1147       1999 Other  109000   429. Yes     No             West     
# ℹ 10,521 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>

Select variables

Using single square brackets [ ] creates a list. This is similar to creating a vector in R with c(). A list is a fundamental Python object type and can be turned into a Series.

Python
R

customer_data.select(pl.col(['region', 'review_text']))

shape: (10_531, 2)

region	review_text
str	str
"South"	"everything's fine"
"West"	null
"South"	null
"Midwest"	null
"West"	"I looked all over the Internet…
…	…
"Northeast"	null
"Northeast"	null
"Northeast"	null
"Midwest"	null
"West"	"Husband really appreciates the…

select(customer_data, region, review_text)

# A tibble: 10,531 × 2
   region    review_text                                                        
   <chr>     <chr>                                                              
 1 South     everything's fine                                                  
 2 West      <NA>                                                               
 3 South     <NA>                                                               
 4 Midwest   <NA>                                                               
 5 West      I looked all over the Internet to find a non-plastic water bottle.…
 6 Midwest   I ordered these sweat pants for my 12-year old daughter to wear fo…
 7 Midwest   <NA>                                                               
 8 South     Super comfortable mini pack. Bought it for hiking..large enough to…
 9 West      <NA>                                                               
10 Northeast Yeah                                                               
# ℹ 10,521 more rows

Create new variables

Polars is actually a query language, like SQL. So it’s not surprising to see methods with names that more closely mirror queries, like the .with_columns() method.

Python
R

customer_data.with_columns(income = pl.col('income') / 1000)

shape: (10_531, 13)

customer_id	birth_year	gender	income	credit	married	college_degree	region	state	star_rating	review_time	review_title	review_text
i64	i64	str	f64	f64	str	str	str	str	i64	str	str	str
1001	1971	"Female"	73.0	742.082667	"No"	"No"	"South"	"DC"	4	"06 11, 2015"	"Four Stars"	"everything's fine"
1002	1970	"Female"	31.0	749.351433	"Yes"	"No"	"West"	"WA"	null	null	null	null
1003	1988	"Male"	35.0	542.239859	"No"	"No"	"South"	"AR"	null	null	null	null
1004	1984	"Other"	64.0	573.935783	"Yes"	"Yes"	"Midwest"	"MN"	null	null	null	null
1005	1987	"Male"	58.0	644.243856	"No"	"Yes"	"West"	"HI"	5	"03 25, 2008"	"Great Product!!"	"I looked all over the Internet…
…	…	…	…	…	…	…	…	…	…	…	…	…
11527	1982	"Other"	169.0	603.366567	"Yes"	"Yes"	"Northeast"	"NY"	null	null	null	null
11528	1986	"Female"	24.0	647.671383	"Yes"	"Yes"	"Northeast"	"RI"	null	null	null	null
11529	1966	"Female"	112.0	820.539578	"Yes"	"Yes"	"Northeast"	"NJ"	null	null	null	null
11530	1971	"Male"	234.0	790.440173	"Yes"	"Yes"	"Midwest"	"OH"	null	null	null	null
11531	1990	"Male"	160.0	641.77857	"No"	"Yes"	"West"	"MT"	5	"08 21, 2014"	"These are definitely one of th…	"Husband really appreciates the…

mutate(customer_data, income = income / 1000)

# A tibble: 10,531 × 13
   customer_id birth_year gender income credit married college_degree region   
         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
 1        1001       1971 Female     73   742. No      No             South    
 2        1002       1970 Female     31   749. Yes     No             West     
 3        1003       1988 Male       35   542. No      No             South    
 4        1004       1984 Other      64   574. Yes     Yes            Midwest  
 5        1005       1987 Male       58   644. No      Yes            West     
 6        1006       1994 Male      164   554. Yes     Yes            Midwest  
 7        1007       1968 Male       39   608. No      No             Midwest  
 8        1008       1994 Male       69   710. No      No             South    
 9        1009       1958 Male      233   702. No      No             West     
10        1010       1994 Female     77   605. Yes     No             Northeast
# ℹ 10,521 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>

Join data frames

Note that missing values are identified as NaN or null. Series types include str, bytes (binary), i64 (num), bool, and int.

Python
R

store_transactions = pl.read_csv(os.path.join('data', 'store_transactions.csv'))

customer_data.join(store_transactions, on='customer_id', how='left')

shape: (10_531, 181)

customer_id	birth_year	gender	income	credit	married	college_degree	region	state	star_rating	review_time	review_title	review_text	jan_2005	feb_2005	mar_2005	apr_2005	may_2005	jun_2005	jul_2005	aug_2005	sep_2005	oct_2005	nov_2005	dec_2005	jan_2006	feb_2006	mar_2006	apr_2006	may_2006	jun_2006	jul_2006	aug_2006	sep_2006	oct_2006	nov_2006	dec_2006	…	dec_2015	jan_2016	feb_2016	mar_2016	apr_2016	may_2016	jun_2016	jul_2016	aug_2016	sep_2016	oct_2016	nov_2016	dec_2016	jan_2017	feb_2017	mar_2017	apr_2017	may_2017	jun_2017	jul_2017	aug_2017	sep_2017	oct_2017	nov_2017	dec_2017	jan_2018	feb_2018	mar_2018	apr_2018	may_2018	jun_2018	jul_2018	aug_2018	sep_2018	oct_2018	nov_2018	dec_2018
i64	i64	str	i64	f64	str	str	str	str	i64	str	str	str	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	…	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
1001	1971	"Female"	73000	742.082667	"No"	"No"	"South"	"DC"	4	"06 11, 2015"	"Four Stars"	"everything's fine"	0	0	0	4	0	4	0	0	0	4	0	3	0	2	0	0	0	3	0	0	0	0	0	5	…	0	0	0	0	3	0	0	4	0	0	3	5	2	0	0	0	0	0	0	3	0	2	0	0	0	1	0	0	0	0	4	0	0	0	0	0	0
1002	1970	"Female"	31000	749.351433	"Yes"	"No"	"West"	"WA"	null	null	null	null	0	0	4	0	0	0	0	0	0	0	2	0	3	0	0	0	0	0	0	0	0	2	0	0	…	0	0	3	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	5
1003	1988	"Male"	35000	542.239859	"No"	"No"	"South"	"AR"	null	null	null	null	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	…	0	0	0	0	0	4	0	4	0	4	0	0	0	0	0	0	0	0	0	0	0	0	0	5	4	4	0	0	0	0	0	0	0	0	0	4	0
1004	1984	"Other"	64000	573.935783	"Yes"	"Yes"	"Midwest"	"MN"	null	null	null	null	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	4	0	…	1	0	0	0	3	0	0	0	0	0	0	4	0	0	0	0	0	5	0	1	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0
1005	1987	"Male"	58000	644.243856	"No"	"Yes"	"West"	"HI"	5	"03 25, 2008"	"Great Product!!"	"I looked all over the Internet…	0	0	0	0	0	4	0	5	0	3	4	0	0	0	0	0	0	0	0	0	5	0	1	0	…	0	0	0	0	0	0	0	0	0	0	5	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	3	2
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
11527	1982	"Other"	169000	603.366567	"Yes"	"Yes"	"Northeast"	"NY"	null	null	null	null	4	0	4	0	0	0	0	3	0	5	0	5	3	0	0	0	0	0	0	0	0	2	4	0	…	5	0	0	0	0	0	2	0	0	3	3	0	0	0	0	5	0	2	3	0	0	0	0	3	4	0	3	0	0	0	0	0	0	0	2	0	0
11528	1986	"Female"	24000	647.671383	"Yes"	"Yes"	"Northeast"	"RI"	null	null	null	null	0	0	0	0	0	0	0	0	0	0	3	0	2	0	5	0	2	0	4	0	0	0	2	0	…	0	0	0	0	0	0	0	0	0	0	0	4	3	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	5	0	4	5
11529	1966	"Female"	112000	820.539578	"Yes"	"Yes"	"Northeast"	"NJ"	null	null	null	null	0	0	0	0	0	0	0	4	1	0	3	5	5	0	0	0	0	0	3	0	0	0	0	0	…	0	0	5	3	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0
11530	1971	"Male"	234000	790.440173	"Yes"	"Yes"	"Midwest"	"OH"	null	null	null	null	3	0	3	0	0	0	0	3	0	0	3	0	0	0	0	4	0	0	0	0	0	0	0	0	…	3	4	0	0	0	3	5	0	0	0	0	4	1	0	0	4	0	0	0	0	0	0	0	3	2	2	0	0	0	0	2	0	0	2	0	0	3
11531	1990	"Male"	160000	641.77857	"No"	"Yes"	"West"	"MT"	5	"08 21, 2014"	"These are definitely one of th…	"Husband really appreciates the…	0	0	0	0	0	0	0	0	0	3	3	0	0	0	0	0	0	5	0	0	0	0	4	3	…	4	0	2	0	3	0	3	0	0	0	0	4	0	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0	0	0	0	0	0	0	3

store_transactions <- read_csv(here::here("posts", "python-intro", "data", "store_transactions.csv"))

left_join(customer_data, store_transactions, join_by(customer_id))

# A tibble: 10,531 × 181
   customer_id birth_year gender income credit married college_degree region   
         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
 1        1001       1971 Female  73000   742. No      No             South    
 2        1002       1970 Female  31000   749. Yes     No             West     
 3        1003       1988 Male    35000   542. No      No             South    
 4        1004       1984 Other   64000   574. Yes     Yes            Midwest  
 5        1005       1987 Male    58000   644. No      Yes            West     
 6        1006       1994 Male   164000   554. Yes     Yes            Midwest  
 7        1007       1968 Male    39000   608. No      No             Midwest  
 8        1008       1994 Male    69000   710. No      No             South    
 9        1009       1958 Male   233000   702. No      No             West     
10        1010       1994 Female  77000   605. Yes     No             Northeast
# ℹ 10,521 more rows
# ℹ 173 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
#   review_title <chr>, review_text <chr>, jan_2005 <dbl>, feb_2005 <dbl>,
#   mar_2005 <dbl>, apr_2005 <dbl>, may_2005 <dbl>, jun_2005 <dbl>,
#   jul_2005 <dbl>, aug_2005 <dbl>, sep_2005 <dbl>, oct_2005 <dbl>,
#   nov_2005 <dbl>, dec_2005 <dbl>, jan_2006 <dbl>, feb_2006 <dbl>,
#   mar_2006 <dbl>, apr_2006 <dbl>, may_2006 <dbl>, jun_2006 <dbl>, …

Consecutive lines of code

While possible with Python code generally, Polars embraces writing consecutive lines of code using method chaining, which is clearly akin to piping in R. Note that each line starts with . (rather than ending with |>) and the entire chain needs to be surrounded with ( ).

Python
R

(customer_data
  .join(store_transactions, on='customer_id', how='left')
  .filter(pl.col('region') == 'West', pl.col('feb_2005') == pl.col('feb_2005').max())
  .with_columns(age = 2024 - pl.col('birth_year'))
  .select(pl.col(['age', 'feb_2005']))
  .sort(pl.col('age'), descending=True)
  .slice(0, 1)
)

shape: (1, 2)

age	feb_2005
i64	i64
67	5

customer_data |> 
  left_join(store_transactions, join_by(customer_id)) |> 
  filter(region == "West", feb_2005 == max(feb_2005)) |> 
  mutate(age = 2024 - birth_year) |> 
  select(age, feb_2005) |> 
  arrange(desc(age)) |> 
  slice(1)

# A tibble: 1 × 2
    age feb_2005
  <dbl>    <dbl>
1    67        5

Summarize discrete data

The .agg() method stands for aggregate, which is exactly what summarize() does in R.

Python
R

(customer_data
  .group_by(pl.col(['region', 'college_degree']))
  .agg(n = pl.len())
)

shape: (8, 3)

region	college_degree	n
str	str	u32
"West"	"Yes"	4106
"South"	"No"	891
"South"	"Yes"	220
"Northeast"	"No"	640
"West"	"No"	989
"Midwest"	"Yes"	872
"Midwest"	"No"	229
"Northeast"	"Yes"	2584

customer_data |> 
  count(region, college_degree)

# A tibble: 8 × 3
  region    college_degree     n
  <chr>     <chr>          <int>
1 Midwest   No               229
2 Midwest   Yes              872
3 Northeast No               640
4 Northeast Yes             2584
5 South     No               891
6 South     Yes              220
7 West      No               989
8 West      Yes             4106

Summarize continuous data

This is a good example of where object-oriented programming requires a different mindset. You might think that there is a general mean() function like in R, but there isn’t and you’d have to load a specific library and reference its namespace to use such a function. Instead, .mean() here is a method for Polars Series and DataFrames.

Python
R

(customer_data
  .select(pl.col(['income', 'credit']))
  .mean()
)

shape: (1, 2)

income	credit
f64	f64
138622.637926	667.160128

customer_data |>
  summarize(
    avg_income = mean(income),
    avg_credit = mean(credit)
  )

# A tibble: 1 × 2
  avg_income avg_credit
       <dbl>      <dbl>
1    138623.       667.

Summarize discrete and continuous data

Combining .agg() with .group_by() is similarly powerful as in {dplyr}.

Python
R

(customer_data
  .group_by(pl.col(['gender', 'region']))
  .agg(
    n = pl.len(), 
    avg_income = pl.col('income').mean(), 
    avg_credit = pl.col('credit').mean()
  )
  .sort(pl.col('avg_income'), descending=True)
)

shape: (12, 5)

gender	region	n	avg_income	avg_credit
str	str	u32	f64	f64
"Other"	"Midwest"	124	154637.096774	663.269593
"Male"	"Midwest"	420	152466.666667	666.03375
"Other"	"Northeast"	337	150563.79822	665.445377
"Male"	"Northeast"	1285	150498.054475	665.087794
"Male"	"West"	2079	149452.621453	666.591336
…	…	…	…	…
"Female"	"West"	2497	133818.582299	668.158516
"Female"	"Northeast"	1602	133332.709114	669.02095
"Other"	"South"	118	119067.79661	659.819318
"Male"	"South"	430	117988.372093	669.075539
"Female"	"South"	563	105888.099467	663.69759

customer_data |>
  group_by(gender, region) |>
  summarize(
    n = n(),
    avg_income = mean(income),
    avg_credit = mean(credit)
  ) |> 
  arrange(desc(avg_income))

`summarise()` has grouped output by 'gender'. You can override using the
`.groups` argument.

# A tibble: 12 × 5
# Groups:   gender [3]
   gender region        n avg_income avg_credit
   <chr>  <chr>     <int>      <dbl>      <dbl>
 1 Other  Midwest     124    154637.       663.
 2 Male   Midwest     420    152467.       666.
 3 Other  Northeast   337    150564.       665.
 4 Male   Northeast  1285    150498.       665.
 5 Male   West       2079    149453.       667.
 6 Other  West        519    144420.       667.
 7 Female Midwest     557    134083.       671.
 8 Female West       2497    133819.       668.
 9 Female Northeast  1602    133333.       669.
10 Other  South       118    119068.       660.
11 Male   South       430    117988.       669.
12 Female South       563    105888.       664.

Lazy evaluation

Remember how Polars is incredibly fast? By tagging a data frame with .lazy(), we are asking Polars to not evaluate the code until triggered and to optimize the code for us in the underlying query engine. Before the code is triggered with something like .collect(), you can even see the underlying optimized query using .explain().

This is exactly what happens when you use {dplyr} to connect to and communicate with a database using SQL code (except for the underlying query optimization).

Python
R

df = (customer_data
  .group_by(pl.col(['gender', 'region']))
  .agg(
    n = pl.len(), 
    avg_income = pl.col('income').mean(), 
    avg_credit = pl.col('credit').mean()
  )
  .sort(pl.col('avg_income'), descending=True)
).lazy()

df.explain()

'DF ["gender", "region", "n", "avg_income", ...]; PROJECT */5 COLUMNS'

df.collect()

shape: (12, 5)

gender	region	n	avg_income	avg_credit
str	str	u32	f64	f64
"Other"	"Midwest"	124	154637.096774	663.269593
"Male"	"Midwest"	420	152466.666667	666.03375
"Other"	"Northeast"	337	150563.79822	665.445377
"Male"	"Northeast"	1285	150498.054475	665.087794
"Male"	"West"	2079	149452.621453	666.591336
…	…	…	…	…
"Female"	"West"	2497	133818.582299	668.158516
"Female"	"Northeast"	1602	133332.709114	669.02095
"Other"	"South"	118	119067.79661	659.819318
"Male"	"South"	430	117988.372093	669.075539
"Female"	"South"	563	105888.099467	663.69759

data_db <- customer_data |>
  group_by(gender, region) |>
  summarize(
    n = n(),
    avg_income = mean(income),
    avg_credit = mean(credit)
  ) |> 
  arrange(desc(avg_income))

data_db |> show_query()

<SQL>
SELECT
    gender,
    region,
    COUNT(*) AS n,
    AVG(income) AS avg_income,
    AVG(credit) AS avg_credit
FROM
    customer_data
GROUP BY
    gender,
    region
ORDER BY
    avg_income DESC;

data_db |> collect()

`summarise()` has grouped output by 'gender'. You can override using the
`.groups` argument.
# A tibble: 12 × 5
# Groups:   gender [3]
   gender region        n avg_income avg_credit
   <chr>  <chr>     <int>      <dbl>      <dbl>
 1 Other  Midwest     124    154637.       663.
 2 Male   Midwest     420    152467.       666.
 3 Other  Northeast   337    150564.       665.
 4 Male   Northeast  1285    150498.       665.
 5 Male   West       2079    149453.       667.
 6 Other  West        519    144420.       667.
 7 Female Midwest     557    134083.       671.
 8 Female West       2497    133819.       668.
 9 Female Northeast  1602    133333.       669.
10 Other  South       118    119068.       660.
11 Male   South       430    117988.       669.
12 Female South       563    105888.       664.

Visualization

There’s no way around it—visualizing data is where {ggplot2} simply reigns supreme, so much so that Posit has been investing in plotnine, a {ggplot2} port for Python. Maybe Posit will eventually facilitate a polyglot grammar of graphics, however in the spirit of this post, let’s consider a genuinely “Pythonic” tool.

If NumPy is base R, matplotlib is plotting in base R. If that analogy holds, matplotlib is an acquired taste, so much so that seaborn was developed to supplement matplotlib, much like pandas was developed to supplement NumPy. While we don’t have a Polars-like replacement (come on Hadley, a polygplot {ggvis} is written in the stars!) we do have seaborn.objects, a still-in-development module deliberately built with the consistency of the grammar of graphics in mind that also attempts to eliminate the need to invoke matplotlib for fine-tuning.

As a reminder, the grammar of graphics is a philosophical approach to visualizing data created by Leland Wilkinson that inspired the creation of {ggplot2} and seaborn.objects. It’s about composing a visualization a layer at a time, specifically:

Data to visualize
Mapping graphical elements to data
A specific graphic representing the data and mappings
Additional fine-tuning via facets, labels, scales, etc.

Having this principled approach to guide the development and consistency of a plotting approach is what distinguishes {ggplot2} and seaborn.objects. (I have often made the argument that SQL does something similar in providing a kind of grammar of data manipulation.) Let’s illustrate some common visualizations using seaborn.objects and {ggplot2} and gain some intuition for how they are related and divergent.

Column plots

Once again, we see the use of an alias to shorten the namespace reference. Here, the alias convention for the seaborn.objects module is so. We also see one of the limitations of method chaining (and thus object-oriented programming): Methods are specific to objects defined by libraries. Thus we can’t directly method chain data wrangled using Polars objects to be visualized by seaborn.objects like we can pipe data from {dplyr} to {ggplot2} in R.

However, the plot itself starts with a familiar-looking so.Plot() function (seaborn-objects’ version of ggplot()) which instantiates a Plot object and specifies (1) our data and (2) the mapping between that data and graphical elements. Then with a consistency that isn’t present with {ggplot2} (I’m looking at you |> vs. +), there are a set of methods applicable to the Plot object, starting with .add(), that can be method chained. Finally the (3) specific graphic is created with another familiar-looking so.Bar(), which is a specific example of an object called a Mark (seaborn-objects’ version of geom_*()).

Python
R

import seaborn.objects as so

region_count = (customer_data
  .group_by(pl.col('region'))
  .agg(n = pl.len())
)

(so.Plot(region_count, x = 'region', y = 'n')
  .add(so.Bar())
)

customer_data |> 
  count(region) |> 
  ggplot(aes(x = region, y = n)) +
  geom_col()

In certain instances we can have the necessary data wrangling done as part of the visualization. For example, in {ggplot2} we can call geom_bar() instead of geom_col() to produce the same plot while in seaborn.objects we still use the same Mark so.Bar() but add another object called a Stat—in this instance so.Hist() to produce the sum for us.

Python
R

(so.Plot(customer_data, x = 'region')
  .add(so.Bar(), so.Hist())
)

customer_data |> 
  ggplot(aes(x = region)) +
  geom_bar()

Another Stat object is so.Agg(). We can join this with an object type called a Move to further customize our visualization—for example, using the Move object so.Dodge() to create a dodged column plot.

Python
R

region_count = (customer_data
  .group_by(pl.col(['region', 'college_degree']))
  .agg(n = pl.len())
)

(so.Plot(region_count, x = 'region', y = 'n', color = 'college_degree')
  .add(so.Bar(), so.Agg(), so.Dodge())
)

customer_data |> 
  count(region, college_degree) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "dodge")

Histograms

Accepting so.Hist() as a Stat and not a Mark, like it is in {ggplot2}, may seem awkward for the R user. However, what results in many specific geometries in {ggplot2} is reduced by the composibility of Mark, Stat, and Move objects in seaborn.objects. For example, a histogram also uses the so.Hist() Stat to bin data but uses the so.Bars() instead of so.Bar() to produce an actual histogram.

Python
R

(so.Plot(customer_data, x = 'income', color = 'region')
  .add(so.Bars(), so.Hist())
)

customer_data |> 
  ggplot(aes(x = income)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Scatterplots

Again we see the one-to-one correspondence of Mark objects and geometries, so.Dot() and geom_point().

Python
R

(so.Plot(customer_data, x = 'income', y = 'credit')
  .add(so.Dot())
)

customer_data |> 
  ggplot(aes(x = income, y = credit)) +
  geom_point()

We see again that a specialized geometry in {ggplot2} is composed of some combination of Mark, Stat, and Move objects (so.Jitter() is another Move object like so.Dodge()) in seaborn.objects, where there are parameters that can be further modified within each function call.

Python
R

(so.Plot(customer_data, x = 'star_rating', y = 'income')
  .add(so.Dot(pointsize = 10, alpha = 0.5), so.Jitter(0.75))
)

customer_data |> 
  ggplot(aes(x = star_rating, y = income)) +
  geom_jitter(size = 3, alpha = 0.5)

Warning: Removed 7372 rows containing missing values or values outside the scale range
(`geom_point()`).

Line plots

It’s comforting to know that working with dates is just as problematic in Python as it is in R.

Python
R

(so.Plot(customer_data, x = 'review_time', y = 'star_rating')
  .add(so.Line())
)

customer_data |> 
  ggplot(aes(x = review_time, y = star_rating)) +
  geom_line()

Warning: Removed 7372 rows containing missing values or values outside the scale range
(`geom_line()`).

Even if we can’t method chain between different object classes (including those from different libraries), we still need to rely on the back-and-forth between data wrangling and visualizing data.

Python
R

rating_data = (customer_data
  .drop_nulls(pl.col('star_rating'))
  .select(pl.col(['review_time', 'star_rating']))
  .with_columns(
    pl.col('review_time').str.to_date(format='%m %d, %Y').alias('review_time')
  )
  .with_columns(
    pl.col('review_time').dt.year().alias('review_year')
  )
  .group_by('review_year')
  .agg(pl.mean('star_rating').alias('avg_star_rating'))
)

(so.Plot(rating_data, x = 'review_year', y = 'avg_star_rating')
  .add(so.Line())
)

customer_data |> 
  drop_na(star_rating) |> 
  select(review_time, star_rating) |> 
  mutate(
    review_time = mdy(review_time),
    review_year = year(review_time)
  ) |> 
  group_by(review_year) |> 
  summarize(avg_star_rating = mean(star_rating)) |>
  ggplot(aes(x = review_year, y = avg_star_rating)) +
  geom_line()

Density plots

Instead of a specific geom_density(), we compose it by combining so.Area() and so.Hist().

Python
R

(so.Plot(customer_data, x = 'income', color = 'gender')
  .add(so.Area(), so.Hist())
)

customer_data |> 
  ggplot(aes(x = income, fill = gender)) +
  geom_density(alpha = 0.5)

Facets

Finally, facets, like .add(), are a method for the Plot object.

Python
R

region_count = (customer_data
  .group_by(pl.col(['region', 'college_degree', 'gender']))
  .agg(n = pl.len())
)

(so.Plot(region_count, x = 'region', y = 'n', color = 'college_degree')
  .facet('gender')
  .add(so.Bar(), so.Agg(), so.Dodge())
)

customer_data |> 
  count(region, college_degree, gender) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "dodge") +
  facet_wrap(~ gender)

Modeling

If the adage is true that Python is “the second best language for everything,” it’s machine learning that is arguably where it should be first. If new statistical models first appear in R, then it can be said that the latest and greatest in machine learning and deep learning is incubated in Python and Python-adjacent libraries. At a high level, this should make sense. If R is tied closely to statistics, the big tent that is Python should naturally lend itself to the learning algorithms developed in computer science.

The most popular modeling library in Python is scikit-learn (referred to as sklearn in code). This machine learning library is built on NumPy, matplotlib, and SciPy—a library for scientific computing. The name scikit-learn comes from its origin as a “scikit” or “SciPy Toolkit,” a collection of extensions built for SciPy to provide specialized functions and methods.

When it comes to modeling, R is equally diverse in its approaches. However, I am partial to the consistency of the {tidymodels} ecosystem of packages, which clearly draw inspiration from scikit-learn. Let’s do a simple comparison of the two.

Prepare data

We’ll start by specifying some parameter values and simulating data. Here we see the obsession with namespacing and efficiency on full display. In R, I load namespace-free access to all of {tidymodels} functions with library(tidymodels). In Python, sklearn is so large and has so many different modules that it is convention to load namespace-free access to specific functions and methods by importing them one at a time as in from sklearn.linear_model import LinearRegression.

Python
R

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import root_mean_squared_error

# Set seed, variable, and parameter values.
np.random.seed(42)
nobs = 500
beta0 = -5
beta1 = 5
beta2 = 2
beta3 = 0

# Simulate data.
sim_data = pl.DataFrame({
    'x1': np.round(np.random.uniform(0, 20, nobs)),
    'x2': np.random.choice(['level01', 'level02', 'level03'], nobs, p = [0.7, 0.2, 0.1]),
    'y': beta0 + beta1 * np.random.uniform(0, 20, nobs) + beta2 * (np.random.choice(['level01', 'level02', 'level03'], nobs, p=[0.7, 0.2, 0.1]) == 'level02') + beta3 * (np.random.choice(['level01', 'level02', 'level03'], nobs, p=[0.7, 0.2, 0.1]) == 'level03') + np.random.normal(0, 3, nobs)
})

sim_data

shape: (500, 3)

x1	x2	y
f64	str	f64
7.0	"level01"	15.07666
19.0	"level01"	51.125741
15.0	"level01"	83.961397
12.0	"level02"	68.491231
3.0	"level01"	75.0641
…	…	…
7.0	"level01"	62.539493
12.0	"level03"	58.44809
2.0	"level01"	42.765411
19.0	"level03"	33.206444
20.0	"level01"	83.561412

library(tidymodels)

# Set seed, variable, and parameter values.
set.seed(42)
nobs <- 500
beta0 <- -5
beta1 <- 5
beta2 <- 2
beta3 <- 0

# Simulate data.
sim_data <- tibble(
  x1 = round(runif(nobs, min = 0, max = 20)),
  x2 = rbinom(nobs, size = 2, prob = c(0.7, 0.3)) |> 
    as.factor() |> fct_recode("level01" = "0", "level02" = "1", "level03" = "2"),
  y = beta0 + beta1 * x1 + beta2 * ifelse(x2 == "level02", 1, 0) + beta3 * ifelse(x2 == "level03", 1, 0) + rnorm(nobs, mean = 0, sd = 3)
)

sim_data

# A tibble: 500 × 3
      x1 x2          y
   <dbl> <fct>   <dbl>
 1    18 level03 88.1 
 2    19 level01 92.7 
 3     6 level02 27.0 
 4    17 level02 82.4 
 5    13 level03 57.8 
 6    10 level02 46.4 
 7    15 level02 68.9 
 8     3 level01  7.10
 9    13 level02 58.3 
10    14 level01 67.5 
# ℹ 490 more rows

Here we can see that scikit-learn requires pandas DataFrames. We also see an interesting reversal where splitting the data into training and testing data produces separate datasets in Python while R creates a single object that contains both.

For feature engineering, recipe steps in {tidymodels} mirror specific transformers in scikit-learn.

Python
R

# Training and testing split.
X = sim_data.drop('y').to_pandas()
y = sim_data['y'].to_pandas()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Feature engineering.
categorical_features = ['x2']
categorical_transformer = Pipeline(steps=[
  ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
  transformers=[
      ('cat', categorical_transformer, categorical_features)
  ]
)

# Training and testing split.
sim_split <- initial_split(sim_data, prop = 0.90)

# Feature engineering.
sim_recipe <- training(sim_split) |>
  recipe(y ~ .) |> 
  step_dummy(all_nominal_predictors())

Specify and fit a model

Here we can see some equivalence between scikit-learn’s Pipelines and {tidymodels}’ workflows where both feature engineering and model fitting are composed and executed at once.

Python
R

# Model specification.
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Fit the model.
model.fit(X_train, y_train)

# Model specification.
sim_lm <- linear_reg() |> 
  set_engine("lm")

# Compose a workflow.
sim_wf_lm <- workflow() |> 
  add_recipe(sim_recipe) |> 
  add_model(sim_lm)

# Fit the model.
sim_lm_fit <- fit(sim_wf_lm, data = training(sim_split))

Evaluate model fit

It’s in evaluating model fit that differences are most apparent. While both scikit-learn and {tidymodels} can compute predictive fit values, scikit-learn doesn’t have a built-in way to access or visualize parameter estimates, especially interval estimates. At some level this shouldn’t be surprising given the different mindsets, but it is still a bit jarring that the most popular modeling package in Python can do prediction but not (statistical) inference. There are other libraries, of course. The statsmodels library is {stats}-like, including formula notation. The Bambi library is Python’s version of {brms} for Bayesian modeling.

Python
R

# Compute RMSE.
y_pred = model.predict(X_test)
root_mean_squared_error(y_test, y_pred)

30.556378031804933

# Visualize slope parameter estimates.
tidy(sim_lm_fit, conf.int = TRUE) |> 
  ggplot(aes(x = term)) + 
  geom_point(aes(y = estimate)) + 
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .1) +
  geom_hline(yintercept = 0, color = "red")

# Compute RMSE.
sim_lm_fit |> 
  predict(new_data = testing(sim_split)) |>
  bind_cols(testing(sim_split)) |>
  rmse(truth = y, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        3.16

Communication

When it comes to communication, the elephant in the room is Jupyter notebooks. Even though Jupyter was designed to be polyglot (Julia, Python, and R), it really is the domain of Python. As an R user, Jupyter is weird. It appears that Jupyter is what happens when you don’t have a great IDE to work with—IDE functionality gets absorbed into the document type itself (e.g., an embedded kernel selector). But just because you can use Jupyter notebooks doesn’t mean you must.

It shouldn’t be surprising that I recommend Quarto for communication. It’s plain text (so it plays well with version control), supports Python and R natively with code cells that behave like actual scripts, and is designed as a means to whatever document type you need—PDFs, slides, websites, dashboards, etc. That said, if you really love or are required to work with Jupyter notebooks, Quarto can convert any .ipynb into a .qmd via the command line with quarto convert notebook.ipynb as well as render any .ipynb into whatever document type you need using quarto render notebook.ipynb --to format.

Final thoughts

There’s honestly a lot to love about both Python and R. Don’t be afraid to use the best of both interchangeably. I’ve found it’s easiest to switch between the different mindsets by keeping some syntax differences. For example, in R I use “double quotes” and in Python I use ‘single quotes.’

If I could change one thing about Python it would be for the community to embrace the fact that it was named after Monty Python, not a snake. Like R has Peanuts references for each release, here’s hoping we eventually see a stable Python release code-named “It’s Just a Flesh Wound.”