Deploying cross-language in high impact projects

Robin Lovelace

University of Leeds, Active Travel England

September 22, 2024

Introduction

Contents

  • Introduction
  • Approaches to cross-language interoperability
  • Containerisation
  • Cross-language pain points
  • Cross-language priorities

Case study for reference

The Network Planning Tool for Scotland

The NPT web app

NPT stack

Backend: R + targets for ‘cross-language’ data pipeline workflow automation

  • Relies on Rust crates
  • Experiments with Python

Frontend: JS + MapLibre for visualisation

Progressive Web App (PWA)

Map controls

User interface

Layer Controls

Definitions

“Deploying”

  • Code runs on more than one computer
  • Results are published on a website that is maintained
  • Project is “in production”
    • Users: Hosted on a trusted and well-used website
    • Performance: updates to ‘production’ planned and documented
    • Money changing hands
    • Expectations

Deploying the NPT

Deployment workflow of NPT

Based on workflow.yml file for GitHub actions.

“High impact projects”

  • Broadly: tangible change results from the work
    • With measurable impact on environmental, social (or economic?) outcomes
      • E.g. Reduction in greenhouse gas emissions
      • E.g. better diets resulting in more DALYs
    • Identifiable ‘pathway to impact’
      • E.g. New methods -> new evidence -> investment in active travel more effective that it would have been otherwise -> more people cycling -> tangible benefits

Technical requirements for impact

  • Users: The web application has users
  • Scale: covers a large geographic area needs big data inputs
  • Trusted: code review and multiple contributors
  • User-friendly: if it’s going to have lots of users
  • Future proof: technology needs to last a long time
  • Community: needed to ensure it lasts

Why cross-language projects?

  • Specific tool written in a particular language (odjitter)
  • Required for ‘best of both worlds’ (JS for visualisation, R for statistical modelling)
  • Having multiple implementations ensures robustness
    • Aeroplane flight software has 3 implementations
    • Redundancy common in mission-critical applications

Comparing approaches

Adding new languages

Source: github.com/Robinlovelace/opengeohub2023

Practical session?

  • Who has written and visualised side-by-side code like this?
Hello world
  • What did you use?

  • Would you like to learn?

Approaches to cross-language projects

Checked boxes indicate approaches used in the NPT project.

Loose coupling

Example: odjitter R package

  msg = glue::glue("{odjitter_location} jitter --od-csv-path {od_csv_path} \\
  --zones-path {zones_path} \\
  --zone-name-key {zone_name_key} \\
  --origin-key {origin_key} \\
  --destination-key {destination_key} \\
  --subpoints-origins-path {subpoints_origins_path} \\
  --subpoints-destinations-path {subpoints_destinations_path} \\
  --disaggregation-key {disaggregation_key} \\
  --disaggregation-threshold {disaggregation_threshold} \\
  --rng-seed {rng_seed} \\
  {deduplicate_pairs}  \\
  --output-path {output_path}")
  if(show_command) {
    message("command sent to the system:")
    cat(msg)
  }
  system(msg)
  res = sf::read_sf(output_path)

Loose coupling

Example: qgisprocess R package

qgis_run <- function(args = character(), ..., env = qgis_env(), path = qgis_path()) {
  if (is.null(path)) {
    message(
      "The filepath of 'qgis_process' is not present in the package cache, ",
      "so the package is not well configured.\n",
      "Restart R and reload the package; run `qgis_configure()` if needed.\n",
      "For now, will try to fix it on the fly, but some functionality may not work.\n"
    )
    path <- qgis_path(query = TRUE, quiet = FALSE)
    # typically the version will also be missing, so fixing that as well:
    if (is.null(qgisprocess_cache$version)) {
      invisible(qgis_version(query = TRUE, quiet = FALSE))
    }
  }
  # workaround for running Windows batch files where arguments have spaces
  # see https://github.com/r-lib/processx/issues/301
  if (is_windows()) {
    withr::with_envvar(
      env,
      processx::run("cmd.exe", c("/c", "call", path, args), ...),
    )
  } else {
    withr::with_envvar(
      env,
      processx::run(path, args, ...)
    )
  }
}

Tight coupling 1

Source: geoarrow-c



# cython: language_level = 3
# cython: linetrace=True


"""Low-level geoarrow Python bindings."""


from libc.stdint cimport uint8_t, int32_t, int64_t, uintptr_t
from cpython cimport Py_buffer, PyObject
from libcpp cimport bool
from libcpp.string cimport string


cdef extern from "geoarrow_type.h":
    struct ArrowSchema:
        const char* format
        const char* name
        const char* metadata
        int64_t flags
        int64_t n_children
        ArrowSchema** children
        ArrowSchema* dictionary
        void (*release)(ArrowSchema*)
        void* private_data

Tight coupling 2

Source: geoarrow-rs

use crate::error::{PyGeoArrowError, PyGeoArrowResult};
use crate::{PyCoordType, PyDimension};


use geoarrow::array::CoordType;
use geoarrow::datatypes::{Dimension, GeoDataType};
use pyo3::exceptions::PyValueError;
use pyo3::intern;
use pyo3::prelude::*;
use pyo3::types::{PyCapsule, PyType};
use pyo3_arrow::ffi::to_schema_pycapsule;
use pyo3_arrow::PyField;


#[pyclass(module = "geoarrow.rust.core._rust", name = "GeometryType", subclass)]
pub struct PyGeometryType(pub(crate) GeoDataType);


impl PyGeometryType {
    pub fn new(data_type: GeoDataType) -> Self {
        Self(data_type)
    }


    /// Import from a raw Arrow C Schema capsules
    pub fn from_arrow_pycapsule(capsule: &Bound<PyCapsule>) -> PyGeoArrowResult<Self> {
        PyField::from_arrow_pycapsule(capsule)?.try_into()
    }


    pub fn into_inner(self) -> GeoDataType {
        self.0
    }
}

Tight coupling 3

Source: RcppExports.R in sf

CPL_geos_union <- function(sfc, by_feature = FALSE, is_coverage = FALSE) {
    .Call(`_sf_CPL_geos_union`, sfc, by_feature, is_coverage)
}


CPL_geos_snap <- function(sfc0, sfc1, tolerance) {
    .Call(`_sf_CPL_geos_snap`, sfc0, sfc1, tolerance)
}

Tight coupling 4

Source: geos-unary-geometry.R in geos

#' @rdname geos_centroid
#' @export
geos_unary_union <- function(geom) {
  geom <- sanitize_geos_geometry(geom)
  new_geos_geometry(.Call(geos_c_unary_union, geom), crs = attr(geom, "crs", exact = TRUE))
}


#' @rdname geos_centroid
#' @export
geos_unary_union_prec <- function(geom, grid_size) {
  geom <- sanitize_geos_geometry(geom)
  recycled <- recycle_common(list(geom, sanitize_double(grid_size)))
  new_geos_geometry(
    .Call(geos_c_unary_union_prec, recycled[[1]], recycled[[2]]),
    crs = attr(geom, "crs", exact = TRUE)
  )
}

Tight coupling 5

Source: geos_functions.jl in LibGEOS.jl

function union(obj1::Geometry, obj2::Geometry, context::GEOSContext = get_context(obj1))
    result = GEOSUnion_r(context, obj1, obj2)
    if result == C_NULL
        error("LibGEOS: Error in GEOSUnion")
    end
    geomFromGEOS(result, context)
end


function unaryUnion(obj::Geometry, context::GEOSContext = get_context(obj))
    result = GEOSUnaryUnion_r(context, obj)
    if result == C_NULL
        error("LibGEOS: Error in GEOSUnaryUnion")
    end
    geomFromGEOS(result, context)
end

Project environments

pixi init test-project --format pyproject
✔ Initialized project in ~/test-project
cd test-project
pixi add geopandas
pixi add --pypi --feature test pytest
pixi install
✔ Added pytest
Added these as pypi-dependencies.
Added these only for feature: test

Cross-language support (source: pixi.sh)

pixi add r-ggplot2

Although some issues according to my tests (see prefix-dev/pixi#2066)

Pixi disk space usage

114M         ┌── bin                          │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓████████ │   6%
101M         │         ┌── locale-archive.tmpl│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓███████ │   6%
101M         │       ┌─┴ locale               │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓███████ │   6%
116M         │     ┌─┴ lib64                  │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓████████ │   7%
142M         │   ┌─┴ usr                      │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓██████████ │   8%
148M         │ ┌─┴ sysroot                    │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓███████████ │   8%
174M         ├─┴ x86_64-conda-linux-gnu       │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓████████████ │  10%
245M         │     ┌── 14.1.0                 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█████████████████ │  14%
245M         │   ┌─┴ x86_64-conda-linux-gnu   │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█████████████████ │  14%
246M         │ ┌─┴ gcc                        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█████████████████ │  14%
246M         ├─┴ libexec                      │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█████████████████ │  14%
108M         │     ┌── 14.1.0                 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓████████ │   6%
108M         │   ┌─┴ x86_64-conda-linux-gnu   │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓████████ │   6%
108M         │ ┌─┴ gcc                        │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓████████ │   6%
341M         │ │ ┌── site-packages            │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓████████████████████████ │  19%
391M         │ ├─┴ python3.12                 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓███████████████████████████ │  22%
1.0G         ├─┴ lib                          │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓███████████████████████████████████████████████████████████████████████ │  58%
1.7G       ┌─┴ default                        │████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ │ 100%
1.7G     ┌─┴ envs                             │████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ │ 100%
1.7G   ┌─┴ .pixi   

Containerisation

Source: github.com/geocompx/docker

Containerisation 2

Source: pangeo-data/pangeo-docker-images

Containerisation 3

Source: https://github.com/b-data

bdata: All images

R Python Jupyter Hub Jupyter Lab code‑server (Code) RStudio Neovim Git Git LFS Pandoc CRAN date Linux distro
4.4.1 3.12.6 5.1.0 4.2.5 4.92.2 (1.92.2) 2024.04.2+764 0.10.1 2.46.1 3.5.1 3.2 Debian 12
4.4.0 3.12.4 5.0.0 4.2.2 4.90.0 (1.90.0) 2024.04.2+764 n/a 2.45.2 3.5.1 3.1.11 2024‑06‑14 Debian 12
4.3.3 3.11.9 4.1.5 4.1.6 4.23.0 (1.88.0) n/a n/a 2.44.0 3.5.1 3.1.11 2024‑04‑24 Debian 12
4.3.2 3.11.8 4.0.2 4.1.2 4.21.2 (1.86.2) n/a n/a 2.44.0 3.4.1 3.1.11 2024‑02‑29 Debian 12
4.3.1 3.11.6 4.0.2 3.6.6 4.18.0 (1.83.1) n/a n/a 2.42.0 3.4.0 3.1.1 2023‑10‑31 Debian 12
4.3.0 3.11.4 4.0.1 3.6.4 4.13.0 (1.78.2) n/a n/a 2.41.0 3.3.0 3.1.1 n/a Debian 12
4.2.3 3.10.11 4.0.0 3.6.3 4.9.1 (1.73.1) n/a n/a 2.40.0 3.3.0 2.19.2 n/a Debian 11
4.2.2 3.10.10 3.1.1 3.6.1 4.9.1 (1.73.1) n/a n/a 2.40.0 3.3.0 2.19.2 n/a Debian 11
4.2.1 3.9.2 2.3.1 3.5.0 4.8.1 (1.72.1) n/a n/a 2.38.1 3.2.0 2.19.2 n/a Debian 11
4.2.0 3.9.2 2.3.1 3.4.3 4.4.0 (1.66.2) n/a n/a 2.36.1 3.2.0 2.18 n/a Debian 11

bdata: verse+ images

R CTAN date Quarto
4.4.1 1.5.57
4.4.0 2024‑06‑14 1.4.555
4.3.3 2024‑04‑24 1.4.553
4.3.2 2024‑02‑29 1.4.550
4.3.1 2023‑10‑31 1.3.450
4.3.0 2023‑06‑16 1.3.361
4.2.3 2023‑04‑21 1.2.475[^1]
4.2.2 2023‑03‑15 1.2.335[^1]
4.2.1 2022‑10‑31 1.1.251[^1]
4.2.0 2022‑06‑23 n/a

bdata: qgisprocess images

R QGIS SAGA OTB[^1]
4.4.1 3.38.3 9.1.3 9.0.0
4.4.0 3.36.3 9.1.3 9.0.0
4.3.3 3.36.2 9.1.3 9.0.0
4.3.2 3.36.0 9.1.3 8.1.2
4.3.1 3.34.0 9.1.3 8.1.2
4.3.0 3.30.3 8.5.0 8.1.1
4.2.3 n/a n/a n/a
4.2.2 n/a n/a n/a
4.2.1 n/a n/a n/a
4.2.0 n/a n/a n/a

bdata: demo

Source: https://github.com/b-data/jupyterlab-r-docker-stack/tree/main?t ab=readme-ov-file

docker run -it --rm \
  -p 8888:8888 \
  -u root \
  -v "${PWD}/jupyterlab-jovyan":/home/jovyan \
  -e NB_UID=$(id -u) \
  -e NB_GID=$(id -g) \
  -e CHOWN_HOME=yes \
  -e CHOWN_HOME_OPTS='-R' \
  glcr.b-data.ch/jupyterlab/r/geospatial

Rocker images (on which geocompx docker builds):

The Docker images built from this repository describe the software installation method in standalone scripts rather than directly in the Dockerfiles. These files are under the scripts directory, and these files are copied in all Docker images, under a top-level /rocker_scripts directory. This allows users to extend images by selecting additional modules to install on top of any pre-built images.

E.g.

FROM rocker/rstudio:4.0.0
RUN /rocker_scripts/install_python.sh
RUN /rocker_scripts/install_julia.sh

Cross-language pain points

Too many options: IDE

  • VSCode is market leader with strong community
    • Pro: many extensions
    • Pro: Live Share
    • Pro: GitHub integration (including copilot)
    • Pro: devcontainers work out-of-the-box
    • Con: setup time, can be intimidating
  • Positron
    • Pro: More batteries included
    • Con: Missing great VSCode features (e.g. Live Share)
    • Con: Early days, fewer developers
    • Thought: why didn’t they put energy into great extensions for modularity?
  • RStudio, Jupyter Lab, Zed, …

Resource requirements

  • Each new environment can take up a few GB of space
  • Lack of shared binaries, could be overcome by pixi and judicious use of Docker containers
  • Headspace requirements of learning new syntax and quirks

Cross-language priorities

Interactive session

  • What would your top ask be to use multiple languages in a single session?
  • What would you like to see in a shared environment, e.g. pixi (see )

Bonus: cross-language publishing

Quickstart: .devcontainer in GitHub codespaces

Open in GitHub Codespaces

Open in GitHub Codespaces

Cross-language tabsets

::: {.panel-tabset group="language"}
...
:::

Output

library(sf)
sf::write_sf(spData::world, "world.gpkg")
plot(spData::world)

import geopandas as gpd
world = gpd.read_file('world.gpkg')
world.plot()

Setting-up gh-pages branch

You can do this with a single quarto command:

quarto publish gh-pages

Which leads to the following text and eventually auto-opens the deployed webiste!

The website

The previous command creates a gh-pages branch with the slides in the docs folder. This is then automatically deployed to GitHub Pages, and the website is opened in your browser when ready:

How awesome is that?

🤯🤯🤯

Debugging

Source of quarto publish gh-pages hint: Error message from GitHub Actions when trying to publish slides to GitHub Pages:

ERROR: Unable to publish to GitHub Pages (the remote origin does not have a branch named "gh-pages". Use first `quarto publish gh-pages` locally to initialize the remote repository for publishing.)

Adding citations

With the Quarto extension

  • You need to be in Visual Editor mode (Ctrl+Shift+F4)
  • Then it’s Ctrl+Shift+F8

Creates citations like this: (Peng 2011)

With “Citation Picker for Zotero” extension

  • Issue with this approach: doesn’t generate the .bib file

Alt+Shift+Z

Support in IDEs

References

Peng, Roger D. 2011. “Reproducible Research in Computational Science.” Science (New York, N.y.) 334 (6060): 1226–27. https://doi.org/10.1126/science.1213847.