Databricks Systems Tables Quickstart

[Views and opinions are my own.]

Databricks released System Tables, which allow a user with sufficient GRANTs to inspect system-level data regarding their account. I write this following the writing of one of my colleagues that provides a comprehensive overview of:

  1. how these tables work,

  2. what one may find within these tables, and

  3. how to operationalize their use.

This post does not rehash these already well-attended topics. Instead, this shall serve as a sort of Quickstart guide for those who are technically savvy, up against a deadline, and just want to jump right in - as it were. I have written a great deal of SQL to query these tables for a number of use cases, and can reasonably state that the specific uses hereinafter shown are super common and can form the foundation for a dashboard of operational insights. We divide it into two sections: Billing and Informational.

We take some variables as given:

  • tag - Some queries search and organize information by the custom tags set on the securable object

  • interval_hours - The time to look back for queries. Of course, this can be changed by the reader to fit the need.

  • discount - if any exists, we want to get the accurate price with the discount, not the list price.

There is also plenty of use of SQL functions that can be read about at this link. The code below can be found in GitHub here.

Billing

Here there are two main tables:

  1. system.billing.usage

  2. system.billing.list_prices

DBUs

These are the basic unit to measure usage; please read the content behind this link if you do not understand them. Without an understanding of how DBUs are emitted, no billing dashboard would be possible. The usage can be queried to give a great deal of information. Since this is a Quickstart, we’ll just show how to aggregate DBUs by the common features: cluster, Serverless SQL, and SKU. To agg over other features is a simple find-and-replace through these queries.

DBUs by cluster:

SELECT usage_metadata.cluster_id, SUM(usage_quantity) AS DBUs
FROM system.billing.usage
WHERE (usage_start_time >= current_timestamp() - make_dt_interval(0, :interval_hours)
   OR usage_end_time >= current_timestamp() - make_dt_interval(0, :interval_hours))
   AND usage_metadata.cluster_id IS NOT NULL
   AND usage_unit = 'DBU'
GROUP BY (usage_metadata.cluster_id)
ORDER BY 2 DESC

DBUs by tag:

SELECT tag, SUM(DBUs) AS DBUs
FROM (
  SELECT CONCAT(:tag, '=', IFNULL(custom_tags[':tag'], 'null')) AS tag, usage_quantity AS DBUs
  FROM system.billing.usage
  WHERE (usage_start_time >= current_timestamp() - make_dt_interval(0, :interval_hours)
    OR usage_end_time >= current_timestamp() - make_dt_interval(0, :interval_hours))
    AND usage_metadata.cluster_id IS NOT NULL
    AND usage_unit = 'DBU')
GROUP BY (tag)
ORDER BY 2 DESC

DBUs by SKU for Classic Compute:

SELECT sku_name, SUM(usage_quantity) AS DBUs
FROM system.billing.usage
WHERE (usage_start_time >= current_timestamp() - make_dt_interval(0, :interval_hours)
   OR usage_end_time >= current_timestamp() - make_dt_interval(0, :interval_hours))
   AND usage_metadata.cluster_id IS NOT NULL
   AND usage_unit = 'DBU'
GROUP BY (sku_name)
ORDER BY 2 DESC

DBUs by SKU for Serverless SQL:

SELECT sku_name, SUM(usage_quantity) AS DBUs
FROM system.billing.usage
WHERE (usage_start_time >= current_timestamp() - make_dt_interval(0, :interval_hours)
   OR usage_end_time >= current_timestamp() - make_dt_interval(0, :interval_hours))
  AND usage_unit = 'DBU'
  AND sku_name LIKE '%SERVERLESS_SQL%'
GROUP BY (sku_name)
ORDER BY 2 DESC

Cost

To find cost, we will need to understand usage as above and multiple by the price. This requires a join between the usage and pricing tables. Further, since pricing can change, we need to ensure we split the usage amounts and times between the price change boundaries. To make things simple, we employ CTEs.

WITH current_pricing AS (
  SELECT
    sku_name, price_start_time, price_end_time,
    (
      pricing.effective_list.default
      * (1.000 - (:discount / 100.000))
    ) AS price_with_discount
  FROM system.billing.list_prices
  WHERE usage_unit='DBU'
    AND currency_code='USD'
    AND (price_start_time >= current_timestamp() - make_dt_interval(0, :interval_hours)
      OR price_end_time >= current_timestamp() - make_dt_interval(0, :interval_hours)
      OR price_end_time IS NULL)
),
current_usage AS (
  SELECT sku_name, usage_start_time, usage_end_time, usage_quantity AS DBUs
  FROM system.billing.usage
  WHERE (usage_start_time >= current_timestamp() - make_dt_interval(0, :interval_hours)
    OR usage_end_time >= current_timestamp() - make_dt_interval(0, :interval_hours))
    AND usage_metadata.cluster_id IS NOT NULL
    AND usage_unit = 'DBU'
),
current_usage_with_pricing AS (
  SELECT
    p.sku_name,
    UNIX_MILLIS(p.price_start_time) AS price_start_time,
    UNIX_MILLIS(p.price_end_time) AS price_end_time,
    UNIX_MILLIS(u.usage_start_time) AS usage_start_time,
    UNIX_MILLIS(u.usage_end_time) AS usage_end_time,
    u.DBUs,
    p.price_with_discount
  FROM   current_pricing p
    JOIN current_usage u
    ON p.sku_name = u.sku_name
      AND (CASE
        WHEN p.price_end_time IS NOT NULL THEN
          (u.usage_start_time BETWEEN p.price_start_time AND p.price_end_time
          OR u.usage_end_time BETWEEN p.price_start_time AND p.price_end_time)
        ELSE
          (u.usage_start_time >= p.price_start_time
          OR u.usage_end_time >= p.price_start_time)
        END
      )
)
SELECT SUM(
  (
    (ARRAY_MIN(ARRAY(usage_end_time, price_end_time))
     - ARRAY_MAX(ARRAY(usage_start_time, price_start_time))
    )
    / (usage_end_time-usage_start_time)
  ) * DBUs * price_with_discount
) AS cost
FROM current_usage_with_pricing

Here, we get the relevant pricing as current_pricing; the relevant usage as current_usage; join those together, and then aggregate the values. This should produce a value in dollars for cost of clusters within the time interval. Note the usage_metadata.cluster_id IS NOT NULL in the filter.

Just for good measure we exemplify a few more common uses.

%DBUs by SKU

WITH usage AS (
  SELECT sku_name, usage_quantity
  FROM system.billing.usage
  WHERE usage_unit = 'DBU'
    AND (usage_start_time >= current_timestamp() - make_dt_interval(0, :interval_hours)
      OR usage_end_time >= current_timestamp() - make_dt_interval(0, :interval_hours))
),
total_usage AS (
  -- OK to cross-join against this, it is a singleton dataset
  SELECT SUM(usage_quantity) AS total
  FROM usage
),
by_sku AS (
  SELECT sku_name, SUM(usage_quantity) AS DBUs
  FROM usage, total_usage
  GROUP BY sku_name
)
SELECT sku_name, DBUs/total AS ratio
FROM by_sku, total_usage
ORDER BY 2 DESC

DLT Maintenance Usage in DBUs

SELECT SUM(usage_quantity) dlt_maintenance_dbus
FROM system.billing.usage
WHERE usage_unit = 'DBU'
  AND (usage_start_time >= current_timestamp() - make_dt_interval(0, :interval_hours)
    OR usage_end_time >= current_timestamp() - make_dt_interval(0, :interval_hours))
  AND usage_metadata.dlt_maintenance_id IS NOT NULL

Warehouse Usage in DBUs

SELECT usage_metadata.warehouse_id, SUM(usage_quantity) warehouse_dbus
FROM system.billing.usage
WHERE usage_unit = 'DBU'
  AND (usage_start_time >= current_timestamp() - make_dt_interval(0, :interval_hours)
    OR usage_end_time >= current_timestamp() - make_dt_interval(0, :interval_hours))
  AND usage_metadata.warehouse_id IS NOT NULL
GROUP BY usage_metadata.warehouse_id
ORDER BY warehouse_dbus DESC

Informational

Operational complexity eventually rises to a level that no human can reasonably crunch down into a report. Always. Bet on it. We need the machine to do this work for us.

We add in more tables:

  • system.query.history

  • system.compute.node_types

  • system.compute.clusters

  • system.access.audit

Node Types that match a certain criteria

This can help a developer set up a suite of benchmarking tests for nodes within certain minimum and maximum size characteristics. A test tool can easily resubmit jobs with different node types and collect results using some of the queries above for review once complete.

SELECT
  node_type,
  memory_mb/1024 AS memory_gb,
  core_count,
  gpu_count
FROM system.compute.node_types
WHERE
  -- Memory from 64GB to 512GB
  memory_mb/1024 BETWEEN 64 AND 513
  AND core_count > 16
ORDER BY
  -- Some metric to track approx cost
  core_count * memory_mb * (1+gpu_count) DESC

Largest Clusters and their Owners

This enumerates recent clusters, orders by size, and shows the most significant first. This can help track either the processing leader or waste leader, depending upon the circumstances.

WITH clusters AS (
  SELECT *
  FROM system.compute.clusters
  WHERE delete_time IS NULL -- current
    OR create_time >= (current_timestamp() - make_dt_interval(0, :interval_hours)) -- recent
  QUALIFY 1 = (ROW_NUMBER() OVER (
    PARTITION BY cluster_id
    ORDER BY
      change_time DESC
  ))
),
drivers AS (
  SELECT c.owned_by, c.cluster_id, nt.core_count, nt.memory_mb/1024 AS memory_gb, nt.gpu_count
  FROM system.compute.node_types nt,
       clusters c
  WHERE nt.node_type = c.driver_node_type
),
workers AS (
  SELECT c.owned_by, c.cluster_id, nt.core_count, nt.memory_mb/1024 AS memory_gb, nt.gpu_count, COALESCE(c.worker_count, c.max_autoscale_workers) AS qty
  FROM system.compute.node_types nt,
       clusters c
  WHERE nt.node_type = c.worker_node_type
)
SELECT DISTINCT
  drivers.owned_by,
  drivers.cluster_id,
  drivers.core_count + (workers.core_count * workers.qty) AS total_cores,
  drivers.memory_gb + (workers.memory_gb * workers.qty) AS total_memory_gb,
  drivers.gpu_count + (workers.gpu_count * workers.qty) AS total_gpus
FROM drivers, workers
WHERE drivers.cluster_id = workers.cluster_id
ORDER BY total_cores*total_memory_gb*(1+total_gpus) DESC

Minimum DBR in use and by whom

This chooses the minimum DBR version currently being used by any cluster and which users own those low-version clusters. Useful during an upgrade cycle.

WITH dbrs AS (
  SELECT regexp_extract(dbr_version, '(\\d+\\.\\d+\\.(\\d|\\w)+)') AS dbr
  FROM system.compute.clusters
  WHERE delete_time IS NULL -- current
    OR create_time >= (current_timestamp() - make_dt_interval(0, :interval_hours)) -- recent
),
min_dbr AS(
  SELECT MIN(dbr) AS dbr
  FROM dbrs
  WHERE dbr IS NOT NULL
    AND '' <> dbr
)
SELECT DISTINCT owned_by, dbr
FROM system.compute.clusters, min_dbr
WHERE dbr_version LIKE CONCAT('%',dbr,'%')

Relative Query rates by Status

Provides the percentage of total queries are successful, cancelled, or failed.

WITH totals AS (
  SELECT execution_status, COUNT(*) qty
  FROM system.query.history
  WHERE execution_status IN ('FINISHED','CANCELED','FAILED') -- Completed only
    AND (start_time >= current_timestamp() - make_dt_interval(0, :interval_hours)
      OR end_time >= current_timestamp() - make_dt_interval(0, :interval_hours))
  GROUP BY execution_status
),
total_runs AS (
  SELECT SUM(qty) AS total
  FROM totals
)
SELECT execution_status, qty/total AS ratio
FROM totals, total_runs

Why Queries Fail

If a high failure rate is shown above, this will cause extra costs due to cluster use without returning results. Best to find the failure reasons that are most common and educate the team about those failures.

WITH errors AS (
  SELECT
    COALESCE(
      NULLIF(REGEXP_EXTRACT(error_message, '\\[(.*?)\\]'), ''),
      NULLIF(REGEXP_EXTRACT(error_message, '(\\w+((Exception)|(Error)):)'), ''),
      error_message
    ) AS error_type
  FROM system.query.history
  WHERE error_message IS NOT NULL
    AND (start_time >= current_timestamp() - make_dt_interval(0, :interval_hours)
      OR end_time >= current_timestamp() - make_dt_interval(0, :interval_hours))
)
SELECT error_type, COUNT(*) AS qty
FROM errors
GROUP BY error_type
ORDER BY qty DESC

Clusters that do not bear the owner’s name

It is common to enforce a naming convention. Queries like this can help find holes in your policies.

SELECT cluster_id, cluster_name, owned_by
FROM system.compute.clusters
WHERE cluster_name IS NOT NULL -- clusters with names
  AND owned_by IS NOT NULL -- clusters with Owners
  AND NOT REGEXP(LOWER(cluster_name), 'job-\\d+-run-\\d+.*') -- ignore jobs
  AND NOT REGEXP(
    REPLACE(LOWER(cluster_name), ' '),
    REPLACE(SUBSTR(LOWER(owned_by) FROM 1 FOR INSTR(owned_by, '@')-1), '.', '.?')
  )

Job Triggers by creator and type

Who is triggering jobs through which means. Who is triggering the most.

SELECT
  request_params['runCreatorUserName'] AS creator,
  request_params['jobTriggerType'] AS trigger,
  COUNT(*) AS qty
FROM system.access.audit
WHERE service_name = 'jobs'
  AND action_name = 'runTriggered'
  AND event_time >= (current_timestamp() - make_dt_interval(0, :interval_hours))
GROUP BY all
ORDER BY qty DESC

Active Users

How widespread is adoption?

SELECT COUNT(DISTINCT user_identity)
FROM system.access.audit
WHERE event_time >= (current_timestamp() - make_dt_interval(0, :interval_hours))
  AND user_identity.email LIKE '%@%'

Failure Leaders

Which usernames are causing the most failure responses in the audit trail? This could be indicative of networking issues or malicious intent.

SELECT user_identity.email AS email, COUNT(*) AS qty
FROM system.access.audit
WHERE event_time >= (current_timestamp() - make_dt_interval(0, :interval_hours))
  AND user_identity.email LIKE '%@%'
  AND NOT (response.status_code BETWEEN 200 AND 300)
GROUP BY ALL
ORDER BY qty DESC

Successful DB SQL Downloads by Hour

DB SQL exfiltration rate tracker. Exfil can be extremely costly depending upon cloud vendor terms; this is a hidden direct cost that is often overlooked until it is too late.

SELECT DATE_FORMAT(event_time, 'yyyyMMddHH') AS the_hour, COUNT(*) AS qty
FROM system.access.audit
WHERE event_time >= (current_timestamp() - make_dt_interval(0, :interval_hours))
  AND service_name = 'databrickssql'
  AND action_name = 'downloadQueryResult'
  AND response.status_code = 200
GROUP BY ALL
ORDER BY the_hour DESC

I Hate Python

[Views and opinions are my own.]

First and foremost, I love Python. I believe Python is one of the best choices the mainstream development community provides for doing Functional Programming. It provides all of the bells and whistles necessary to do solid, well-thought (dare I say lawful?) FP.

One could say the hallmark features of FP are

  • functional values

  • free monoid with a fold

  • lambdas

  • strong types

Python does a wonderful job of providing all of these to a Developer. From these building blocks, we can mostly get to Type Classes in Python.

“But Hey! Wait just a moment! I thought you hated Python!!!” I do. Well… not properly, but I hate how Python is typically used. It appears, someone out there is teaching a large subset of the Python community to write Python like it is a worse version of Java (Yes, things can get worse than Java, so count your blessings!). Just a few points.

  • Some data can be stored not in a class

  • Some classes can be alone in the world with no Parent or Child classes defined

  • No one ever said Functions/Methods must be coupled to data

All too often I find myself in a sea of poorly conceived class hierarchies wherein “inheritance” is misappropriated to restrict rather than to extend capabilities and behaviors. This is the type of Python I hope to avoid in the future but know I will be faced with in short order.

Python is Good for FP

Yes. It is.

Python has Functions as Values

def add(a,b):
    return a + b   
compose = add
print(f"compose is add: {compose(1, 2) == add(1, 2)}")
def same(f, g, a, b):
    print(f"{f(a, b) == g(a, b)} for  and ")
same(add, compose, 1, 2)

This displays

compose is add: True
True for 1 and 2

Python has a Free Monoid with a Fold

from functools import reduce
free = [1,2,3,4,5,6]
def c1(a, b): return a + b
def c2(a, b): return a * b
def c3(a, b): return str(a) + str(b)
print(reduce(c1, free, 0))
print(reduce(c2, free, 1))
print(reduce(c3, free, ""))

This displays

21
720
1234

Python has Lambdas

from functools import reduce
free = [1,2,3,4,5,6]
print(reduce(lambda a,b: a + b, free))
print(reduce(lambda a,b: a * b, free))
print(reduce(lambda a,b: str(a) + str(b), free))

This displays

21
720
123456

Python has Strong Types

Strong types are important for FP. Static types are not.

Strong types ensure the data one function writes is the data another function reads. It ensures the type of a value is static through the lifetime of that value.

Static types ensure the type of a name is static through the lifetime of that name. Static vs Dynamic typing is only important in applications where there is use of the assignment operator outside of declaration. As functional programmers, we do not do this, even if it is allowed. A “limitation” or “inconvenience” caused by dynamic typing can simply be overcome by discipline.

a = '1'
b = 1
print(f"a= is a ; b= is a ; a == b is {a == b}")
print(f"a+a gives ")
print(f"b+b gives ")
print(f"a+b error")
print(a+b)

This displays

a=1 is a <class 'str'>; b=1 is a <class 'int'>; a == b is False
a+a gives 11
b+b gives 2
a+b error
Traceback (most recent call last):
  File "C:\Users\dread\OneDrive\DreadedSoftware\Talks\y2022\hatepython\04.py", line 12, in <module>
    print(a+b)
TypeError: can only concatenate str (not "int") to str

In Python, Everything is Mutable

This, like Dynamic Typing, is not a problem for a disciplined functional programmer.

a = [1,2,3,4,5,6,7]
print(a)
for i in range(len(a)):
    if 0 == i % 2:
        a[i] = -a[i]
print(a)

This displays

[1, 2, 3, 4, 5, 6, 7]
[-1, 2, -3, 4, -5, 6, -7]

If we avoid the = operator, things should work out great.

Applying the Basics

Say we have two computations which apply an operation across a list of numbers to produce a new number.

from functools import reduce
a = [1,2,3,4]
def black_box_extreme_computation_1(sequence):
    return reduce(lambda a,b: a+b, sequence)
def black_box_extreme_computation_2(sequence):
    return reduce(lambda a,b: a*b, sequence)
print(f"add: ")
print(f"multiply: ")

This displays

add: 10
multiply: 24

The problem is the functions differ only by the operation applied. this is very WET code, and we like DRY. Luckily, we can pass functions as values in Python.

a = [1,2,3,4]
def add(a,b):
    return a + b
def multiply(a,b):
    return a*b
def black_box_extreme_computation(sequence):
    def inner(f):
        return reduce(f, sequence)
    return inner
print(f"add: ")
print(f"multiply: ")

This displays

add: 10
multiply: 24

We have gotten rid of the repetition, but that is not good enough. Why? Code is meant to be read. A function which takes a simple functional value f is not good enough. We want to provide enough information to the reader so that the reader does not need to read the entire function to understand how and why the f parameter is used.

Further, gone are the days of "I write code so that a competent engineer can understand it." There are a lot of boot camp developers out there. These boot camps are like 3 months long. these people have literally been coding for 3 months. Even if someone with 3 months of experience reads most functions they will likely misunderstand something, misinterpret something or miss something important. We need to do some work up front to ensure a "competent engineer" (whatever that means) can understand the function just by reading the prototype AND a fresh-out-of-camp colleague will either understand the function just by the prototype or immediately ask someone "What do I do here?"

More succinctly, we need to ensure we put enough information into the function prototype so that someone who does not "just know" what things mean, will know to ask a question.

For example, let's require our f to be Associative. That information is not easily ascertainable, and someone, at some point in time will try to shove something non associative into the mix. Like maybe:

def alternate():
    def switcher():
        n = True
        while True:
            n = not n
            yield n
    gen = switcher()
    def inner(a, b):
        if next(gen): return add(a,b)
        else: return multiply(a,b)
    return inner

Any application which requires associativity will produce odd stuff in this case, and we want to avoid these types of bugs because, they are incredibly difficult to find and fix in a language with no compiler.

To solve this, we will use a class and type annotations. Even though types are not enforced in Python, annotations make it easy to tell future developers (including yourself 3 weeks from now) what it is your intention was in writing the code.

from abc import ABC, abstractmethod
class AssociativeOperator(ABC):
    @abstractmethod
    def combine(self, a, b): pass
from functools import reduce
a = [1,2,3,4]
class Add(AssociativeOperator):
    def combine(self, a,b):
        return a+b
class Multiply(AssociativeOperator):
    def combine(self, a,b):
        return a*b
def black_box_extreme_computation(sequence):
    def inner(f: AssociativeOperator):
        return reduce(f.combine, sequence)
    return inner

Now, it is super clear that we want an associative operation. For good measure, let’s do something similar with a Monoid.

from abc import ABC, abstractmethod
class Monoid(ABC):
    a = None
    @abstractmethod
    def combine(self, a, b): pass
from functools import reduce
a = [1,2,3,4]
class Add(Monoid):
    a = 0
    def combine(self, a,b):
        return a+b
class Multiply(Monoid):
    a = 1
    def combine(self, a,b):
        return a*b
def black_box_extreme_computation(sequence):
    def inner(f: Monoid):
        return reduce(f.combine, sequence, f.a)
    return inner

A simple change, and we went from requiring just a closed Associative Operator (Semigroup), to requiring both a closed Associative Operator and an identity item.

These very much are type classes; however, this encoding still leaves something to be desired. In programming languages that support "real functional programming", a function that looks like f(input)(typeclass_instance) can nearly automatically discover which typeclass_instance to choose.

For instance in Scala arguments can be automagically inferred by using the implicit keyword. In Haskell, typeclasses are well-supported at the compiler level and just declaring a type class instance in a function prototype delivers the instance.

What is needed is a semi-automatic way to pass in a function which is enhanced in some way by another function.

Enter the Decorator

def noDecorator():
    def log(f, message):
        print(f(message))
    def formatter(message):
        import datetime
        delim = {
            'begin': '~** ',
            'end'  : ' **~'
        }
        return f"[] "
    log(formatter, "first")
    log(formatter, "second")
    log(formatter, "third")
noDecorator()
def withDecorator():
    def logger(f):
        def inner(*args):
            print(f(*args))
        return inner
    @logger
    def log(message):
        import datetime
        delim = {
            'begin': '~** ',
            'end'  : ' **~'
        }
        return f"[] "
    log("first")
    log("second")
    log("third")
withDecorator()

This displays

[2022-07-17 02:38:08.810432] ~** first **~
[2022-07-17 02:38:08.811466] ~** second **~
[2022-07-17 02:38:08.811466] ~** third **~
[2022-07-17 02:38:08.811466] ~** first **~
[2022-07-17 02:38:08.812467] ~** second **~
[2022-07-17 02:38:08.812467] ~** third **~

Since we are discussing Typeclasses, it is important to note we can stack decorators to apply multiple enhancements to the same function.

def truncate(f):
    def inner(message):
        return f(message[:3])
    return inner
def logger(f):
    def inner(message):
        print(f(message))
    return inner
@logger
@truncate
def log(message):
    import datetime
    delim = {
        'begin': '~** ',
        'end'  : ' **~'
    }
    return f"[] "
print(log("first"))

This displays

[2022-07-17 02:41:15.011042] ~** fir **~

Let’s enhance our same typeclass example with Monoid with Decorators to almost automatically pass in our instances

Semi-Automatic Choice

from functools import reduce
from abc import ABC, abstractmethod
import random
a = [1,2,3,4]
class Monoid(ABC):
    def combine(a, b): pass
class Add(Monoid):
    def combine(a,b):
        return a+b
class Multiply(Monoid):
    def combine(a,b):
        return a*b
def monoid(F: Monoid):
    def decorator(f):
        def inner(*args, **kwargs):
            return f(*args, **kwargs)(F)
        return inner
    return decorator
def black_box_extreme_computation(sequence):
    def inner(f: Monoid):
        return reduce(f.combine, sequence)
    return inner
@monoid(Multiply)
def pi(sequence):
    return black_box_extreme_computation(sequence)
@monoid(Add)
def sigma(sequence):
    return black_box_extreme_computation(sequence)
print(f"pi(a) =\n   ")
print(f"pi(random.sample(a, len(a))) =\n   {pi(random.sample(a, len(a)))}")
print(f"pi([*a, 5]) =\n   {pi([*a, 5])}")
print(f"sigma(a) =\n   ")
print(f"sigma(random.sample(a, len(a))) =\n   {sigma(random.sample(a, len(a)))}")
print(f"sigma([*a, 5]) =\n   {sigma([*a, 5])}")

This displays

pi(a) =
   24
pi(random.sample(a, len(a))) =
   24
pi([*a, 5]) =
   120
sigma(a) =
   10
sigma(random.sample(a, len(a))) =
   10
sigma([*a, 5]) =
   15

To get rid of the need to explicitly pass in the typeclass instance in argument, we created a function which takes a typeclass instance as an argument and returns a decorator. Decorators are simply functions, functions are values, why not return a decorator as the result of a function? Our decorator is indeed a currying operator.

This gets us to semi-automatic typeclass use. There is a fair bit of pageantry and ceremony in defining typeclasses, there is no auto derivation, there is no compiler help; however, at least we have marked functions as requiring well-known typeclasses, and we can write Python code that looks at least a little bit like it was written in a language which enforces the types of discipline we take for granted in other environments.

Monads

Setup

from functools import reduce
from abc import ABC
a = [1,2,3,4]

A Functor is just a function mapping an item in our source category (types) into an item in our target category (types inside of some other type). We will make one for List and for “Optional”. Note: in Python everything is Optional so this is a fancy version of the identity, but it still makes for a nice example.

class Functor(ABC):
    def lift(a): pass
    def map(f): pass
class ListFunctor(Functor):
    def lift(a): return [a]
    def map(f):
        def inner(a):
            return map(f, a)
        return inner
class AnyFunctor(Functor):
    def lift(a): return a
    def map(f):
        def inner(a):
            if a != None: return f(a)
            else: return a
        return inner

A Functor and some special Natural Transformations make up a Monad. So let’s do it for List and Optional.

class Monad(ABC):
    F: Functor = None
    def bind(a, f): pass
class ListMonad(Monad):
    F = ListFunctor
    def bind(self, a, f):
        listList = self.F.map(f)(a)
        return [b for a in listList for b in a]
class AnyMonad(Monad):
    F = AnyFunctor
    def bind(self, a, f):
        optOpt = self.F.map(f)(a)
        return optOpt
def monad(F: Monad):
    def decorator(f):
        def inner(*args, **kwargs):
            return f(*args, **kwargs)(F)
        return inner
    return decorator

Just for completeness, we need an operation to chain monadic computations easily. Every “real functional language” provides such a mechanism (Haskell: >>=; Scala: for), so let’s do it here too.

def chain(a, *args):
    def inner(F: Monad):
        return reduce(lambda a,b: F.bind(F, a, b), args, a)
    return inner
@monad(ListMonad)
def flatMap(a, *args):
    return chain(a, *args)
@monad(AnyMonad)
def noneCheck(a, *args):
    return chain(a, *args)

Some functions to chain together

def toRange(a): return list(range(a))
def repeater(a): return [a] * 2
def doubler(a): return [a*2]
def stringToInt(a):
    try:
        return int(a)
    except:
        None
def intToList(a):
    return list(range(a))

And we have…

print(f"flatMap(a, doubler, repeater, toRange) =\n   {flatMap(a, doubler, repeater, toRange)}")
print(f"noneCheck('a', stringToInt, intToList) =\n   {noneCheck('a', stringToInt, intToList)}")
print(f"noneCheck('7', stringToInt, intToList) =\n   {noneCheck('7', stringToInt, intToList)}")
print(f"flatMap(noneCheck('7', stringToInt, intToList), intToList) =\n   {flatMap(noneCheck('7', stringToInt, intToList), intToList)}")

Which displays

flatMap(a, doubler, repeater, toRange) =
   [0, 1, 0, 1, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7]
noneCheck('a', stringToInt, intToList) =
   None
noneCheck('7', stringToInt, intToList) =
   [0, 1, 2, 3, 4, 5, 6]
flatMap(noneCheck('7', stringToInt, intToList), intToList) =
   [0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5]

Conclusion

We have typeclass-based, Monadic operations in Python which is a Dynamically typed language. This requires strong typing and functions as values.

We use classes to model Typeclasses.

We use the facts that functions are values and decorators are functions to give us a currying function which we then use to give us semi-automatic typeclass instance choice.

We have proven this sound for Semigroups, Monoids, Functors and Monads. There is no reason to believe it would fail for Groupoids, Groups, Applicatives or any other Typeclasses that are representable in Python.

Scala: Stable Identifier Required

This error message seems to be the source of a lot of confusion and pain across the Scala ecosystem. Here, we will attempt to make it less scary.

Take the following:

val b = 2
def simpleMatch(a: Int): String = a match{
  case `b` => "OK"
  case _ => a + " NOK"}
(0 to 5).map(simpleMatch)

The result is:

Vector(0 NOK, 1 NOK, OK, 3 NOK, 4 NOK, 5 NOK)

When we do something as seemingly benign as change the val to a var our work life is immediately made worse

var b = 2
def simpleMatch(a: Int): String = a match{
  case `b` => "OK"
  case _ => a + " NOK"}
(0 to 5).map(simpleMatch)

And the compiler attempts to help us by kindly letting us know error: stable identifier required, but this.b found. What?!?!?!

The Scala Language Specification

When an error message is too terse for understanding a few great places to start are stack overflow, gitter, twitter or any other messaging or questioning service where a lot of scala developers gather. Assuming you've already exhausted these options we dig into the scala language specification (SLS) itself.

Section 3.1 of the SLS states very clearly

A stable identifier is a path which ends in an identifier.

Ok. Not super helpful immediately but, it does lead us to further reading; we need to understand both, paths and identifiers.

Identifiers

Section 1.1 defines identifiers and chapter 2 describes them. It is very dry however illuminating reading. Identifiers are the names of things. The word name here is applied broadly; operators (such as *, +, =, ++=) are also names.

Paths

Section 3.1 of the SLS is on Paths. It gives four cases

  • The empty path ε
  • C.this, where C references a class. The path this is taken as a shorthand for C.this where C is the name of the class directly enclosing the reference.
  • p.x where p is a path and x is a stable member of p. Stable members are packages or members introduced by object definitions or by value definitions of non-volatile types.
  • C.super.x.x or C.super[M].x[M].x where C references a class and xx references a stable member of the super class or designated parent class M of C. The prefix super is taken as a shorthand for C.super where C is the name of the class directly enclosing the reference.

The third case helps us here. Our error states error: stable identifier required, but this.b found (this.b looks like p.x). The difference is our b is not a package, object definition or val; it is a var. This makes some semantic sense; the word stable can hardly be used to describe something like a var which can change at any time.

Basically, a stable identifier is simply a name which is bound statically to a value. They are required for certain tasks (like pattern matching) so the compiler can make sense of the code it is generating and the types it is inferring. Next time you see this error just check that your types are well defined, and you are not shadowing any stable names with unstable ones. A quick work around is instead of matching on the unstable identifier you set it equal to a stable one

val definitelyStable = b
def simpleMatch(a: Int): String = a match{
  case `definitelyStable` => "OK"
  case _ => a + " NOK"}
(0 to 5).map(simpleMatch)

Introductory TyDD in Scala: Deconstructing complex Types

These types are ugly and cumbersome; they are not at all human readable. How do we get data out of such a type and format it in a way that is useful to the operator? Let's again start with a naive example.

def stringify1[A, B, C](
  fa: A => String, fb: B => String, fc: C => String,
  in: List[(A, (B, C))]): String = {
  in.map{case (a, (b, c)) =>
    fa(a) + ", " + fb(b) + ", " + fc(c)
  }.mkString("(", "; ", ")")
}

Here we take a List of nested pairs and return a string that is hopefully more human readable than the toString method would provide.

Abstract the F

Like before, we will abstract the F from our function that it may be used with any type constructor of arity 1 rather than hard-coded to List.

import cats.Functor
val functorList = new Functor[List]{
  override def map[A, B](fa: List[A])(f: A => B): F[B] =
    fa.map(f)
}
def stringify2[F[_]: Functor, A, B, C](
  fa: A => String, fb: B => String, fc: C => String,
  in: F[(A, (B, C))]): String = {
  val F = implicitly[Functor[F]]
  F.map(in){case (a, (b, c)) =>
    fa(a) + ", " + fb(b) + ", " + fc(c)
  }
  ???
}

The cats library provides a nifty Functor typeclass for us so, we can abstract the map call pretty easily. What of the mkString? As it turns out, cats provides a typeclass for this as well! It is called Show; let's see how it works.

import cats.Show
def stringify3[F[_]: Functor, A, B, C](
  fa: A => String, fb: B => String, fc: C => String,
  in: F[(A, (B, C))])(implicit
  FS: Show[F[String]]): String = {
  val F = implicitly[Functor[F]]
  val result = F.map(in){case (a, (b, c)) =>
    fa(a) + ", " + fb(b) + ", " + fc(c)
  }
  FS.show(result)
}

And extending this idea of Show to the Function1 instances we have

def stringify4[F[_]: Functor, A: Show, B: Show, C: Show](
  in: F[(A, (B, C))])(implicit
  FS: Show[F[String]]): String = {
  val F = implicitly[Functor[F]]
  val fa = implicitly[Show[A]].show _
  val fb = implicitly[Show[B]].show _
  val fc = implicitly[Show[C]].show _
  val result = F.map(in){case (a, (b, c)) =>
    fa(a) + ", " + fb(b) + ", " + fc(c)
  }
  FS.show(result)
}

Abstracting over Arity

We'll attempt to build a recursive version of this function using the same principled we've used in previous posts.

def stringify5[F[_]: Functor, A: Show, B: Show](
  in: F[(A, B)])(implicit
  FS: Show[F[String]]): String = {
  val F = implicitly[Functor[F]]
  val fa = implicitly[Show[A]].show _
  val fb = implicitly[Show[B]].show _
  val result = F.map(in){case (a, b) =>
    fa(a) + ", " + fb(b)
  }
  FS.show(result)
}

This is not what we want! We need to recurse on the Show instance inside the Functor. Let's make a recursive function for Show. This will follow the code we read from Shapeless very closely.

implicit def makeShow[A: Show, B: Show]: Show[(A, B)] = {
  val fa = implicitly[Show[A]].show _
  val fb = implicitly[Show[B]].show _
  new Show[(A, B)]{
    override def show(t: (A, B)): String = {
      val (a, b) = t
      "(" + fa(a) + ", " + fb(b) + ")“
    }
  }
}

This says, Given any two Show instances, a Show instances for their pair can be produced. Alternatively, it can be said that given a proof for Show[A] and a proof for Show[B] a proof for Show[(A, B)] follows.

So, now we have the following stringify function:

def stringify[F[_]: Functor, A: Show](
  in: F[A])(implicit
  FS: Show[F[String]]): String = {
  val F = implicitly[Functor[F]]
  val fa = implicitly[Show[A]].show _
  val result = F.map(in)(fa)
  FS.show(result)
}

Notice how simple our code has become. All of the specific type information has been absorbed into the recursive implicit functions. This is simply the Show instance for Functor itself. Furthermore, our makeShow function will produce Show instances for any nesting of Tuple2 instances; it generates Show instances for binary trees. The makeShow function is 10 lines of code (even including the Scala boilerplate) and gave us a giant boost in usefulness for our library code.

What's the Point?

Let's see an example of how this can be used. Given our individual Show instances:

implicit val ShowListString = new Show[List[String]]{
  def show(in: List[String]): String =
    in.mkString("(", "; ", ")")
}
implicit val showInt = new Show[Int]{
  override def show(in: Int): String = in.toString}
implicit val showLong = new Show[Long]{
  override def show(in: Long): String = in.toString}
implicit val showString = new Show[String]{
  override def show(in: String): String = in}
implicit val showDouble = new Show[Double]{
  override def show(in: Double): String = f"$in%.2f"}
implicit val showFloat = new Show[Float]{
  override def show(in: Float): String = f"$in%.2f"}
implicit val showArrayByte = new Show[Array[Byte]]{
  override def show(in: Array[Byte]): String = new String(in)}

Our writer and reader code becomes:

//Write the thing with our writer
val result = implicitly[Result1]
//Read the thing back with our reader
println(stringify(result))

For anyone to create an application which reads and writes data with our library they need only define the specific business logic and types for their application. If they miss writing an instance, the compiler will tell them. If they rearrange the order of the types in IO, the compiler will figure it out for them. Many common pain points of writing data readers and writers have been taken care of by the library implementor giving the rest of the team the ability to focus on business logic.

The goal here is not to have an entire company of developers who write this kind of code. Just like the goal of any business is to have employees who each bring something new to the team, the goal here is to have a few developers who write these libraries that other developers can use to develop business applications. The benefit of this is the libraries help limit the kinds of errors that can make it to production; the entire production cycle gets a confidence boost.

Introductory TyDD in Scala: Writing Type Level Functions

Type level functions consume types and produce types. This allows us to tell the compiler exactly what we want in our code that errors may be caught at compile time rather than at later stages in our product deployment cycle. The true power of this approach lies in the fact that with Scala no application can be produced without compilation. No matter how undisciplined a developer may be, she cannot skip compilation. Moving error checking into compilation gives us more confidence in the performance of our binary than confidence gained by any other means. In short, when there is a failure in production we always ask "were the tests run?" but never do we ask "was it compiled?"

A Basic Zipper

Here we have a zipper.

def zipper[A, B, C](
  a: List[A], b: List[B], c: List[C]
): List[(A, (B, C))] = a.zip(b.zip(c))

From previous posts in this series we know this can be generalized with a type class. Let's exercise this muscle immediately.

trait Zip[F[_]]{
  def apply[A, B](a: F[A], b: F[B]): F[(A, B)]
}
def zipper[F[_]: Zip, A, B, C](
  a: F[A], b: F[B], c: F[C]): F[(A, (B, C))] = {
  val F = implicitly[Zip[F]]
  F(a, F(b, c))
}

This is nicer than the first version but it is still super restrictive. It only works for zipping exactly 3 values. If we want to zip 2 or 4 or 70 values, we are out of luck! We saw how the shapeless HList allowed us to compose an arbitrary number of arbitrary types into a single type. Let's try to use the same kinds of methods here to produce a zipper of arbitrary arity.

Step 1 - Simplify as much as is possible

We will simplify a zipper to the purest form we can. The purest zipper takes two instances of a particular type constructor and produces a single instance of that type constructor on a pair. Let's write that.

def zipper[F[_]: Zip, H, T](
  h: F[H], t: F[T]): F[(H, T)] = {
  val F = implicitly[Zip[F]]
  F(h, t)
}

Now, we don't need separate functions for different arity versions of a zipper. We can simply call this function recursively to produce the desired result.

val (list1, list2, list3, list4, list5, list6) = ...
implicit val ZipList = new Zip[List]{
  override def apply[A, B](
    a: List[A], b: List[B]): List[(A, B)] = a.zip(b)
}
val with2 = zipper(list1, list2)
val with3 = zipper(list1,
            zipper(list2, list3))
val with6 = zipper(list1,
            zipper(list2,
            zipper(list3,
            zipper(list4,
            zipper(list5, list6)))))

If we recall the shapeless HList code learned to read, we see the same pattern here of a recursive type being produced from recursive calls. This can be done in the compiler using implicits.

Step 2 - Replace explicit recursion with implicit

implicit def zipper[F[_]: Zip, H, T](implicit
  h: F[H], t: F[T]): F[(H, T)] = {
  val F = implicitly[Zip[F]]
  F(h, t)
}

This is the same code just the keyword implicit is placed in two locations. In scala we can promote an explicit call to an implicit call simply by adding a keyword. We inform the inputs with implicit so the compiler knows to find the head and tail by itself. We inform the function with implicit so the compiler knows to call the function implicitly if a value of the necessary type is not found.

We communicate our intent to the compiler with implicits and types. Type aliases help simplify business logic.

type F[A] = List[A]
type Result =
  F[
    (Int, (Long, (String, (Double, (Float, Array[Byte])
  ))))]

To tell the compiler which values it can use during the application of functions we inform the values as implicit

implicit val list1: List[Int] = ???
implicit val list2: List[Long] = ???
implicit val list3: List[String] = ???
implicit val list4: List[Double] = ???
implicit val list5: List[Float] = ???
implicit val list6: List[Array[Byte]] = ???

The implicitly function tells the compiler what it needs to execute.

implicitly[Result]

And that's it! The compiler assembles the recursive calls for us. No more errors from placing things in the wrong order; refactoring is as simple as rearranging the order of the types in our alias Result. One change on one line propagates throughout the entire code base automatically.

Why stop at Tuple2?

There is nothing here that requires that a tuple be used. In fact, the only thing we need is a type constructor of arity 2. We can express as

trait ZipG[F[_], G[_, _]]{
  def apply[A, B](a: F[A], b: F[B]): F[G[A, B]]
}
implicit def zipper[F[_], G[_, _], H, T](implicit
  F: ZipG[F, G], h: F[H], t: F[T]): F[G[H, T]] = {
  F(h, t)
}

And we can zip Tuple2s or Eithers by creating type class instances

implicit val zipListTuple2 = new ZipG[List, Tuple2]{
  override def apply[A, B](
    a: List[A], b: List[B]): List[(A, B)] = a.zip(b)
}
implicit val zipListEither = new ZipG[List, Either]{
  override def apply[A, B](
    a: List[A], b: List[B]): List[Either[A, B]] =
    for{a <- a; b <- b}yield{
      if(a.toString.size < b.toString.size) Left(a)
      else Right(b)
    }
}

For Tuple2 we have business logic like

type F[A] = List[A]
type Result =
  F[
    (Int, (Long, (String, (Double, (Float, Array[Byte])
  ))))]

implicit val list1: List[Int] = ???
implicit val list2: List[Long] = ???
implicit val list3: List[String] = ???
implicit val list4: List[Double] = ???
implicit val list5: List[Float] = ???
implicit val list6: List[Array[Byte]] = ???

implicitly[Result]

This is the same as before. Commonly, further abstractions at the type level have little or no effect on code at the value level. Abstractions should allow the code to be more expressive than it was prior to the abstraction exercise never less expressive.

Changing our business logic to use Either as our Zipping class is simple

type F[A] = List[A]
type Result =
  F[
    Either[Int, Either[Long, Either[String,
    Either[Double, Either[Float, Array[Byte]]
  ]]]]]

implicit val list1: List[Int] = ???
implicit val list2: List[Long] = ???
implicit val list3: List[String] = ???
implicit val list4: List[Double] = ???
implicit val list5: List[Float] = ???
implicit val list6: List[Array[Byte]] = ???

implicitly[Result]

This is very powerful. By changing our type aliases, we were able to entirely change the meaning of our business logic without complex refactorings. As long as there are the correct type classes in implicit scope, the business logic need not be bothered by those implementation details.

What we have here is a sort of data pipeline and writer. Now that we have formatted data that we can work with in code, how do we present that data to an operator? Next, we'll write a reader for our types.

 

Introductory TyDD in Scala: Reading Type Level Functions

This is shapeless' HList

sealed trait HList extends Product with Serializable

final case class ::[+H, +T <: HList](head : H, tail : T) extends HList {
  override def toString = head match {
    case _: ::[_, _] => "("+head+") :: "+tail.toString
    case _ => head+" :: "+tail.toString
  }
}

sealed trait HNil extends HList {
  def ::[H](h : H) = shapeless.::(h, this)
  override def toString = "HNil"
}

case object HNil extends HNil

The important thing to note is it is very similar to a standard scala List just at the type level. It has a head and tail as type parameters and the tail is itself an HList. The implementation details are not important for our purposes here. All we need to take from this is an HList is a recursive list (head, tail) which is either empty (HNil) or whose elements can be of different types.

A Type Level Operation on HList

This is a function defined in shapeless which helps client code work with HLists

trait Mapped[L <: HList, F[_]] extends Serializable {
  type Out <: HList
}
object Mapped {
  …
  type Aux[L <: HList, F[_], Out0 <: HList] =
    Mapped[L, F] { type Out = Out0 }
 implicit def hnilMapped[F[_]]: Aux[HNil, F, HNil] =
    new Mapped[HNil, F] { type Out = HNil }
  …
  implicit def hlistMapped1[
  H, T <: HList, F[_], OutM <: HList](implicit
  mt: Mapped.Aux[T, F, OutM]): Aux[H :: T, F, F[H] :: OutM] =
    new Mapped[H :: T, F] { type Out = F[H] :: OutM }
}

We can see all the parts we mentioned in the first post of this series. Recall:

  1. Keywords
  2. Input types
  3. Output types
  4. Body

Note there are multiple implicit def declarations. The modifiers implicit def are similar to case statements in value function bodies. Let's take this piece by piece.

The First Case

implicit def hnilMapped[F[_]]: Aux[HNil, F, HNil] =
new Mapped[HNil, F] { type Out = HNil }

When focusing on the types, the intent becomes clear

implicit def hnilMapped[F[_]]: Aux[HNil, F, HNil] =
  new Mapped[HNil, F] { type Out = HNil }

This reads something like: Given a type constructor, yield the empty HList. Simple enough! This stuff is comprehensible afterall!

The Second Case

implicit def hlistMapped1[
H, T <: HList, F[_], OutM <: HList](implicit
mt: Mapped.Aux[T, F, OutM]): Aux[H :: T, F, F[H] :: OutM] =
  new Mapped[H :: T, F] { type Out = F[H] :: OutM }

This is admittedly more complex. Let's take it in stride again focusing on the types.

implicit def hlistMapped1[
H, T <: HList, F[_], OutM <: HList](implicit
mt: Mapped.Aux[T, F, OutM]): Aux[H :: T, F, F[H] :: OutM] =
  new Mapped[H :: T, F] { type Out = F[H] :: OutM }
  • Given a head, a tail which is an HList, a type constructor and another HList
  • Also, given proof that this type class holds good for our tail
  • We can generate an HList where our type constructor is applied to our head and the tail follows

A brief Word on Induction

Many of the type level functions encountered in the wild are of this type. Where there is at the very least

  1. a trivial case which produces some terminal element
  2. a more complex case where given a type or arity n can produce a type of arity n+1

In other words, recursive types can be generated inductively assuming you have an instance of the recursive type and all the necessary parts to wrap that instance within the target type.

In the next post we'll create some type level code using these methods.

Introductory TyDD in Scala: Basic Type Class Development

(A more complete treatment of type classes and higer kinds.)

A simple Type Class

Take the following

trait Mapping[A, B]{
  def map(a: A): B
}

and an instance for it

val mapping: Mapping[List[Int], List[String]] =
  new Mapping[List[Int], List[String]]{
    override def map(a: List[Int]): List[String] =
      a.map(_.toString)
  }

This instance is super restrictive. It only works for taking Int into String. We want to map a List of any type. Since we know what our type parameters are, we can achieve our goal by passing in a function

trait ListMapping[A, B]{
  def map(list: List[A])(f: A => B): List[B]
}

So, given a List[A] and a function, A => B, we can get a List[B]. And by taking the type parameters from the trait definition and placing them onto the function definition, we can squeeze out a bit more freedom.

trait ListMapping{
  def map[A, B](list: List[A])(f: A => B): List[B]
}
val mapping: ListMapping =
  new ListMapping{
    override def map[A, B](a: List[A])(f: A => B): List[B] =
      a.map(f)
  }

Now, why would anyone ever do this? The List type provides a map function which does exactly this. With this approach one may provide any number of methods for mapping a list. Take this alternative:

val reverseMapping: ListMapping =
  new ListMapping{
    override def map[A, B](a: List[A])(f: A => B): List[B] =
      a.reverse.map(f)
  }

Through type classes, we can define new functionality for old data structures on the fly. Similar code can be written for sort order, string formatting or just about anything else.

Making things even more general

While, the ability to map Lists of any type in any number of ways is fairly abstract it is not abstract enough for our purposes. What if we want to map a different data structure such as an Option or a Stream or a spark Dataset?

Luckily, Scala has a language feature which can help us out here.

trait WithMap[F[_]]{
  def map[A, B](m: F[A])(f: A => B): F[B]
}

The type parameter, F[_], has a type parameter of _, this tells the compiler that our type parameter itself requires a type parameter. Notice in our definition all mention of List has been replaced by our parameter, F. This just says that given a type, F, which itself takes a type parameter, we can change the inner type or F without changing F. We can do this with any parameterized type of arity 1.

implicit val listWithMap = new WithMap[List]{
  override def map[A, B](m: List[A])(f: A => B): List[B] =
    m.map(f)
}
implicit val optionWithMap = new WithMap[Option]{
  override def map[A, B](m: Option[A])(f: A => B): Option[B] =
    m.map(f)
}
implicit val streamWithMap = new WithMap[Stream]{
  override def map[A, B](m: Stream[A])(f: A => B): Stream[B] =
    m.map(f)
}
val reverseListWithMap = new WithMap[List]{
  override def map[A, B](m: List[A])(f: A => B): List[B] =
    m.reverse.map(f)
}


With these techniques we can define super polymorphic functions. Take this pretty strinfigy function

def prettyString[F[_]: WithMap, A](m: F[A])(f: A => String): String = {
  implicitly[WithMap[F]].map(m)(f).toString
}

This takes two type parameters, F[_]: WithMap and A. The `:` character in the first type parameter tells the compiler that it needs an implicit instance of WithMap defined for our type F.

And here is a data processor defined in the same way

def processData[F[_]: WithMap, A, B, C, D](
  m1: F[A])(f1: A => B)(f2: B => C)(f3: C => D): F[D] = {
  val F = implicitly[WithMap[F]] 
  val m2 = F.map(m1)(f1)
  val m3 = F.map(m2)(f2)
  F.map(m3)(f3)
}

We have taken an implementation detail (the map function on List, Option, etc...) and brought it outside the type. This has given us the ability to talk about data which has a sensible map function without knowing what that data necessarily looks like.

Next we'll learn how to read some of the Type Level functions that exist in the shapeless library.

Introductory TyDD in Scala: Anatomy of a Type Level Function

Value Level Functions

A value level function typically looks like

def f(a: Int, b: Int): String = {
  a.toString + b.toString
}

The key parts are

  1. A keyword: def
  2. Inputs: a: Int, b: Int
  3. Outputs: String
  4. Body: a.toString + b.toString

When you see these 4 parts, you know you are reading a value level function. Nothing surprising here. Now, let's see what a similar definition looks like at the type level.

Type Level Functions

A type level function looks like

trait MyTrait[A, B]{type Out}
object MyTrait{
  def apply[A, B, C](): MyTrait[A, B]{type Out = C} =
    new MyTrait[A, B]{override type Out = C}
}

In this definition there is a type refinement, MyTrait[A, B]{type Out = C}. These are undesirable artifacts of type level development. To simplify these definitions we use the Aux alias (a document about on this). Aux helps us remove type refinements from our logic.

A type level function looks like

trait MyTrait[A, B]{type Out}
object MyTrait{
  type Aux[A, B, C] = MyTrait[A, B]{type Out = C}
  def apply[A, B, C](): Aux[A, B, C] =
    new MyTrait[A, B]{override type Out = C}
}

The type refinement from the previous example is replaced by the nicer (more readable, fewer braces, less code) Aux type.

Type level functions have the same 4 key parts as value level functions

  1. Keywords: trait object
  2. Inputs: A, B
  3. Outputs: type Out
  4. Body: def apply[A, B, C](): Aux[A, B, C]

Here the inputs are type parameters and outputs are type members. This is so the output types are not erased and can be referenced later in business logic. This is similar to value level functions as the result of a value level function does not expose the inputs required by the function.

Bodies of type level functions are value level functions. They are typically only a few lines long. Their purpose is to present the compiler with a way to construct a new type from the types provided. This is what these blog posts will focus on.

Whenever you see a set of definitions which have these 4 qualities, you know you are looking at a type level function.

The type class is the fundamental element of this style of type driven development. The next post will give an overview of this concept.

Introductory TyDD in Scala

Here are the slides (pptx, pdf) and code from the PHASE talk on Type Driven Development in Scala. Long form blog post version is forthcoming.

  1. Anatomy of a Type Level Function
  2. Type Classes & Higher Kinds
  3. Reading Type Level Functions
  4. Writing Type Level Functions
  5. Reading Data that is this strongly typed

Abstraction in F[_]: Implicits and Induction

Now we have a set of Pipeline instances

trait P1; trait P2; trait P3
implicit def guardP1 = new Guard[P1]{
  final override def name: String = "p1"
}
implicit def guardP2 = new Guard[P2]{
  final override def name: String = "p2"
}
implicit def guardP3 = new Guard[P3]{
  final override def name: String = "p3"
}
implicit def p1: Pipeline[P1, Stream, ...]
implicit def p2: Pipeline[P2, Stream, ...]
implicit def p3: Pipeline[P3, Stream, ...]

And we need to construct a Pipeline which represents the choosing of on of these Pipeline instances. In psudocode:

val pipeline: Pipeline[P1 or P2 or P3, Stream, ...]

In the previous post, given three Pipeline instances, we produced an Either[Either[Either[...]]] from a URI. More abstractly given three items we produced a recursive structure with three levels.

Building Recursive Structures with Shapeless

The shapeless library provides machinery for composing these kinds of recursive applications.

A Brief Intro to Shapeless

Two of the more widely used structures in shapeless are HList and Coproduct.

HList is a Heterogenous List. It is a List that retains type information for each element rather than for all elements. For example in the REPL:

scala> 1 :: "1" :: 1.0 :: Nil
res1: List[Any] = List(1, 1, 1.0)

scala> 1 :: "1" :: 1.0 :: HNil
res2: shapeless.::[Int,shapeless.::[String,shapeless.::[Double,shapeless.HNil]]] = 1 :: 1 :: 1.0 :: HNil

The List comes back with a type parameter of Any while the HList knows it has an Int, String and Double. An HList is a Product. It maintains that there exists all of its types.

A Coproduct is the Categorical Dual of a Product. It maintains that there exists any of its types. In the REPL:

scala> val t: T = Inl(1)
t: T = Inl(1)

scala> val t: T = Inr(Inl('1'))
t: T = Inr(Inl(1))

scala> val t: T = Inr(Inr(Inl(1.0)))
t: T = Inr(Inr(Inl(1.0)))

We can see these are both recursive data structures. Shapeless excels at these kinds of things in Scala.

Building a Recursive Either Inductively

Induction is performed by noting two things:

  1. A base case from which one may begin the process
  2. An inductive case from which one may continue the process

We can model this in Scala using functions and implicits.

What we have in our nested Either pipeline from the previous post is going from the inside (simplest) case outward is

  1. an empty Stream
  2. or a p3(uri)
  3. or a p2(uri)
  4. or a p1(uri)

One way to picture this is to say given a way to produce an empty Stream we can produce a p3(uri) or an empty Stream. Given a (p3(uri) or an empty Stream) we can produce a (p2(uri) or p3(uri) or an empty Stream). And finally, given an (p2(uri) or p3(uri) or an empty Stream) we can produce a (p1(uri) or p2(uri) or p3(uri) or an empty Stream).

Our base case is Stream() and our inductive step prepends the possibility of a Pipeline instance.

First, a preliminary type alias

object Pipeline{
  type Aux[T, A, B, O] = Pipeline[T, A, B]{type Out = O}
  ...
}

Adding the Aux keyword allows us to skip the type refinement each time we need to refer to the type member of our Pipeline trait. It is a nifty trick for "lifting" a type member into a type parameter to get around an annoyance of the language.

Generalizing our inductive Stream Pipeline, we have a base Pipeline instance:

implicit def PNil: Pipeline.Aux[CNil, CNil, CNil, Unit:+:CNil] = {
  new Pipeline[CNil, CNil, CNil]{
    type Out = Unit:+:CNil
    override def apply(uri: URI): Out = Inl(())
  }
}

This basically implies: Given nothing, we can get a meaningless Pipeline instance. This is just a stub or placeholder that we may begin the inductive process.

Our inductive Pipeline step builds upon a previous step prepending the possibility of a particular Pipeline:

implicit def inductivePipeline[TH, F[_], AH, BH,
  TT <: Coproduct, AT <: Coproduct, BT <: Coproduct, OT <: Coproduct
](implicit
  head: Pipeline.Aux[TH, AH, BH, Either[Unit, F[Unit]]],
  tail: Pipeline.Aux[TT, AT, BT, OT]
): Pipeline.Aux[TH:+:TT, AH:+:AT, BH:+:BT, F[Unit]:+:OT] = {
  new Pipeline[TH:+:TT, AH:+:AT, BH:+:BT]{
    final override type Out = F[Unit]:+:OT
    final override def apply(uri: URI): Out = {
      head(uri).fold(
        {_ =>
          Inr(tail(uri))
        },
        s => Inl(s)
      )
    }
  }
}

This basically says given two Pipeline instances the second of which has type parameters all of which are themselves Coproduct instances we can produce a Pipeline which has type parameters all of which are Coproducts whose left case is the first input Pipeline instance's type parameters. This requires unpacking.

Given two Pipeline instances

This is not too surprising: our head and tail.

The second of which has type parameters all of which are themselves Coproduct instances

Our tail is a Pipeline constructed of Coproducts. This tail is the first ingredient for our inductive steps. Recall the base case which is a Pipeline.Aux[CNil, CNil, CNil, Unit:+:CNil]. All of the type parameters here are indeed Coproduct instances. So the first step of our example can take the base case as its tail instance.

We can produce a Pipeline which has type parameters all of which are Coproducts

The return value is a Pipeline with Coproducts for type parameters. This is important because this implies each step of the inductive process can be used as the tail in another step of the inductive process.

Whose left case is the first input Pipeline instance's type parameters

The head becomes the left most case in the resulting Pipeline instance.

Using our Inductive Pipeline

Now that we've done all this work, it is time to reap the benefits!

implicit val pipeline1 = Pipeline[One, …]
implicit val pipeline2 = Pipeline[Two, …]
implicit val pipeline3 = Pipeline[Three, …]
val pipeline4: Pipeline.Aux[
  One :+: Two :+: Three :+: CNil, …,
  Stream[Unit] :+: Stream[Unit] :+: Stream[Unit] :+: Unit :+: CNil] = implicitly
pipeline4(uri)

This. Is. Awful...

The type annotations here are brutal. Of course, with more shapeless machinery we can overcome this but a much simpler solution arises when we look first at what we are trying to accomplish.

What we want

implicit val pipeline1 = Pipeline[One, …]
implicit val pipeline2 = Pipeline[Two, …]
implicit val pipeline3 = Pipeline[Three, …]
val pipeline4 = pipeline1 op pipeline2 op pipeline3 op PNil
pipeline4(uri)

If we can develop an operator which can stitch pipelines together, we are finished. No fancy macros. No fancy generics. No fancy anything. Just an operator.

We can accomplish this with an implicit AnyVal and an operator which ends with a ':' character

implicit class Ops[TT <: Coproduct,
  AT <: Coproduct, BT <: Coproduct, OT <: Coproduct
](
  val tail: Pipeline.Aux[TT, AT, BT, OT]
) extends AnyVal{
  def +:[TH, F[_], AH, BH](
    head: Pipeline.Aux[TH, AH, BH, Pipeline.Out[F]]
  ): Pipeline.Aux[TH:+:TT,AH:+:AT, BH:+:BT,F[Unit]:+:OT] =
    inductivePipeline[
      TH, F, AH, BH, TT, AT, BT, OT
    ](head, tail)
}

This says the same thing as the inductivePipeline function but it introduces a nice infix operator rather than requiring a lot of complicated machinery to get the compiler to "do it for us". Thus, we have our combined pipeline:

implicit val pipeline1 = Pipeline[One, …]
implicit val pipeline2 = Pipeline[Two, …]
implicit val pipeline3 = Pipeline[Three, …]
val pipeline4 = pipeline1 +: pipeline2 +: pipeline3 +: PNil
pipeline4(uri)

This has reasonable type annotations and more importantly is straightforward. The syntax is exactly like the syntax of Seq construction in the Collections library.

Now we have a scalable pipeline library. It is easy to construct pipelines and merge them. Also, rearranging the order of the pipelines is trivial. There are however, two things to consider.

1. PNil is a def rather than a val

I am totally lost here. For some reason, the typer cannot unify  at the implicit conversion call site when PNil is a val. This is easy enough to resolve but perplexing none-the-less.

2. This will not work with spark Dataset

Spark Dataset requires the type paramerers have Encoder context bounds. cats Functor does not have context bounds on its type parameters. There is an easy work around for this using cats Id and a custom Functor-like type class

trait BoundedFunctor[F[_], B[_]]{
  def map[U: B, V: B](fu: F[U])(f: U => V): F[V]
}
trait Functor[F[_]] extends BoundedFunctor[F, Id]

Then we can redefine our apply method as

def apply[T: Guard, F[_]: BoundedFunctor, A, B]...

The rest of the code takes on similar changes and we can use Dataset with our pipeline by creating an instance thus:

implicit val sparkFunctor = new BoundedFunctor[Dataset, Encoder]{
  override def map[U: Encoder, V: Encoder](
    fu: Dataset[U])(f: U => V): Dataset[V] = fu.map(f)
}

Abstraction in F[_]: Lift Implementation Details Outside the Class

We now know that we can lift implementation details outside the class to gain some freedom in the client code; we have gained abstraction through decoupling. Why not apply a similar abstraction on the rest of the code.

We had:

trait Pipeline[F[_], A, B]{
  final def apply(uri: URI)(implicit F: Functor[F]): F[Unit] = {
    val in = read(uri)
    val computed = F.map(in)(computation)
    F.map(computed)(write)
  }
  def read(uri: URI): F[A]
  def computation(in: A): B
  def write(in: B): Unit
}

And we can abstract it into:

trait Read[F[_], A] extends Function1[URI, F[A]]
trait Computation[A, B] extends Function1[A, B]
trait Write[B] extends Function1[B, Unit]

trait Pipeline[F[_], A, B]{
  final def apply(uri: URI)(implicit
    F: Functor[F],
    read: Read[F, A],
    computation: Computation[A, B],
    write: Write[B]): F[Unit] = {
    val in = read(uri)
    val computed = F.map(in)(computation)
    F.map(computed)(write)
  }
}

While we're at it, let's abstract out the apply method as well.

sealed trait Pipeline[F[_], A, B]{
  def apply(uri: URI): F[Unit]
}
object Pipeline{
  final def apply[F[_]: Functor, A, B](implicit
    read: Read[F, A],
    computation: Computation[A, B],
    write: Write[B]) = {
    val F: Functor[F] = implicitly
    new Pipeline[F, A, B]{
      override def apply(uri: URI): F[Unit] = {
        val in = read(uri)
        val computed = F.map(in)(computation)
        F.map(computed)(write)
      }
    }
  }
}

Now our trait is a simple expression of inputs to outputs. But why would we ever perform this kind of abstraction? It seems like we are complicating the problem, not making it simpler.

 

The benefits here are more subtle and only appear in certain use cases. Say you have two pipelines:

val p1: Pipeline[Stream, Int, String] = ???
val p2: Pipeline[Stream, String, Array[Char]] = ???

And you wanted to combine them into a single

val pipeline: Pipeline[Stream, Int, Array[Char]] = ???

Taking two Pipeline instances and composing them is not a readily understood idea. However, taking two Function1 instances and composing them is very well understood. Notice we brought the functions read, compute and write outside the class as simple Function1 instances. Abstracting the Pipeline functions outside the trait provides the developer who is writing the client code with a clear and well understood method for composing multiple Pipelines.

This is still an incomplete implementation. We can see a path forward for composing any number of Pipelines whose computations can be composed but, how do we compose Pipelines who accept different inputs?

A simple switching mechanism

Say we have three Pipeline instances which require separate inputs.

val p1: Pipeline[Stream, ...] = ...
val p2: Pipeline[Stream, ...] = ...
val p3: Pipeline[Stream, ...] = ...

Our application would need to accept a URI and choose which pipeline (if any) should run it.

def perform(uri: URI): Stream[Unit] = {
if(uri.toString.contains("p1")) p1(uri)
else if(uri.toString.contains("p2")) p2(uri)
else if(uri.toString.contains("p3")) p3(uri)
else Stream()
}

This is a lot of boilerplate. Especially when you consider the number of Pipelines (for any successful business) is expected to increase. Let's unpack what we have and see if we can't abstract it into our Pipeline definition.

  1. Uniform Input URI is the input to ALL Pipeline instances
  2. Guards checking a URI against some
  3. Constant value defining a Pipeline for use in a Guard
  4. Default case in case the input matches no Pipeline

Our uniform input means we don't have to worry about which Pipeline can take what Types of values. This is already abstract enough.

We'll build a typeclass to model Guards and Constants associated with each pipeline.

trait Guard[-T]{
def name: String
}
sealed trait Pipeline[-T, A, B]{
def apply(uri: URI): F[Unit]
}
object Pipeline{
final def apply[T: Guard, F[_]: Functor, A, B](implicit
read: Read[F, A],
computation: Computation[A, B],
write: Write[B]): Default[T, F, A, B] = {
val G: Guard[T] = implicitly
val F: Functor[F] = implicitly
new Pipeline[T, A, B]{
override def apply(uri: URI): F[Unit] = ???
  }
}
}

We have an issue here. The last else case of our function returns an empty Stream. In the Pipeline object we don't know what our effect type is. We cannot return an empty version thereof. This problem takes me back to a talk given by Runar Bjarnason last year wherein he describes how when we liberate our types, we constrain our implementation and when we constrain our types we liberate our implementation. We have liberated all of our types here (except URI) leaving ourselves no room to implement what we need. So, we need to constrain a type that we may regain our ability to implement our function. Let's constrain our output type.

 
trait Guard[-T]{
  def name: String
}
sealed trait Pipeline[-T, A, B]{
type Out
  def apply(uri: URI): Out
}
object Pipeline{
  final def apply[T: Guard, F[_]: Functor, A, B](implicit
    read: Read[F, A],
    computation: Computation[A, B],
    write: Write[B]): Default[T, F, A, B] = {
    val G: Guard[T] = implicitly
    val F: Functor[F] = implicitly
    new Pipeline[T, A, B]{
  type Out = Either[Unit, F[Unit]]
      override def apply(uri: URI): Out = {
        val from = uri.toString
        if(from.contains(G.name)) Right{
          val in = read(uri)
          val computed = F.map(in)(computation)
          F.map(computed)(write)
        } else Left(())
      }
    }
  }
}

So our client code becomes

trait P1
trait P2
trait P3
implicit def guardP1 = new Guard[P1]{
  final override def name: String = "p1"
}
implicit def guardP2 = new Guard[P2]{
  final override def name: String = "p2"
}
implicit def guardP3 = new Guard[P3]{
  final override def name: String = "p3"
}
implicit def p1: Pipeline[P1, Stream, ...]
implicit def p2: Pipeline[P2, Stream, ...]
implicit def p3: Pipeline[P3, Stream, ...]
def perform(uri: URI): Either[Either[Either[Unit, Stream[Unit]], Stream[Unit]], Stream[Unit]] = {
  p1(uri).fold(
    _ => Left(p2(uri).fold(
      _ => Left(p3(uri).fold(
        _ => Left(()),
        a => Right(a)
      )),
      a => Right(a)
    )),
    a => Right(a)
  )
}

This has made things much worse. There is even more boiler plate and the nesting will become unreasonable in short order. But we have something we can easily reason about at the type level:

  • Given a known set of Pipeline instances
  • Created a computation which is at most 1 Pipeline
  • Resulting in a nested Data Structure

These characteristics indicate we can take an inductive approach to building our Pipeline library. Enter Shapeless.

 

Abstraction in F[_]: Abstract Your Functions

We have a reasonably abstract pipeline in

trait Pipeline[F[_], A, B]{
  final def apply(uri: URI): F[Unit] =
    write(computation(read(uri)))
  def read(uri: URI): F[A]
  def computation(in: F[A]): F[B]
  def write(in: F[B]): F[Unit]
}

Recognizing Higher-Kinded Duplication

Taking a close look at the trait, we see the computation and write functions are the same aside from their type variables. In fact, if we rename them to have the same name, the compiler complains

scala> :paste
// Entering paste mode (ctrl-D to finish)
trait Pipeline[F[_], A, B]{
  def perform(in: F[A]): F[B]
  def perform(in: F[B]): F[Unit]
}
// Exiting paste mode, now interpreting.
<console>:9: error: double definition:
method perform:(in: F[B])F[Unit] and
method perform:(in: F[A])F[B] at line 8
have same type after erasure: (in: Object)Object
         def perform(in: F[B]): F[Unit]

Since these are the same, we can build an abstraction to simplify our API even further.

trait Pipeline[F[_], A, B]{
  final def apply(uri: URI): F[Unit] = {
    val in = read(uri)
    val computed = convert(in)(computation)
    convert(computed)(write)
  }
  def read(uri: URI): F[A]
  def computation(in: A): B
  def write(in: B): Unit
  def convert[U, V](in: F[U], f: U => V): F[V]
}

We've removed the need for the developer to understand the effect type in order to reason about a computation or write step. Now, let's focus on this new function

def convert[U, V](in: F[U], f: U => V): F[V]

This is super abstract. Like so abstract it is meaningless without context. I am reminded of this video in which Rob Norris explains how he continued to abstract his database code until some mathematical principles sort of arose from the work. In this talk, he points out that anytime he writes something sufficiently abstract he checks a library for it, as he probably has not himself discovered some new basic principle of mathematics. We do the same here.

Inside the cats library we find the following def within the Functor class

def map[A, B](fa: F[A])(f: A => B): F[B]

This is the same as if we wrote our convert function as curried rather than multiple argument. We replace our custom function with one from a library; the chance is greater that a developer is well-informed on cats than our internal library. (post on implicits and type classes)

trait Pipeline[F[_], A, B]{
  final def apply(uri: URI)(implicit F: Functor[F]): F[Unit] = {
    val in = read(uri)
    val computed = F.map(in)(computation)
    F.map(computed)(write)
  }
  def read(uri: URI): F[A]
  def computation(in: A): B
  def write(in: B): Unit
}

Here we were able to replace an internal (thus constrained) implementation detail with an external (thus liberated) argument. In other words, we have lifted an implementation detail outside our class giving the client code freedom to use the same instantiated Pipeline in a multitude of contexts.

Abstraction in F[_]: Abstract your Types

A Pipeline of Stream from BigInt to String

Say we have a  data pipeline:

trait Pipeline{
  final def apply(uri: URI): Stream[Unit] =
    write(computation(read(uri)))
  def read(uri: URI): Stream[BigInt]
  def computation(in: Stream[BigInt]): Stream[String]
  def write(in: Stream[String]): Stream[Unit]
}

The limitations here are many. The most important limitation is this only works for data pipelines your team can model as streams of an input to a BigInt to a String to an output. This is not very useful. The first step is abstracting over your computation types.

A Pipeline of Stream

Removing the constraint on BigInt and String requires type parameters on our Pipeline trait:

trait Pipeline[A, B]{
  final def apply(uri: URI): Stream[Unit] =
    write(computation(read(uri)))
  def read(uri: URI): Stream[A]
  def computation(in: Stream[A]): Stream[B]
  def write(in: Stream[B]): Stream[Unit]
}

We have gained a bit of freedom in implementation. We can now write Pipelines that can be modeled as streams of an input to a A to a B to an output given any A and B. Now instead of being constrained to BigInt and String, we have gained some liberty through our abstraction.

Still, we are constrained to the scala Stream type. This, too, is a nuisance what if we require Pipelines that effect through fs2 Stream or spark Dataset or any other suitable effect? Similar to how we abstracted away from BigInt and String by making them type parameters A and B, we can do the same with our Stream.

A Pipeline

Using a higher-kinded type parameter, we can abstract over any effect assuming the effect has a single type parameter.

trait Pipeline[F[_], A, B]{
  final def apply(uri: URI): F[Unit] =
    write(computation(read(uri)))
  def read(uri: URI): F[A]
  def computation(in: F[A]): F[B]
  def write(in: F[B]): F[Unit]
}

Now, we can make a data Pipeline using any such types! We can have our original Pipeline of Stream from BigInt to String

val pipeline: Pipeline[Stream, BigInt, String] = ???

We can have a Pipeline of fs2 Stream with some type construction:

type MyStream[A] = fs2.Stream[fs2.Task, A]
val pipeline: Pipeline[MyStream, BigInt, String] = ???

We can even do this with spark

val pipeline: Pipeline[Dataset, BigInt, String] = ???

Any Pipeline your team can model as a read effect a computation and a write effect can be defined with Pipeline defined this way.

Abstraction in F[_]

I gave a talk at typelevel today about abstraction data pipelines and some ways to ease the use of Spark in purely functional application. It seemed to have gone pretty well, here you will find the video, deck (pptx, pdf) and code

In the coming weeks I'll be posting the long-form version of the talk.

  1. Abstract Your Types
  2. Abstract Your Functions
  3. Lift Implementation Details Outside the Class
  4. Implicits and Induction

Simple Generic Derivation with Shapeless

(Code on GitHub)

Shapeless is a library which makes generic programming in Scala easier. Generic Programming gives you the ability to build abstractions that the language does not directly support

First thing's first:

import shapeless._

For the purposes of this post, this is the only import needed.

HList

The HList type provides a way to maintain a collection of items whose types may be different. For example say you have an Int, a String and a Double that you want inside a list. Using a standard Scala List one has:

scala> List(1, "1", 1.0)
res0: List[Any]

The problem is the type information is lost, so when we use this data, we need to cast the appropriate index to the appropriate type. HList is a list which holds type information for each element separately. With the shapeless import, HLists are constructed using the :: operator

scala> 1 :: "1" :: 1.0 :: HNil//very similar syntax to standard List
res1: shapeless.::[Int,shapeless.::[String,shapeless.::[Double,shapeless.HNil]]]

We can see the type information for Int, String and Double are kept as part of the overall type. This implies we can produce concrete types for lists of arbitrary elements. This is the cornerstone of Generic Derivation with Shapeless.

Note: Yes, Scala has perfectly reasonable Product types with tuples and case classes and whatnot; however, for reasons that will become clear later in this post, these types are insufficient for our purposes here.

Typeclass Boilerplate

Let's take a typeclass

trait Foo[Type]{
  def bar(t: Type): String
}

Not super exciting but, it will get the point across. Say we create instances for Int, String and Double

implicit def fooInt = new Foo[Int]{
  def bar(t: Int): String = t + ": Int"
}
implicit def fooString = new Foo[String]{
  def bar(t: String): String = t + ": String"
}
implicit def fooDouble = new Foo[Double]{
  def bar(t: Double): String = t + ": Double"
}

Now we have instances we can use for all things that are Int or String or Double and we are happy with this for a time. Then we get a request for a Foo instance that can be used for both Int and String so, we produce one using our new friend the HList:

implicit def fooIntString = new Foo[Int :: String :: HNil]{
  def bar(t: Int :: String :: HNil): String = {
    val i :: s :: HNil = t //very similar syntax to standard List
    fooInt.bar(i) + ", " + fooString.bar(s)
  }
}

But, this is coming in from user input on a webservice and users are notoriously inconsistent with the ordering of their arguments. We need one for String then Int as well:

implicit def fooStringInt = new Foo[String :: Int :: HNil]{
  def bar(t: String :: Int :: HNil): String = {
    val s :: i :: HNil = t
    fooString.bar(s) + ", " + fooInt.bar(i)
  }
}

Great! But what about combinations with Double? And what about when we need to support Long or List[String] or any other type? The combinations here blow up quite quickly in any reasonable application. Luckily, Scala and Shapeless give us a few tricks to get rid of all the tedious boilerplate present in any typeclass polymorphic system.

Implicits to the Rescue

All of the typeclass definitions thus far have been declared implicit. There is a fantastic reason for this: Implicits get the Scala compiler to produce boilerplate declarations for us at compile time. We can greatly reduce the amount of boilerplate needed to create typeclass instances for combinations of types. All we have to do is tell the compiler what to do.

Recall, all our HList declarations end with HNil. We need an instance for HNil

implicit def fooNil = new Foo[HNil]{def bar(t: HNil): String = ""}

Recall too, what the type of our HList looked like

scala> 1 :: "1" :: 1.0 :: HNil
res1: shapeless.::[Int,shapeless.::[String,shapeless.::[Double,shapeless.HNil]]]

removing the noise, we can see a pattern of nesting: [Int,[String,[Double,HNil]]]. We know from math that when we have nested syntax, we evaluate from the inside out; this applies for Scala code as well. So logically, we start with an HNil, prepend a Double, prepend a String then prepend an Int.

We have given the compiler implicit typeclass instances for each of these types; all we need is to give it an implicit operation for prepending an instance onto another instance:

implicit def fooPrepend[Head, Tail<:HList](implicit
  head: Foo[Head],
  tail: Foo[Tail]): Foo[Head :: Tail] = new Foo[Head :: Tail]{
  def bar(t: Head :: Tail): String = {
    val hd :: tl = t
    head.bar(hd) + ", " + tail.bar(tl)
  }
}

This says, Given a Foo instance for some type, Head, and some HList, Tail, we can produce a new Foo instance for the HList, Head :: Tail. We can see this work in the REPL:

scala> :paste
// Entering paste mode (ctrl-D to finish)

val a: Foo[Int :: HNil] = implicitly
val b: Foo[String :: HNil] = implicitly
val c: Foo[String :: Int :: HNil] = implicitly
val d: Foo[Double :: String :: Int :: HNil] = implicitly
val e: Foo[Double :: String :: Int :: String :: HNil] = implicitly

// Exiting paste mode, now interpreting.

a: Foo[shapeless.::[Int,shapeless.HNil]]
b: Foo[shapeless.::[String,shapeless.HNil]]
c: Foo[shapeless.::[String,shapeless.::[Int,shapeless.HNil]]]
d: Foo[shapeless.::[Double,shapeless.::[String,shapeless.::[Int,shapeless.HNil]]]]
e: Foo[shapeless.::[Double,shapeless.::[String,shapeless.::[Int,shapeless.::[String,shapeless.HNil]]]]]

There was no need to define by hand Foo instances for any HList type that is a combination of types for which implicit Foo instances exist in scope.

The Reason for HList

Why not just use any old Product type? HList, unlike Product, has a way to take any HList and combine it with another type to produce another HList. There is no such capability baked into Product.

Recall, the last thing we did was build a Foo instance for a larger HList out of a Foo instance for a smaller HList. This process, started with the smallest possible HList, HNil, and built larger and larger HLists until the required type was produced.

This talk of Products leads us to our next step. Most applications have a lot of Product types and need typeclass instances for these Products. Given an HList is just a really fancy Product can we generically derive instances for tuples and case classes?

Note: There is an excellent talk on how to do this with nested Tuples.

Generic

Shapeless provides a typeclass, Generic, which helps convert between Product and its similar (isomorphic) HList. It is fairly straight forward to produce a Generic instance for a Product and get an HList:

scala> :paste
// Entering paste mode (ctrl-D to finish)

val gen = Generic[(Int, String)]
val asH = gen.to((1, "1"))

// Exiting paste mode, now interpreting.

gen: shapeless.Generic[(Int, String)]
asH: gen.Repr = 1 :: 1 :: HNil

With this, we can take any Product and create a Foo instance for an HList that is similar:

scala> :paste
// Entering paste mode (ctrl-D to finish)

val genIS = Generic[(Int, String)]
val genDS = Generic[(Double, String)]
val a = implicitly[Foo[genIS.Repr]].bar(genIS.to(1, "1"))
val b = implicitly[Foo[genDS.Repr]].bar(genDS.to(1.0, "1"))

// Exiting paste mode, now interpreting.

genIS: shapeless.Generic[(Int, String)]
genDS: shapeless.Generic[(Double, String)]
a: String = 1: Int, 1: String
b: String = "1.0: Double, 1: String, "

This implies given:

  1. a Product, P
  2. a Generic instance for P, gen
  3. and an implicit Foo instance for the HList, gen.Repr

We can produce a result that would be valid for an instance of Foo[P]. All we need to do is find a way to delegate the work Foo[P] needs to do to an instance of Foo[gen.Repr].

Generic Derivation

Like before, we'll use an implicit def so the compiler helps us along.

implicit def fooProduct[P<:Product, H<:HList](implicit
  gen: Generic[P]{type Repr = H},//compiler needs to know Generic converts P to H
  foo: Foo[H]): Foo[P] = {
  new Foo[P]{
  def bar(p: P): String = foo.bar(gen.to(p))
}
}

This states, Given a Product and an HList, P and H, a Generic instance, gen, which can convert between P and H and a Foo instance for H, we can construct a Foo instance for P. There are two things to note:

  1. We do not need to type implicit instanced for Generic like we had to for Foo
  2. The implicit Generic needs a structural bound

The shapeless library provides out of the box, automatic Generic instances for any Product type and brings these into implicit scope. The problem with Generic is the type Repr is a member (dependent) type not a type parameter. In order to make sure the compiler can prove Repr is indeed our parameter H, we need to give it explicitly. And with this, we have our generic derivation for Foo:

scala> :paste
// Entering paste mode (ctrl-D to finish)

val f: Foo[(String, Int)] = implicitly
val g: Foo[(Double, String, Int)] = implicitly
val h: Foo[(Double, String, Int, String)] = implicitly

case class A(i1: Int, i2: Int, s: String)
case class B(d: Double, s: String)
case class C(i1: Int, d1: Double, s: String, d2: Double, i2: Int)
val i: Foo[A] = implicitly
val j: Foo[B] = implicitly
val k: Foo[C] = implicitly

// Exiting paste mode, now interpreting.

f: Foo[(String, Int)]
g: Foo[(Double, String, Int)]
h: Foo[(Double, String, Int, String)]
defined class A
defined class B
defined class C
i: Foo[A]
j: Foo[B]
k: Foo[C]

And that's that! We can derive instances for our type class for any Product given each individual type within the product has an implicit instance defined.

In Sum

Shapeless provides a framework for Generic Programming which can help us remove boilerplate from our applications and derive instances for types that would be too tedious to write out ourselves. To take advantage of the derivation one needs a few parts:

  1. Implicit instances for basic types like Int, String, Double, etc...
  2. An implicit instance for HNil
  3. An implicit def which when given implicit instances for a type and an HList, Head and Tail, can produce an implicit instance for the HList, Head :: Tail.
  4. An implicit def which for a Product, P, when given implicit instances for an HList, H, and a Generic which can convert between P and H, can produce an implicit instance for P

Asynchronicity

(video, slides, code)

I gave a talk at PHASE (PHiladelphia Area Scala Enthusiasts) about Asynchronous programming. Specifically, we covered how to transform a basic linear and sequential computation (factorial) into one which could be processed in parallel. Moreover, we added an extra layer of complexity onto the computation (n choose k) to show how Actors and Futures in Scala can work together to help a developer build an asynchronous system.

What the Hell is an "Effect Type"?

I was reading through the fs2 documentation and user guide and thought to myself, this is really straight forward! And, to their credit it is. Anyone who is used to FP and Scalaz or Haskell will take to the documentation with little friction. However, when an FP novice or someone from a language with less powerful FP implementations (F#, C#, Java) encounters this documentation, its a pain to trudge through. 

Through many attempts at explaining fs2, I have found the main topic of concern is the "effect type". On its surface, this term seems rather benign:

  1. Everyone knows FP strives to have no side effects
  2. We all know certain things (IO for example) are fundamentally effectful
  3. In order to encode these effects into FP style, we build abstractions
  4. These abstractions for effects are our effect types

So, taking the canonical example from the fs2 documentation

import fs2.{io, text, Task}
import java.nio.file.Paths

def fahrenheitToCelsius(f: Double): Double =
(f - 32.0) * (5.0/9.0)

val converter: Task[Unit] =
  io.file.readAll[Task](Paths.get("testdata/fahrenheit.txt"), 4096)
    .through(text.utf8Decode)
    .through(text.lines)
    .filter(s => !s.trim.isEmpty && !s.startsWith("//"))
    .map(line => fahrenheitToCelsius(line.toDouble).toString)
    .intersperse("\n")
    .through(text.utf8Encode)
    .through(io.file.writeAll(Paths.get("testdata/celsius.txt")))
    .run

// at the end of the universe...
val u: Unit = converter.unsafeRun()

We see a file read, a parse, a transformation then a file write (reading and writing files are side effect heavy). The effect type is Task and the result is Unit. This can be read as

The converter exists to create a Task which reads a file as bytes, converts those bytes to utf-8 Strings, transform those Strings and write them back to disk in a separate file returning no result. The converter is purely effectful.

One can do this with a standard scala.collection.immutable.Stream as well.

val path = Paths.get("testdata/fahrenheit.txt")
val out = Paths.get("testdata/celsiusStream.txt")
val readerT =
  Try(Files.newBufferedReader(path, StandardCharsets.UTF_8))
val writerT =
  Try(Files.newBufferedWriter(out, StandardCharsets.UTF_8))
val result = for{
  reader <- readerT
  writer <- writerT
}yield{
  Stream.continually{reader.readLine}
    .takeWhile(null != _)
    .filter(s => !s.trim.isEmpty && !s.startsWith("//"))
    .map(line => fahrenheitToCelsius(line.toDouble).toString)
    .flatMap{Stream(_, "\n")}
    .foreach(writer.write)
}
readerT.foreach(_.close())
writerT.foreach(_.close())

So, other than the obvious fs2 io convenience functions, to most FP unindoctrinated it seems the standard Stream version is about as useful and as safe as the fs2 Stream version. However, the fs2 Stream is much better FP practice.

Function Parameters should be Declared

Upon Stream creation, there is a big difference between fs2 and standard. With fs2, the creation mechanism is a pure function

io.file.readAll[Task](Paths.get("testdata/fahrenheit.txt"), 4096)

It takes all of its necessary data as parameters and returns a Stream with effect type Task. On the other hand, the standard Stream requires a closure to be initialized

Stream.continually{reader.readLine}

It requires a by-name parameter, the by-name returns a different result upon each invocation and the body of the by-name depends on the closure within which the function is called. This line of code is impossible to understand by itself; taken out of context, it is meaningless. In other words the function lacks referential transparency.

Function Duties should be Declared

Lines from a file in fs2 are produced with

scala> io.file.readAll[Task](Paths.get("testdata/fahrenheit.txt"), 4096).
     |       through(text.utf8Decode).
     |       through(text.lines)
res3: fs2.Stream[fs2.Task,String] = evalScope(Scope(Free)).flatMap(<function1>)

and standard Stream we have

scala> val path = Paths.get("testdata/fahrenheit.txt")
path: java.nio.file.Path = testdata\fahrenheit.txt

scala> val reader = Files.newBufferedReader(path,
 | java.nio.charset.StandardCharsets.UTF_8)
reader: java.io.BufferedReader = java.io.BufferedReader@3aeed31e

scala> Stream.continually{reader.readLine}
res4: scala.collection.immutable.Stream[String] = Stream(120, ?)

There are two important differences here. The first is evaluation; even though standard streams are considered lazy, they evaluate the head value eagerly. We can see fs2 gives us a computation where standard Streams gives us a value. The second difference is the type.

In fs2 we have a type of Stream[Task, String]; standard gives us Stream[String]. The fs2 Stream explicitly expresses the intention for effectful computation, whereas the standard Stream hides this implementation detail. Another way of looking at this is, standard Streams hide their effects where fs2 Streams surface their effects. In fact, fs2 Streams (if effectful) only allow the developer to find its result through the effect type. This gives the developer a sort of heads up about what's going on in the application behind the scenes.

What the Effect Type Represents

The effect type in any such functional library represents the intent to perform an operation outside the scope of the return type. This is very common for IO (as we've seen) and other effectful operations like Logging or showing the user a pop up. The computation usually produces some sort of a result but, before returning the result writes it to disk, logs it or tells the user about a completion state. In FP, we like to express all data in a function through the function definition and effects are just weird data.

Implicits and Type Classes for Mortals

Implicits and Type Classes can be fairly hard to nail down for a developer learning Functional Programming. They can seem more like magic than like provably-correct code and, clever compiler operations can exacerbate the issue.

For those of us who have been working with these concepts for a while, implicit scope and type class polymorphism come second natrure. They are like an extension of the fingers. An expression of this divide appeared in my twitter feed a few months ago: 

This post aims to impart some of this natural feeling to those who are just breaking in. Hopefully, type class code will be easier to work with after reading.

A Crash Course in Implicit Resolution

Implicit scope in scala is insanely complex. Luckily, understanding the subtleties of the compiler algorithm are not a necessity. We only need to understand a few simple points:

  1. The implicit keyword
  2. Implicit values
  3. Implicit function parameters

The Implicit Keyword

This keyword tells the compiler that a function or value should be searchable in implicit scope. Implicit scope is the set of all implicit declarations in scope.

We'll now build up to using identifiers that are searchable in implicit scope.

Implicit Values

An implicit value is a value defined using the implicit keyword.

implicit val x = 7

Now, the value x is searchable in implicit scope. We'll make use of it later.

Implicit Function Parameters

An implicit function parameter, like an implicit value, is a parameter to a function that is marked with implicit.

def mkList(begin: Int)(implicit end: Int): List[Int] = {
  (begin to end).toList
}
assert(List(1,2,3,4,5,6,7) == mkList(1)(7))

An implicit parameter works just like a parameter without the implicit declaration. So, what's the point?

Implicit parameters do not need to be explicitly written; if a value of the correct type is found in implicit scope, the compiler will automatically input that value for the developer to the function. The following is valid:

implicit val x = 7
def mkList(begin: Int)(implicit end: Int): List[Int] = {
  (begin to end).toList
}
assert(List(1,2,3,4,5,6,7) == mkList(1))
assert(List(5,6,7) == mkList(5))

So What?

Implicit resolution is very handy for applying contexts to an application. A simple example is logging.

import scala.concurrent._
import ExecutionContext.Implicits.global
case class Logger(location: String){
  def log(message: String) =
    println(s"$message\t$location")
}

def monitoredOperation(a: String, b: Int)(implicit logger: Logger) = {
  logger.log(a)
  logger.log(b.toString)
  logger.log(a + (2 * b))
}

val logA = Logger("logDir/a.log")
val logB = Logger("logDir/b.log")

def performA() = Future{
  implicit val log = logA
  monitoredOperation("A1", 5)
  monitoredOperation("A2", 4)
  monitoredOperation("A3", 3)
  monitoredOperation("A4", 6)
}
def performB() = Future{
  implicit val log = logB
  monitoredOperation("B1", 4)
  monitoredOperation("B2", 6)
  monitoredOperation("B3", 5)
  monitoredOperation("B4", 7)
}

implicit val logMain = Logger("logDir/.log")
for{
  a <- performA()
  b <- performB()
}{
  monitoredOperation(".1", 8)
}

There are three different paths for logging which are necessary at three separate points in the application. The implicit dependency monitoredOperation has on the logger lets the developer define a logging context at an appropriate scope and then code without giving thought to logging. This helps in two ways:

  1. The developer is free to code to the business case without being further burdened by logging
  2. Forces the developer to decouple code which logs to separate locations

The second help is illustrated by the following

implicit val logA = Logger("logDir/a.log")
implicit val logB = Logger("logDir/b.log")
monitoredOperation("", 0)//compile error: ambiguous implicit values

Bringing two values of the same type into implicit scope causes the compiler to fail on implicit resolution. The compiler cannot reason about which is the one you want.

Type Classes for Mortals

Type Classes are a technique for purely functional polymorphism. They are a powerful tool for using the scala compiler to help the developer group data types with similar use cases.

Building Type Classes

Say you need to extract lengths from GUIDs for validation however the GUIDs are presented in multiple formats.

  1. String
  2. List[Char]
  3. List[String]

For example:

val valid = 32
val id1 = "3F2504E0-4F89-41D3-9A0C-0305E82C3301"
val id2 = List('3', 'F', '2', '5', '0', '4', 'E', '0',
  '4', 'F', '8', '9',
  '4', '1', 'D', '3',
  '9', 'A', '0', 'C',
  '0', '3', '0', '5', 'E', '8', '2', 'C', '3', '3', '0', '1')
val id3 = List("3F2504E0", "4F89", "41D3", "9A0C", "0305E82C3301")
println(valid == id1.size)//false
println(valid == id2.size)//true
println(valid == id3.size)//false

In scala there is a built in size member for all three of these classes. The issue is, the three values represent the same GUID logically but, their size methods return very different results. Most of the time in OO, we would gain access to the classes, create a trait which validates them as GUIDs and have them each implement that trait. This is not possible since we are dealing with standard library classes. Type Classes provide a performant solution to this gap in functionality.

What is a Type Class?

A type class in scala is a trait with a parameter and at least one abstract method  which contracts over the parameterized type. In our example it would look like:

trait Guid[T]{
  def isGuid(item: T): Boolean
}

This interface allows us to write a function for validating a GUID given an implementation of the Guid trait.

def guidIsValid[T](item: T, guid: Guid[T]): Boolean = {
  guid.isGuid(item)
}

Now, we just need to implement this trait for our three classes and use them as required. All together we have

val valid = 32
val id1 = "3F2504E0-4F89-41D3-9A0C-0305E82C3301"
val id2 = List('3', 'F', '2', '5', '0', '4', 'E', '0',
  '4', 'F', '8', '9',
  '4', '1', 'D', '3',
  '9', 'A', '0', 'C',
  '0', '3', '0', '5', 'E', '8', '2', 'C', '3', '3', '0', '1')
val id3 = List("3F2504E0", "4F89", "41D3", "9A0C", "0305E82C3301")
//use case

trait Guid[T]{
  def isGuid(item: T): Boolean
}
def guidIsValid[T](item: T, guid: Guid[T]): Boolean = {
  guid.isGuid(item)
}
val stringGuid = new Guid[String]{
  override def isGuid(str: String): Boolean = {
    val stripped = str.filter('-' != _)
    valid == stripped.size
  }
}
val lCharGuid = new Guid[List[Char]]{
  override def isGuid(chars: List[Char]): Boolean = {
    valid == chars.size
  }
}
val lStringGuid = new Guid[List[String]]{
  override def isGuid(strs: List[String]): Boolean = {
    val size = strs.map(_.size).sum
    valid == size
  }
}
guidIsValid(id1, stringGuid)//true
guidIsValid(id2, lCharGuid)//true
guidIsValid(id3, lStringGuid)//true

Here, the compiler makes sure we have a valid implementation of our type class, Guid, for each call. This is nice but, the extra argument to the guidIsValid function makes the syntax more cumbersome than the OO version (OO version would have a validGuid method on String, List[Char] and List[String]). What we would like is for the last three lines to read

guidIsValid(id1)//true
guidIsValid(id2)//true
guidIsValid(id3)//true

Remember Implicits

Let's redefine our guidIsValid function so we can leverage the implicit functionality baked into the scala compiler to input our type class argument for our function

def guidIsValid[T](item: T)(implicit guid: Guid[T]): Boolean = {
  guid.isGuid(item)
}

Now, if we redefine our type class implementations as implicits we have

trait Guid[T]{
  def isGuid(item: T): Boolean
}
def guidIsValid[T](item: T)(implicit guid: Guid[T]): Boolean = {
  guid.isGuid(item)
}
implicit val stringGuid = new Guid[String]{
  override def isGuid(str: String): Boolean = {
    val stripped = str.filter('-' != _)
    valid == stripped.size
  }
}
implicit val lCharGuid = new Guid[List[Char]]{
  override def isGuid(chars: List[Char]): Boolean = {
    valid == chars.size
  }
}
implicit val lStringGuid = new Guid[List[String]]{
  override def isGuid(strs: List[String]): Boolean = {
    val size = strs.map(_.size).sum
    valid == size
  }
}
guidIsValid(id1)//true
guidIsValid(id2)//true
guidIsValid(id3)//true

Better compiler messages

To make it easier on implementors to reason about compile errors, type class library designers should always use the implicit not found annotation on their type classes.

@annotation.implicitNotFound("${T} has no implicit Guid instance defined in scope.")
trait Guid[T]{
  def isGuid(item: T): Boolean
}

Working With Type Classes

We finally get to the main point of the tweet we referenced at the very beginning of this post: 

how hard this code is ... to work with

First let's pick out just the type class and a single implementation from our code.

val valid = 32
trait Guid[T]{
  def isGuid(item: T): Boolean
}
implicit val stringGuid = new Guid[String]{
  override def isGuid(str: String): Boolean = {
    val stripped = str.filter('-' != _)
    valid == stripped.size
  }
}

val id1 = "3F2504E0-4F89-41D3-9A0C-0305E82C3301"

This is the most basic kind of type class; its just a type class. The library has no functionality or guidance about how to use it. Most library designers add in a lot of extra implementation to the companion object for the type class. The companion object typically holds

  1. convenience functions for performing operations functionally
  2. something called Ops which wraps values in a context for the type class
  3. an implicit def into the Ops class for quickly performing operations with OO-style
trait Guid[T]{
  def isGuid(item: T): Boolean
}
object Guid{
  def apply[T](item: T)(implicit guid: Guid[T]): Boolean = {
guid.isGuid(item)
  }
  class GuidOps[T](item: T, guid: Guid[T]){
    def isGuid(): Boolean = guid.isGuid(item)
  }
  implicit def lift[T](item: T)(implicit guid: Guid[T]): GuidOps[T] = {
    new GuidOps[T](item, guid)
  }
}

With this definition in the library, the developer can choose which style to use

import Guid._
implicit val stringGuid = new Guid[String]{
  override def isGuid(str: String): Boolean = {
    val stripped = str.filter('-' != _)
    valid == stripped.size
  }
}
val id1 = "3F2504E0-4F89-41D3-9A0C-0305E82C3301"
Guid(id1)//functional
id1.isGuid()//object oriented

Type classes are simply the FP version of the decorator pattern from OO (This is made clear by the implicit def from T to GuidOps[T]; a direct mapping from the type class for some T to a decorator for T). They take a class and apply new uses to a single instance of that class without affecting other instances of that class. Together with implicits, they seem to augment a class directly (js prototype style) but, this appearance is simply an illusion created by the compiler as it decorates functions and objects for the developer.

Type Variance - Why it Matters

The company I work for has a robust Intern program; as a result, I work with a lot of young engineers and computer scientists. To date:

  1. 100% of their resumes mention they have a tight grasp on Object Oriented Programming
  2. 100% of them fail to understand the finer points of subtyping and furthermore subclassing

I have given more explanations of variance than I have given explanations of anything else on the job. So, in a effort to practice the DRY principle in all my affairs, I decided to put it into documentation I can point to.

Note: Type Variance has A LOT of math (type theory & category theory) behind it. This post will focus on its usage in the Scala language not on the math.

Sub Classes

class Foo[T]
def check[A, B](a:A, b:B)(implicit ev: A <:< B): Unit = {}

Here the class Foo is parameterized by T and the check function simply checks if the type of its first argument is a subclass of the type of its second argument. Don't worry about the check function, its implementation details are beyond the scope of this post but, its a nifty trick!

val str = ""
val obj = new Object()
check(str, obj)//compiles

So, the check here compiles which tells us String is a subclass of Object.

val fStr = new Foo[String]
val fObj = new Foo[Object]
check(fStr, fObj)//error: Cannot prove that zzz.simple.Foo[String] <:< zzz.simple.Foo[Object].

This doesn't compile because the type parameter of Foo allows for no variation in its relationship; Foo is invariant in T.

We will begin by briefly describing variance.

Variance

Variance is a huge part of programming with types. It is an intrinsic property of class hierarchies and can be witnessed as such in languages like C++ and Scala who have compiler errors along the lines of:

  • covariant whatever in contravariant position
  • whatever is in covariant position but is not a subclass of whatever

In short, type variance describes the types that may be substituted in place of another type.

Covariance: We say a substitution is covariant with a type, Foo, if Foo or any other class with a subclass relationship to Foo is valid.
Contravariance: We say a substitution is contravariant with a type, Bar, if Bar or any other class with a superclass relationship to Bar is valid.
Invariance: We say a substitution is in variant with a type, Foo, if only types that are exactly Foo are valid.

Variance in Practice

We already saw invariance above. In Scala covariance and contravariance are denoted by using the + and - symbols respectively.

Covariance

case class Foo[+T]//covariant in T

Redeclaring Foo in this way makes it covariant so our test now validates

val str = ""
val obj = new Object()
check(str, obj)//compiles
val fStr = new Foo[String]
val fObj = new Foo[Object]
check(fStr, fObj)//compiles

Declaring the type variable with a + (as covariant) tells the compiler that the subclass relationship between type parameters gives rise to a direct subclass relationship in Foo. So any def, val or var requiring a Foo[Object] can take a Foo[String] as an argument in place.

Contravariance

case class Foo[-T]//contravariant in T

This redeclaration makes Foo contravariant and breaks our test again

val str = ""
val obj = new Object()
check(str, obj)//compiles
val fStr = new Foo[String]
val fObj = new Foo[Object]
check(fStr, fObj)//error: Cannot prove that zzz.simple.Foo[String] <:< zzz.simple.Foo[Object].

This is what we expect! Contravariance implies a superclass relationship not a subclass relationship. We can fix this by reversing our input arguments

check(fObj, fStr)//compiles

This declaration is a hint to the compiler that the subclass relationship between type parameters gives rise to a superclass relationship in Foo. So any def, val or var requiring a Foo[String] can take a Foo[Object].

How to use Variance

Where covariance preserves the subclass relationship from the type parameter into the type, contraveriance reverses this relationship.

Covariance is used a lot in Scala by the collections library. Most of the immutable collections are covariant. This makes working with your data types inside the collection the same as working with them outside the collection when writing interfaces.

Contravariance is less prominent. I use contravariance for typeclasses a lot.

trait Bar[-T]{
  def bar(t:T): Unit
}
implicit val bar = new Bar[Object]{def bar(o:Object): Unit = ()}
def procBar[T:Bar](t: T){
  implicitly[Bar[T]].bar(t)
}
procBar(obj)//compiles
procBar(str)//pulled the superclass instance in

If a class does not have a typeclass instance in implicit scope for its type it can use a contravariant instance if one is in scope.

 

 

Crashing F#

There's been a lot of buzz about using F# on the CLR for functional programming. This week, I decided to take the language for a spin.

The Basics

F# has a wonderfully light and succinct syntax. It uses the same keyword, let, for both values and functions.

let str = "Hello"
let func a = a + str 
let another = func " World!"
printfn "%s" another

Also, discriminated unions (sum types) are super simple

type SumType =
  |First
  |Second
  |Third of string * int
  |Fourth of a:string * b:int * c:double

Anytime I try out a new language I try to explore five things:

 
  1. Tail recursive functions
  2. Functor
  3. Monad
  4. Free Monad
  5. Pattern Matching

F#, as far as I can tell after using it for a total of about 10 hours, can be used for 1-3 and 5 but not 4. Here is a brief synopsis of my trials.

Tail Recursion

This was extremely simple in F#. Its two steps:

  1. Mark a function as recursive using the rec keyword
  2. Put the recursive step as the last operation
module Recursion
    //need a keyword for recursion, otherwise boilerplate free
    //very nice
    let rec check many acc =
        if(0 = many) then acc
        else
            let a = many - 1
            let b = acc + many
            check a b

The Recursion.check function uses the accumulator pattern to add int values from 1 to many to the initial accumulator value acc.

Functor & Monad

This took some work! F# lacks higher kinded polymorphism. My first attempt at this was:

type Functor<'A> =
    abstract map: 'A<'B> -> ('B -> 'C) -> 'A<'C>

Which produced a compile time error: "Type parameter cannot be used as type constructor." Ouch!!!

Luckily, F# type inference is good enough that we don't really need to define type classes with a little boilerplate. Here is a Functor & Monad example:

//cannot do higher kinds so no type classes
//do not need higher kinds; type syntax takes care of it
module CategoryList
type Functor<'Type>() =
member this.map(l: list<'Type>)(f: 'Type -> 'Return) =
  List.map f l
type Monad<'Type>(func: Functor<'Type>) =
member this.point(t: 'Type) = [t]
member this.flatMap(l: list<'Type>)(f: 'Type -> list<'Return>) =
List.concat(func.map l f)

Defining a Functor and Monad instance is simple. We don't need a type hierarchy here; we only need 3 functions.

module Category
let check() =
//instantiate monad
let func = CategoryList.Functor<int>()
let monad = CategoryList.Monad(func)

//define monadic chain functions
let flatMap l f = monad.flatMap f l
let map l f= func.map f l

//define helpers
let mapper (x:int) = [1 .. x]
let mapperStr (x:int) = x.ToString()

let hundred = [1 .. 100]//data

//(|>) language level monad support
let hundredMore = (hundred
|> flatMap mapper)
printfn "%A" hundredMore

let bigString = (hundredMore
|> flatMap mapper
|> map mapperStr)
printfn "%A" bigString

The key here is the |> operator (also known as pipe). By piping a list into a flatMap into another flatMap and so on, we can get language level monad support.

Free Monad

Here I have failed. The closest I've gotten is:

module Free
type FreeMonad<'F, 'A> =
|Pure of 'A
|Suspend of 'F

type FreeMonad<'F, 'A> with
member this.fold fpure fsuspend =
match this with
|Pure(a) -> fpure(a)
|Suspend(free) -> fsuspend(free)

let inline liftF(t:'F)(func: 'T): FreeMonad<'F, 'A>
when ^T: (member map: (^F -> (^A -> ^AA) -> ^FF)) = Suspend(t)
let inline point(a: 'A)(func: 'T): FreeMonad<'F, 'A>
when ^T: (member map: (^F -> (^A -> ^AA) -> ^FF)) = Pure(a)

let check(): Unit =
let func = CategoryList.Functor<int>()
let lst: list<int> = [1..10]
let monad:FreeMonad<list<int>, int> = liftF(lst)(func)
()

I could never get the liftF function to work out. The combination of type parameters never type checks against the parameterized types AA and FF. The compiler just cannot understand the second layer of type inference that needs to happen for the map method to be recognized.

Pattern Matching

Pattern matching is pretty straight forward. You can do direct

let func a =
  match a with
  |1 -> "one"
  |2 -> "two"
  |3 | 5 | 10 | 30 -> "large"
  |_ -> "else"

with guards

let func a =
    match a with
    |value when 0 > value -> "negative"
    |value when 0 < value -> "positive"
    |_ -> "zero"

type matching

let func a =
    match a with
    |(str: string) -> "string"
    |_ -> "not"

discrimminated unions

type Union =
    |Simple
    |Pair of int * int
    |Triple of int * int * int
let func a =
    match a with
    |Simple -> "simple"
    |Pair(a, b) -> "(" + a.ToString() + "," + b.ToString() + ")"
    |Triple(a, b, c) -> "(" + a.ToString() + "," + b.ToString() + c.ToString() + ")"

That's most of pattern matching, by combining these patterns you can get some pretty serious control folow moving through your application.