Efficient Computing at Carnegie Mellon

A list of recent open-source research systems developed by the research group of Professor David G. Andersen and Dr. Michael Kaminsky in the Carnegie Mellon University Computer Science Department.

Data structures

Feed-Forward Bloom Filter

Memory efficient and cache-optimized algorithm for simultaneously searching for a large number of patterns in a very large corpus [SIAM ALENEX 2011]

Reference implementation

Rank & Select Structures on Uncompressed Bit Sequences

Rank & select structures that use a cache-centric design approach to achieve performance competitive with the highest-performance prior designs while imposing space overhead as low as the most space-efficient, but slower, prior designs [SEA 2013]

Reference implementation

Cuckoo Filter

Data structure that can replace Bloom filters for approximate set membership tests, supporting dynamic item addition and removal items while achieving even higher performance than Bloom filters [CoNEXT 2014]

Reference implementation

libcuckoo

High-throughput and memory-efficient concurrent hash table that supports multiple readers and writers [EuroSys 2014]

Version presented in paper

Updated implementation

SuRF (Succinct Range Filter)

Fast and compact data structure for approximate membership tests that supports both single-key lookups and common range queries [SIGMOD 2018]

SuRF data structure

Live interactive demo (source code)

Underlying FST data structure

RocksDB equipped with SuRF

HOPE (High-speed Order-Preserving Encoder)

Fast dictionary-based compressor that encodes arbitrary keys while preserving their order [SIGMOD 2020]

Reference implementation

Distributed consensus

EPaxos (Egalitarian Paxos)

Distributed consensus algorithm based on Paxos that efficiently achieves optimal commit latency in the wide-area when tolerating one and two failures, uniform load balancing across all replicas, and graceful performance degradation when replicas are slow or crash [SOSP 2013]

Reference implementation

Paxos Quorum Leases

Technique that allows Paxos-based systems to perform reads with high throughput and low latency without sacrificing consistency or write latency or accepting more than minimal impact on system availability [CMU PDL 2014]

Reference implementation

Isolation

CATBench

Benchmark demonstrating that cache partitioning can protect the performance of latency-sensitive networked applications from local resource contention [CMU CSD 2017]

Benchmarks in paper

Microservice Microbenchmarks

Benchmark demonstrating a likely explanation for contemporary serverless platforms' high microservice invocation latencies in the hundreds of milliseconds for cold starts and tens of milliseconds for warm starts [ATC 2018]

Benchmarks in paper

Lightweight Preemptible Functions

Program primitives for limiting function execution time and isolating memory within a process

Thesis

CMU CSD 2022

Conference paper

Mechanism for synchronously performing a function call with a precise timeout that is lightweight, efficient, and composable, all while being portable between programming languages [ATC 2020]

libinger implementation

libturquoise (preemptive tokio-threadpool)

libgotcha implementation

libas-safe library

libas-safe example

hyper benchmark from paper

libpng benchmark from paper

Key-value stores

MemC3

Set of architecturally and workload inspired algorithmic and engineering improvements to the popular Memcached system that substantially improve both its memory efficiency and throughput [NSDI 2013]

MemC3 implementation

HERD

Key-value system designed to make the best use of an RDMA network by reducing network round trips while using efficient RDMA primitives [SIGCOMM 14]

HERD implementation

MICA

Scalable in-memory key-value store that achieves a record-setting throughput of 120 million requests per second (MRPS) on a single commodity server, 9.2X the performance (RPS) and 2.8X the system energy efficiency (RPS/watt) of the best-published FPGA-based claims [NSDI 2014, ISCA 2015]

Version presented in NSDI paper

Version presented in ISCA paper

Updated implementation (MICA 2)

Cicada

Single-node multi-core in-memory transactional database with serializability that provides high performance under diverse workloads by leveraging optimistic and multi-version concurrency control schemes and multiple loosely synchronized clocks [SIGMOD 2017]

Cicada implementation

Benchmarks in paper

Memory and storage

Building Blocks for Main Memory Data Stores

Systems design proposals for consistent, durable, and safe memory management for future byte-addressable non-volatile (NV) memory [CMU PDL 2011]

Reference implementations

Evaluation of Multi-Stage Log-Structured Designs

New analytic primitives and MSLS design models that quickly give accurate performance estimates for evaluation of systems such as LevelDB, RocksDB, HBase, and Cassandra [FAST 2016]

Benchmarks in paper

Networking

CuckooSwitch

Software-based Ethernet switch design built around a memory-efficient, high-performance, and highly-concurrent hash table for compact and fast FIB lookup [CoNEXT 2013]

CuckooSwitch implementation

G-Opt

Exploration of a hypothesis that many of the benefits of using Graphics Processing Units (GPUs) as accelerators for software-based routing and packet handling applications arise less from the GPU hardware itself as from the expression of the problem in a language such as CUDA or OpenCL that facilitates memory latency hiding and vectorization through massive concurrency [NSDI 2015]

Benchmarks in paper

Design Guidelines for High Performance RDMA Systems

Framework for understanding RDMA performance and helping system designers to navigate the RDMA design space [ATC 2016]

Benchmarks in paper

FaSST (Fast, Scalable and Simple distributed Transactions)

RDMA-based system that provides distributed in-memory transactions with serializability and durability, yet eschews one-sided RDMA for fast RPCs using two-sided unreliable datagrams [OSDI 2016]

Benchmarks in paper