Cool Projects with Cloud Data Analytics

Cool projects that are worth exploring, contact me if you’re interested.

Get inspired by Richard Feynman!

“What I cannot create, I do not understand.”
“You’re unlikely to discover something new without a lot of practice on old stuff.”

1. Understand auto-vectorization in Arrow

Background

Vectorized execution processes multiple data using a single instruction (SIMD).

Vectorized code is typically challenging to implement and maintain. Consequently, arrow-rs has removed all manual SIMD instructions and now relies on LLVM to auto-vectorize the scalar code.

jayzhan211 noted that not all code is auto-vectorized, while tustvold elaborated on common conditions for auto-vectorization.

Project Goals

Understand what is being auto-vectorized:
1. Manually inspect using the cargo-asm tool, focusing on SIMD-friendly operations like sum, find, min/max, etc.
2. List common coding mistakes that prevent LLVM from auto-vectorizing the execution.
Understand the benefits of auto-vectorization:
1. Write manually vectorized code and compare its performance with LLVM-generated vectorization.
2. Evaluate performance on Intel x86, AMD x86, Apple M chips, and other ARM chips on cloud providers.
Improve the situation:
1. Share findings with the world in a blog post or paper.
2. (Extended goal) Develop a tool to detect common mistakes.
3. (Extended goal) Develop a tool to highlight auto-vectorized regions of code, similar to code coverage.

Related side-project: implement AVX512 encoding for FSST in Rust, as a way to understand what it takes to write SIMD intrinsics.