SIMD in C_C++ Techniques

Prompt

in modern cpu, I thought there is so called SIMD instruction, i.e. one op with multiple data. say, I want to add 4 int with another 4 int in vector or array, instead of one by one, it can be passed together to one register with one instruction. in c/c++, how can I do this? I know there must be some compiler support to know that target machine must support such instruction in cpu model, and compiling machine maybe different than running machine, so, this involves both build/running like toolchain building. but at least what instruction set or pysuedo assembly language can do, or can I do it at high-level language like c/c++?

Response

Okay, let's break down how you can leverage SIMD (Single Instruction, Multiple Data) instructions in C/C++. You've got the right idea – it's about performing the same operation on multiple data elements simultaneously for significant performance gains, especially in areas like multimedia processing, scientific computing, and cryptography.

Here are the primary ways to achieve this in C/C++:

Compiler Auto-Vectorization:
- Concept: Modern compilers (like GCC, Clang, MSVC, Intel C++ Compiler) are quite smart. When you enable optimizations (e.g., -O2, -O3, or specific vectorization flags like -ftree-vectorize in GCC/Clang), the compiler analyzes your code, particularly loops, to see if it can automatically rewrite parts of it using SIMD instructions without you explicitly telling it to.
- How: You write standard C/C++ code, often focusing on simple loops that process arrays sequentially.
  C++
  void add_arrays(int *a, int *b, int *result, int size) { for (int i = 0; i < size; ++i) { result[i] = a[i] + b[i]; // Compiler might vectorize this loop } }
- Pros: Easiest method, requires no platform-specific code, potentially portable (compiler handles the details).
- Cons: Not guaranteed – the compiler might fail to vectorize if the loop is too complex, has dependencies, non-aligned memory access (sometimes), or uses function calls that aren't vectorizable. Performance can vary significantly between compilers, versions, and optimization flags. Less direct control.
- Build/Run & Target CPU: You tell the compiler the target architecture it should optimize for using flags.
  - GCC/Clang: -march=native (optimizes for the machine you're compiling on), -march=x86-64-v3 (targets CPUs with AVX/AVX2), -mavx2, -msse4.2 (enable specific instruction sets).
  - MSVC: /arch:AVX2, /arch:AVX512. When cross-compiling (building on one machine to run on another), you must specify the target architecture features (e.g., -mavx2) rather than using -march=native.
Intrinsics:
- Concept: This is the most direct way to use SIMD from C/C++. Compilers provide special functions (intrinsics) that map almost directly to specific SIMD assembly instructions. You explicitly tell the compiler which SIMD operation to perform on which data.
- How: You include specific header files and use special data types and functions.
  - For x86 (SSE, AVX, AVX-512): Include <immintrin.h> (a common header that includes others like <xmmintrin.h>, <emmintrin.h>, <pmmintrin.h>, <smmintrin.h>, <nmmintrin.h>, <avxintrin.h>, <avx2intrin.h>, <avx512fintrin.h>, etc.).
    - Data types: __m128i (128 bits, often 4x 32-bit ints or 8x 16-bit shorts, etc.), __m256i (256 bits, AVX/AVX2), __m512i (512 bits, AVX-512). Also float/double versions: __m128, __m256, __m512.
    - Functions (example for adding 4x 32-bit integers using SSE/AVX):
      - _mm_loadu_si128((__m128i*) &a[i]): Load 128 bits (4 ints) from memory (unaligned access). _mm_load_si128 for aligned.
      - _mm_add_epi32(vec_a, vec_b): Add packed 32-bit integers.
      - _mm_storeu_si128((__m128i*) &result[i], vec_result): Store 128 bits back to memory (unaligned). _mm_store_si128 for aligned.
  - For ARM (NEON): Include <arm_neon.h>.
    - Data types: int32x4_t (4x 32-bit integers), float32x4_t, etc.
    - Functions (example for adding 4x 32-bit integers):
      - vld1q_s32(&a[i]): Load quad-word (128 bits, 4 ints).
      - vaddq_s32(vec_a, vec_b): Add quad-word signed 32-bit integers.
      - vst1q_s32(&result[i], vec_result): Store quad-word.
- Example (x86 SSE/AVX - adding 4 ints):
  C++
  #include <immintrin.h> // For x86 SSE/AVX // Assume arrays are aligned or use unaligned loads/stores void add_4_ints_simd(const int* a, const int* b, int* result) { // Load 4 ints from a and b into 128-bit registers // Use _mm_load_si128 if you KNOW data is 16-byte aligned __m128i vec_a = _mm_loadu_si128((const __m128i*)a); __m128i vec_b = _mm_loadu_si128((const __m128i*)b); // Add the packed 32-bit integers (4 additions in one instruction) __m128i vec_result = _mm_add_epi32(vec_a, vec_b); // Store the 4 resulting ints back to memory // Use _mm_store_si128 if result is 16-byte aligned _mm_storeu_si128((__m128i*)result, vec_result); } int main() { alignas(16) int array_a[4] = {1, 2, 3, 4}; // alignas might help compiler/loads alignas(16) int array_b[4] = {10, 20, 30, 40}; alignas(16) int array_result[4]; add_4_ints_simd(array_a, array_b, array_result); // array_result will contain {11, 22, 33, 44} // ... print results ... return 0; }
- Pros: Explicit control over SIMD usage, potential for maximum performance, allows targeting specific instructions.
- Cons: Code becomes platform-specific (x86 intrinsics are different from ARM NEON intrinsics), less readable, requires understanding the specific instruction set capabilities and limitations (latency, throughput, alignment requirements).
- Build/Run & Target CPU: Similar to auto-vectorization, you need compiler flags (-mavx2, /arch:AVX2, etc.) to tell the compiler it's okay to use these instructions. The compiler won't let you use AVX2 intrinsics if you haven't enabled AVX2 support, even if the intrinsic function exists in the header.
SIMD Libraries (Abstraction Layers):
- Concept: Various libraries provide C++ wrappers around intrinsics, often using operator overloading and templates to give you a more "C++-like" way to write SIMD code that can potentially be portable across different architectures (x86, ARM).
- How: Include the library header, use its vector types and overloaded operators.
  C++
  // Pseudo-code example using a hypothetical library #include <SomeSimdLibrary.h> void add_arrays_lib(float *a, float *b, float *result, int size) { using Vec = SomeSimdLibrary::vector<float>; // Vector of floats int vec_size = Vec::size(); // How many floats fit in one vector (e.g., 4 for SSE, 8 for AVX) for (int i = 0; i < size; i += vec_size) { Vec vec_a(&a[i]); // Load from memory Vec vec_b(&b[i]); // Load from memory Vec vec_result = vec_a + vec_b; // Overloaded operator+ uses SIMD intrinsics vec_result.store(&result[i]); // Store back to memory } // Handle remaining elements if size is not a multiple of vec_size }
- Examples: Vc library, highway (by Google), std::experimental::simd (evolving standard, support varies), libraries like Eigen (often use SIMD internally for matrix/vector ops).
- Pros: More readable and potentially more portable than raw intrinsics, leverages C++ features like templates and operator overloading.
- Cons: Adds an external dependency, might have slightly more overhead than hand-tuned intrinsics (usually minor), learning curve for the specific library API.
- Build/Run & Target CPU: The library itself often handles detecting the CPU features or requires you to set compiler flags that it uses internally.

Summary & Choice:

Easiest: Rely on auto-vectorization. Write clean, simple loops and enable compiler optimizations (-O3 -march=native or specific target flags). Check compiler reports (-fopt-info-vec in GCC/Clang) to see if it worked.
Max Performance/Control: Use intrinsics. This requires learning the specific instruction sets (SSE, AVX, NEON) and writing platform-dependent code but gives you the most direct control.
Balance: Use a SIMD library. Offers a C++ abstraction over intrinsics, improving readability and portability at the cost of a dependency.

Remember the build vs. run distinction: always compile with flags appropriate for the target machine where the code will eventually run.

Prompt

what is BLAS and openBLAS, is this some library for parallel?

Response

Okay, let's clarify BLAS and OpenBLAS. They are related but distinct concepts, and yes, parallelism is a key feature of implementations like OpenBLAS.

1. BLAS (Basic Linear Algebra Subprograms)

What it is: BLAS is not a software library itself, but rather a specification or an Application Programming Interface (API) standard. It defines a set of low-level, fundamental routines for performing common linear algebra operations.
Purpose: The goal of BLAS is to provide a standard interface for basic vector and matrix operations so that:
- Higher-level numerical libraries (like LAPACK, NumPy, SciPy, MATLAB, R) can be built on top of them.
- Hardware vendors and performance experts can create highly optimized implementations of these core routines tailored to specific CPU architectures (like Intel, AMD, ARM) or even GPUs, without requiring changes to the higher-level libraries that call the BLAS routines.
Levels: The BLAS specification is divided into three levels:
- Level 1: Vector-vector operations (e.g., dot product, vector addition, scaling: y = αx + y). O(n) complexity.
- Level 2: Matrix-vector operations (e.g., matrix-vector multiplication: y = αAx + βy). O(n²) complexity.
- Level 3: Matrix-matrix operations (e.g., matrix-matrix multiplication: C = αAB + βC). O(n³) complexity. Level 3 operations are particularly important for performance because they offer the most opportunity for data reuse (cache efficiency) and parallelism.
Parallelism: The BLAS specification itself doesn't strictly mandate parallelism. It just defines what operations should do. However, performance is the primary reason for its existence, so high-quality implementations of BLAS are almost always parallel.

2. OpenBLAS

What it is: OpenBLAS is a specific, popular, open-source software library that provides an implementation of the BLAS API (and also LAPACK, which builds upon BLAS for more complex operations like solving linear systems or eigenvalue problems).
Purpose: OpenBLAS aims to be a highly optimized implementation of BLAS/LAPACK for various modern CPU architectures (including Intel Sandy Bridge, Haswell, Skylake, AMD Zen, various ARM architectures, etc.).
Optimization: It achieves high performance through several techniques:
- SIMD Instructions: Extensive use of architecture-specific SIMD instructions (like SSE, AVX, AVX2, AVX-512 on x86; NEON on ARM) to perform multiple calculations per instruction cycle (connecting back to your previous question!).
- Cache Optimization: Careful management of data movement to maximize the use of CPU caches.
- Multi-threading (Parallelism): This directly answers your question. OpenBLAS is designed for parallelism. It automatically uses multiple threads (typically based on the number of CPU cores available) to execute computationally intensive routines (especially Level 3 operations) in parallel, significantly speeding them up on multi-core processors. This is shared-memory parallelism – using multiple cores on the same machine.
Usage: When you install scientific computing packages like NumPy or SciPy, they often link against an optimized BLAS library like OpenBLAS (or Intel MKL, ATLAS, Accelerate Framework on macOS) behind the scenes to ensure that numerical operations run fast.

In Summary:

BLAS: A standard API for basic linear algebra operations.
OpenBLAS: A high-performance, open-source implementation of the BLAS (and LAPACK) standard.
Parallelism: While the BLAS standard doesn't require it, optimized implementations like OpenBLAS heavily rely on multi-threading (parallelism), along with SIMD instructions and cache optimization, to achieve their speed on modern multi-core CPUs.

So, yes, OpenBLAS is effectively a library for (among other optimization techniques) parallel execution of linear algebra routines on shared-memory systems (multi-core CPUs).

Prompt

CMAKE report this "CMake Warning:

Manually-specified variables were not used by the project:",

Response

Okay, that CMake warning:

CMake Warning: Manually-specified variables were not used by the project:
  VARIABLE_NAME_1
  VARIABLE_NAME_2
  ...

means exactly what it says: You provided one or more variables to CMake during the configuration step (likely using the -D flag on the command line, e.g., cmake -DMY_OPTION=ON .., or through the CMake GUI), but the project's CMakeLists.txt files never actually referenced or used those specific variables during the configuration process.

Why it happens and what to check:

Typo in Variable Name: This is the most common cause. You might have misspelled the variable name when passing it via -D or in the GUI. CMake variable names are case-sensitive.
- Action: Carefully double-check the spelling and casing of the variable name you provided against the name the project actually expects. Look in the project's documentation (README, INSTALL files) or search its CMakeLists.txt files for the intended variable name.
Incorrect Variable Name: You might be using a variable name that doesn't actually exist or isn't recognized by this specific project's build scripts. Perhaps you're thinking of a variable used in a different project or an older version of this one.
- Action: Consult the project's documentation to find the correct variable names for the options you want to set. Common examples are CMAKE_INSTALL_PREFIX, CMAKE_BUILD_TYPE, BUILD_SHARED_LIBS, BUILD_TESTING, etc., but projects define many custom ones using the option() command.
Variable Used Conditionally (and condition is false): The variable might only be referenced inside an if(...) ... endif() block in the CMakeLists.txt, and the condition for that block is currently false. Therefore, the code path that uses your variable is skipped.
- Action: Examine the CMakeLists.txt files to see if the variable is used conditionally. Understand why that condition might be false. Perhaps you need to enable another option first.
Variable Used in an Excluded Subdirectory/Component: The variable might be specific to an optional component or subdirectory of the project that isn't currently being included or built (perhaps controlled by another CMake option).
- Action: Check if the variable pertains to an optional part of the build and ensure that part is enabled if you need it.
Obsolete Variable: The variable might have been used in a previous version of the project but has since been removed or renamed.
- Action: Check the project's documentation or recent changelogs for information about deprecated or changed CMake options.
Variable Meant for Underlying Tool: In rare cases, a variable might be intended for a tool called by CMake rather than CMake itself, but this is less common for typical -D options.

Is it a problem?

It's a warning, not necessarily an error.

If you intended for that variable to have an effect: Yes, it's a problem. The warning indicates your configuration attempt for that specific setting failed because the project ignored it. The build might proceed, but it won't be configured the way you wanted (e.g., a feature you tried to enable with -DMY_FEATURE=ON won't be built if MY_FEATURE wasn't used).
If you passed the variable by mistake or it's truly obsolete: No, it's not functionally a problem, just noise. The build will proceed without using that specific variable. However, it's good practice to clean up your CMake command line or script to remove unused variables to avoid confusion.

How to Fix:

Identify the variable(s) listed in the warning.
Verify the correct spelling, case, and existence of the variable in the project's documentation or CMakeLists.txt.
Correct any typos or use the proper variable name.
Remove the -DVAR_NAME=VALUE argument from your cmake command if the variable is unnecessary or obsolete.
Investigate conditions or dependencies if the variable is used conditionally or in optional components.