DTypes & Devices: Choose Your Weapons¶

Module 1 | Lesson 3

Professor Torchenstein's Grand Directive¶

Mwahahaha! You've sliced, fused, and reshaped tensors with the skill of a master surgeon! You can command their form, but what of their soul? What of their very essence?

Today, we delve deeper! We shall master the two most fundamental properties of any tensor: its data type (dtype), which determines its precision and power, and its device, the very dimension it inhabits—be it the humble CPU or the roaring, incandescent GPU! Choose your weapons wisely, for these choices dictate the speed, precision, and ultimate success of your grand experiments!

Torchenstein holding a glowing beaker

Your Mission Briefing¶

By the end of this electrifying session, you will have mastered the arcane arts of:

🔬 Understanding the soul of a neural network: the floating-point dtype.
⚖️ Analyzing the critical trade-offs between float64, float32, float16, and bfloat16.
✨ Transmuting floats to balance precision, range, and performance for training and inference.
⚡ Teleporting your tensors to the most powerful device (CPU, GPU, MPS) to unleash their speed.
⚠️ Diagnosing the catastrophic errors that arise from floating-point overflow and mismatched devices.

Previously in the Lab... (A Quick Recap)¶

In our last experiment, we mastered Tensor Metamorphosis, transforming tensor shapes with reshape, view, squeeze, and unsqueeze. We learned that a tensor's shape is merely an illusion—a view into a contiguous block of 1D memory.

Now that you command a tensor's external form, we shall master its internal essence. The journey continues!

The Alchemist's Arsenal - Mastering Data Types (`dtype`)¶

Behold, apprentice! Not all tensors are forged from the same ethereal stuff. The very essence of a tensor—its .dtype—determines what kind of numbers it can hold, its precision in the arcane arts of mathematics, and the amount of precious memory it consumes!

A wise choice of dtype can mean the difference between a lightning-fast model and a sluggish, memory-guzzling behemoth. Let us inspect the primary weapons in our arsenal!

Checking the soul of your tensor¶

We will summon tensors of different dtypes, transmute them, and witness the performance implications firsthand!

To perform these miracles, you must master two key tools:

The .dtype attribute: A tensor's inherent property that reveals its data type. You can't change it directly, but you can inspect it to understand your tensor's essence.
The .to() method: This is your transmutation spell! It's a powerful and versatile method that not only changes a tensor's dtype but can also teleport it to a different device at the same time!

Floating-Point Types (The Elixirs of Learning)¶

The very lifeblood of neural networks! These are essential for representing real numbers, calculating gradients, and enabling your models to learn.

torch.float32 (torch.float): The 32-bit workhorse. This is the default dtype for a reason—it offers a fantastic balance between precision and performance. Most of your initial experiments will thrive on this reliable elixir.
torch.float64 (torch.double): 64-bit, for when you require the utmost, surgical precision. Its use in deep learning is rare, as it doubles memory usage and can slow down computations, but for certain scientific calculations, it is indispensable. A powerful tool, but often overkill for our purposes!
torch.float16 (torch.half): A 16-bit potion for speed and memory efficiency. Halving the precision can dramatically accelerate training on modern GPUs and cut your memory footprint in half! But beware—its limited range can sometimes lead to numerical instability.
torch.bfloat16: The new favorite in the high council of AI! Also 16-bit, but with a crucial difference from float16. It sacrifices some precision to maintain the same dynamic range as float32, making it far more stable for training large models like Transformers.

In [28]:

Copied!





import torch

# A single number for our comparison
pi_number = torch.pi

# Summoning tensors of different float dtypes
tensor_fp64 = torch.tensor(pi_number, dtype=torch.float64)
tensor_fp32 = torch.tensor(pi_number, dtype=torch.float32)
tensor_fp16 = torch.tensor(pi_number, dtype=torch.float16)
tensor_bf16 = torch.tensor(pi_number, dtype=torch.bfloat16)



print("--- Floating-Point digits after decimal point and Memory Footprints ---")
print(f"{tensor_fp64.dtype}:  {tensor_fp64.item():.10f} | Memory: {tensor_fp64.element_size()} bytes ")
print(f"{tensor_fp32.dtype}:  {tensor_fp32.item():.10f} | Memory: {tensor_fp32.element_size()} bytes")
print(f"{tensor_fp16.dtype}:  {tensor_fp16.item():.10f} | Memory: {tensor_fp16.element_size()} bytes (Half the size of fp32!)")
print(f"{tensor_bf16.dtype}: {tensor_bf16.item():.10f} | Memory: {tensor_bf16.element_size()} bytes (Half the size of fp32!)")
import torch

# A single number for our comparison
pi_number = torch.pi

# Summoning tensors of different float dtypes
tensor_fp64 = torch.tensor(pi_number, dtype=torch.float64)
tensor_fp32 = torch.tensor(pi_number, dtype=torch.float32)
tensor_fp16 = torch.tensor(pi_number, dtype=torch.float16)
tensor_bf16 = torch.tensor(pi_number, dtype=torch.bfloat16)



print("--- Floating-Point digits after decimal point and Memory Footprints ---")
print(f"{tensor_fp64.dtype}:  {tensor_fp64.item():.10f} | Memory: {tensor_fp64.element_size()} bytes ")
print(f"{tensor_fp32.dtype}:  {tensor_fp32.item():.10f} | Memory: {tensor_fp32.element_size()} bytes")
print(f"{tensor_fp16.dtype}:  {tensor_fp16.item():.10f} | Memory: {tensor_fp16.element_size()} bytes (Half the size of fp32!)")
print(f"{tensor_bf16.dtype}: {tensor_bf16.item():.10f} | Memory: {tensor_bf16.element_size()} bytes (Half the size of fp32!)")

--- Floating-Point digits after decimal point and Memory Footprints ---
torch.float64:  3.1415926536 | Memory: 8 bytes 
torch.float32:  3.1415927410 | Memory: 4 bytes
torch.float16:  3.1406250000 | Memory: 2 bytes (Half the size of fp32!)
torch.bfloat16: 3.1406250000 | Memory: 2 bytes (Half the size of fp32!)

The Limits of floats: `torch.finfo` Spell¶

How, you ask, can a master alchemist know the precise limits of their elixirs? You need not guess! PyTorch provides a powerful incantation for this very purpose: torch.finfo.

This spell reveals the deepest secrets of any floating-point dtype:

Attribute	Description
`bits`	The number of bits of memory the `dtype` occupies (e.g., 16, 32, 64).
`min` / `max`	The smallest and largest numbers that can be represented. Exceeding this causes overflow (`inf`)!
`eps`	Epsilon. The smallest possible difference between 1.0 and the next representable number. This is a pure measure of precision around the number 1. A smaller `eps` means higher precision.
`tiny`	The smallest positive number that can be represented. Numbers smaller than this are lost to the void (rounded to zero)!
`resolution`	The approximate number of decimal digits of precision you can trust.

Let us now cast this spell and gaze upon the true nature of our floating-point dtypes!

In [29]:

Copied!





# details about floats
print("--- Details about floats ---")
# Let's print out torch.finfo attributes in a beautiful aligned table
def print_finfo(dtype):
    finfo = torch.finfo(dtype)
    print(f"{str(dtype):<14} | {finfo.bits:<4} | {finfo.eps:<10.4e} | {finfo.tiny:<12.4e} | {finfo.min:<12.4e} | {finfo.max:<12.4e}")

print(f"{'dtype':<14} | {'Bits':<4} | {'Epsilon':<10} | {'Tiny':<12} | {'Min':<12} | {'Max':<12}")
print("-" * 80)
for dtype in [torch.float64, torch.float32, torch.float16, torch.bfloat16]:
    print_finfo(dtype)
# details about floats
print("--- Details about floats ---")
# Let's print out torch.finfo attributes in a beautiful aligned table
def print_finfo(dtype):
    finfo = torch.finfo(dtype)
    print(f"{str(dtype):<14} | {finfo.bits:<4} | {finfo.eps:<10.4e} | {finfo.tiny:<12.4e} | {finfo.min:<12.4e} | {finfo.max:<12.4e}")

print(f"{'dtype':<14} | {'Bits':<4} | {'Epsilon':<10} | {'Tiny':<12} | {'Min':<12} | {'Max':<12}")
print("-" * 80)
for dtype in [torch.float64, torch.float32, torch.float16, torch.bfloat16]:
    print_finfo(dtype)

--- Details about floats ---
dtype          | Bits | Epsilon    | Tiny         | Min          | Max         
--------------------------------------------------------------------------------
torch.float64  | 64   | 2.2204e-16 | 2.2251e-308  | -1.7977e+308 | 1.7977e+308 
torch.float32  | 32   | 1.1921e-07 | 1.1755e-38   | -3.4028e+38  | 3.4028e+38  
torch.float16  | 16   | 9.7656e-04 | 6.1035e-05   | -6.5504e+04  | 6.5504e+04  
torch.bfloat16 | 16   | 7.8125e-03 | 1.1755e-38   | -3.3895e+38  | 3.3895e+38

Lets test the range of floats and see what happens when we push the limits. What is the maximum number that can be represented in each dtype, and what happens when we go beyond that?

In [30]:

Copied!





# A large number for our comparison

large_number = 70000.0
tensor_fp64 = torch.tensor(large_number, dtype=torch.float64)
tensor_fp32 = torch.tensor(large_number, dtype=torch.float32)
tensor_fp16 = torch.tensor(large_number, dtype=torch.float16)
tensor_bf16 = torch.tensor(large_number, dtype=torch.bfloat16)

print("--- Floating-Point Higher range, precision loss ---")
print(f"{tensor_fp64.dtype}:  {tensor_fp64.item():.10f} ")
print(f"{tensor_fp32.dtype}:  {tensor_fp32.item():.10f} ")
print(f"{tensor_fp16.dtype}:  {tensor_fp16.item():.10f} ")
print(f"{tensor_bf16.dtype}: {tensor_bf16.item():.10f} ")
# A large number for our comparison

large_number = 70000.0
tensor_fp64 = torch.tensor(large_number, dtype=torch.float64)
tensor_fp32 = torch.tensor(large_number, dtype=torch.float32)
tensor_fp16 = torch.tensor(large_number, dtype=torch.float16)
tensor_bf16 = torch.tensor(large_number, dtype=torch.bfloat16)

print("--- Floating-Point Higher range, precision loss ---")
print(f"{tensor_fp64.dtype}:  {tensor_fp64.item():.10f} ")
print(f"{tensor_fp32.dtype}:  {tensor_fp32.item():.10f} ")
print(f"{tensor_fp16.dtype}:  {tensor_fp16.item():.10f} ")
print(f"{tensor_bf16.dtype}: {tensor_bf16.item():.10f} ")

--- Floating-Point Higher range, precision loss ---
torch.float64:  70000.0000000000 
torch.float32:  70000.0000000000 
torch.float16:  inf 
torch.bfloat16: 70144.0000000000

The Curious Case of `bfloat16` and the Number 70,144¶

Mwahahaha! Apprentice, you have sharp eyes! You witnessed a strange transmutation: our number 70000.0 became 70144.0 when cast to bfloat16. Is this a bug? A flaw in our alchemy? No! This is a profound secret about the very fabric of digital reality!

To understand this, we must journey into the heart of the machine and see how it stores floating-point numbers.

The Blueprint of a Float: Scientific Notation in Binary¶

Every floating-point number in your computer's memory is stored like a secret formula with three parts:

The Sign (S): A single bit (0 for positive, 1 for negative).
The Exponent (E): A set of bits that represent the number's magnitude or range, like the 10^x part in scientific notation.
The Mantissa (M): A set of bits that represent the actual digits of the number—its precision.

The number is roughly reconstructed as: (-1)^S * M * 2^E.

The mantissa is the key here. It's a binary fraction that always starts with an implicit 1., followed by the sum of fractional powers of 2. For example: 1.M = 1 + m1/2 + m2/4 + m3/8 + ...m23/2^23, where m1, m2, m3, ...m23 are 0 or 1 (depends on how many bits are in the mantissa, this case 23 bits).

The Meaning of "Precision" for floating-point numbers¶

When we say bfloat16 has "less precision" than float32, we don't mean fewer decimal places in the way humans think. We mean it has fewer bits in its mantissa.

float32 has 1 bit for sign, 23 mantissa bits and 8 bits for exponent
float16 has 1 bit for sign, 10 mantissa bits and 5 bits for exponent, more bits for mantissa means less coarse==more precision, less range (min - max)
bfloat16 has 1 bit for sign, 7 mantissa bits and 8 bits for exponent, less bits for mantissa means more coarse==less precision, more range (min - max) then float16, same range as float32

This means bfloat16 can only represent a much smaller, coarser set of numbers between any two powers of two. For small numbers (like 3.14), the representable values are very close together. But for large numbers, the "gaps" between representable values become huge!

Detailed explanation: Why 70,144?¶

The number 70000 is simply not one of the numbers that can be perfectly formed with bfloat16's limited 7-bit mantissa at that large exponent range.

Lets write the number 70,000 in binary: 1 0001 0001 0111 0000.

For 70,000, scientific notation number starts with 1., we move the decimal point 16 places to the left (2^16). $$ 1. \underbrace{0001000101110000}_{\text{16 binary digits}} \times 2^{16} $$

The mantissa (the part after the 1.) is where the precision limit strikes.

A float32 has 23 bits for its mantissa. It can easily store those binary digits with room to spare. The number 70,000 is stored perfectly.

0-sign bit, 0001000101110000 0000000-23 mantissa bits, 00010000- exponent bits (2^16, omits bias for simplicity)

A bfloat16 only has 7 bits for its mantissa.

0-sign bit,1. 0001000-7 mantissa bits (first 7 digits) -> rounded up to 0001001, 00010000- exponent bits (2^16, omits bias for simplicity)
1*2^16 + (0*1/2 + 0*1/4 + 0*1/8 + 1*1/16 + 0*1/32 + 0*1/64 + 1*1/128)*2^16 = 65536+ 4096 + 512=70144

bfloat16 must take that 16-digit binary sequence and round it to fit into just 7 bits. This forces a loss of information, even for a whole number.

Original Mantissa: 0001000101110000
bfloat16 capacity: Can only store the first 7 digits: 0001000.
Rounding: It checks the 8th digit (1) and, following rounding rules, rounds the 7-bit number up. The new mantissa becomes 0001001.

So, bfloat16 ends up storing the number as 1.0001001 \times 2^{16}.

Think of it like trying to measure 70,000 millimeters with a ruler that only has markings every 256 millimeters. You can't land on 70,000 exactly. You must choose the closest mark.

The two closest "marks" that bfloat16 can represent in that range are:

69,888
70,144

Since 70,000 is closer to 70,144, the transmutation spell rounds it to that value. It is not an error, but the result of sacrificing precision to maintain the vast numerical range of float32. This robustness is exactly why it is the preferred elixir for training colossal neural networks! You have witnessed the fundamental trade-off of modern AI hardware!

Why `float16` behaves so weird around the number `65504`?¶

Being armed with this knowledge, try to figure out what happened with the following numbers:

Why float16 behaves so weird around the number 65504? Why it is not inf?
Why bfloat16 of 65504+1.0 is equal to 65504+20.0?

In [31]:

Copied!





almost_max_number = 65500.0

nearest_max_number = 65500.0 + torch.pi
print(f"----\nTesting {almost_max_number} + {torch.pi} = {nearest_max_number} < 65504 (max for float16)")
tensor_fp64 = torch.tensor(nearest_max_number, dtype=torch.float64)
tensor_fp32 = torch.tensor(nearest_max_number, dtype=torch.float32)
tensor_fp16 = torch.tensor(nearest_max_number, dtype=torch.float16)
tensor_bf16 = torch.tensor(nearest_max_number, dtype=torch.bfloat16)

print(f"{tensor_fp64.dtype}:  {tensor_fp64.item():.10f} ")
print(f"{tensor_fp32.dtype}:  {tensor_fp32.item():.10f} ")
print(f"{tensor_fp16.dtype}:  {tensor_fp16.item():.10f} - in range of float16, but precision is lost")
print(f"{tensor_bf16.dtype}: {tensor_bf16.item():.10f} - in range of bfloat16, precision is lost more than float16")


float16_max = 65504.0
add_to_max = 1.0
overflow_number = float16_max + add_to_max
print(f"----\nTesting {float16_max} + {add_to_max} = {overflow_number} > 65504 (max for float16) it should overflow?")
tensor_fp16 = torch.tensor(overflow_number, dtype=torch.float16)
tensor_bf16 = torch.tensor(overflow_number, dtype=torch.bfloat16)


print(f"{tensor_fp16.dtype}:  {tensor_fp16.item():.10f} - could you explain why it is not inf?")
print(f"{tensor_bf16.dtype}: {tensor_bf16.item():.10f} - in range of bfloat16, precision is lost more than float16")

add_to_max = 16.0
overflow_number = float16_max + add_to_max
print(f"----\nTesting {float16_max} + {add_to_max} = {overflow_number} > 65504 (max for float16) it should overflow?")

tensor_fp16 = torch.tensor(overflow_number, dtype=torch.float16)
tensor_bf16 = torch.tensor(overflow_number, dtype=torch.bfloat16)


print(f"{tensor_fp16.dtype}:  {tensor_fp16.item():.10f} - could you explain why this is inf?")
print(f"{tensor_bf16.dtype}: {tensor_bf16.item():.10f} - in range of bfloat16, precision is lost more than float16")
almost_max_number = 65500.0

nearest_max_number = 65500.0 + torch.pi
print(f"----\nTesting {almost_max_number} + {torch.pi} = {nearest_max_number} < 65504 (max for float16)")
tensor_fp64 = torch.tensor(nearest_max_number, dtype=torch.float64)
tensor_fp32 = torch.tensor(nearest_max_number, dtype=torch.float32)
tensor_fp16 = torch.tensor(nearest_max_number, dtype=torch.float16)
tensor_bf16 = torch.tensor(nearest_max_number, dtype=torch.bfloat16)

print(f"{tensor_fp64.dtype}:  {tensor_fp64.item():.10f} ")
print(f"{tensor_fp32.dtype}:  {tensor_fp32.item():.10f} ")
print(f"{tensor_fp16.dtype}:  {tensor_fp16.item():.10f} - in range of float16, but precision is lost")
print(f"{tensor_bf16.dtype}: {tensor_bf16.item():.10f} - in range of bfloat16, precision is lost more than float16")


float16_max = 65504.0
add_to_max = 1.0
overflow_number = float16_max + add_to_max
print(f"----\nTesting {float16_max} + {add_to_max} = {overflow_number} > 65504 (max for float16) it should overflow?")
tensor_fp16 = torch.tensor(overflow_number, dtype=torch.float16)
tensor_bf16 = torch.tensor(overflow_number, dtype=torch.bfloat16)


print(f"{tensor_fp16.dtype}:  {tensor_fp16.item():.10f} - could you explain why it is not inf?")
print(f"{tensor_bf16.dtype}: {tensor_bf16.item():.10f} - in range of bfloat16, precision is lost more than float16")

add_to_max = 16.0
overflow_number = float16_max + add_to_max
print(f"----\nTesting {float16_max} + {add_to_max} = {overflow_number} > 65504 (max for float16) it should overflow?")

tensor_fp16 = torch.tensor(overflow_number, dtype=torch.float16)
tensor_bf16 = torch.tensor(overflow_number, dtype=torch.bfloat16)


print(f"{tensor_fp16.dtype}:  {tensor_fp16.item():.10f} - could you explain why this is inf?")
print(f"{tensor_bf16.dtype}: {tensor_bf16.item():.10f} - in range of bfloat16, precision is lost more than float16")

----
Testing 65500.0 + 3.141592653589793 = 65503.14159265359 < 65504 (max for float16)
torch.float64:  65503.1415926536 
torch.float32:  65503.1406250000 
torch.float16:  65504.0000000000 - in range of float16, but precision is lost
torch.bfloat16: 65536.0000000000 - in range of bfloat16, precision is lost more than float16
----
Testing 65504.0 + 1.0 = 65505.0 > 65504 (max for float16) it should overflow?
torch.float16:  65504.0000000000 - could you explain why it is not inf?
torch.bfloat16: 65536.0000000000 - in range of bfloat16, precision is lost more than float16
----
Testing 65504.0 + 16.0 = 65520.0 > 65504 (max for float16) it should overflow?
torch.float16:  inf - could you explain why this is inf?
torch.bfloat16: 65536.0000000000 - in range of bfloat16, precision is lost more than float16

Why `bfloat16` is Better for Transformers: A Tale of Range and Rebellion¶

Now for a secret that separates the masters from the mere dabblers! Both float16 and bfloat16 use 16 bits, but they do so with diabolically different strategies. Understanding this is key to training modern marvels like Transformers!

The world of float32 is a stable, predictable realm. But it is slow and memory-hungry! When we attempt to accelerate our dark arts with float16, we encounter a terrible problem: The Tyranny of a Tiny Range.

The Peril of `float16`: An Unstable Concoction¶

float16 dedicates more bits to its mantissa (precision), but starves its exponent (range). Its numerical world is small, spanning from roughly 6.1 x 10^-5 to 65,504. Anything outside this narrow window becomes an inf (infinity) or vanishes to zero.

During the chaotic process of training a massive Transformer, values can fluctuate wildly. This is where the tyranny of float16 strikes hardest:

Exploding Gradients: Imagine a scenario deep within your network where a series of large gradients are multiplied. Even with normalization, an intermediate calculation can easily exceed 65,504. For instance, the Adam optimizer tracks the variance of gradients (v term), which can grow very large. If this value overflows to inf, the weight update becomes NaN (Not a Number), and your entire training process collapses into a fiery numerical singularity!
Vanishing Activations: Inside a Transformer, attention scores are passed through a Softmax function. If the input values (logits) are very large negative numbers, the resulting probabilities can become smaller than float16's minimum representable value. They are rounded down to zero, and that part of your model stops learning entirely!

To combat this, alchemists of old used a crude technique called loss scaling: manually multiplying the loss to keep gradients within float16's safe range. It is a messy, unreliable hack!

The `bfloat16` Rebellion: Sacrificing Precision for Power!¶

The great minds at Google Brain, in their quest for ultimate power, forged a new weapon: the Brain Floating-Point Format, or bfloat16! They looked at the chaos of float16 and made a brilliant, rebellious choice.

They designed bfloat16 to have the same number of exponent bits as float32 (8 bits). This gives it the exact same colossal dynamic range, spanning from 1.18 x 10^-38 to 3.4 x 10^38. It can represent gargantuan numbers and infinitesimally small ones without breaking a sweat.

The price? It has fewer mantissa bits (7 bits) than float16 (10 bits), giving it less precision. But here is the profound secret, backed by countless experiments in the deepest labs: neural networks are incredibly resilient to low precision.

Why do the inaccuracies not hurt?

Stochastic Nature of Training: We train models using stochastic gradient descent on mini-batches of data. This process is inherently noisy! The tiny inaccuracies introduced by bfloat16's rounding are like a single drop of rain in a hurricane—they are statistically insignificant compared to the noise already present in the training process.
Error Accumulation is Not Catastrophic: As researchers from The Hardware Lottery blog and other deep learning practitioners have noted, the errors from low precision tend to average out over millions of updates. The network's learning direction isn't meaningfully altered. The gradient still points downhill, even if it's a slightly wobblier path.

"For the volatile, chaotic world of deep learning, a vast and stable range is far more important than surgical precision."

— Prof. Torchenstein

The Transformer's Elixir of Choice¶

For training and fine-tuning, bfloat16 is the undisputed champion, the elixir that fuels the titans of AI.

Training Stability: Its float32-like range means no more exploding gradients in optimizer states or vanishing activations in softmax. You can throw away the clumsy crutch of loss scaling.
Memory Efficiency: Like float16, it cuts your model's memory footprint in half compared to float32. This allows you to train larger models or use larger batch sizes, accelerating your path to discovery.
Hardware Acceleration: It is natively supported on the most powerful instruments in any modern laboratory: Google TPUs and NVIDIA's latest GPUs (Ampere architecture and newer, like the A100 or RTX 30/40 series).

The Rogues' Gallery: Who Uses bfloat16? The most powerful creations of our time were forged in the fires of bfloat16. Giants like Google's T5 and BERT, Meta's Llama 2, the Falcon models, and many more rely on bfloat16 for stable and efficient training.

The Verdict for Your Lab:

For Training & Fine-Tuning: bfloat16 is your weapon of choice. It is the modern standard for a reason.
For Inference: float16 is often perfectly acceptable. After a model is trained, the range of values it processes is more predictable, making float16's higher precision and wider hardware support a safe and efficient option.

Integer Types (The Counting Stones)¶

For when you need to count, index, or represent discrete information like image pixel values or class labels.

torch.int64 (torch.long): The 64-bit grandmaster of integers. This is the default for indexing operations and is crucial for embedding layers, where you need to look up values from a large vocabulary.
torch.int32 (torch.int): A solid 32-bit integer, perfectly suitable for most counting tasks.
torch.uint8: An 8-bit unsigned integer, representing values from 0 to 255. The undisputed king for storing image data, where each pixel in an RGB channel has a value in this exact range!

Now for the counting stones—the integer types. Their purpose is not precision, but to hold whole numbers. Observe their varying sizes.

In [32]:

Copied!





# Summoning tensors of different integer dtypes
tensor_i64 = torch.tensor(1000, dtype=torch.int64)
tensor_i32 = torch.tensor(1000, dtype=torch.int32)
tensor_i16 = torch.tensor(1000, dtype=torch.int16)
tensor_i8 = torch.tensor(100, dtype=torch.int8)
tensor_ui8 = torch.tensor(255, dtype=torch.uint8)

print("--- Integer Memory Footprints ---")
print(f"torch.int64: Memory: {tensor_i64.element_size()} bytes")
print(f"torch.int32: Memory: {tensor_i32.element_size()} bytes")
print(f"torch.int16: Memory: {tensor_i16.element_size()} bytes")
print(f"torch.int8:  Memory: {tensor_i8.element_size()} bytes")
print(f"torch.uint8: Memory: {tensor_ui8.element_size()} bytes")
# Summoning tensors of different integer dtypes
tensor_i64 = torch.tensor(1000, dtype=torch.int64)
tensor_i32 = torch.tensor(1000, dtype=torch.int32)
tensor_i16 = torch.tensor(1000, dtype=torch.int16)
tensor_i8 = torch.tensor(100, dtype=torch.int8)
tensor_ui8 = torch.tensor(255, dtype=torch.uint8)

print("--- Integer Memory Footprints ---")
print(f"torch.int64: Memory: {tensor_i64.element_size()} bytes")
print(f"torch.int32: Memory: {tensor_i32.element_size()} bytes")
print(f"torch.int16: Memory: {tensor_i16.element_size()} bytes")
print(f"torch.int8:  Memory: {tensor_i8.element_size()} bytes")
print(f"torch.uint8: Memory: {tensor_ui8.element_size()} bytes")

--- Integer Memory Footprints ---
torch.int64: Memory: 8 bytes
torch.int32: Memory: 4 bytes
torch.int16: Memory: 2 bytes
torch.int8:  Memory: 1 bytes
torch.uint8: Memory: 1 bytes

The Transmutation Spell: Witnessing the Effects of Casting¶

Now that you understand the properties of each dtype, witness what happens when we perform the transmutation! Casting from a higher precision dtype to a lower one is a lossy operation. You gain speed and save memory, but at the cost of precision!

Observe the fate of our high-precision number as we cast it down the alchemical ladder. Compare the precision loss of each dtype, esspecially float16 and bfloat16 (some time they are the same, some time not).

In [33]:

Copied!





# Our original, high-precision tensor
from math import pi


pi_fp64 = torch.tensor(3.141592653589793, dtype=torch.float64)

for i in range(12):
    pi_2power = pi_fp64**(i+1)
    print("--------------------------------")
    print(f"pi^{i+1}: {pi_2power.item():.10f} float64")

    # Cast it down
    pi_fp32 = pi_2power.to(torch.float32)
    print(f"Casted to float32:  {pi_fp32.item():.10f} (Precision lost!)")

    pi_fp16 = pi_2power.to(torch.float16)
    print(f"Casted to float16:  {pi_fp16.item():.10f} (More precision lost!)")

    pi_bf16 = pi_2power.to(torch.bfloat16)
    print(f"Casted to bfloat16: {pi_bf16.item():.10f} (More precision loss!)")

    # Casting floats to integers truncates the decimal part entirely!
    integer_pi = pi_2power.to(torch.int)
    print(f"Casted to integer: {integer_pi.item()} (Decimal part vanished!)")
# Our original, high-precision tensor
from math import pi


pi_fp64 = torch.tensor(3.141592653589793, dtype=torch.float64)

for i in range(12):
    pi_2power = pi_fp64**(i+1)
    print("--------------------------------")
    print(f"pi^{i+1}: {pi_2power.item():.10f} float64")

    # Cast it down
    pi_fp32 = pi_2power.to(torch.float32)
    print(f"Casted to float32:  {pi_fp32.item():.10f} (Precision lost!)")

    pi_fp16 = pi_2power.to(torch.float16)
    print(f"Casted to float16:  {pi_fp16.item():.10f} (More precision lost!)")

    pi_bf16 = pi_2power.to(torch.bfloat16)
    print(f"Casted to bfloat16: {pi_bf16.item():.10f} (More precision loss!)")

    # Casting floats to integers truncates the decimal part entirely!
    integer_pi = pi_2power.to(torch.int)
    print(f"Casted to integer: {integer_pi.item()} (Decimal part vanished!)")

--------------------------------
pi^1: 3.1415926536 float64
Casted to float32:  3.1415927410 (Precision lost!)
Casted to float16:  3.1406250000 (More precision lost!)
Casted to bfloat16: 3.1406250000 (More precision loss!)
Casted to integer: 3 (Decimal part vanished!)
--------------------------------
pi^2: 9.8696044011 float64
Casted to float32:  9.8696041107 (Precision lost!)
Casted to float16:  9.8671875000 (More precision lost!)
Casted to bfloat16: 9.8750000000 (More precision loss!)
Casted to integer: 9 (Decimal part vanished!)
--------------------------------
pi^3: 31.0062766803 float64
Casted to float32:  31.0062770844 (Precision lost!)
Casted to float16:  31.0000000000 (More precision lost!)
Casted to bfloat16: 31.0000000000 (More precision loss!)
Casted to integer: 31 (Decimal part vanished!)
--------------------------------
pi^4: 97.4090910340 float64
Casted to float32:  97.4090881348 (Precision lost!)
Casted to float16:  97.4375000000 (More precision lost!)
Casted to bfloat16: 97.5000000000 (More precision loss!)
Casted to integer: 97 (Decimal part vanished!)
--------------------------------
pi^5: 306.0196847853 float64
Casted to float32:  306.0196838379 (Precision lost!)
Casted to float16:  306.0000000000 (More precision lost!)
Casted to bfloat16: 306.0000000000 (More precision loss!)
Casted to integer: 306 (Decimal part vanished!)
--------------------------------
pi^6: 961.3891935753 float64
Casted to float32:  961.3892211914 (Precision lost!)
Casted to float16:  961.5000000000 (More precision lost!)
Casted to bfloat16: 960.0000000000 (More precision loss!)
Casted to integer: 961 (Decimal part vanished!)
--------------------------------
pi^7: 3020.2932277768 float64
Casted to float32:  3020.2932128906 (Precision lost!)
Casted to float16:  3020.0000000000 (More precision lost!)
Casted to bfloat16: 3024.0000000000 (More precision loss!)
Casted to integer: 3020 (Decimal part vanished!)
--------------------------------
pi^8: 9488.5310160706 float64
Casted to float32:  9488.5312500000 (Precision lost!)
Casted to float16:  9488.0000000000 (More precision lost!)
Casted to bfloat16: 9472.0000000000 (More precision loss!)
Casted to integer: 9488 (Decimal part vanished!)
--------------------------------
pi^9: 29809.0993334462 float64
Casted to float32:  29809.0996093750 (Precision lost!)
Casted to float16:  29808.0000000000 (More precision lost!)
Casted to bfloat16: 29824.0000000000 (More precision loss!)
Casted to integer: 29809 (Decimal part vanished!)
--------------------------------
pi^10: 93648.0474760830 float64
Casted to float32:  93648.0468750000 (Precision lost!)
Casted to float16:  inf (More precision lost!)
Casted to bfloat16: 93696.0000000000 (More precision loss!)
Casted to integer: 93648 (Decimal part vanished!)
--------------------------------
pi^11: 294204.0179738905 float64
Casted to float32:  294204.0312500000 (Precision lost!)
Casted to float16:  inf (More precision lost!)
Casted to bfloat16: 294912.0000000000 (More precision loss!)
Casted to integer: 294204 (Decimal part vanished!)
--------------------------------
pi^12: 924269.1815233737 float64
Casted to float32:  924269.1875000000 (Precision lost!)
Casted to float16:  inf (More precision lost!)
Casted to bfloat16: 925696.0000000000 (More precision loss!)
Casted to integer: 924269 (Decimal part vanished!)

Boolean Type (The Oracle)¶

Represents the fundamental truths of the universe: True or False.

torch.bool: The result of all your logical incantations (>, <, ==). Essential for creating masks to filter and select elements from your tensors.

The Lair of Computation - Mastering Devices (`device`)¶

What devices PyTorch supports?¶

A tensor's dtype is its soul, but its .device is its home—the very dimension where its calculations will be performed. Choosing the right device is the key to unlocking diabolical computational speed! While many alchemical experiments can be run on a standard CPU, true power lies in specialized hardware.

Here are the primary realms you can command:

cpu: The Central Processing Unit. The reliable, ever-present brain of your machine. It's a generalist, capable of any task, but it performs calculations sequentially. For small tensors and simple operations, it's our trusty home base.
cuda: The NVIDIA GPU! This is the roaring heart of the deep learning revolution. A GPU is a specialist, containing thousands of cores designed for one purpose: massively parallel computation. Moving your tensors here is essential for training any serious neural network.
mps: Metal Performance Shaders. Apple's answer to CUDA for their new M-series chips. If you are wielding a modern Mac, this device will unleash the power of its integrated GPU for accelerated training.
xpu: A realm forged by Intel for their own line of GPUs and AI accelerators, showing PyTorch's expanding hardware support.
xla: A gateway to Google's powerful Tensor Processing Units (TPUs), often used for training colossal models in the cloud.

How PyTorch Supports So Many Devices: The Magic of ATen¶

How can a single command like torch.matmul() work on a CPU, an NVIDIA GPU, and an Apple chip? The secret lies in PyTorch's core library: ATen.

Think of ATen as a grand dispatcher in our laboratory. When you issue a command, ATen inspects the tensor's .device and redirects the command to a highly optimized, device-specific library:

If device='cpu', ATen calls libraries like oneDNN.
If device='cuda', ATen calls NVIDIA's legendary cuDNN library.
If device='mps', ATen calls Apple's Metal framework.
For other devices like xpu or xla, it calls their respective specialized backends.

This brilliant design makes your PyTorch code incredibly portable. You write the incantation once, and ATen ensures it is executed with maximum power on whatever hardware you possess!

The Ritual: Dynamic Device Placement¶

A true PyTorch master does not hardcode their device! That is the way of the amateur. We shall write a glorious, platform-agnostic spell that automatically detects and selects the most powerful computational device available.

The hierarchy is clear: CUDA is the sanctum sanctorum, MPS is the respected wizard's tower, and CPU is our reliable home laboratory. Our code shall seek the most powerful realm first.

In [34]:

Copied!





# Our Grand Spell for Selecting the Best Device
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Mwahahaha! We have awakened the CUDA beast!")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Behold! The power of Apple's Metal Performance Shaders!")
else:
    device = torch.device("cpu")
    print("The humble CPU will have to suffice for today's experiments.")

print(f"Selected device: {device}\\n")

# --- Summoning and Teleporting Tensors ---

# 1. Summon a tensor directly on the chosen device
tensor_on_device = torch.randn(2, 3, device=device)
print(f"Tensor summoned directly on '{tensor_on_device.device}'")
print(tensor_on_device)

# 2. Teleport a CPU tensor to the device using the .to() spell
cpu_tensor = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(f"A CPU tensor, minding its own business: {cpu_tensor.device}")

teleported_tensor = cpu_tensor.to(device)
print(f"Teleported to '{teleported_tensor.device}'!")
print(teleported_tensor)

# IMPORTANT: Operations between tensors on different devices will FAIL!
# This would cause a RuntimeError:
# try:
#     result = cpu_tensor + teleported_tensor
# except RuntimeError as e:
#     print(f"\\nAs expected, chaos ensues: {e}")
# Our Grand Spell for Selecting the Best Device
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Mwahahaha! We have awakened the CUDA beast!")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Behold! The power of Apple's Metal Performance Shaders!")
else:
    device = torch.device("cpu")
    print("The humble CPU will have to suffice for today's experiments.")

print(f"Selected device: {device}\\n")

# --- Summoning and Teleporting Tensors ---

# 1. Summon a tensor directly on the chosen device
tensor_on_device = torch.randn(2, 3, device=device)
print(f"Tensor summoned directly on '{tensor_on_device.device}'")
print(tensor_on_device)

# 2. Teleport a CPU tensor to the device using the .to() spell
cpu_tensor = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(f"A CPU tensor, minding its own business: {cpu_tensor.device}")

teleported_tensor = cpu_tensor.to(device)
print(f"Teleported to '{teleported_tensor.device}'!")
print(teleported_tensor)

# IMPORTANT: Operations between tensors on different devices will FAIL!
# This would cause a RuntimeError:
# try:
#     result = cpu_tensor + teleported_tensor
# except RuntimeError as e:
#     print(f"\\nAs expected, chaos ensues: {e}")

Mwahahaha! We have awakened the CUDA beast!
\nSelected device: cuda\n
Tensor summoned directly on 'cuda:0'
tensor([[-0.1802,  0.6344,  0.5375],
        [ 0.4829, -0.6208, -0.6716]], device='cuda:0')
\nA CPU tensor, minding its own business: cpu
Teleported to 'cuda:0'!
tensor([[1, 2, 3],
        [4, 5, 6]], device='cuda:0')

Speed Trials¶

Comparison of speed of different floating-point types on CPU and GPU¶

Now for a truly electrifying experiment! We shall create colossal tensors of different floating-point dtypes and subject them to a barrage of intense, element-wise mathematical operations. This will reveal the dramatic speed differences between our alchemical elixirs when processed in the massively parallel crucible of a GPU.

In [35]:

Copied!





import time
import torch

# A utility for our CPU speed trials
def time_cpu_operation(tensor):
    start_time = time.time()
    # A sequence of intense, element-wise mathematical transformations!
    torch.exp(torch.cos(torch.sin(tensor)))
    end_time = time.time()
    return end_time - start_time

# A colossal tensor for our experiment!
size = 20000 # Larger size to make the computation more intensive
large_tensor_cpu = torch.randn(size, size)

print(f"--- CPU Speed Trials ({size}x{size} Element-wise Operations) ---")
time_fp32_cpu = time_cpu_operation(large_tensor_cpu.clone())
print(f"Float32 on CPU took: {time_fp32_cpu:.6f} seconds (Our baseline)")

time_fp16_cpu = time_cpu_operation(large_tensor_cpu.clone().to(torch.float16))
print(f"Float16 on CPU took: {time_fp16_cpu:.6f} seconds ")

time_bf16_cpu = time_cpu_operation(large_tensor_cpu.clone().to(torch.bfloat16))
print(f"BFloat16 on CPU took: {time_bf16_cpu:.6f} seconds ")




# --- SPEED TRIALS ON GPU (if available) ---
if torch.cuda.is_available():
    print(f"--- GPU Speed Trials ({size}x{size} Element-wise Operations) ---")
    large_tensor_gpu = large_tensor_cpu.to("cuda")

    # Define the GPU timing utility using CUDA events for accuracy
    def time_gpu_operation(tensor):
        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        
        # Warm-up run to compile kernels, etc.
        torch.exp(torch.cos(torch.sin(tensor)))
        
        # The actual timed operation
        start_event.record()
        torch.exp(torch.cos(torch.sin(tensor)))
        end_event.record()
        torch.cuda.synchronize() # Wait for the GPU operation to complete
        
        return start_event.elapsed_time(end_event) / 1000 # Return time in seconds

    # Time Float32
    time_fp32_gpu = time_gpu_operation(large_tensor_gpu.clone())
    speedup_vs_cpu = time_fp32_cpu / time_fp32_gpu
    print(f"Float32 on GPU took: {time_fp32_gpu:.6f} seconds ({speedup_vs_cpu:.2f}x faster than CPU!)")

    # Time Float16
    try:
        large_tensor_gpu_fp16 = large_tensor_gpu.clone().to(torch.float16)
        time_fp16_gpu = time_gpu_operation(large_tensor_gpu_fp16)
        speedup_vs_gpu_fp32 = time_fp32_gpu / time_fp16_gpu
        speedup_vs_cpu_fp32 = time_fp32_cpu / time_fp16_gpu
        speedup_vs_cpu_fp16 = time_fp16_cpu / time_fp16_gpu
        print(f"Float16 on GPU took: {time_fp16_gpu:.6f} seconds ({speedup_vs_gpu_fp32:.2f}x vs GPU FP32, {speedup_vs_cpu_fp32:.2f}x vs CPU FP32, {speedup_vs_cpu_fp16:.2f}x vs CPU FP16!)")
    except RuntimeError as e:
        print(f"Float16 not supported on this GPU: {e}")

    # Time BFloat16
    try:
        large_tensor_gpu_bf16 = large_tensor_gpu.clone().to(torch.bfloat16)
        time_bf16_gpu = time_gpu_operation(large_tensor_gpu_bf16)
        speedup_vs_gpu_fp32 = time_fp32_gpu / time_bf16_gpu
        speedup_vs_cpu_fp32 = time_fp32_cpu / time_bf16_gpu
        speedup_vs_cpu_fp16 = time_bf16_cpu / time_bf16_gpu
        print(f"BFloat16 on GPU took: {time_bf16_gpu:.6f} seconds ({speedup_vs_gpu_fp32:.2f}x vs GPU FP32, {speedup_vs_cpu_fp32:.2f}x vs CPU FP32, {speedup_vs_cpu_fp16:.2f}x vs CPU BFP16!)")
    except RuntimeError as e:
        print(f"BFloat16 not supported on this GPU: {e}")

else:
    print("GPU not available for speed trials. A true pity!")
import time
import torch

# A utility for our CPU speed trials
def time_cpu_operation(tensor):
    start_time = time.time()
    # A sequence of intense, element-wise mathematical transformations!
    torch.exp(torch.cos(torch.sin(tensor)))
    end_time = time.time()
    return end_time - start_time

# A colossal tensor for our experiment!
size = 20000 # Larger size to make the computation more intensive
large_tensor_cpu = torch.randn(size, size)

print(f"--- CPU Speed Trials ({size}x{size} Element-wise Operations) ---")
time_fp32_cpu = time_cpu_operation(large_tensor_cpu.clone())
print(f"Float32 on CPU took: {time_fp32_cpu:.6f} seconds (Our baseline)")

time_fp16_cpu = time_cpu_operation(large_tensor_cpu.clone().to(torch.float16))
print(f"Float16 on CPU took: {time_fp16_cpu:.6f} seconds ")

time_bf16_cpu = time_cpu_operation(large_tensor_cpu.clone().to(torch.bfloat16))
print(f"BFloat16 on CPU took: {time_bf16_cpu:.6f} seconds ")




# --- SPEED TRIALS ON GPU (if available) ---
if torch.cuda.is_available():
    print(f"--- GPU Speed Trials ({size}x{size} Element-wise Operations) ---")
    large_tensor_gpu = large_tensor_cpu.to("cuda")

    # Define the GPU timing utility using CUDA events for accuracy
    def time_gpu_operation(tensor):
        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        
        # Warm-up run to compile kernels, etc.
        torch.exp(torch.cos(torch.sin(tensor)))
        
        # The actual timed operation
        start_event.record()
        torch.exp(torch.cos(torch.sin(tensor)))
        end_event.record()
        torch.cuda.synchronize() # Wait for the GPU operation to complete
        
        return start_event.elapsed_time(end_event) / 1000 # Return time in seconds

    # Time Float32
    time_fp32_gpu = time_gpu_operation(large_tensor_gpu.clone())
    speedup_vs_cpu = time_fp32_cpu / time_fp32_gpu
    print(f"Float32 on GPU took: {time_fp32_gpu:.6f} seconds ({speedup_vs_cpu:.2f}x faster than CPU!)")

    # Time Float16
    try:
        large_tensor_gpu_fp16 = large_tensor_gpu.clone().to(torch.float16)
        time_fp16_gpu = time_gpu_operation(large_tensor_gpu_fp16)
        speedup_vs_gpu_fp32 = time_fp32_gpu / time_fp16_gpu
        speedup_vs_cpu_fp32 = time_fp32_cpu / time_fp16_gpu
        speedup_vs_cpu_fp16 = time_fp16_cpu / time_fp16_gpu
        print(f"Float16 on GPU took: {time_fp16_gpu:.6f} seconds ({speedup_vs_gpu_fp32:.2f}x vs GPU FP32, {speedup_vs_cpu_fp32:.2f}x vs CPU FP32, {speedup_vs_cpu_fp16:.2f}x vs CPU FP16!)")
    except RuntimeError as e:
        print(f"Float16 not supported on this GPU: {e}")

    # Time BFloat16
    try:
        large_tensor_gpu_bf16 = large_tensor_gpu.clone().to(torch.bfloat16)
        time_bf16_gpu = time_gpu_operation(large_tensor_gpu_bf16)
        speedup_vs_gpu_fp32 = time_fp32_gpu / time_bf16_gpu
        speedup_vs_cpu_fp32 = time_fp32_cpu / time_bf16_gpu
        speedup_vs_cpu_fp16 = time_bf16_cpu / time_bf16_gpu
        print(f"BFloat16 on GPU took: {time_bf16_gpu:.6f} seconds ({speedup_vs_gpu_fp32:.2f}x vs GPU FP32, {speedup_vs_cpu_fp32:.2f}x vs CPU FP32, {speedup_vs_cpu_fp16:.2f}x vs CPU BFP16!)")
    except RuntimeError as e:
        print(f"BFloat16 not supported on this GPU: {e}")

else:
    print("GPU not available for speed trials. A true pity!")

--- CPU Speed Trials (20000x20000 Element-wise Operations) ---
Float32 on CPU took: 1.031577 seconds (Our baseline)
Float16 on CPU took: 0.360738 seconds 
BFloat16 on CPU took: 0.360255 seconds 
--- GPU Speed Trials (20000x20000 Element-wise Operations) ---
Float32 on GPU took: 0.030761 seconds (33.54x faster than CPU!)
Float16 on GPU took: 0.015344 seconds (2.00x vs GPU FP32, 67.23x vs CPU FP32, 23.51x vs CPU FP16!)
BFloat16 on GPU took: 0.015259 seconds (2.02x vs GPU FP32, 67.61x vs CPU FP32, 23.61x vs CPU BFP16!)

Professor Torchenstein's Outro¶

Mwahahaha! Do you feel it? The hum of raw computational power at your fingertips? You have transcended the mundane world of default settings and seized control of the very essence of your tensors. You are no longer a mere summoner; you are an alchemist and a dimensional traveler!

You have learned to choose your weapons—the precise dtype for the task at hand and the mightiest device for your computations. This knowledge is the bedrock upon which all great neural architectures are built.

But do not rest easy! Our journey has just begun. The tensors are humming, eager for the next lesson where we shall unleash their raw mathematical power with Elemental Tensor Alchemy!

Until then, keep your learning rates high and your devices hotter! The future... is computational!