Skip to main content
Public
Browse Files
64c701a0b38bd4aca30ae532ab361e21c3b651f9
Full Commit Hash
Commit Details
183 Added

Initial commit - Upload project 'flashmla'

WebDev
Authored
January 23, 2026, 4:51 am
Statistics
183
Files Added
0
Files Modified
0
Files Deleted
0
Files Renamed
Changed Files 183 files
.gitignore 110 B
A Added · .
.gitmodules 93 B
A Added · .
benchmark
A Added · .
bench_flash_mla.py 18.71 KB
A Added · benchmark
visualize.py 745 B
A Added · benchmark
csrc
A Added · .
api
A Added · csrc
api.cpp 507 B
A Added · csrc/api
common.h 8.54 KB
A Added · csrc/api
dense_decode.h 9.59 KB
A Added · csrc/api
dense_fwd.h 78 B
A Added · csrc/api
sparse_decode.h 17.86 KB
A Added · csrc/api
sparse_fwd.h 7.57 KB
A Added · csrc/api
cutlass
A Added · csrc
defines.h 564 B
A Added · csrc
kerutils
A Added · csrc
include
A Added · csrc/kerutils
kerutils
A Added · csrc/kerutils/include
common
A Added · csrc/kerutils/include/kerutils
common.h 143 B
A Added · csrc/kerutils/include/kerutils/common
device
A Added · csrc/kerutils/include/kerutils
common.h 1.56 KB
A Added · csrc/kerutils/include/kerutils/device
device.cuh 320 B
A Added · csrc/kerutils/include/kerutils/device
sm100
A Added · csrc/kerutils/include/kerutils/device
gemm.cuh 25.49 KB
A Added · csrc/kerutils/include/kerutils/device/sm100
helpers.cuh 4.29 KB
A Added · csrc/kerutils/include/kerutils/device/sm100
intrinsics.cuh 19.18 KB
A Added · csrc/kerutils/include/kerutils/device/sm100
tma_cta_group2_nosplit.cuh 11.08 KB
A Added · csrc/kerutils/include/kerutils/device/sm100
sm80
A Added · csrc/kerutils/include/kerutils/device
helpers.cuh 1.19 KB
A Added · csrc/kerutils/include/kerutils/device/sm80
intrinsics.cuh 7.2 KB
A Added · csrc/kerutils/include/kerutils/device/sm80
sm90
A Added · csrc/kerutils/include/kerutils/device
helpers.cuh 4.47 KB
A Added · csrc/kerutils/include/kerutils/device/sm90
intrinsics.cuh 4.59 KB
A Added · csrc/kerutils/include/kerutils/device/sm90
host
A Added · csrc/kerutils/include/kerutils
host.h 6.4 KB
A Added · csrc/kerutils/include/kerutils/host
kerutils.cuh 66 B
A Added · csrc/kerutils/include/kerutils
supplemental
A Added · csrc/kerutils/include/kerutils
torch_tensors.h 3.21 KB
A Added · csrc/kerutils/include/kerutils/supplemental
params.h 6.12 KB
A Added · csrc
sm100
A Added · csrc
decode
A Added · csrc/sm100
head128
A Added · csrc/sm100/decode
README.md 185 B
A Added · csrc/sm100/decode/head128
head64
A Added · csrc/sm100/decode
config.h 7.54 KB
A Added · csrc/sm100/decode/head64
instantiations
A Added · csrc/sm100/decode/head64
model1.cu 176 B
A Added · csrc/sm100/decode/head64/instantiations
v32.cu 173 B
A Added · csrc/sm100/decode/head64/instantiations
kernel.cuh 50.67 KB
A Added · csrc/sm100/decode/head64
kernel.h 189 B
A Added · csrc/sm100/decode/head64
helpers.h 717 B
A Added · csrc/sm100
prefill
A Added · csrc/sm100
dense
A Added · csrc/sm100/prefill
collective
A Added · csrc/sm100/prefill/dense
fmha_common.hpp 4.99 KB
A Added · csrc/sm100/prefill/dense/collective
fmha_fusion.hpp 12.74 KB
A Added · csrc/sm100/prefill/dense/collective
sm100_fmha_fwd_epilogue_tma_warpspecialized.hpp 8.34 KB
A Added · csrc/sm100/prefill/dense/collective
sm100_fmha_fwd_mainloop_tma_warpspecialized.hpp 44.56 KB
A Added · csrc/sm100/prefill/dense/collective
sm100_fmha_load_tma_warpspecialized.hpp 11.3 KB
A Added · csrc/sm100/prefill/dense/collective
sm100_fmha_mla_fwd_mainloop_tma_warpspecialized.hpp 45.53 KB
A Added · csrc/sm100/prefill/dense/collective
sm100_fmha_mla_load_tma_warpspecialized.hpp 12.57 KB
A Added · csrc/sm100/prefill/dense/collective
common
A Added · csrc/sm100/prefill/dense
gather_tensor.hpp 6.91 KB
A Added · csrc/sm100/prefill/dense/common
helper.h 3.56 KB
A Added · csrc/sm100/prefill/dense/common
mask.cuh 132 B
A Added · csrc/sm100/prefill/dense/common
pipeline_mla.hpp 9.96 KB
A Added · csrc/sm100/prefill/dense/common
pow_2.hpp 3.3 KB
A Added · csrc/sm100/prefill/dense/common
utils.hpp 593 B
A Added · csrc/sm100/prefill/dense/common
device
A Added · csrc/sm100/prefill/dense
fmha_device_bwd.hpp 12.44 KB
A Added · csrc/sm100/prefill/dense/device
fmha.hpp 9.61 KB
A Added · csrc/sm100/prefill/dense/device
fmha_cutlass_bwd_sm100.cu 3.74 KB
A Added · csrc/sm100/prefill/dense
fmha_cutlass_bwd_sm100.cuh 8.91 KB
A Added · csrc/sm100/prefill/dense
fmha_cutlass_fwd_sm100.cu 3.61 KB
A Added · csrc/sm100/prefill/dense
fmha_cutlass_fwd_sm100.cuh 13.37 KB
A Added · csrc/sm100/prefill/dense
interface.h 876 B
A Added · csrc/sm100/prefill/dense
kernel
A Added · csrc/sm100/prefill/dense
fmha_causal_tile_scheduler.hpp 6.8 KB
A Added · csrc/sm100/prefill/dense/kernel
fmha_kernel_bwd_convert.hpp 6.55 KB
A Added · csrc/sm100/prefill/dense/kernel
fmha_kernel_bwd_sum_OdO.hpp 6.8 KB
A Added · csrc/sm100/prefill/dense/kernel
fmha_options.hpp 2.79 KB
A Added · csrc/sm100/prefill/dense/kernel
fmha_tile_scheduler.hpp 5.41 KB
A Added · csrc/sm100/prefill/dense/kernel
sm100_fmha_bwd_kernel_tma_warpspecialized.hpp 75.87 KB
A Added · csrc/sm100/prefill/dense/kernel
sm100_fmha_bwd_mla_kernel_tma_warpspecialized.hpp 76.27 KB
A Added · csrc/sm100/prefill/dense/kernel
sm100_fmha_fwd_kernel_tma_warpspecialized.hpp 25.52 KB
A Added · csrc/sm100/prefill/dense/kernel
sparse
A Added · csrc/sm100/prefill
common_subroutine.h 6.13 KB
A Added · csrc/sm100/prefill/sparse
fwd
A Added · csrc/sm100/prefill/sparse
fwd_for_small_topk
A Added · csrc/sm100/prefill/sparse
head128
A Added · csrc/sm100/prefill/sparse/fwd_for_small_topk
config.h 4.9 KB
A Added · csrc/sm100/prefill/sparse/fwd_for_small_topk/head128
instantiations
A Added · csrc/sm100/prefill/sparse/fwd_for_small_topk/head128
phase1_decode_k512.cu 233 B
A Added · csrc/sm100/prefill/sparse/fwd_for_small_topk/head128/instantiations
phase1_prefill_k512.cu 220 B
A Added · csrc/sm100/prefill/sparse/fwd_for_small_topk/head128/instantiations
phase1.cuh 55.63 KB
A Added · csrc/sm100/prefill/sparse/fwd_for_small_topk/head128
phase1.h 215 B
A Added · csrc/sm100/prefill/sparse/fwd_for_small_topk/head128
head128
A Added · csrc/sm100/prefill/sparse/fwd
config.h 4.69 KB
A Added · csrc/sm100/prefill/sparse/fwd/head128
instantiations
A Added · csrc/sm100/prefill/sparse/fwd/head128
phase1_k512.cu 162 B
A Added · csrc/sm100/prefill/sparse/fwd/head128/instantiations
phase1_k576.cu 162 B
A Added · csrc/sm100/prefill/sparse/fwd/head128/instantiations
phase1.cuh 29.84 KB
A Added · csrc/sm100/prefill/sparse/fwd/head128
phase1.h 153 B
A Added · csrc/sm100/prefill/sparse/fwd/head128
head64
A Added · csrc/sm100/prefill/sparse/fwd
config.h 4.76 KB
A Added · csrc/sm100/prefill/sparse/fwd/head64
instantiations
A Added · csrc/sm100/prefill/sparse/fwd/head64
phase1_k512.cu 161 B
A Added · csrc/sm100/prefill/sparse/fwd/head64/instantiations
phase1_k576.cu 161 B
A Added · csrc/sm100/prefill/sparse/fwd/head64/instantiations
phase1.cuh 28.19 KB
A Added · csrc/sm100/prefill/sparse/fwd/head64
phase1.h 152 B
A Added · csrc/sm100/prefill/sparse/fwd/head64
sm90
A Added · csrc
decode
A Added · csrc/sm90
dense
A Added · csrc/sm90/decode
config.h 199 B
A Added · csrc/sm90/decode/dense
instantiations
A Added · csrc/sm90/decode/dense
bf16.cu 176 B
A Added · csrc/sm90/decode/dense/instantiations
fp16.cu 210 B
A Added · csrc/sm90/decode/dense/instantiations
splitkv_mla.cuh 56.97 KB
A Added · csrc/sm90/decode/dense
splitkv_mla.h 148 B
A Added · csrc/sm90/decode/dense
traits.h 3.56 KB
A Added · csrc/sm90/decode/dense
sparse_fp8
A Added · csrc/sm90/decode
components
A Added · csrc/sm90/decode/sparse_fp8
config.h 652 B
A Added · csrc/sm90/decode/sparse_fp8/components
dequant.h 3.48 KB
A Added · csrc/sm90/decode/sparse_fp8/components
helpers.h 4.2 KB
A Added · csrc/sm90/decode/sparse_fp8/components
config.h 9.07 KB
A Added · csrc/sm90/decode/sparse_fp8
instantiations
A Added · csrc/sm90/decode/sparse_fp8
model1_persistent_h128.cu 189 B
A Added · csrc/sm90/decode/sparse_fp8/instantiations
model1_persistent_h64.cu 189 B
A Added · csrc/sm90/decode/sparse_fp8/instantiations
v32_persistent_h128.cu 186 B
A Added · csrc/sm90/decode/sparse_fp8/instantiations
v32_persistent_h64.cu 185 B
A Added · csrc/sm90/decode/sparse_fp8/instantiations
splitkv_mla.cuh 38.44 KB
A Added · csrc/sm90/decode/sparse_fp8
splitkv_mla.h 207 B
A Added · csrc/sm90/decode/sparse_fp8
helpers.h 6.5 KB
A Added · csrc/sm90
prefill
A Added · csrc/sm90
sparse
A Added · csrc/sm90/prefill
config.h 4.22 KB
A Added · csrc/sm90/prefill/sparse
fwd.cu 851 B
A Added · csrc/sm90/prefill/sparse
fwd.h 112 B
A Added · csrc/sm90/prefill/sparse
instantiations
A Added · csrc/sm90/prefill/sparse
phase1_k512_topklen.cu 326 B
A Added · csrc/sm90/prefill/sparse/instantiations
phase1_k512.cu 327 B
A Added · csrc/sm90/prefill/sparse/instantiations
phase1_k576_topklen.cu 158 B
A Added · csrc/sm90/prefill/sparse/instantiations
phase1_k576.cu 159 B
A Added · csrc/sm90/prefill/sparse/instantiations
phase1.cuh 26.76 KB
A Added · csrc/sm90/prefill/sparse
phase1.h 175 B
A Added · csrc/sm90/prefill/sparse
smxx
A Added · csrc
decode
A Added · csrc/smxx
combine
A Added · csrc/smxx/decode
combine.cu 9.41 KB
A Added · csrc/smxx/decode/combine
combine.h 150 B
A Added · csrc/smxx/decode/combine
get_decoding_sched_meta
A Added · csrc/smxx/decode
get_decoding_sched_meta.cu 5.61 KB
A Added · csrc/smxx/decode/get_decoding_sched_meta
get_decoding_sched_meta.h 139 B
A Added · csrc/smxx/decode/get_decoding_sched_meta
utils.h 3.35 KB
A Added · csrc
docs
A Added · .
20250422-new-kernel-deep-dive.md 8.16 KB
A Added · docs
20250929-hopper-fp8-sparse-deep-dive.md 7.09 KB
A Added · docs
assets
A Added · docs
MLA Kernel Sched.drawio.svg 73.64 KB
A Added · docs/assets
flash_mla
A Added · .
__init__.py 452 B
A Added · flash_mla
flash_mla_interface.py 18.65 KB
A Added · flash_mla
LICENSE 1.04 KB
A Added · .
README.md 10.41 KB
A Added · .
setup.py 5.87 KB
A Added · .
tests
A Added · .
kernelkit
A Added · tests
__init__.py 590 B
A Added · tests/kernelkit
.gitignore 74 B
A Added · tests/kernelkit
bench.py 9.15 KB
A Added · tests/kernelkit
compare.py 4.32 KB
A Added · tests/kernelkit
generate.py 1.03 KB
A Added · tests/kernelkit
precision.py 997 B
A Added · tests/kernelkit
utils.py 1.35 KB
A Added · tests/kernelkit
lib.py 16.29 KB
A Added · tests
quant.py 7.97 KB
A Added · tests
ref.py 4.45 KB
A Added · tests
test_flash_mla_dense_decoding.py 8.84 KB
A Added · tests
test_flash_mla_sparse_decoding.py 13.98 KB
A Added · tests
test_flash_mla_sparse_prefill.py 5.55 KB
A Added · tests
test_fmha_sm100.py 7.38 KB
A Added · tests
Quick Actions
Commit Information
Hash:
64c701a0b38b
Commit ID:
75
Created:
2026-01-23 04:51:56
Age:
Jan 23, 2026
Repository:
flashmla
Total Files:
183
Download Options