As you can see, the effect you are expecting is present for Float32:
julia> rnd64 = rand(Float64, 1000);
julia> rnd32 = rand(Float32, 1000);
julia> rnd16 = rand(Float16, 1000);
julia> @btime $rnd64.^2;
616.495 ns (1 allocation: 7.94 KiB)
julia> @btime $rnd32.^2;
330.769 ns (1 allocation: 4.06 KiB) # faster!!
julia> @btime $rnd16.^2;
2.067 μs (1 allocation: 2.06 KiB) # slower!!
Float64 and Float32 have hardware support on most platforms, but Float16 does not, and must therefore be implemented in software.
Note also that you should use variable interpolation ($) when micro-benchmarking. The difference is significant here, not least in terms of allocations:
julia> @btime $rnd32.^2;
336.187 ns (1 allocation: 4.06 KiB)
julia> @btime rnd32.^2;
930.000 ns (5 allocations: 4.14 KiB)