The main reason is: performance. Another reason is power consumption.
Separate dCache and iCache makes it possible to fetch instructions and data in parallel.
Instructions and data have different access patterns.
Writes to iCache are rare. CPU designers are optimizing the iCache and the CPU architecture based on the assumption that code changes are rare. For example, the AMD Software Optimization Guide for 10h and 12h Processors states that:
Predecoding begins as the L1 instruction cache is filled. Predecode information is generated and stored alongside the instruction cache.
Intel Nehalem CPU features a loopback buffer, and in addition to this the Sandy Bridge CPU features a µop cache The microarchitecture of Intel, AMD and VIA CPUs. Note that these are features related to code, and have no direct counterpart in relation to data. They benefit performance, and since Intel “prohibits” CPU designers to introduce features which result in excessive increase of power consumption they presumably also benefit total power consumption.
Most CPUs feature a data forwarding network (store to load forwarding). There is no “store to load forwarding” in relation to code, simply because code is being modified much less frequently than data.
Code exhibits different patterns than data.
That said, most CPUs nowadays have unified L2 cache which holds both code and data. The reason for this is that having separate L2I and L2D caches would pointlessly consume the transistor budget while failing to deliver any measurable performance gains.
(Surely, the reason for having separate iCache and dCache isn’t reduced complexity because if the reason was reduced complexity than there wouldn’t be any pipelining in any of the current CPU designs. A CPU with pipelining is more complex than a CPU without pipelining. We want the increased complexity. The fact is: the next CPU design is (usually) more complex than the previous design.)