Custom computer hardware too expensive for universities to build

According to Mary Jane Irwin interviewed by Leah Hoffman

Most of our work since then has been this simulation-based work. There’s been very little building.

Is that because of cost?

It’s really, really expensive these days to build custom hardware at the current or near-current technology node. You just can’t afford to do it at a university, so most architects do simulation-based research. Some design with FPGAs, and that’s great, but the number of people building custom hardware in universities has dwindled to almost nothing.

The missing memristor

According to Vongehr, S. and Meng, X. The Missing Memristor has Not been Found. Sci. Rep. 5, 11657; doi: 10.1038/srep11657 (2015).

In 2008, the discovery of a “missing memristor” was announced with three (!) basically simultaneously timed, overlapping Nature group articles3,4,5 and on the front pages of major newspapers, all as if an almost 40 years ago predicted, deeply scientifically significant hypothesis had been finally proven. There was immediately controversy around that the devices are neither new nor the 1971 proposed real memristor device6,7. Similar devices were already discovered in 19958, but those early discoverers do still not think that their devices are real memristors. Memristive behavior is known from thin films since even before 19719. Before 2008, such devices and nonvolatile memory applications10 were correctly not called memristors11,12. The 2008 claim showed that the films of TiO2 between metals, well known since the 1960s, can be described as resistors with memory13, and “memristors” understood merely as nonlinear resistors with memory have been described by Kubo theory in the 1950s14. Novel in 2008 was merely the widely emphasized claim that such devices are the long sought “missing memristor,” but looking closer, it turns out that the authors were apparently missing something they call “perfect memristor,” the meaning of which accords to the mentioned tacit redefinition of “memristor” as not-just-memristive. It is highly doubtful that wide media attention would be given to such a mere technicality described around long known devices. Even if the existence of a perfect memristor was firstly recognized in 2008, the world is missing something else entirely: a real memristor device as suggested in 1971 on grounds of EM symmetry, namely on par with the known real inductor, the fourth next to that known third, both of them impossible without magnetic flux.

10x slowdown in high-performance computing progress

According to Strohmaier, Meuer, Dongarra and Simon in “The TOP500 List and Progress in High-Performance Computing” in the November 2015 Computer

As we approach the exascale era, the rate of increase in peak and application performance of our largest systems has clearly slowed. Our analysis suggests that over the next decade we will likely fall short fall short for performance increases by almost an order of magnitude (100x instead of 1,000x). An even more substantial slowdown can be expected once any change or an end to Moore’s law affects increases in per-socket performance.

Any such slowdown will eventually open up opportunities for companies to explore competitive advantages through stronger architectural differentiation.

“Half of the DRAM in the world will disappear,” says Sehat Sutardja

(Tip of the hat to “ippisl“.) According to Sehat Sutardja, CEO of Marvell, in a June 2015 interview with Linley Gwennap

Linley: How much can this approach reduce the amount of
DRAM you need?

Sehat: Drastically. I used to say 90%, but that scared
the hell out of the DRAM companies, so now I just say
50%. Half of the DRAM in the world will disappear. We’re
talking $23 billion in savings to the OEMs [global DRAM
revenue in 2014 was $46 billion], meaning $40 billion or
$60 billion to you and I, because we never pay the OEM
price.

Linley: But the incremental cost is going to be extra flash
memory. How much flash do you need?

Sehat: As much as you want. If you want a system to
have 16GB of main memory, you use 16GB of flash.

Linley: You’re replacing DRAM with flash, but flash is cheaper.

Sehat: Less than one-tenth of the price.

The Rex Neo chip: 10x to 25x increase in energy efficiency for same performance

According to Nicole Hemsoth

When Rex Computing CEO, Thomas Sohmers, was working with embedded computing systems for military applications at MIT at age 13, his thoughts turned to much larger scale systems. For exascale-class computing, he realized there were many lessons to be carried over from embedded computing that could potentially have an impact on the toughest challenges that lie ahead—balancing the performance demands with overall power efficiency and scalability of both the hardware and software.

The result of his research is an up and coming chip called Neo, which brings to bear a new architecture, instruction set, and core design that scraps a great deal of what Sohmers deems unnecessary about current cache architecture, snaps in a new interconnect, and if his assumptions are correct, can do this in a power envelope and performance target that goes beyond the current Department of Energy requirements for exascale computing goals, which they hope to realize in the 2020 to 2023 timeframe.

Sohmers says that the national labs are already expressing early interest in the 64-bit Neo cores, which are 1/145 the size of a fourth generation Haswell core and 1/27 the size of a 32-bit ARM Cortex A-15. He expects to deliver a 256 core chip by the end of 2016 at the earliest using a 28 nanometer process, which will offer 65 gigaflops per watt. Successive generations will use 10 nanometer or 7 nanometer processes as those roll out. “Current proposals for exascale in 2022 are for 20 megawatts, but it’s definitely possible to do better than that within five years,” he noted.

Tip o’ the hat to Greg Jaxon.

Functional programming in C++14, instead of templates

According to Christoph Kohlhepp (tip o’ the hat to Greg Jaxon)

C++14 generic lambdas enable a degree of functional programming that has, until now, been impossible in C++.

and

They make C++ constructs more concise.

and

They eliminate many use cases of template syntax that has been a hallmark of C++ since the introduction of the STL.

and

When the C++ standard introduced template, an entire coding style of generic programming unfolded on this which made “<>” as ubiquitous as the pointer symbol “*” had been in C land. With the introduction of generic lambdas to the C++ standard, a genuine functional coding style is about to unfold and perhaps make the keyword auto just as ubiquitous.

 

Spending VLSI area to buy energy efficiency: the dark silicon design regime

According to Michael B. Taylor in “A landscape of the new dark silicon design regime

Increasingly over time, the semiconductor industry is adapting to this new design regime, realizing that multicore chips will not scale as transistors shrink and that the fraction of a chip that can be filled with cores running at full frequency is dropping exponentially with each process generation. This reality forces designers to ensure that, at any point in time, large fractions of their chips are effectively dark — either idle for long periods of time or significantly underclocked. As exponentially larger fractions of a chip’s transistors become darker, silicon area becomes an exponentially cheaper resource relative to power and energy consumption. This shift calls for new architectural techniques that “spend” area to “buy” energy efficiency. This saved energy can then be applied to increase performance, or to have longer battery life or lower operating temperatures.

The way forward may be to learn lessons from biological brains, such as severely limiting multiplexing and

Fast, static, “gather, reduce, and broadcast” operators. Neurons have fan out and fan in of approximately 7,000 to other neurons that are located significant distances away. Effectively, they can perform efficient operations that combine vector-style gather memory accesses to large numbers of static-memory locations, with a vector-style reduction operator and a broadcast. Do more efficient ways exist for implementing these operations in silicon? It could be useful for computations that operate on finite-sized static graphs.

RCU (read-copy-update): synchronization via structured deferral

According to Paul E. McKenney, Silas Boyd-Wickizer & Jonathan Walpole in a practical tech report about “the RCU API and how to apply it”

Most major Linux kernel subsystems use RCU as a synchronization mechanism. […] Understanding RCU is now a prerequisite for understanding the Linux implementation and its performance.

The success of RCU is, in part, due to its high performance in the presence of concurrent readers and updaters.
The RCU API facilitates this with two relatively simple
primitives: readers access data structures within RCU
read-side critical sections
, while updaters use RCU synchronization to wait for all pre-existing RCU read-side
critical sections to complete. When combined, these primitives allow threads to concurrently read data structures,
even while other threads are updating them. (LINK)

Then see McKenney’s more general (and fanciful) treatment in “Structured Deferral: Synchronization via Procrastination“, which also discusses hazard pointers.

Programming brain-teasers — don’t wait for a job interview to enjoy them

Arden Dertat, when he was a student at Brown, blogged a nice series of 28 programming interview questions (and answers) HERE that are fun even if you’re not preparing for an interview.

A favorite interview-style question of mine is one that a friend used to ask about logic design: Consider a stairwell with a switch at each floor to control the lights. If the lights are on, flipping any switch turns the lights off, and if the lights are off, flipping any switch turns the lights on. How do they wire that up?

Arden Dertat, now at Microsoft, still tweets sometimes about interviews.