Single Die Cost ~ $1000/670 ~ $1.5. What’s Really Motivating the Matrix Engine Movement in HPC?

the first AI processor to implement Because of the long training times of neural networks – often days or weeks – throughput is critical. Nvidia has been offering distinct Tesla GPU products for training and inference for some time and Intel is following suit with its Neural Network Processor (NNP) chips. That's three times as fast as what Nvidia's Tesla T4 inference GPU can manage with a batch size of 128, while delivering 26 milliseconds of latency. Habana chief business officer Eitan Medina recently described his company's architectural strategy at this year's Hot Chips event.

As we reported back in June, when Habana unveiled the chip, seven of the ports are used to connect Gaudis within a node, leaving three to link up to other servers. The company is testing first silicon and expects all three Gaudi products to sample by the end of 2019, with volume production expected to start in mid-2020.

At a batch size of one, Goya handles 8,500 ResNet-50 …

Exciting News: Habana has been acquired by Intel. Below are performance results measured on Goya, for a question answering task, identifying the answer to the input question within the paragraph, based on Stanford Question Answering Database (SQuAD). At this point, those diverging hardware requirements are fairly well understood. ... Additionally, Habana’s Goya AI Inference Processor, which is … For example, with the ResNet-50 model, the use of 8-bit integer (INT8) provides the best image recognition throughput, but with an accuracy loss of 0.4 percent for compared to a GPU baseline. The GEMM, TPC, and DMA engines can operate concurrently with the shared memory, offering a latency-hiding mechanism. SANTA CLARA, Calif.--(BUSINESS WIRE)--Intel Corporation today announced that it has acquired Habana Labs, an Israel-based developer of programmable deep learning accelerators for the data center for approximately $2 billion. Software: Ubuntu v-16.04; SynapseAI v-0.1.6 Also, with Facebook working with several other accelerator chip start-ups, there is, of course, no guarantee that Habana will receive major orders from the social media giant. Read the announcement in the Intel Newsroom. Habana is also the first vendor to announce hardware for Facebook’s OCP form factor and Glow software. The Goya silicon is wrapped in a PCI-Express 4.0 card meant to plug into standard datacenter servers. Purpose-Built for AI Training.

Intel Investor Relations This combination gives Habana access to Intel AI capabilities, including significant resources built over the last three years with deep expertise in AI software, algorithms and research that will help Habana scale and accelerate. With its freedom from proprietary software and interfaces – and probably a much lower price – it should appeal to cloud data center customers who currently buy expensive NVIDIA GPUs and are anxious to see alternative suppliers. Software-wise, Gaudi comes with Habana's AI software stack, known as SynapseAI, which comprises  a graph compiler, runtime, debugger, deep learning library and drivers. The model is converted into internal representation. Goya Configuration: Hardware: Goya HL-100; CPU Xeon Gold 6152@2.10GHz. Habana will report to Intel's Data Platforms Group, home to Intel's broad portfolio of data center class AI technologies. Large-node training systems based on Gaudi are expected to deliver up to a 4x increase in throughput versus systems built with the equivalent number of GPUs. Habana claims it has already shipped Goya to 20 select clients. This frees customers from NVIDIA's proprietary software and interfaces. Cara Walker As such, Habana's performance advantage claim may be short-lived. 1,527 sentence/sec on BERT.

Habana withheld most of the information regarding the GEMM engine. Anyone can stuff a chip full of multipliers, doesn’t mean they can utilize them well. Gaudi uses a cluster of eight TPC 2.0 cores. Training neural networks is the most computationally demanding of the two, inference is the most demanding of low latency performance. TSMC 12nm FFN, 754 mm 2 die, 18.6 bn transistors.

This can be managed through a software API, which modifies the data type accordingly based on the desired accuracy. Not only do these two application areas present significantly different requirements for the hardware, but the markets for these systems also present their own particular needs. Gaudi represents Habana’s second attempt to break into the AI market following the commercial launch of its Goya inference chips in Q4 2018. endobj 2019 Habana Labs Ltd. | | Ver 1.0 | June 2019 6 4. The chip supports numerical formats commonly used for inference work, including FP32, INT32, INT16, INT8, UINT32, UINT16, UINT8. Hardware Configuration: T4; Host Supermicro SYS-4029GP – TRT T4 For that you get full bandwidth, with non-blocking communication. It requires a manageable amount of bandwidth since they only need to exchange updates to the model as it’s trained. Preview. Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between. © 2020 Habana Labs Ltd. All rights reserved. Habana’s AI Training and Inference Products. Goya, the inference chip, was released in January 2018, followed by Gaudi, the training processor, in June 2019.

Hardware: Goya HL-100; CPU Xeon Gold [email protected] In fact, it's the only processor we know of that incorporates RDMA directly onto the package and certainly the only one that offers 1 Tb/sec of connectivity to each processor. Software Configuration:  TensorRT 5.1; Synthetic dataset; Container – 19.03-py3; In addition, the company claims that Gaudi uses only 140 Watts of power when running the benchmark, around half that of the V100. Together, we will deliver our customers more AI innovation, faster.". And since inference is usually performed in large-scale cloud setups, its power envelope must enable it to fit into standard server gear. "More specifically, Habana turbo-charges our AI offerings for the data center with a high-performance training processor family and a standards-based programming environment to address evolving AI workloads.".

Habana will remain an independent business unit and will continue to be led by its current management team. Our growing team of industry analysts and thought leaders should address all your needs. A 128-Gaudi processor cluster can be built with 16 HLS-1 systems using 10 Ethernet switches. Hardware: Goya HL-100 PCIe Card; CPU XEON E5 The processor is characterized as a heterogeneous architecture, comprised of a general matrix to matrix multiplication (GEMM) engine and 8 Tensor Processor Cores (TPCs). "This acquisition advances our AI strategy, which is to provide customers with solutions to fit every performance need – from the intelligent edge to the data center," said Navin Shenoy, executive vice president and general manager of the Data Platforms Group at Intel.

