Towards Democratized and Reproducible AI for EDA Research: Open Datasets and Benchmarks in Various Aspects

Artificial intelligence (AI) techniques have demonstrated remark able effectiveness in electronic design automation (EDA) and agile IC design. For such data-driven technology, the access to high quality, diverse, and representative circuit data is essential for both ML model development and evaluation. However, the lack of circuit data remains a long-standing and primary technical bottleneck. First, the lack of open datasets raises a high barrier to the devel opment of AI for EDA solutions. For ML model training, the label collection process can be highly time-consuming and resource demanding. Also, limited open-source circuit designs may not pro vide sufficient diversity in training, limiting the ML model perfor mance. Second, the lack of open benchmarks makes fair compar isons among different AI for EDA solutions highly challenging. It is difficult for truly outstanding AI solutions to stand out.

Our special session (SS) delves into this long-lasting circuit data availability challenge by presenting four open-source works from different perspectives. Topics in our SS include 1) an RTL-to-GDS digital dataset and benchmark, 2) an analog circuit synthesis bench mark, 3) both dataset and benchmark about LLM-aided IC design, and 4) AI-assisted circuit data generation techniques and corre sponding datasets. They cover not only both digital and analog design, but also emerging techniques such as LLM-aided design and AI-generated circuit datasets. By releasing four latest explo rations to the public, we hope to contribute to more democratized and reproducible AI for EDA techniques. Specifically, open datasets will allow every researcher to train their AI solutions without an EDA license, and open benchmarks with a leaderboard will facilitate fair comparisons and encourage replications of AI solutions.

- Duke University, USA
Dataset and Benchmark for Digital Design
The application of Machine Learning (ML) in Electronic Design Automation (EDA) for Very Large-Scale Integration (VLSI) design has garnered significant research attention. Despite the requirement for extensive datasets to build effective ML models, most studies are limited to smaller, internally generated datasets due to the lack of comprehensive public resources. In response, we introduce EDALearn, a holistic, open-source dataset as well as benchmark suite specifically for ML tasks in EDA. It presents an end-to-end flow from synthesis to physical implementation, en riching data collection across various stages. It fosters reproducibil ity and promotes research into ML transferability across different technology nodes. Accommodating a wide range of VLSI design instances and sizes, our dataset and benchmark aptly represent the complexity of contemporary VLSI designs. Additionally, we pro vide an in-depth data analysis, enabling users to fully comprehend the attributes and distribution of our data, which is essential for creating efficient ML models. Our contributions aim to encourage further advances in the ML-EDA domain.
- Fudan University, China
Benchmark for Analog Design
Recent advancements in machine learning (ML) for au tomating analog circuit synthesis have been significant, yet chal lenges remain. A critical gap is the lack of a standardized eval uation framework, compounded by diverse process design kits (PDKs), simulation tools, and a limited variety of circuit topologies. These factors hinder direct comparisons and validation of algo rithms. To address these shortcomings, we introduced ACOB, an open-source benchmark suite designed to provide fair and compre hensive evaluations. ACOB includes 30 circuit topologies across five categories—sensing elements, voltage references, oscillators, amplifiers, and phase-locked loops. It supports several technol ogy nodes for both academic and commercial applications and is compatible with commercial simulators like Cadence Spectre and Synopsys HSPICE, as well as the open-source simulator Xyce. This suite not only standardizes the assessment of ML algorithms in analog circuit synthesis but also promotes reproducibility with its open datasets and detailed benchmark specifications. ACOB’s user-friendly design ensures that researchers can easily adopt it for robust, transparent comparisons of state-of-the-art methods, and furthermore, it exposes researchers to real-world problems in industrial design cycles, enhancing the relevance and impact of their work in practical settings. Additionally, we have provided a comprehensive comparison study of different representative meth ods of analog sizing on ACOB, showcasing the capabilities and advantages of various approaches.
- HKUST, China
Dataset and Benchmark for Llm-Assisted Design
The automated generation of design RTL based on large language model (LLM) and natural language instructions has demon strated great potential in agile design recently. However, the lack of datasets and benchmarks prevents the development and fair evalu ation of LLM solutions. This work highlights our latest advances in open datasets and benchmarks from three perspectives. (1) RTL Coder (latest update), an automated data generation flow, which has generated an open-source dataset with 30 thousand RTL gener ation samples. Open-source LLMs fine-tuned on this dataset prove to outperform the commercial GPT-3.5. (2) RTLLM 2.0, an updated benchmark for HDL generation. The benchmark is augmented to 50 hand-crafted designs. Each design provides the design description at different granularities, test cases, and a correct RTL code. (3) As sertLLM, an open-source benchmark assessing the LLM’s assertion generation capabilities. The benchmark includes 20 designs, each providing specification, signal definition, and correct RTL code. These three studies are integrated into one framework, providing off-the-shelf support for the development and evaluation of LLMs for RTL code generation and verification.
AI-Assisted Dataset Generation
The EDA community has recently begun recognizing the potential of generative artificial intelligence (AI) in chip design. However, its full potential is not fully exploited due to the limited availability of publicly accessible datasets crucial for advancing research in EDA. This paper highlights the dual role of generative AI; in particular, it showcases (i) BeGAN, the use of a generative AI strategy to create thousands of realistic benchmarks for power grid synthesis and analysis to advance power-related research, and (ii) EDA Corpus—an expert-curated and generative AI-enhanced dataset to serve research and development of EDA tool assistants. These two case studies serve to emphasize the ability of generative methods to both create and utilize datasets to advance research and lower the barriers to entry in EDA.
- University of Minnesota, USA
- Arizona State University, USA