Thursday, February 24, 2011

Inside R & D to decrypt our Godson-2

 Editor's Note: The author of down, I sat alone in the room 105, North Building, an uneasy wait. Mr Cheng for the people there with me and a student, a few minutes ago, I sent them to the back of the chip logic analyzer to take small building to go. and across the North Building chip, a small building basketball courts and 10 to a research group of the brothers, they are as uneasy as I waited.
17   0 30 AM, the corridor came the rhythmic footsteps , in the open, the silent corridor is particularly strong. footsteps getting closer, my heart suddenly raised up, like footsteps pound, knock on my heart, because I know that our Godson-2 back.
105 rooms, the door was open, and Chang Heng catch on to burst into a square box in her hands. He is my special package sent to Shanghai to Godson-2 chip manufacturers to take back the test. We carefully unlocked the box, dozens of films as Godson-2 chip, like soldiers waiting for review neatly packaged in a dedicated box. a phone appropriated chip small building, less than two minutes, brought together 105 rooms on six or seven They are the newly established FBI afternoon members of the group.
I picked a few chips in the chip with a multimeter some simple static test, select a chip from the daughter card into the slot cover, and handle card into the motherboard. carefully press the power switch, the display did not move, my heart burst of nervous. After a few tries, for a daughter card, put the chip into the motherboard, after, a press the power switch, the display while beating, to the string of characters such as about. We cheered, just mentioned, the heart back into the throat in the first pit of the stomach.
in a simple BIOS boot the system, they began to start LINUX operating system, everything is going smoothly .1 point 10, appeared on the screen LINUX operating system login prompt light. We hurried to call the chip tells the small building where the other students waiting. Zhongshi Jiang returned to my wife sent a text message, she is also waiting for our evening news. A few minutes later received the Godson-2 after birth first blessing.
4 30 AM, Godson-2 passed the other tests. We use a computer equipped with Godson-2 CPU group within our BBS, filling the Godson-2 after the first scoop of water born and made a few letters EMAIL. I decided to Godson-2 for the first time before the FBI came to an end, Silicon Valley took place a long time to send a friend a bottle of XO's meeting room on the chip small building a half cup per person to celebrate for a moment. finished his drink all interest unabated, according to prior agreement and a taxi to Tiananmen Square to watch the flag-raising and report to Chairman Mao Memorial Hall to Chairman Mao. This year is the 110th anniversary of Chairman Mao, we name the chip called MZD110.
6 点 25 points, we again stand under the flagpole in front of Tiananmen Square, watching the glowing red flag rising in the national anthem. last year to accept the I am the successful development of Godson-1, after the raising of the flag when the mind to see what really forgot I was thinking. This time I tried to think of something worthwhile, such as what sort of rhetoric. But, with the national flag, mind went blank, but threw the flag on the peak of the staff the moment string, Yang Liwei, 24 hours out of Shenzhou-5, all waving red screen door suddenly appeared in my mind that the way to the front door of the breakfast picture also emerges in the face of long, lingering.
on September 28, 2002 Godson-1, the conference, Li Guojie, director once quoted something from nothing. more than 10 times. In fact this is our application for the CAS Knowledge Innovation Project of computer hardware and software of major projects and 863 projects Tang Zhimin provided the thematic focus of the indicators. In the thick of the application of these two projects and contracts in the book, I just remember the two figures, one is clocked at 500MHz or more, the value of a SPEC CPU2000 is a 300 points or more. from the date of commitment to the project, these two figures as tightly as the two magic spells in my head hoop ( I always thought that this project is 863 neutral one of the most successful, with only two numbers do put enough clear expression).
now appears that frequency 500MHz or more is also easy to handle some work can always be done under the dead, using 0.13 micron process is easy to do. The hard part is SPEC CPU2000 scores 300 points or more. The so-called SPEC CPU2000, is a group of internationally recognized standard test procedures This set of procedures used on the target machine is running, running time calculated based on the actual operating speed of the target computer. This standard test program from the SPEC CPU89, SPEC CPU92, SPEC CPU95, SPEC CPU2000 has been developed into several period. which cover the SPEC CPU2000 the most extensive application, including file compression, FPGA placement and routing, compilers, combinatorial optimization, chess, word processing, computer vision, programming languages, interpreter, database, layout simulator, quantum dynamics, shallow water model, three-dimensional potential solution, partial differential equations, three-dimensional graphics library, computational fluid dynamics, image recognition / neural networks, seismic wave propagation simulation, computational chemistry, number theory / prime number testing, etc., have on the millions of lines of source code. scoring criteria procedures in this group is a 300MHz Sun, the four emission Untra Sparc II running on time as the standard running time, its score of 100 points, other machines running time compared with the standard time calculated the corresponding scores. General , now more popular four-RISC processors as the main launch Alpha 21264, MIPS R12000, IBM Power III, etc. In the case of 400MHz to 500MHz SPEC CPU2000 scores can reach 300 points, while the Pentium III at 800MHz when the SPEC CPU2000 floating-point score was 200 points. Godson-2 to reach 300 points, at least with more than 1GHz PIII or PIV performance fairly. Thus, while the 500MHz Although not easy to do, but it is more difficult to sub-SPEC CPU2000 value of 300 points or more.
improve processor performance, increased frequency and optimize the hardware and software structure can be neglected, light intensity and light frequency are not enough emphasis on the structure. like take 100 to span the wood from the A B, the armor of a back and forth every 10 minutes, each carry a wood; B and back every 20 minutes, each carry four wood; C for 60 minutes a round, each carry six wood. We can not run under a fast (high frequency) performance of say a maximum, it can not carry the most time under the C (more instructions executed per beat) said the highest performance of C, the performance is a composite of things. Of course, there are other to considerations, such as B, C three hourly compensation (processor power, area). Although the Godson-1 success for us is a huge step forward, but one thing I have been deeply ashamed of themselves hate, that is, the performance of Godson-1, did not achieve the desired goal.
Although the frequency is not low, but from the program to run a similar structure than the RISC processor and the same frequency of the PII there are some gaps. SPEC CPU2000 the score is not high. I used to soak in the room all day and night to run a variety of testing procedures, and software optimization methods attempt to improve performance. While there has been some effect, but not ideal. subsequent performance analysis shows that some Godson-1, the performance bottleneck in fact totally simple optimization can be overcome, but unfortunately it did promote the project too quickly, without sufficient time to do performance analysis and optimization. these things make me very depressed, so the performance analysis is not enough as the Godson 1 No design process is an important lesson, and vowed in Godson-2's design to a blood revenge. sense of shame almost Yong, was Godson-2 at every step of the performance analysis is really for the Godson-2 performance played a significant role. in the Chinese Academy of Sciences assumed the leadership of our major projects Godson-2 project review meeting, Li, Tang Zhimin, and I report to the hospital after the party ready to leave, chasing come out Jiangyuan Zhang Li said: put pressure on you a treasure. said: : end of the project in 2004, SPEC CPU2000 scores the goal to reach 300 points, and on this basis, determine the 64-bit architecture, and 4 emission targets. According to the project objectives and developed Godson-1, the experience and lessons learned, we identified the Godson-2 No. developed the following three design principles.
first full advantage of hardware and software design principle of collaborative design. that is, through the processor parallelism in the development of all levels to improve performance. These include the instruction level parallelism, parallelism, data level parallelism and thread-level parallelism. which are mainly four-instruction-level parallelism to achieve emission structure, that is, in any stage instruction pipeline is executed for each shot more than four orders. In order to effectively play a more efficient launch path must be to achieve full of chaos order execution technology to reduce the instruction to wait between each other. data-level parallelism is mainly through the development of the technology SIMD vector instructions. thread-level parallelism, including single-processor, multi-threading and multi-processor multi-threading technology. Godson-2 number on the parallel development of the main development of instruction-level parallelism techniques, and through the floating point parts shared with the media processing method for data-level parallelism of the SIMD technologies. major refers to collaborative software and hardware compiler optimization and hardware support for the compiler to improve performance neither one-sided pursuit of complex hardware taking on everything, nor the heavy burden of fully optimizing compiler pressure in the body. The compiler is very important to improve the performance, we had such an experience: in the same machine, with different compiler to compile the same program out of the running time difference actually 75%.
Second, the physical design of the first guiding principle of structural design, the logical path is not the greatest lines from the system to determine the need for structural design, but by the physical design requirements to determine. which is to determine the maximum delay of the pipeline at every level and in the structural design constraints. Secondly, for structural design, physical design should have the concept of mind, that is, to understand the logic of the corresponding physical What kind of. Godson-2 in the design process, structural designers do at least the level netlist. Third, the design and realization of the principle of slow and steady on the first, attention to Cycle-by-Cycle C-Simulator and the C simulator as detailed design documentation is developed in the Godson-1, formed during one of the most important experience that will affect the progress, and will accelerate progress. In addition, the Godson-2 functional design and physical design into a few steps. The first step, or the use of standard cell design, only a very limited part in the full custom made (eg register file), the main frequency 200-300MHz or more, the function does not implement the two CACHE, strive to complete the flow sheet. The second step is to add two CACHE functional interface or / and DDR interfaces, physical design, use more macros, but the design method or standard cell-based method over frequency in the 300-400MHz. The third step is to add features support for multi-processor systems, in more places to use full-custom or full custom process unit, for the frequency 400-500MHz or more. Godson-2 will be the last full custom silicon-based.
Godson-2's design, including structural design, logic design and physical design of three stages, these three stages overlap, in which design 1 stage and the design of Godson also overlap. Godson-2 structural design intermittently for several months. just started in 2002 April and May during the Godson-1, while the physical design of the system on the Godson-2 a preliminary consideration of the structure. In the main processor on the market such as the Alpha 21264, MIPS R10000, Ultra Sparc III, Power III, HP PA8700, PIV, IA64, etc. and the main work of the academic research base established on the basis of Godson-2, a register renaming, dynamic scheduling and computing components of the framework. to June or July with the Godson-1, the physical design and system development work started, the structural design of Godson-2 is almost stopped. then we set a total of only two people on the 30th, many staff are overlapping, there is no power to do two things.
in July 2002 after the mid-tapeout Godson 1, the use of waiting time to return the chip Godson the structural design of No. 2 on the agenda again 15 .7 Ministry Jin Xiaoming graduate teacher called me to be held in Sichuan Guangyuan invited to be a graduate seminar on the report, the teacher should have been done report Xu Zhiwei Xu Temporary teachers not so send me something to save the market. Tang Zhimin Jin told the teacher I just tapeout a chip, it should be free, so the push has not shirk. After the meeting wanted to go to Jiuzhaigou, you need three or four days to come back. provisional decision before I go to some of the Godson designers No. 2 to the Guangyuan, prepared the way the structure of the Godson-2 discussion. I and the Graduate School of Chinese Academy of Sciences together with two teachers sitting locomotive days to go, AN Hong teachers, new and Zhang Fan Dongrui plane the next day to go, almost the same time to Guangyuan. At that time my daughter was on summer vacation, I love the Godson 1 tapeout to work after the arrival of a company, so I put together 6-year-old daughter put the.
later proved I made the decision before he is very correct, and Jiuzhaigou in the days Guangyuan high efficiency, determine the basic structure of the Godson-2 frame. those few days we have during the day and arranged according to conference group activities evening to discuss the structure of Godson-2, and discuss the results in the formation based on structural design of Godson-2, the preliminary document, into the wee hours every day one or two points. As processors register renaming and dynamic scheduling structure has been basically established, so the discussion fetches and focus on the structure of memory access components.
from Guangyuan road to Jiuzhaigou and Fan Dongrui I sat in the last row of the discussion Godson-2 part of the fetch and decode the structure of a car on the road bumps One day, we also discussed the day. fetching and decoding a large part of the design space, including the use of transfer speculation which algorithm, multiple transmit transfer instructions how to eliminate the circumstances behind the delay slot, to take that and transfer speculation is the command to launch the unit or block (four instructions) as a unit, the transfer speculation is fetching or decoding stage, the timing of amendments to BHT and BTB, how to improve the performance of CACHE instruction, and instruction TLB and data TLB's relations. which transfer instructions on how to eliminate the discussion later in the longest delay slot, mainly for the BTB traditional methods and Alpha 21264 line prediction method used in the comparison and analysis repeated. I have always liked to sit relatively bumpy car , the better the car the more the spirit of Britain, Fan Dongrui spirit of good, so very efficient way. 20:00 Jiuzhaigou car arrived, the Godson-2's take that part of the structure has been basically established.
part of the structure of memory access than the access refers to the part of the complex. On the one hand, it is most closely with the operating system part of the adequacy of its function is to support general-purpose operating system, the key factor; the other hand, it is to improve one of the core components of processor performance. If CACHE access efficiency is not high, the pipeline design, no matter how the rest is useless. This is the Godson-1, we have some experience. both academia and the business community on how to improve memory access performance made a lot of research, a large design space The core issues include how to reduce pipeline delays, how to improve the CACHE CACHE hit rate and reduce the waiting delayed due to not hit, and how to resolve memory access of RAW, WAR, and WAW-related and so on. in the next few days, we These aspects of the trade-offs and discussions repeatedly. until the train back to Beijing, only relatively rough idea.
Interestingly, all the way down, my daughter to see our work, monasteries, and has its own CPU design experience. Until today, I asked her what CPU, she said Finally, a burning, burning out a shiny little box burned out The above system development. Although the Godson-1, system development and performance analysis of delay for some time, but I play in the Godson-1, after nearly a month of great gains. particularly on the relationship between performance and frequency a more in-depth understanding. For example, for some memory access intensive applications, the motherboard CPU frequency of 83MHz and 250MHz frequency when the frequency of the performance not as good as the motherboard and CPU frequency of 200MHz, 100MHz performance. Now think of it as a processor's performance a city's transport system throughput, may be due to blockage of a few affect the throughput of the entire city, as long as these few clear, and have spent little effort, but will greatly increase throughput .2002 9 Godson-1, 28 conference, the design of Godson-2 .10 2 March in full swing, I took Zhang to the alma mater of new and Li Zusong eleven holidays of USTC for the use of C Godson-2 simulator closed development, Incidentally, the report to his alma mater, about our work. at HKUST by half between the original Treasury spent more than a week, and basically complete the C simulator code is written. In the course of structure refinement did not consider that a lot of the original to the problem. Sometimes we argue these issues in a very heated. For example, when you need to cancel the transfer of mistaken judgments which are being executed in front of the branch instruction, which is behind the branch instruction, Zhang and Li Zusong new perspective MIPS R10000 reference to the method used, and I think that method is too cumbersome, hope to have a more concise way. We have been arguing for two days each in the process of debate inspired by, and finally found a simple and efficient method. < br> 2002 年 10 月 8 日 back to Beijing, Godson C-2 simulator has been basically formed. We continue to in my development of semi-enclosed office, mainly to improve the C simulator and start debugging. During that time we Only a week Tuesday, Thursday and Saturday night to rest, the other time in debugging. debugging process is also to mobilize the group on a lot of other people to write the test in mid-May Vector .11 C simulator successfully started LINUX operating system, start C simulator performance optimization to speed up the simulation speed and use C Godson-2 simulator performance analysis of the structure.
during this period and subsequent few months, we run the simulator in C complete SPEC CPU2000 almost all of the more popular programs and the performance of the eighties dhrystone and whetdstone such testing procedures, the performance of Godson-2 preliminary analysis. In the course of running the program also found a number of design bug and ill-considered place. deeply impressed the operation is due to memory access order execution result in two or more interchangeable between memory access operations and cause a deadlock CACHE block. The other one is deeply impressed MIPS instruction set provides transfer instructions delay slot instruction can not be a branch instruction, otherwise uncertain behavior of the processor, but we found in the C simulator designed in our instruction, if the transfer is the transfer delay slot instruction processor instructions can lead to deadlock. Although this caused by the wrong program, but also the structural design of the local ill-considered, for the wrong program we can give the wrong result, but not the machine Gaosi.
as Zhang and Li Zusong the addition of new, Godson-2 Godson C-1 than perfect simulator a lot, including the checkpoint, including many features are added to the simulator, in addition, C simulator has greatly improved the operation speed. Zhang also smoothly developed a number of new gadgets.
by the end of November 2002, I think the C simulator has been basically stable, the convening of the Godson-1 and Godson-2 deployment summary of the meeting, the full deployment of Godson-2 of the RTL design.
2002 年 12 月初 we set up RTL design team, as we have limited manpower, RTL prepared by the staff are drawn from various groups, I am also responsible for register renaming and several queue modules. Godson-2, a RTL design can be divided into three stages. < br> The first phase of the design phase. from 12 in early we spent about half a month's time to understand the structure of Godson-2, and I started the top-level module design, mainly the interconnection between each module, interface bus and the definition of the trigger 28 .12 completed the design of top-level module and start the preparation of the RTL modules. Because of Cycle-by-Cycle C-simulator as a reference, January 14, 2003 to complete the preparation of all modules RTL and compile, January 21 the first instruction to run successfully. On this basis, after three days and three nights of effort, to January 25 in the successful operation of Godson-1, including all the MIPS instructions used in the section of functional test procedures. Since 2002, Chinese New Year without any holidays, so January 25 after a holiday the whole group.
second stage is a stage of the FBI. Year after RTL simulation environment to run in the LINUX operating system. After more than a week of continuous efforts, in February 18 LINUX operating system to run successfully. Godson 1 in the FBI process, running LINUX, the entire design of the pipeline has not experienced any problems found, only found some related issues with floating point. but in the Godson-2 No, run after trying to run whetdstone LINUX encountered great difficulties, and even once appeared stagnant. because the error occurred in the course of the dynamic library calls, and there is no dynamic libraries can not debug the source code. RTL I organized a last resort preparation of personnel in the March 7 and 8, two days of closed self-examination. through self-examination found more than 20 large and small errors, so that the FBI is running whetdstone breakthrough progress. Later, we were two closed self-examination, found only one, two minor errors.
third phase for adjustment and optimization of phase, which is the logical design of Godson-2 is a critical stage. Compared with the stage of the FBI, the optimization phase that the bug less, but a comprehensive and based on the RTL simulator with C for performance analysis of the results of the whole design of the delay, area, performance of the continuous optimization. through the initial optimization, the delay reduces Godson-2 double more than 30% area reduction, the performance of the same frequency increased by 30% or more. At this stage every week full of exciting improvements to better appreciate the profound truth. Confucius said, processor design even more so. with 1% of the effort to complete a correct design, but 99% of the effort needed to optimize it.
Godson-2 and RTL in the optimization process, we summarize the three experiences. The first is a better experience. to make a proper design and make a fine design is very different. In order to achieve excellence, thought to be never satisfied, persistent improvement. encounter complex problems, can not be satisfied with the complex way to solve, we must strive to simplification of the problem and then the simplest way to solve. The second lesson is dedicated to understanding and grasp of detail, while the global step back for observation and reflection is essential. in Godson-2 No. There are many optimization is to promote the project to document the process of finishing a step back, look at the article, or when the Inspiration closed self-examination. on the design of micro-and macro grasp of understanding is not neglected. If the details of the design do not do a certain understanding of the document or see the article in the finishing process is relatively true, there would not have inspired; the other hand, if too obsessed with the details, you may only see the trees, ignoring some big improvements. Third experience based on facts of experience. on the design of ongoing performance analysis, physical synthesis, and simulation of the Godson-2 provided substantial improvement and correction of a factual basis. in accordance with the design and improvement of the facts, be sure to a lot of facts and figures based on (a small, non-representative not) in-depth analysis of the facts, find out the facts behind the hidden essence of things, this is the optimal design and improvement The.
and RTL design and verification at the same time that the construction of FPGA verification environment. In this regard, I made a mistake. Godson-1, because that has proven experience in the FPGA, the FPGA verification Godson-2 should not problems, so just let one person responsible for the range of Po Gap FPGA validation. I did not expect the Godson-2 on a larger scale, design and more complex, leading to FPGA verification difficult. The main difficulty is due to the fit in an FPGA, you need to multi-chip FPGA and multi-chip interconnect between the FPGA signal much need for each piece of FPGA interface frequency transmission. Also caused by the multiple transmit multi-port register file is also difficult to implement in the FPGA. to the end of April that I realized FPGA Verify the power is not enough to invest and strengthen the power in this regard. until the end of June the first Godson-2 chip, two weeks before tapeout, complete FPGA verification and validation in a timely manner through the FPGA design of a discovery error.
processor architecture and logic during the design process, other aspects of the work are also undertaken, including Wang Jian and Zheng Bao-Jian 1 Godson led the continued development of the system and the Godson-2 software development environment, Zheng Weimin led the development of Godson-2 motherboard, Xu Tong, ZHAO Ji-industry, Zhongshi Jiang, Zhang Heng responsible for the physical design and verification methods and research, and so the summary.
Godson-2 in the RTL design process, SARS outbreak in Beijing and give us a great test. then the policy is that where there is not unified holiday, but each department can leave according to their specific circumstances. I decided to discuss and Tang Zhimin we take certain preventive measures and appropriate to reduce the intensity of work . We ask those who take public transport system is not commuting to work, work before 9 pm, daily lunch and dinner arranged by the room to eat in the office. As for the visit of the outside world, there has long been allowed to enter the North Building. In addition, the inside and the room gave us the release of the prevention of drug-related, and we bought some myself. In these days, although we were forced to slow down the progress of some, but still continue to move forward forward. I'm the face of disaster for the people across the country united spirit inspired victory over SARS, but also for the whole group in such difficult circumstances, their posts are moved.
2003 年 3 month we began to deploy Godson-2 used in a 9-port register file of the full custom design. to insurance, we deployed two options to design the register file. The preferred solution is to ask a large company to help us do this register file, the same as with the Chinese Academy of Sciences Microelectronics Center for Microelectronics Center of cooperation, please register the same reactor design as an alternative program. As the first silicon mainly on the correctness of the design validation and structural properties, so for the first time outside the register file in addition to silicon ASIC design method is used and ready to use the Chinese Academy of Sciences Centre for Synopsys EDA tools, EDA layout tools in order to reduce the cost of purchase, so in May before the physical design team who also Synopsys tools were more familiar with .2003 in May started the Godson-2 physical design launched. from early May to 6 months late, our methods and processes used in the repeated test, comparison and identification, in particular on whether to use the hierarchical design method, the use of which Wireload Model, and the Floorplan of programs carried out repeated tests and try and finalize the methodology and processes. to the end of June to determine the layout when the program and completed their route, contact with the tape manufacturers to get ready in TSMC July 10 tapeout. would have Everything is CPU2000 performance analysis program when a program floating-point results are sometimes right and sometimes wrong. As other programs are running properly, and operating system support for part of the virtual address CACHE still bug, so first I did not think there are problems RTL .7 2 pm, Zhang said in the engine room of the new sentence: dozens of hours later, we use FPGA verification, C simulator, and RTL simulation procedure for this error tracking. finally found the morning of July 4, a-RTL bug. Fortunately, the problem involves only partial design, we are modifying RTL netlist changes by hand after a day spent to complete the ECO placement and routing.
one after another. ... was about to Xie Hui

No comments:

Post a Comment