• 
    

    
    

      99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

      Design of RFSoC-based Digital Phased Array Feed (PAF) and Hybrid Architecture Beamforming System

      2022-05-24 08:10:36XinPeiNaWangDanWerthimerXueFengDuanJianLiToktonurErgeshQiLiuandMingHuiCai

      Xin Pei ,Na Wang ,Dan Werthimer ,Xue-Feng Duan,3 ,Jian Li,3 ,Toktonur Ergesh,3 ,Qi Liu,3,and Ming-Hui Cai,3

      1 Xinjiang Astronomical Observatory,Chinese Academy of Sciences,Urumqi 830011,China;peixin@xao.ac.cn,na.wang@xao.ac.cn

      2 University of Chinese Academy of Sciences,Beijing 100049,China

      3 Xinjiang Key Laboratory of Microwave Technology,Urumqi 830011,China

      4 Department of Astronomy,University of California,Berkeley,CA 94720,USA

      Abstract As the number of array elements and bandwidth increase,the design challenges of the Phased Array Feed (PAF)front-end and its signal processing system increase.Aiming at the ng-PAF of the 110 m radio telescope,this article introduces the concept of fully digital receivers and attempts to use Radio Frequency System-on-Chip (RFSoC)technology to digitize close to the feed array,reduce the complexity and analog components of the front-end,and improve the fidelity of the signals.The article discusses the digital beamforming topology and designs a PAF signal processing experimental system based on RFSoC+GPU hybrid architecture.The system adopts a ZCU111 board to design RF-direct digitization and preprocessing front-end,which can sample eight signals up to 2.048 GSPS,12 bit,channelize the signals into 1024 chunks,then reorder into four data streams and select one of the 256 MHz frequency bands to output through four 10 Gb links.A GPU server is equipped with four RTX 3090 GPUs running four HRBF_HASHPIPE instances,each receiving a 64 MHz bandwidth signal for high-throughput realtime beamforming.The experimental system uses a signal generator to emulate Sa-like signals and propagates through rod antennas,which verifies the effectiveness of the beamforming algorithm.Performance tests show that after algorithm optimization,the average processing time for a given 4 ms data is less than 3 ms in the four-GPU parallel processing mode.The RFSoC integrated design shows significant advantages in power consumption and electromagnetic radiation compared with discrete circuits according to the measurement results.

      Key words: instrumentation:miscellaneous–techniques:miscellaneous–telescopes

      1.Introduction

      A Phased Array Feed (PAF) is a multi-beam receiver technology that has been widely developed in radio astronomy in recent years.PAF utilizes an array of small feed elements and receivers at the focal plane of a radio telescope,whose outputs can be used to form multiple simultaneous beams through analog or digital signal processing,thus increasing the telescope's Field of View (FoV) and improving the efficiency of sky surveys.In addition,these densely overlapping beams can form continuous sky coverage,a variety of flexible observation modes can be realized through a dynamic beamforming algorithm,and antenna gain close to or even more than traditional feeds can be obtained.

      The beamformer is the core of PAF signal processing.It adjusts the phase of the signals in the array to align the phases of the element signals in a specific direction,and then adds the signals in the array to obtain a beam with a fixed direction.Array signal processing in multiple directions can form multiple beams.PAF signal processing requires real-time computing,and the processing capability of the beamformer has always been a bottleneck restricting the development of PAF.

      The early digital PAF beamforming systems could use a single device to sample a small number of array signals and perform narrowband signal processing (Landon et al.2010).With the increase in the number of PAF array elements and signal processing bandwidth,the signal acquisition and transmission rate have increased sharply,and the platform for beamforming calculation based on multiple Field Programmable Gate Array (FPGA) processors has begun to be applied(van Cappellen &Bakker2010).

      In 2015,the Australian National Telescope Facility (ATNF)developed an MK-II feed array with 188 patch dipoles for Australian Square Kilometre Array Pathfinder(ASKAP,Tuthill et al.2016).Forty-eight Xilinx Kintex-7 FPGAs are employed for realtime beamforming calculation (Hampson et al.2014),which can process a 384 MHz bandwidth signal and form 36 dual polarization beams,reaching an FoV of 30 square degrees at 1.4 GHz (McConnell et al.2016).

      With the significant advancement of GPU parallel computing capabilities,hybrid architecture clusters that utilize FPGA for signal acquisition and channelization,and GPU for beamforming calculations have begun to be promoted.In 2017,Brigham Young University (BYU) collaborated with National Radio Astronomy Observatory(NRAO)and West Virginia University(WVU) to develop an FocalL-band Array Feed (FLAG) PAF receiver system for Green Bank Telescope (GBT,Roshi et al.2018).The system incorporates five GPU computing nodes for correlation and beamforming calculation (CBF),which can process 150 MHz bandwidth signals in real-time and form 7–13 beams.

      The phase stability of the PAF array signal in the transmission link is essential to achieve high-precision beamforming.The Mk-II system uses Radio Frequency (RF)over Fiber (RFoF) to transmit signals.One-hundred eightyeight RF signals are converted into optical signals at the receiver end and transmitted to the signal processing room through analog optical fibers.This technology can obtain higher phase stability than coaxial cable transmission,however environmental changes in long-distance transmission lines will cause signal gain and phase fluctuations.In addition,strong electromagnetic interference can saturate the signal transmission link.

      In order to avoid the shortcomings of analog signal transmission and improve signal fidelity,it is an effective method to directly digitize the RF signal at the receiver end.NRAO has developed a dedicated analog-digital-optical module in the design of FLAG (Morgan et al.2013).Thirty-eight PAF RF signals are sampled at the receiver cabin (sampling bandwidth is 150 MHz,8 bit),and then transmitted to the signal processing room through single-mode digital optical fiber links.It relies on a double high-performance shielding box to reduce electromagnetic radiation and has successfully applied digitization close to the receiver to large-aperture radio telescopes.

      The next generation of PAF has more elements and wider signal bandwidth,which poses certain challenges to the design of the PAF front-end and its signal processing system.

      QiTai radio Telescope (QTT,Wang2014) is planned to be built in Qitai county of Xinjiang,China.The telescope will be a fully steerable 110 m aperture antenna covering the frequency band from 270 MHz to 115 GHz (Ma et al.2019).A 20 cm band PAF receiver will be installed on the QTT to improve the sky survey efficiency of the telescope and enhance its observation capabilities in pulsar search,transient detection,extended source scanning,molecular line surveys,etc.The planned PAF receiver will contain 96 elements (dual polarization) with a frequency range from 0.7 to 1.8 GHz.This article considers the QTT PAF to be built as the research object,carries out the architecture design,algorithm and platform development,experimentation and result analysis,and eventually provides advanced and feasible signal processing technology solutions for the next-generation PAF receiver of the large-aperture radio telescope.

      2.Fully Digital PAF Front-end

      2.1.Solutions for the Large-scale PAF Front-end Design

      With the increase in the number of PAF array elements,each feed in the traditional super-heterodyne design needs to be connected to analog mixing,filtering and amplifying devices,and the scale of analog components will be very large.On the one hand,the complexity and cost of the system increase,and the analog link will take up more space,which increases the difficulty of designing a compact PAF.On the other hand,due to the consistency differences of analog devices,the signal responses of different channels are different,which will affect the signal quality of PAF beamforming.

      The fully digital receiver design is a new concept of receiver design based on a software-defined radio with the development of digital technology.Its design idea is to place digitization as close as possible to the feed source,as illustrated in Figure1,reduce or eliminate analog links,reduce complexity and weight,save space,avoid signal gain and phase fluctuations caused by environmental changes in analog links,and eventually improve signal fidelity.It is very suitable for the compact design of large-scale feed arrays and is the development direction of large-scale PAF receivers in the future.

      Figure 1.Block diagram of the fully digital receiver design.

      2.2.Fully Digital PAF Front-end Based on the Xilinx RFSoC Technology

      The key point to digital receiver design is to avoid radio frequency interference caused by digital devices to highsensitivity large-aperture telescopes while ensuring high-speed,multi-channel signal sampling.FLAG uses a discrete structure of low-power devices and a double-layer shielding design to meet the electromagnetic radiation requirements,but its sampling bandwidth can only reach 150 MHz,and the transmission rate is only 2.5 Gbps(Morgan et al.2013).For the next generation ultra-wideband PAF,there is still a gap in signal acquisition bandwidth.

      In 2018,Xilinx launched a new generation of integrated signal acquisition and processing chip—Radio Frequency System-on-Chip(RFSoC),which integrates UltraScale+FPGA,ARM,ADC/DAC,100 Gb Ethernet and other resources(currently upgraded to the third generation);a single chip can achieve up to eight channels of 5 GSPS,14 bit or 16 channels of 2.5 GSPS,14 bit sampling,and the power consumption of signal acquisition and transmission is much lower than that of a discrete design(Xilinx White Paper W489,2019).According to the test results of characteristic parameters such as the effective number of bits (ENOB),spurious-free dynamic range (SFDR),signal-to-noise and distortion (SINAD),intermodulation distortion (IMD),cross-talk between adjacent channels,and stability of the Xilinx ZCU111 evaluation board (Liu et al.2021),the performance of this board is suitable for large-scale PAF signal acquisition and processing applications.

      3.Experimental System Design

      In order to provide development reference for the QTT largescale beamforming network,we analyzed the digital beamforming network architecture and designed a PAF signal processing experimental system based on the RFSoC+GPU hybrid architecture,as detailed below.

      3.1.Digital Beamforming Topology Analysis

      The digital beamforming network generally consists of two parts:F-engine and B-engine.F-engine performs sampling,channelization and data distribution on the array element signals,and B-engine completes correlation and beamforming calculations.Due to a large number of PAF signals and the huge amount of data processing,multiple F-engine and B-engine nodes are generally used for distributed computing.According to the beamforming calculation principle,the calculation of a single beam requires the participation of multiple array signals,and the signals need to be shared between different beams.In view of this feature,there are two common digital beamforming network topologies.

      One is the ring-connected architecture (see Figure2),in which each B-engine processes the data from the corresponding F-engine,and then transmits it to the next B-engine for further calculations,and finally connects them into a ring.This architecture is suitable for narrowband PAF signal processing(Gunst &Kant2005) due to the asymmetric data processing volume of the B-engine.

      Figure 2.Ring-connected digital beamforming network.

      Figure 3.Cross-connected digital beamforming network.

      The other one is the cross-connected architecture (see Figure3).This architecture divides the broadband signal into multiple narrowband signals in the F-engine and then distributes the narrowband signals to corresponding B-engines for correlation and beamforming through a high-speed network.It is equivalent to using each B-engine as an independent narrowband beamformer,and the combination of multiple nodes constitutes a wideband PAF signal processing system.An example of this design is the ASKAP beamformer(DeBoer et al.2009),where each B-engine handles 19 MHz of the total 304 MHz bandwidth.The number of F-engines can be adjusted according to the number of PAF array elements,and the number of B-engines can be configured according to the number of PAF signals,signal processing bandwidth,number of formed beams and considering the processing capacity of a single node.The scalability and flexibility of this architecture are better.This experimental system is designed with the crossconnected architecture,the F-engine is designed based on RFSoC and the B-engine runs on a GPU server.

      Figure 4.The block diagram of firmware design for the ZCU111 board.The ADC collects eight RF signals with a sampling rate of 2.048 GSPS,12 bit.The signals are channelized into 1024 chunks,and then reordered to four data streams,and one of the 256 MHz frequency bands is selected and output through four 10 Gb links.

      Figure 5.Internal structure of the Reorder module,including three stages to realize data stream combination and frequency band selection.

      3.2.Design of the RFSoC Digital Receiver

      The digital receiver of this system is implemented on a Xilinx ZCU111 evaluation board,which features the Zynq?UltraScale+TMRFSoC ZCU28DR chip.The chip supports eight ADC channels with up to 4 GSPS,12 bit sampling rate.The FPGA firmware is designed under the CASPER5https://casper.ssl.berkeley.edu/wiki/Main Page.tool-kit using the library of ZCU1116https://github.com/liuweiseu/ZCU111branch (using PYNQ v2.5 image).CASPER is the Collaboration for Astronomy Signal Processing and Electronics Research.Its main goal is to simplify the design process of radio astronomy instruments by developing platform-independent open source hardware and software,thereby promoting design reuse.Over the past ten years,CASPER technology and related instruments have been installed on hundreds of telescopes around the world (Hickish et al.2016).

      3.2.1.FPGA Firmware Design

      The block diagram of firmware design is displayed in Figure4.The ADC of ZCU111 collects eight RF signals with a sampling rate of 2.048 GSPS,12 bit.After processing by the 2048-point Polyphase Filter Bank (PFB) and Fast Fourier Transform (FFT) modules,each of the eight input signals is channelized into 1024 chunks with a bandwidth of 1 MHz.The output complex data from the FFT module is separated into real and imaginary parts through the c_to_ri modules and then converted into 8 bits in the Quant module.

      The next critical step is to recombine the data.According to the calculation principle of the cross-connected beamformer,the narrowband signals need to be combined into multiple groups and transmitted to the corresponding B-engine for further processing.Here,the Reorder module is used to combine 64 channels into one group,and 1024 channels are divided into 16 groups.In order to realize this function,a 3-level structure is adopted inside the Reorder module,as depicted in Figure5.The Reorder Stage 1 receives the 64-channel data stream output by the Quant module,and combines the data into eight channels through the two-stage Concat modules.Each analog signal corresponds to one channel.The data type is converted from UFix_8_0 to UFix_64_0,and then the data sequence is transformed by eight Reorder and Transpose modules;1024 sub-channels are divided into four groups and output through four data streams.The Reorder Stage 2 combines discrete analog signal streams,and eight analog signals are combined into one data stream.The four groups of data are combined through four Concat modules to generate four data streams.As an experimental system,we currently only utilize four 10 Gb network ports for data output.The use of dual 100 Gb network ports(using a FMC-QSFP High Speed Links Daughter Card)for transmission is under consideration.Therefore,we use Mux modules for frequency band selection.The 1024 MHz bandwidth is divided into four 256 MHz subbands (corresponding to m_fb0—mfb3 four data streams),and one of the subbands can be selected for output through the Mux module.Then a Reorder and Transpose module separates the combined data and rearranges the data sequence into eight data streams.The Reorder Stage 3 combines eight parallel data streams intofour,and outputs them through four network modules for subsequent transmission.

      The output data are packaged into VDIF format7https://vlbi.org/wp-content/uploads/2019/03/VDIF-specification-Release-1.0-ratified.pdfby the Packetizer modules,and output through four Eth modules.The band select number,IP address and port number can be configured by writing the internal register through an external Python script.

      3.2.2.FPGA Synthesis Report

      The utilization report of FPGA resources is shown in Table1.The utilization rate of all individual resources is less than 50%.The resource utilization rates of LUT and Block Random Access Memory (BRAM) are relatively high,occupying 43.704%and 41.157%respectively.This is because modules such as FFT,Reorder and Eth all need memory for data buffering.

      Table 1 FPGA Utilization Report

      The power consumption report of FPGA is displayed in Figure6,where the total power consumption is 26.886 W,and the dynamic and static power consumption are 25.236 W and 1.65 W respectively.The report only counts the power consumption of running the designed firmware,and the basic power consumption including running the Linux operating system is not included.In order to compare the actual power consumption of ZCU111 with the predicted power consumption given in the report,this article measured the power consumption of ZCU111.First,we start the ZCU111 to run the Linux operating system,and measure the power consumption at this time as the basic power consumption,which is 20.3 W.Then,we load the FPGA firmware program and measure the total power consumption,which is 53.7 W.The measured firmware power consumption is the total power consumption minus the basic power consumption,which is 33.4 W.The measured result is 6.514 W higher than the predicted value.Compared with the discrete ADC design,the power consumption of RFSoC is greatly reduced (Xilinx White Paper W489,2019).

      Figure 6.The power consumption report of FPGA.

      3.2.3.Electromagnetic Compatibility(EMC)Characteristics

      Generally,the electromagnetic radiation of electronic equipment decreases with the reduction of power consumption.However,the electromagnetic radiation generated by different types of electronic devices and circuits has certain differences.In order to clarify the difference in electromagnetic radiation characteristics between RFSoC integrated ADC and discrete circuits,the article tested and compared the electromagnetic radiation characteristics of ZCU111 and SNAP28https://casper.ssl.berkeley.edu/wiki/SNAP2(discrete ADC design).The measurement methods and procedures conform to the GJB151B-2013 (GJB151B–2013) standard.The test is carried out in a microwave anechoic chamber.The distance between the device under test and the receiving antenna is 1 m,and two antennas are used to cover the frequency bands of 100 MHz to 1000 MHz and 1000 MHz to 6000 MHz,respectively.The power in the electric field is collected utilizing the maximum detection method.Figures7(a)and (b) show the electromagnetic radiation characteristics of ZCU111 and SNAP2 in the vertical polarization from 100 MHz to 1000 MHz and 1000 MHz to 6000 MHz,respectively.(The horizontal polarization test results are similar.) The power consumption of ZCU111 and SNAP2 are 20.3 W and 55 W respectively during measurement.In the entire frequency band,the electromagnetic radiation intensity of SNAP2 is significantly stronger than that of ZCU111;especially within~4000 MHz,the radiation intensity from SNAP2 fluctuates very sharply,resulting in a lot of broadband radiation.The electromagnetic radiation from ZCU111 is relatively stable,and the intensity decreases with the increase of frequency,which is more obvious in the range of~400–1000 MHz.The situation above 1000 MHz is more consistent with the background,but you can still see a lot of narrowband radiations(maybe from the on-board clock),and the amount and intensity of these narrowband radiations are much smaller than those of SNAP2.However,these narrowband radiations should be restricted well in electromagnetic shielding boxes.

      Figure 7.The electromagnetic radiation characteristics of ZCU111 and SNAP2 in the vertical polarization.

      3.3.Beamforming Algorithm Implementation

      The basic principle of beamforming is to add the element signals in each main lobe steering angle Ωito form a beam,and the beam output can be expressed as

      Among them,x[n] is a matrix containing multiple array signals,wiis the weight vector and a vector matrix is used to represent the gain and phase weights of different array elements to adjust the delay of the array signals under different transmission paths.It also compensates for the difference in signal gain response in different array links.Hrepresents conjugate transpose,andkis the total number of formed beams.Multiple beams are calculated using multiple weight vectorswi.

      The beamforming algorithm is developed based on the High Availability SHared PIPeline Engine(HASHPIPE,MacMahon et al.2018;Pei et al.2021)framework,which provides a C API for designing parallel pipelines,where processing blocks run in separate threads and are connected to ring buffers.Three threads(net thread,bf thread and output thread)are designed to achieve real-time stream data acquisition,processing and storage.The algorithm is named High-throughput Real-time BeamForming HASHPIPE (HRBF_HASHPIPE) and is now publicly available online.9https://github.com/SparkePei/hrbf_hashpipeMultiple HRBF_HASHPIPE instances can be started on the server simultaneously to process multiple data streams in parallel.When the program is initialized,multiple input and output ring buffers are allocated so that data between multiple threads can be read or written efficiently.

      The flow chart of the beamforming algorithm is drawn in Figure8.Light green stands for GPU beamforming function,pink corresponds to HRBF_HASHPIPE thread and the BF thread calls the GPU beamforming core function at runtime.The program loads weights and copies data from host to device in parallel after beamforming initialization.Weights can be obtained from pre-stored files,and the data are received by the net thread from the network port and transferred into the ring buffer.The program then chooses whether to perform transpose according to the received data structure.Once the weights and data are ready,the cublasCgemmBatched() function can be called to perform beamforming calculations.Then the user can calculate and integrate the power spectrum of the beamformed data.In order to adapt to different polarizations,two options are provided.Stokes calculation,transposition and accumulation are selected for dual polarization data,and power calculation,transposition and accumulation are performed for single polarization data.Finally,it saves the integration result to disk.The server is equipped with two Non-Volatile Memory express (NVMe) Solid-State Drive (SSD) cards,which can choose to store the original high-speed data after beamforming.The algorithm reserves an interface to call the xGPU10https://github.com/GPU-correlators/xGPUprogram for correlation calculation.

      According to the beamforming calculation formula (1),the data structure of the beamforming calculation is displayed in Figure9.The weight vector can be represented by a threedimensional array as W[F,E,B],where B stands for formed beams,from 0 to M,E signifies array elements,from 0 to N,and F corresponds to frequency bins,from 0 to L.The array element input datax(n) can be represented by a threedimensional array as X[F,T,E],where E represents the array elements,from 0 to N,T signifies the time samples,from 0 to H,and F corresponds to the frequency bins,from 0 to L.The beamforming calculation outputy(n) can be represented by a three-dimensional array as Y[F,T,B],where B corresponds to formed beams,from 0 to M,T represents the time samples,from 0 to H,and F stands for the frequency bins,from 0 to L.The integration outputy(n/i) can be represented by a three-dimensional array as Y[F,Ti,B],where B represents Stokes beams,from 0 to K,Ti corresponds to the time samples after integration,from 0 to S,and F stands for the frequency bins,from 0 to L.

      Figure 8.The flow chart of the beamforming algorithm.Light green stands for GPU beamforming function,pink corresponds to HASHPIPE thread and the BF thread calls the GPU beamforming core function at runtime.

      Figure 10.Plots of generated signals and calculated results.(a) Seven generated sine wave signals,in which the sine waves are phase shifted by π/12.(b) Time domain plot of simulated 7-element signals with noise.(c) Frequency domain plot of simulated 7-element signals with noise.(d) Frequency domain plot of the beamformed signal,which is at bin No.40.

      4.Experimentation

      The major steps include generating simulation data,transferring data to the GPU server and running the HRBF_HASHPIPE instance to test the GPU algorithm,applying a signal generator to emulate antenna signals,and using the ZCU111 board and the GPU server to collect,transmit,receive and process the signals to test the entire system.The details are illustrated as follows.

      4.1.GPU Beamforming Algorithm Verification

      We use C language to the write analog signal generation and network transmission program,and generate seven groups of sine wave signals combined with noise to simulate seven array elements.The sine waves are phase shifted by π/12,as depicted in Figure10(a).The simulated element signal with noise in the time domain and frequency domain are displayed in Figures10(b)and(c).The 7-element signals are formed into one signal (beam),the beamforming factor is preset according to the amplitude and phase of the analog signals,and the prepared weights are shown in Table2.We then start the HRBF_HASHPIPE beamforming instance on the server,and the signals are received by the 100 GbE network card and processed.The beamformed result is plotted in the frequency domain as shown in Figure10(d);the frequency bin of the formed signal is the same as the 7-element signals,both are 40,and the signal-to-noise ratio is significantly improved.

      Table 2 The Prepared Weighs for Each Element

      4.2.Real-time Signal Sampling and Processing

      4.2.1.Experimental System

      A block diagram of the experimental system is displayed in Figure11.We use a signal generator (R&S SMA100B) to generate a sine wave signal and combine it with a noise source through a reverse-connected power splitter.Then the signal isdivided into eight channels by a power splitter.ZCU111 ADC collects eight input signals,and after channelization and reordering,selects 256 MHz frequency band to output through four 10 Gb links.ZCU111 and GPU server are connected through a 40/100 GbE switch.The GPU server receives data through two 100 Gb links,and each link receives 128 MHz bandwidth data through two ports.The server runs four HRBF_HASHPIPE instances,and each receives single-port data,calls one GPU for beamforming processing and finally saves the calculation results to NVMe SSD.

      The signal generator is running in level sweep mode,preparing the scan list in advance and the output signal is shown in Figure12.The amplitude of the output signal is distributed as an Sa-like function to emulate the direction of arrival for beam pattern measurements.

      For comparison,the experiment is tested with and without injected noise.In addition,in order to simulate the actual antenna signal propagation link,the experiment also utilizes rod antennas for testing.A photo of the experimental system is featured in Figure13.One of the rod antennas is used as a transmitting antenna to connect to the output port of the signal generator for transmission,and eight rod antennas are used as receiving antennas to connect the received signals to the ADC.

      Figure 11.A block diagram of the experimental system.Using a signal generator to emulate measurement signals,ZCU111 collects signals and output selected 256 MHz bandwidth data to GPU server through four 10 Gb links.The server runs four HRBF_HASHPIPE instances;each receives 64 MHz bandwidth data and calls one RTX 3090 GPU for beamforming calculation.

      Figure 12.The output signal power pattern generated by a signal generator(R&S SMA100B).The amplitude of the output signal is distributed as an Salike function to emulate the direction of arrival for beam pattern measurements.The frequency of the output signal is 800 MHz,and the amplitude is scanned from -10 to +10 dBm.

      4.2.2.ADC Raw Data Monitor

      To monitor the input signals and the status of the ADCs,a Python script reads the ADC sampling data from ZCU111's BRAM which is captured by the snapshot module and plots every second.A plot of ADC sampling data is depicted in Figure14.The three rows from the top are the sample histogram,the original voltage sample and the power spectrum after FFT transformation,and the column represents each ADC.The frequency of the output signal from the signal generator is 800 MHz,and the amplitude is scanned from-10 to +10 dBm.After passing through the power splitter,the signal is attenuated by about 10 dB.ADC 0–3 are single-ended inputs,among which ADC 2–3 are connected to 1–2 GHz filters,and the signals are slightly weaker.ADC 4–7 are differential inputs,which use RF transformers (Mini-Circuits TB-654+) for conversion.

      4.2.3.Beamforming Results

      We start the HRBF_HASHPIPE instance on the server to test the function of data receiving and processing.In order to receive the 800 MHz signal generated by the signal generator,we select frequency band 3 by setting the register reg_band_select to 3,which selects the band from 768 to 1023 MHz.The signal is received through four ports of a 100 Gb network link,with a total data rate of 32,928 Mbps.The corresponding frequency band of the data received by port 0 is 768–831 MHz,the frequency of the input signal is 800 MHz and the bandwidth of each frequency bin is 1 MHz (1024 MHz bandwidth divided by 1024 channels),so the test signal just falls on the 32nd frequency bin of port 0.The received packets of data can be represented by a three-dimensional array as data[T,F,E],where E signifies the array elements,from 0 to 7,F represents the frequency bins,from 0 to 63,and T denotes the time samples,from 0 to 7.This is different from beamforming calculation data structure X[F,T,E],so a transpose is required before calculation.The contour map of eight single-point frequency signals is shown in Figure15,in which theX-axis andY-axis represent a two-dimensional sampling sequence,and each 11 points constitute a row,for a total of 11 rows.The eight input signals simulate the beam pattern of the direction of arrival measured by the eight array elements.

      Since the inputs are simulated as single-polarization signals,beamforming,power calculation and integration are performed on eight signals.Here,we simply set the weight of each element to be the same to verify the correctness of the algorithm.How to obtain the optimized weights is beyond the scope of this article.The contour maps of the formed beam obtained by testing without and with noise,and using rod antennas are displayed in Figures16(a)–(c),respectively.The frequency of the formed signals is the same as the frequency generated by the signal generator,and the signal patterns are similar to the generated signals and all are enhanced.Due to the attenuation of electromagnetic wave propagation,the signals received by rod antennas are weaker than the signals directly input from the signal generator.

      Figure 13.A photo of the experimental system.The signal generator is used to generate measurement signals and can be combined with the noise source.For experiments using direct signal input,the signal is divided into eight channels by a power splitter.ZCU111 collects eight input signals and output processed data flow to the GPU server through a 40/100 GbE switch.For experiments using rod antennas to simulate the actual antenna signal propagation link,one of the rod antennas is used as a transmitting antenna to connect to the output port of the signal generator for transmission,and eight rod antennas are used as receiving antennas to connect the received signals to the ADC.

      5.Performance Analysis and Optimization

      HRBF_HASHPIPE mainly includes six parts of processing:prepare weights,prepare data,matrix multiplication kernel,calculate Stokes and accumulate,copy Stokes to host and write Stokes (taking Stokes processing as an example,power data processing is similar).The processing time comparison before and after optimization of each part is shown in Table3.Among them,the processing time for preparing data after optimization is shortened by~6384 times.This part mainly includes:copy data from host to device,convert real data to complex,and perform data transpose.After using page-locked memory (or pinned memory),the time consumption for copying data from host to device dropped from 9.035 to 0.003 ms.After adopting CUDA multi-threading and optimizing the number of dim-Block and dimGrid,the time consumption of converting real data to complex and performing the data transpose decreased from 31.768 ms and 237.547 ms to 0.018 ms and 0.065 ms,respectively.

      Table 3 Comparison of Processing Time Before and After Optimization

      In addition,the program uses cublasCgemmBatched() for batch processing which speeds up the computational efficiency of matrix multiplication by a factor of~38.The algorithm implements CUDA multi-threading and optimizes the number of dimBlock and dimGrid accelerates the speed of calculating Stokes and Accu by~500 times.The algorithm also utilizes parallel replication which increases the speed of the copy Stokes to host by~15 times,and adopts NVMe SSD instead of hard disk for data recording which increases the speed of the write Stokes to file by about~4 times.

      The time-consuming ratio of each processing task after optimization is shown in Figure17.Among them,the copy Stokes to host and the write Stokes to file accounted for the highest proportion,42.78% and 34.76% respectively.The copy Stokes to host cannot be used with page-locked memory or memory mapping to reduce data exchange time.Experiments have found that if these methods are utilized,it will take more time and become unstable.These two items are related to the number of accumulations.In the test,32 accumulations are performed,so the data volume is reduced by 32 times.If the integration time is longer,the time consumption of these two items will be greatly reduced.Prepare weights refers to the time it takes to load a new file.If the weights file is not changed frequently,the time consumed by this part would be reduced a lot.

      Large-scale PAF has a large number of array elements,wide bandwidth and high data rate,which require real-time beamforming calculations.The computational complexity of the beamforming can be expressed as

      wherekis the sampling factor,Nbis the number of formed beams,Npis the average number of ports forming each beam(the number of sub-array elements) andBsubis the bandwidth of the sub-channel processed by each node (the number of frequency bins).Each complex number calculation requires two real number multiplications.

      The article tested the calculation performance under different beams,elements,frequency bins and time samples.First,we set the number of elements and time samples to fixed values to see how changes in the number of frequency bins and beams affect the calculation performance.The GPU processing time is plotted in Figure18.When the number of frequency bins varies between 16,32,64,128 and 256,and the number of beams changes between 16,32,64 and 128,the processing time increases with the increased number of frequency bins and beams,showing a linear relationship.

      Figure 14.A plot of ADC sampling data which is captured by the snapshot module and plots every second.The three rows from the top are the sample histogram,the original voltage sample,and the power spectrum after FFT transformation,and the column represents each ADC.The frequency of the signal is 800 MHz,and the amplitude is scanned from -10 to +10 dBm.After passing through the power splitter,the signal is attenuated by about 10 dB.ADC 0–3 are single-ended inputs,among which ADC 2–3 are connected to 1–2 GHz filters,and the signals are slightly weaker.ADC 4–7 are differential inputs,which use RF transformers for conversion.

      Figure 15.The contour map of eight single-point frequency signals.The eight input signals simulate the beam pattern of the direction of arrival measured by the eight array elements.X-axis and Y-axis represent a two-dimensional sampling sequence,and each 11 points constitute a row,for a total of 11 rows.The signal intensity is in dB.

      Next,we fix the number of elements and beams and make the number of time samples change with the number of frequency bins.Figure19affirms that the processing time increases linearly with the increase in the number of time samples and frequency bins.Then,we fix the number of beams and time samples,but change the number of elements and frequency bins.Figure20demonstrates that the processing time increases linearly with the increase in the number of elements and frequency bins as well.

      Figure 16.The contour map of the formed beam,in which the X-axis and Y-axis represent an 11×11 two-dimensional sampling sequence.The signal intensity is in dB.(a)The input signals come from the signal generator directly without noise.(b)The input signals come from the signal generator and are combined with noise.(c)Using eight rod antennas to receive signals.

      Figure 17.The time-consuming ratio of each processing task after optimization.Among them,the copy Stokes to host and the write Stokes to file accounted for the highest proportion,42.78%and 34.76%respectively.In the test,32 accumulations are configured.As the accumulation times increase,the time consumption of the above two items can be greatly reduced.The prepare weights refers to the frequency required to load a new file.

      Through the above test,we found that each parameter has a linear relationship with the calculation performance.When more array elements or bandwidth are coming,only the number of processing resources needs to be linearly increased.

      According to the planned PAF parameter of QTT,the number of elements,beams,frequency bins and time samples are set to 192,64,64 and 4096,respectively.Then we start four threads to test the processing performance of multi-GPU parallel computing.Each time sample is 1 μs (1 MHz chunk),4096 samples are combined into a batch and the data duration is 4 ms.The test results of multi-GPU parallel threads are depicted in Figure21,the maximum time of a single calculation is around 3.3 ms and the average processing time of all instances is less than 3 ms,which meet the requirements of real-time processing.

      Figure 18.Performance test:number of frequency bins vs.number of beams.The processing time increases with the increased number of frequency bins and beams,showing a linear relationship.

      Figure 19.Performance test:number of frequency bins vs.number of time samples.The processing time increases linearly with the increase in the number of time samples and frequency bins.

      Figure 20.Performance test:number of frequency bins vs.number of elements.The processing time increases linearly with the increase in the number of elements and frequency bins.

      Figure 21.Test results of multi-GPU parallel threads.The maximum time of a single calculation is around 3.3 ms and the average processing time of all instances is less than 3 ms,which meet the requirements of real-time processing.

      6.QTT PAF Digital Beamforming System Topology Design

      The PAF receiver that QTT plans to assemble contains 96 dual-polarized array elements with a frequency range of 0.7–1.8 GHz.We use the cross-connected architecture to plan the RFSoC+GPU hybrid cluster to achieve QTT PAF signal processing.The system consists of two parts:Digital Receiver(DRX) and CBF,corresponding to RFSoC and GPU server respectively.The DRX samples,channelizes and packetizes the array signal at the front end,divides the wideband signal into multiple narrowband signals,and then transmits the digital signal to the CBF through the high-speed network forcorrelation and beamforming.Table4lists the number and parameters of DRX,CBF and switches under different sampling rates,data accuracy,data output rates,etc.

      Table 4 Equipment Quantity and Parameters Under Different Settings

      If we choose the first configuration in Table4for fully digital receiver front-end design,the QTT PAF signal processing and beamforming network topology is illustrated in Figure22.A total of 24 DRX nodes is required to collect 192 array element signals,and each DRX node collects eight channels with a sampling rate of 4.096 GSPS,and 1.024 GHz bandwidth is selected from 0.7 to 1.8 GHz bandpass signal for processing.(The ADC Mixer of RFSoC allows the signal to be shifted up or down in frequency by an arbitrary amount.)When the output data precision is 8 bit,the total data rate of DRX output reaches 3072 Gbps.Each DRX node outputs through three 100 Gb network ports,with a rate of 42.6 Gbps per port.Forty-eight CBF nodes are needed for beamforming calculations,each node processes 21.3 MHz bandwidth signals,each node is equipped with two 100 Gb network ports and each port has an input data rate of 32 Gbps.It requires three switches,each of which contains sixty-four 100 Gb network ports.

      Figure 22.QTT PAF signal processing and beamforming topology architecture,we choose the first configuration in Table 4 for the design.A total of 24 DRX nodes collect 192 array element signals,and each DRX node collects eight channels with a sampling rate of 4.096 GSPS,and 1.024 GHz bandwidth is selected from 0.7 to 1.8 GHz bandpass signal.Each DRX node outputs through three 100 Gb network ports,with a rate of 42.6 Gbps per port (at 8 bit data precision).Forty-eight CBF nodes are used for beamforming,each node is equipped with two 100 Gb network ports to process 21.3 MHz bandwidth signals and the input data rate is 32 Gbps per port.There are three switches for data exchange,and each contains sixty-four 100 Gb ports.

      If the sampling rate is reduced to 2.048 GSPS,taking the second Nyquist zone (1.024–2.048 GHz) for sampling,the actual processing band is from 1.024 to 1.8 GHz(choosing the second configuration in Table4).The Xilinx ZUx9DR series can be selected to collect 16-channel array signals on a single DRX,and the number of DRX nodes will be reduced by half to 12.The total output data rate of DRX is still 3072 Gbps,however,the rate per port is doubled to 85.3 Gbps due to the number of DRX nodes being halved.The number of CBF nodes and switches remains unchanged.In comparison,by reducing the 224 MHz signal processing bandwidth,12 DRX nodes can be saved.Furthermore,we can reduce the data precision to decrease the budget.For example,if the data accuracy of DRX output is reduced to 4 bit,the number of CBF nodes will be reduced by half to 24,and the data of the switch will be reduced to 2,which can save half of the cost(corresponding to the third configuration in Table4).

      7.Discussion and Conclusion

      ZCU111 is the first RFSoC evaluation board launched by Xilinx.It uses the first-generation chip ZU28DR,and officially supports many resources such as development libraries,demos,extension daughter cards,etc.CASPER also ported this board to its development tool-flow and provided some development libraries such as ADC and 10 Gb modules.To get more support and save development time,this article is designed based on the ZCU111 platform.RFSoC products have now been developed to Gen 3,for example,the ZCU21611https://www.xilinx.com/products/boards-and-kits/zcu216.htmlboard equipped with the ZU49DR chip can sample up to 16 channels of 2.5 GSPS,14 bit.This board has more channels and a higher dynamic range,which is more economical and advantageous for PAF multi-channel acquisition and anti-interference saturation.In actual engineering applications,we recommend designing based on the Gen 3 RFSoC boards.To cope with the ultrahigh-speed data flow,we will develop a 100 Gb network module for RFSoC data output,and utilize the latest HASHPIPE version12https://github.com/david-macmahon/hashpipe/tree/ibverbs-branchwhich supports InfiniBand Verbs(IBVerbs) for high-efficiency packets receiving on a GPU server.

      The digital beamforming network architecture is essential in the overall design of the signal processing system.Based on the characteristics of different architectures and experimental results,the article designs a RFSoC+GPU hybrid cluster distributed signal processing and beamforming network topology for the QTT PAF system.The DRX nodes divide the wideband signals into multiple groups and transmit them through separate links and switches.A certain number of CBF nodes are connected to the corresponding switch to receive and process data in real-time.This architecture realizes the full bandwidth processing of all the array signals,ensuring the integrity of data transmission and reducing the redundant transmission.Each CBF node processes a section of narrowband signals,and the number of formed beams in a single CBF node can be adjusted according to the beam size of different frequency bands.For signals with more array elements and wider bandwidth,more processing units can be added,and its scalability is excellent.

      The PAF receiver is a powerful tool for large-scale radio telescope surveys,which can greatly enhance the detection capability of faint celestial bodies.Aiming at the design requirements of the QTT PAF system,our research attempts to use RFSoC to provide a technical reference for digital PAF front-end design,and make breakthrough improvements in receiver architecture design to improve PAF front-end performance,save space and weight,and reduce costs.The article designs a real-time digital beamforming experimental platform based on RFSoC+GPU hybrid architecture,which demonstrates the advantages of RFSoC integrated designs in terms of power consumption and electromagnetic radiation compared to discrete circuits,and verifies the feasibility of the technology and the effectiveness of the algorithm.It provides economical and advanced digital design and real-time signal processing solutions for the next generation of large-scale,ultra-wideband PAF receivers.

      Acknowledgments

      We thank Wei Liu,Jack Hickish and the CASPER community for developing the ZCU111 libraries,David MacMahon and Jeff Cobb for their help in designing the HRBF_HASHPIPE software,Karl Warnick,Brian Jeffs and Mitch Burnett for their advice on the design of the PAF experimental platform,and Yang Wu for his advice on the PAF signal emulation and testing.Finally,we thank the referee for helpful comments.This work was funded by the National Natural Science Foundation of China(NSFC,Grant No.12073066),the National Key R&D Program of China under No.2021YFC2203502,the Youth Innovation Promotion Association of CAS under No.2020063,the NSFC(Grant Nos.61931002,12073067 and 11973077),and the Natural Science Foundation of Xinjiang Uygur Autonomous Region under No.2021D01E07.The research is partly supported by the Operation,Maintenance and Upgrading Fund for Astronomical Telescopes and Facility Instruments,budgeted from the Ministry of Finance of China(MOF)and administrated by the Chinese Academy of Sciences (CAS).

      延吉市| 大田县| 长岛县| 察哈| 叙永县| 大悟县| 杭锦后旗| 从化市| 吴忠市| 五常市| 沙洋县| 文登市| 南阳市| 抚州市| 耒阳市| 龙岩市| 武宁县| 中牟县| 五常市| 北碚区| 聂拉木县| 新源县| 尼勒克县| 当涂县| 东阳市| 朝阳市| 平遥县| 阳曲县| 彰化县| 卓尼县| 遂宁市| 德兴市| 文成县| 东港市| 调兵山市| 玛纳斯县| 罗江县| 剑川县| 谷城县| 晋宁县| 龙口市|