solution > Wan Fang big data platform

Cloud-based and big data-based solutions help businesses transform successfully

1 Platform Overview

Data is now permeating every industry and business function，Gradually become an important factor of production，The use of vast amounts of data will herald a new wave of productivity growth and change across all sectors of society，Today, the world has fully entered the information age，According to IDC，The global big data market is expected to reach $53 billion by 2017，It will maintain a compound annual growth rate of more than 30% in the next few years。
The big data machine is mainly developed for the national common cloud computing basic platform for large enterprise applications, based on distributed algorithms and data management technology, to improve the ability of big data mining and intelligent services。The development of big data machine is in line with the guidance of national information security policy, from CPU chip and server system design and manufacturing to operating system, common supporting software, virtualization technology and system cluster to achieve full-stack and integrated data governance support。

Figure 1 Domestic big data platform

The WFCloud big Data platform is in the platform service layer of the system，It is a big data processing software customized and developed on the processor platforms such as Longson, Sunway and Feiteng，On the big data machine cluster，Provide resource pools through virtualization for big data processing，Form big data processing cluster，The software is fully adapted and optimized on the cluster，The search query, graph calculation, machine learning, data mining, real-time data processing and other models are unified under one basic platform，And exposed with a consistent interface API，Provides access to various business application information，Big data platform services for multi-source data processing，And can provide all kinds of big data processing, analysis tools，Analyze and extract all kinds of business information and multi-source data，It provides effective support for the auxiliary decision system。

2 平台设计

The WFCloud Big Data platform addresses distributed storage and the underlying implementation of computing，Distributed cluster is used as the underlying implementation，Use distributed file system to store data，Use distributed computing to realize big data task processing，Assist the use of memory computing to solve the speed problem caused by distributed computing to write file systems。By providing various data storage, calculation and mining interfaces，Provide business service calculation and data support，With a large amount of data, you can focus on business development without worrying about the underlying data organization，In particular, some existing programs based on Hadoop, HBase, and Hive can be more easily migrated to processor architecture server systems such as Loongson, Sunway, and Feiteng。

2.1 Platform architecture

The WFCloud big data platform is built on the processor architecture servers of Loongson, Sunway, Feiteng, etc. A large number of adaptations and optimizations have been made on the servers, and the architecture has been rewritten according to hardware characteristics to meet the needs of big data use。Among them, the reliability and performance of big data software are mainly optimized and improved。In the actual production environment, the big data platform provides HA in active/standby mode for all software as much as possible, and adopts the active/standby or load sharing configuration to effectively avoid the impact on system reliability in the single point of failure scenario。The automatic big data software deployment tool provides one-click installation programs and one-click cluster control functions。The software architecture of the big data platform is shown in the following figure。

Figure 2 Software architecture of big data platform

2.2.Core component

2.2.1 WFCloudBig data infrastructure platform

The WFCloud Big Data infrastructure platform is built on the open source big data architecture Apache Hadoop，It can be built on the processor architecture servers such as Loongson, Sunway and Feiteng，Build a distributed file system based on HDFS to achieve massive storage，Distributed parallel processing based on MapReduce framework，Combine the primary/secondary backup architecture to achieve high availability of the system，Provide distributed computing and distributed storage capabilities for big data processing systems，Provide platform support for upper-layer database system and other application systems。

图3 Distributed storage architecture
Distributed storage is a Master/Slave architecture, as shown in the figure above。Due to the nature of distributed storage, a storage cluster has an active and standby control node and several data nodes。The controller node manages the metadata of the file system, while the data node stores the actual data。The client accesses the file system through interaction with the control node and the data node。The client contacts the control node to get the metadata of the file, while the real file I/O operations interact directly with the data node。
The WFCloud big data infrastructure can effectively ensure the reliability of distributed file systems through redundant backup, copy storage, heartbeat detection, security mode, data integrity detection, space reclamation, metadata disk failure, and snapshot。The platform uses Yarn as the resource management system to manage and schedule resources for various applications。MapReduce framework optimized based on Loongson, Sunway, Feiteng and other processor platforms provides the ability to quickly process large amounts of data in parallel, as a distributed data processing mode and execution environment。
The WFCloud big data infrastructure platform is designed for different application scenarios and different application priorities, such as storage, offline computing, distributed computing and other directions, and can optimize the configuration in a targeted manner, with a high degree of customization and scalability。

2.2.2 WFCloud big data memory computing框架

WFCloud big data in-memory computing framework is based on the open source framework Apache Spark, and its related cluster software and monitoring software are re-customized and developed for the processor platforms such as Loongson, Sunway, and Feiteng。Spark is a big data processing framework built around speed, ease of use and complexity。It provides a comprehensive, unified framework for managing the big data processing needs of a variety of datasets with different nature (text data, chart data, etc.) and data sources (batch data or real-time streaming data)。Spark uses in-memory computing technology to analyze and calculate data in memory before it is written to a hard disk。The Spark project is mainly composed of RDDs (Elastic Distributed Data Set), Spark SQL, Spark Streaming, Spark MLib and Spark GraphX。
The WFCloud big data in-memory computing framework features the following:
● Support distributed memory computing
● Support iterative calculation
● Compatible with Hadoop system file read and write modes
● Fault tolerance in calculation process
● Support multiple language development applications (Scala/Java/Python)
● Linear expansion of computing power
The WFCloud Big Data in-memory computing framework is a memory-based iterative computing framework (shown in Figure 4) that is suitable for applications that require multiple manipulations of specific data sets, such as machine learning, graph mining algorithms, and interactive data mining algorithms。The more repeated operations are required in the calculation process, the greater the amount of data to be read, and the greater the benefit. In the case of small data amount but high computing density, the benefit is relatively small。Due to the nature of elastic data sets, it is not suitable for applications with asynchronous fine-grained state updates, such as data stores for Web application services。

图4 Memory computingArchitecture diagram
The data computed in the WFCloud Big Data in-memory computing framework can come from multiple data sources, such as Local files, HDFS, and so on。The WFCloud Cloud computing platform uses HDFS as its underlying data storage. Users can quickly switch from MapReduce to the WFCloud big data memory computing framework, which can read large-scale data at a time for parallel computing。After the calculation is complete and the results are stored in HDFS, the WFCloud Big Data in-memory computing framework can provide 10 to 100 times better performance than MapReduce。The WFCloud Big Data in-memory computing framework, as a computing engine, also supports small-batch streaming, offline batch processing, SQL queries, and data mining to avoid the storage and performance overhead caused by users loading the same data in these different types of systems。
In the case of Loongson, Sunway, Feiteng and other servers and X86 devices performance gap，Using the memory computing framework can make up for the defects of MapReduce in execution performance to some extent，Such as intermediate result output, data format and memory distribution, execution strategy, and task scheduling overhead。

2.2.3 WFCloud big database system

All types of military information systems，The database supports the storage, query and statistical analysis of various types of data，But as the amount of data for some specific types of data continues to grow，Such as sensors, target trajectory and log information data，The limits of common database storage and access have been reached，The advantages of NoSQL database access performance and storage scalability become the key to solve the problem。Relational database is no longer the only choice, the database field is entering the era of mixed persistence, that is, the use of a variety of database solutions, and the use of different data storage models, this solution to the problem of data persistence of the mixed way is gradually adopted。
WFCloud Large Database System (WFBase) is built on the open-source database Apache HBase. It is a highly reliable, high-performance, column-oriented, and scalable distributed database that can provide massive data storage functions. The general architecture is shown in Figure 5。Based on the design idea of One Rule Them All, big data database is used to deal with the storage and retrieval of semi-structured and unstructured data, and provides database-level data storage and retrieval for business systems, data warehouse construction and data mining, facilitating application development。The system closely combines the server characteristics of Loongson, Sunway, Feiteng and so on, gives full play to the hardware performance and improves the overall performance of the database system。

图5 WFBase架构
WFBase uses HDFS as its file storage system, except for some log files generated by WFBase, all data files in WFBase can be stored on the HDFS file system。HDFS provides highly reliable underlying storage support for WFBase。
WFBase is suitable for storing large table data (the scale of the table can reach billions of rows and millions of columns), and the read and write access to large table data can reach the real-time level, providing high reliability, high performance, column storage, scalable, real-time read and write database system。WFBase utilizes ZooKeeper as a collaborative service, and can use the WFCloud Big Data in-memory computing framework and MapReduce to process the massive data in WFBase。

2.2.4 WFCloud Big Data warehouse

WFCloud Big Data Warehouse is based on the open source data warehouse Apache Hive, mainly provides SQL-like language operation structured data storage services and basic data analysis services。The WFCloud big Data warehouse is a single-instance service process that compiles and parses WQL into MapReduce or HDFS tasks。
WFCloud Big Data Warehouse as a data warehouse based on HDFS and MapReduce architecture (as shown in Figure 6)，Its main capability is to compile and parse WFCloud Query Language (WQL)，Generate and execute MapReduce jobs or HDFS operations。
The main features of WFCloud big Data warehouse are as follows:

Massive structured data analysis and summary
Simplify complex MapReduce writing tasks into SQL statements
Supports flexible data storage formats such as JSON, CSV, TEXTFILE, RCFILE, and SEQUENCEFILE

图6 Data warehouseArchitecture diagram
The WFCloud Big Data Warehouse includes the following related components:

User interface: including WFCloudshell, Thrift client, Web management
Thrift Server: When the WFCloud big data warehouse is running in server mode, it can act as a Thrift server for clients to connect to
Metadata: Typically stored in relational databases (MySQL, Derby, and so on)
Parser: including interpreter, compiler, optimizer, actuator, through a series of processing of HiveQL query statements lexical analysis, syntax analysis, compilation, optimization and query plan generation。Query plans are executed by MapReduce calls

3 案例

3.1 Information service center big data fusion platform

The big data fusion platform is deployed on the network and mainly provides real-time warehousing, real-time retrieval, real-time analysis and other functions for massive multi-source heterogeneous data。At the same time, it provides a distributed data processing platform with streaming data processing and data mining capabilities。The data processing layer structure of the big data fusion platform is shown in the following figure:

图7Information center big data fusion platform framework
The big data fusion platform is based on the distributed file system, integrates the Hadoop distributed computing platform, supports the distributed computing architecture of traditional MapReduce and memory computing, and has super distributed computing capabilities, supporting the fast and efficient processing of TB or PB level data。
The core of the big data fusion platform is the database system, which mainly solves the two problems of mass data storage and high-speed retrieval of mass data。The big data fusion platform independently develops a big data database system based on SQL on Hadoop, solves the storage of structured and unstructured data, indexes the incoming data in real time, analyzes, splits and extracts the data, and stores it in the big data database system。At the same time, it is closely combined with the hardware platform and optimized based on the platform to give full play to the hardware performance and improve the database performance。
The data processing layer supports real-time processing, stream processing, graph computing and data mining. Data mining can be based on the data in the database for retrieval, processing and modeling, and support the deep mining of data and business intelligence analysis。

3.2 Target area meteorological support system

The target area meteorological support system is a special system used to guarantee the environmental judgment of the target area。The dedicated meteorological support system consists of nearly 17 sub-systems, including information receiving and processing, fine forecast and early warning, decision support, security application and business support. The background processing unit of each sub-system uses Longson, Shenwei, Feiteng and other server equipment。
Meteorological data is a kind of typical unstructured data, and its daily increment in practical application can reach tens of TB。In order to meet the requirements of this project, a meteorological support big data processing platform integrating various application services, data preprocessing, real-time storage, fast retrieval, intelligent analysis, and two - and three-dimensional visual display is established。
The software framework of meteorological support system is shown as follows:

图8气象保障Big data platform application topology
The data storage layer is an important part of the business，The memory storage uses the memory database Redis for cluster construction，Efficient and fast processing of data requiring real-time processing;The persistent storage is built by traditional Dameng database cluster，Store and back up data that needs to be persisted，Play a safety protection role;Distributed file storage uses MongoDB database for cluster construction，Fast and efficient storage of non-relational data，For real-time access by multiple users;The nearline storage is built by the WFBase cluster，It is mainly used to store applications with low access and high access performance，At the same time, the device is required to have a considerable storage capacity and flexible cluster scalability。
The platform service layer provides basic services and system platforms for business applications, mainly including cloud computing big data platform and two - and three-dimensional geographic information system platform。The data service layer can be extended flexibly as a plug-in for specific applications。Data processing service includes two parts: data distribution and data receiving。
The business visualization layer is a display unit that provides users with data analysis and deduction, and enables real-time analysis and service monitoring of meteorological data through the terminal。
The core data storage and processing part of the entire meteorological support system is mainly built using the WFCloud big data platform to realize the localization of the system and ensure the processing performance of the system。

3.3Construction of a data center

Based on Sunway Big Data Machine and Risi operating system, the project provides virtualization and big data processing technology to complete the construction of massive unstructured data storage and retrieval platform。It provides basic support for traditional database application, data mining application and data visualization。
The distributed processing platform is built on the Sunway big data machine cluster，Expand cluster size with Sunway virtualization technology，Distributed file system is adopted to realize distributed storage，Distributed computing framework is designed and implemented by using distributed computing and Map Reduce，Combine the primary/secondary backup architecture to achieve high availability of the system，Provide distributed computing and storage capabilities for Sunway's big data processing system，The specific software architecture is shown in the figure。

图9 Data center software architecture diagram
The specific implementation of the project is divided into the following steps:
1) Porting and optimizing the distributed processing platform under the Sunway platform;
2) Use WFCloud big data platform to build distributed processing platform system and WFBase database for specific implementation and testing;
3) After the big data platform is built, interact with the Avatar database。Provide related data mining and retrieval interfaces, provide basic platform application system transplantation support, and provide data interaction module interface;
4) Complete database testing together with Avatar Database。
5) Complete the GBase8A database test with NTU General。

3.4A college Shenwei big data platform construction

The project builds a national defense big data information fusion platform based on Sunway Big Data machine and Risi operating system, with virtualization and big data processing technology as the core support。
National defense, as an industry with high security requirements, is particularly favored by basic hardware and software such as Loongson, Feiteng, and Shenwei。Sunway Big Data integration solution adopts self-developed technology from hardware, operating system, big data software, virtualization software and application interface, and integrates security middleware and security database to build a new information integration platform for national defense big data。
In order to meet the information development needs of an information fusion center of a college, three levels of construction should be completed: basic environment, platform application and system service。Among them, the core application support environment in the platform application layer is one of the more important links, including the integration and construction of system software such as basic library, basic middleware, basic development running environment, and basic development driver。WFCloud Big Data infrastructure platform, WFCloud big data memory computing framework and WFBase system are built on the basis of hardware and software infrastructure (Sunway server)。Combined with the advanced domestic cloud computing big data architecture and technology, through source code reconstruction and software architecture reconstruction, Sunway big data platform architecture has been formed, roughly as shown in the figure。

图10 Sunway big data platform architecture
The operating system depends on the hardware platform, but it has its particularity。Solved the open source Linux transplantation, basic library transplantation, driver transplantation and other problems, and then the technical staff to optimize the adaptation。The core application support environment is designed to build an integrated platform for user feedback and technical optimization。According to the user's demand for application indicators, combined with the optimization of the open source basic software of the operating system, solve the user's problems。