View
693
Download
8
Embed Size (px)
Citation preview
© Hitachi, Ltd. 2016. All rights reserved.
Hitachi, Ltd. OSS Solution Center2016/10/27
Yusuke FuruyamaYang Xie
Leveraging smart meter data for electric utilities:Comparison of Spark SQL
with Hive
2© Hitachi, Ltd. 2016. All rights reserved.
1. Leveraging smart meter data[Sample use case for electric utilities]2. Performance evaluation of MapReduce and Spark 1.6 (using Hive and Spark SQL)
3. Additional evaluation with Spark 2.0
Contents
4. Summary
3© Hitachi, Ltd. 2016. All rights reserved.
1. Leveraging smart meter data [Sample use case for electric utilities]
4© Hitachi, Ltd. 2016. All rights reserved.
1-1 Hitachi’s Social innovation business approach
5© Hitachi, Ltd. 2016. All rights reserved.
1-2 Situation of electric utilities and their needs
• Electric utilities have to adapt to competitive free market• Request for price cut of power transmission fee from government
Liberalization of the retail Electric Power Market
Needs to cut the cost for transmission and distribution equipment• Transmission and distribution equipment was replaced periodically• Decide the timing of replace by the condition of equipment
Maintenance team needs: Obtain the load status of each equipment
Electricity rate
Power plant Transmission
LinesSubstation Distributio
n LinesHome
Transmission and Distribution Companies
Power transmission
feeElectric Power Generation Companies Electricity
retailers
6© Hitachi, Ltd. 2016. All rights reserved.
1-3 Situation of electric utilities and their needs (future)
• Decreasing nuclear plant as a stable power supplier• Increasing renewable energy supply
Unstable power supply
Needs for high level Demand Response• Rates by time zone (current demand response)• Many and small renewable energy suppliers• Near real-time demand response for each distribution system
Planning team needs: Obtain near real-time load status of each equipment
Electricity rate
Power plant Transmission
LinesSubstation Distributio
n LinesHome
Transmission and Distribution Companies
Power transmission
feeElectric Power Generation Companies Electricity
retailers
7© Hitachi, Ltd. 2016. All rights reserved.
1-4 Leveraging big data for electric utilities
Power Grid Transformer(500/Distribution Line)
Switch(20/Distribution Line)
Meter DataManagement System Data from smart meters(every
30min.)
Analyze the data from smart meters to grasp the load status of equipment
Meet the needs of electric utilitiesSubstation
(1,000) Distribution Lines(5/Substation)
Smart Meter( 4 /Trans) ・・・0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
Data Analysis System
Total 10,000,0
00
Concern about performance needed for near real-time processing
8© Hitachi, Ltd. 2016. All rights reserved.
Data processing platform (Hadoop, Spark)
Data Analysis System
Target of this session
Planner(Equipment/
Demand response)
Meter DataManagement System
1-5 System Component
Raw Data Processed Data
Data visualization tool (Pentaho )
9© Hitachi, Ltd. 2016. All rights reserved.
2. Performance evaluation of MapReduce and Spark 1.6 (using Hive and Spark SQL)
10© Hitachi, Ltd. 2016. All rights reserved.
2-1 Use Case for electric utilities
Points of analysisNeeds Point of analysis
Find the equipment that needs to be replaced
Find the equipment that has heavy workload
Estimate the timing for replacement Check the trend of load status
Select the proper capacity of new equipment
Extract the peak of the load
Needs of electric utilities (recap)
Needs・ Analyze the data from smart meters to grasp the load status of equipment・ Near real-time
11© Hitachi, Ltd. 2016. All rights reserved.
2-2 Contents of performance evaluation
Point of evaluation for near real-time processing
Check if MapReduce and Spark can process the data in 30min.
Point of analysis Aggregate per
Find the equipment has heavy workload
Equipment
Check the trend of load status
Term
Extract the peak of the load
Time zone
Items for evaluation
・ Distribution system ・ Substation ・ Switch・ Transformer ・ Meter・ 1day ・ 1month(30days) ・1year(365days)・ Specific 30min of each day ・ 24h
• Concern about performance for processing 10,000,000 meters• Data comes from smart meter every 30min (48/Day, spec of smart
meter)
12© Hitachi, Ltd. 2016. All rights reserved.
2-3 Target of performance evaluation
Target of evaluation
Data Processing Platform(MapReduce, Spark 1.6)
Data Analysis System
End
Data visualization tool ( Pentaho )
Meter DataManagement System
Target
Aggregation batch(Hive / Spark SQL)
Start
Time from start of aggregation batch for meter data to end of the batch
13© Hitachi, Ltd. 2016. All rights reserved.
2-4 Target of performance evaluation System
Configuration Spec
Master NodeCPU Core 2
Memory 8 GBCapacity of
disk80 GB
# of disk 1 Per slave
nodetotal
CPU Core 16 64 Memory 128 GB 512 GB
Capacity of disk
900 GB -
# of disk 6 24 Total capacity
of disks5.4 TB
( 5,400 GB )
21.6 TB( 21,600
GB )
10Gbps LAN10Gbps SW1Gbps LAN
disk disk ・・・
disk
4 Slave Nodes(Physical Machines)
1 Master Node (Virtual Machine)
14© Hitachi, Ltd. 2016. All rights reserved.
2-5 Dataset
Smart meter data
Term # of records Size ( CSV ) Size ( ORC File )
365days (1year) 3,650 million 2.475 TB 1.325 TB30days (1month) 300 million 0.205 TB 0.158 TB1day 10 million 0.007 TB 0.005 TB
0:00-0:30 power usage
0:30-1:00 power usage
・・・
23:30-0:00 power usage
Meter mgmt. info
meter10:00-0:30 power
usage0:30-1:00 power
usage・・・
23:30-0:00 power usage
Meter mgmt. info
meter2・・・
0:00-0:30 power usage
0:30-1:00 power usage ・・
・
23:30-0:00 power usage
Meter mgmt. info
meter10,000,000
Smart meter data /day
Data size
48 columns (every 30min)
10,000,000 records/day
15© Hitachi, Ltd. 2016. All rights reserved.
2-6 Contents of performance evaluation (recap)
Items for evaluation
Target of evaluation
Point of analysis Aggregate per
Find the equipment has heavy workload
Equipment
Check the trend of load status
Term
Extract the peak of the load
Time zone
・ Distribution system ・ Substation ・ Switch・ Transformer ・ Meter・ 1day ・ 1month(30days) ・1year(365days)・ Specific 30min of each day ・ 24h
Point of evaluation
Time from start of aggregation batch for meter data to end of the batch
Check if MapReduce and Spark can process the data in 30min.
16© Hitachi, Ltd. 2016. All rights reserved.
Time from start of aggregation batch for meter data to end of the batch
2-6 Contents of performance evaluation (recap)
Point of evaluationCheck if MapReduce and Spark can process the data in 30min.
Items for evaluationPoint of analysis Aggregate
perFind the equipment has heavy workload
Equipment
Check the trend of load status
Term
Extract the peak of the load
Time zone
・ Distribution system ・ Substation ・ Switch・ Transformer ・ Meter・ 1day ・ 1month(30days) ・1year(365days)・ Specific 30min of each day ・ 24h
+ File type・ Text (CSV)・ ORCFile (Column-based)
Target of evaluation
17© Hitachi, Ltd. 2016. All rights reserved.
2-7 Comparison of txt with ORCFile (MapReduce)
• Couldn’t finish processing in 30min• Performance improvement by ORCFile
1day
30days
365days
0 sec 2000 sec 4000 sec 6000 sec
62 sec
224 sec
3286 sec
99 sec
338 sec
6962 sec
Hive + TXTHive + ORCFile
1day
30days
365days
0 sec 2000 sec 4000 sec
31 sec
69 sec
492 sec
32 sec
138 sec
3015 sec
Hive + TXTHive + ORCFile
30min 30min
Result (24h) Result (0.5h)
18© Hitachi, Ltd. 2016. All rights reserved.
2-8 Comparison of txt with ORCFile (Spark 1.6)
1day
30days
365days
0 sec 400 sec 800 sec 1200 sec 1600 sec
28 sec
50 sec
993 sec
47 sec
154 sec
1408 sec
Spark SQL + TXTSpark SQL + ORCFile
• Could finish processing in 30min (1800s)• Performance improvement by ORCFile
1day
30days
365days
0 sec 400 sec 800 sec 1200 sec 1600 sec
26 sec
32 sec
169 sec
30 sec
111 sec
1263 sec
Spark SQL + TXTSpark SQL + ORCFile
Result (24h) Result (0.5h)
19© Hitachi, Ltd. 2016. All rights reserved.
2-9 Review of the resultsWhy the processing was fast with ORCFile
Processing 0.5h data
・・・
・・・
Smart meter data
・・・
0 :00
23 :300 :
00
0 :30
0 :00
0 :000 :00
0 :00
0 :000 :00
0 :00
Use 1column
only
Processing 24h data
・・・
・・・
+
・・・
・・・
・・・
0 :30
23 :30 SUM/day
+
Use all 48
columns+
• Processing big data with ORCFile was more effective than processing small data
Results
• Processing 0.5h data with ORCFile was more effective than processing 24h data
Compression
Reading specific column
Features of ORC File
++
++
+++
++
SUM/daySUM/day
SUM/daySUM/day
Smart meter data0 :00
0 :300 :30
0 :300 :30
23 :3023 :30
23 :3023 :30
0 :000 :000 :00
0 :000 :00
0 :300 :30
0 :300 :30
23 :30
23 :3023 :30
23 :30
20© Hitachi, Ltd. 2016. All rights reserved.
Distributionsystem
Perequipment
0 sec 500 sec 1000 sec
32 sec
104 sec
69 sec
556 sec
HiveSpark SQL
2-10 Comparison of MapReduce with Spark 1.6
Distributionsystem
Perequipment
0 sec 500 sec 1000 sec
50 sec
145 sec
224 sec
718 sec
HiveSpark SQL
Distributionsystem
Perequipment
0 sec 1000 sec 2000 sec 3000 sec 4000 sec
993 sec
1359 sec
3286 sec
4131 sec
HiveSpark SQL
・ Could finish processing the data from 10,000,000 meter in 30min using Spark 1.6!・ Spark’s good performance with “per equipment” processing
Result (30days / 24h) Result (30days / 0.5h)
Result (365days / 24h)
30min
80%down
78%down
81%down
54%down
21© Hitachi, Ltd. 2016. All rights reserved.
Trans2,500,000
Switch100,000
2-11 Review of the results Why the processing per equipment was more effective than the
processing for entire distribution system when using spark?
Per equipment・・・
・・・
・・・
・・・
・・・
・・・
・・・
・・・
・・・
・・・
0 :000 :000 :00
0 :000 :00
0 :300 :300 :30
0 :300 :30
23 :3023 :3023 :30
23 :3023 :30
Trans1Trans2Trans3
Switch1Switch2
Line1 Substa.1
10005000
・・・
・・・
・・・
・・・
・・・
0 :000 :000 :00
0 :30
23 :3023 :3023 :30
・・・
0 :000 :00
0 :300 :30
23 :3023 :30
For entiredistributionsystem
JOIN
・ Less disk I/O than MapReduce・ Smaller data (including re-distributing data) than total memory of cluster
Smart meter data Transformer Switch Distributio
nLine
Substation
Distribution
System
Distribution
System
JOIN JOIN JOINJOIN
Smart meter data
0 :30
0 :30
22© Hitachi, Ltd. 2016. All rights reserved.
3. Additional evaluation with Spark 2.0
23© Hitachi, Ltd. 2016. All rights reserved.
3-1 Evaluation environment System
Configuration Spec
Master NodeCPU Core 16Memory 12 GB
Capacity of disk
900 GB
# of disk 8 Per slave
nodetotal
CPU Core 20 120Memory 384 GB 2304 GB
Capacity of disk
1200 GB -
# of disk 10 60 Total capacity
of disks12.0 TB
( 12,000 GB )
72.0 TB( 72,000
GB )
10 Gbps LAN10 Gbps SW
6 Slave Nodes(Physical)
1 Master Node(Physical)
disk disk ・・・
disk
24© Hitachi, Ltd. 2016. All rights reserved.
3-2 Comparison of Spark 2.0 with Spark 1.6
PerEquipment
0 sec 100 sec 200 sec 300 sec 400 sec
316 sec
322 sec
Spark 1.6Spark 2.0
PerEquipment
0 sec 30 sec 60 sec 90 sec 120 sec
87 sec
102 sec
Spark 1.6Spark 2.0
PerEquipment
0 sec 20 sec 40 sec 60 sec 80 sec 100 sec
71 sec
89 sec
Spark 1.6Spark 2.0
・ Performance improvement roughly 20% (including disk I/O)・ More effective with small data
Result (365days / 24h) Result (30days / 24h)
Result (1day / 24h)
Time from start of aggregation batch for meter data to end of the batch
25© Hitachi, Ltd. 2016. All rights reserved.
Demo
26© Hitachi, Ltd. 2016. All rights reserved.
Data visualization tool (Pentaho )
Aggregated Data(29 days)
Demo: Data aggregation and visualization
② Execute aggregation batch (Spark SQL)
① Show 29 days data
Aggregated Data(30 days)
③ Show 30 days data
Raw Data (1day)
④ Outlier detected!!
27© Hitachi, Ltd. 2016. All rights reserved.
4. Summary
28© Hitachi, Ltd. 2016. All rights reserved.
4 Summary
Leveraging data from 10,000,000 smart meters for electric utilities- Built data analysis system- Concern about performance
Evaluate the performance of batch processing- Spark could process the data from 10,000,000 meters in 30min (4node)
Evaluate the performance of Spark 2.0- Performance improvement roughly 20% (compared to 1.6)
Data Processing Platform(MapReduce, Spark)
Data Analysis System
Data visualization tool ( Pentaho )
Aggregation batch(Hive / Spark SQL)
Meter DataManagement System
© Hitachi, Ltd. 2016. All rights reserved.
Hitachi, Ltd. OSS Solution Center
Comparison of Spark SQL with HiveLeveraging smart meter data for electric utilities:
2016/10/27
Yusuke FuruyamaYang Xie
END
29
30© Hitachi, Ltd. 2016. All rights reserved.
他社商品名、商標等の引用に関する表示
Hadoop 、 Hive および Spark は、 Apache Software Foundation の米国およびその他の国における登録商標または商標です。
その他記載の会社名、製品名などは、それぞれの会社の商標もしくは登録商標です。
32© Hitachi, Ltd. 2016. All rights reserved.
Appendix. Difficulty with large data to be shuffled
Attempted to aggregate raw (48 columns) meter data per equipment- Extremely slow (Spark 2.0) or Job failed (Spark 1.6)- Processing: Iteration of JOIN and GROUP BY+SUM- Huge data to be shuffled (spilled out from page cache)
1day 30days 365days0 sec
2000 sec4000 sec6000 sec8000 sec
10000 sec12000 sec14000 sec
76 sec 223 sec
13281 sec
85 sec 236 sec
3650 sec
Before addingAfter adding
Target data
Proc
essi
ng t
ime
Add HDFS disks as disks for shuffle- Performance Improved (365days)- Performance degraded (1day/30days)
・ Data for Spark (including temporary data) should be smaller than memory. ・ Had better to process as a trial to estimate
Adding disks for shuffle (Spark 2.0.0 )
Heavy load on a local disk (OS disk) by shuffle