26
Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Semantic Community Data Science Data Science for the DataAct Datathon August 7, 2015 1

Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

Embed Size (px)

Citation preview

Page 2: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

2

First Meetup: Data Science for the Data Act at Treasury, December 15, 2014, CGI Federal

• DATA Act Requirements Thoughts and Open Discussion, Art Nicewick, Executive Consultant, CGI Federal Slides• Data Science for the Data Act at Treasury, Brand Niemann Slides• Web Sites:

• http://fedspendingtransparency.github.io/ • http://fedspendingtransparency.github.io/dataelements/

• Questions: At this time, we are asking for comments in response to the following questions:• Which data elements are most crucial to your current reporting and/or analysis?• In setting standards, what are industry standards the Treasury and OMB should be

considering?• What are some of the considerations that Treasury and OMB should take into account when

establishing data standards?http://www.meetup.com/Virginia-Big-Data-Meetup/events/218682974/

Page 3: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

3

Congressional Testimony

• Try as it might, the federal government doesn't have the best track record on publicly reporting spending data, Gene Dodaro, comptroller general of the Government Accountability Office, told lawmakers December 3, 2014.• USASpending.gov's success thus far could serve as a cautionary tale for the

implementation of the Digital Accountability and Transparency Act, or DATA Act, said Dodaro during a hearing of the House Oversight and Government Reform Committee.• "Our recent report on USASpending.gov really illustrates the challenge, here," said

Dodaro.• GAO's recent report found, five years after USASpending.gov launched, much of the

information remains incomplete and inaccurate – with 324 programs not recorded in the database, $619 billion omitted and many of the data elements required for reporting missing, said Dodaro.

Source: FierceGov

Page 4: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

4

Blogs

• In a speech at the 2014 Financial Stability Conference last week in Washington, the Director of the Office of Financial Research at Treasury, Dick Berner, called for universal adoption of Legal Entity Identifiers (LEI) throughout the federal government.• Source: Treasury.gov Web Site

• OMB’s Mark Reger compared the DATA Act to the Full Employment Act, noting, “there is a ton of work to be done.” Reger said that the input from data transparency consultants, contractors, and data specialists is needed to tell the implementing federal executives what data is most important and help with analysis.• Source: Data Coalition Blog

Page 5: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

5

Data Transparency Breakfasts

• December 8, 2014: Federal Financial Management and the DATA Act• The fourth Data Transparency Breakfast, presented by PwC, will explore the transformation of

the U.S. government's spending information from disconnected documents into standardized data, as required by the DATA Act of 2014, from the perspective of federal financial managers.

• Join the financial officers who will be responsible for applying government-wide DATA Act data standards to make federal financial reports fully searchable, interoperable, and open to all. Our panel will explore the challenges and opportunities of the DATA Act transformation.

• My Note: I attended the Data Transparency Breakfast this morning in preparation for our December 15th Meetup. Please see additions to the agenda above, especially the slides, Web Site Links and Questions we will be discussing to provide feedback to Mark Reger, Deputy Controller, OMB, at his request to me at the breakfast.

Source: Data Coalition

Page 6: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

6

Government Technology & Innovation Incubator for Big Data Analytics II Meetup, March 25th, Eastern Foundry• 6:30 p.m. Welcome and Introduction (Preview of Proposed DATA Act

Elements, Standardized Formulas, and Agency Implementation Challenges)• 6:45 p.m. Brief Member Introductions• 7:00 p.m. Chris Garner, Paxata, Inc., Presentation and Demo Slides• 7:20 p.m. Steve Hanmer, Gov PATH Solution, Presentation and Demo• 7:40 p.m. Open Discussion• 8:00 p.m. Government Technology & Innovation Incubator: Eastern Foundry

Tour, Geoff Orazem• 8:30 p.m. Networking• 9:00 p.m. Depart

http://www.meetup.com/Federal-Big-Data-Working-Group/events/221283174/

Page 7: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

7

Newly Appointed U.S. CIO Tony Scott Speaks

• U.S. Chief Information Officer Tony Scott, in his first day of public appearances after his appointment by President Obama last month, described the President's 2013 Open Data Policy. Though the Open Data Policy is not mandatory for independent regulatory agencies, including most financial regulators, Scott said financial regulators can bring benefits to investors, their own operations, and the financial industry by voluntarily following it. View slideshow presentation here: http://www.datacoalition.org/wp-content/uploads/2015/03/Open-data-and-financial-regulationv2.pdf

Page 8: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

8

Financial Regulation Summit Highlights

• Over 300 public and private sector open data leaders gathered at Union Market in Washington, D.C. on Tuesday for the Coalition's Financial Regulation Summit - aimed at building a consensus for the transformation of U.S. financial regulatory reporting from disconnected documents into open, standardized data. Participants included Members of Congress; U.S. Chief Information Officer Tony Scott; Treasury Office of Financial Research Director Dick Berner; and representatives of nearly every major financial regulator. The Financial Regulation Summit was presented by RR Donnelley, with additional sponsorship by Workiva, Booz Allen Hamilton, PwC, RDG Filings, and Socrata. In coming weeks, the Coalition will publish video of all Summit presentations and a full analysis of the MADOFF Transparency Act.

Page 9: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

9

Parties Interested in the DATA Act 1

• You are invited to participate in a webinar hosted by the DATA Act Section 5 Pilot Team to discuss the Digital Accountability and Transparency Act (DATA Act) Section 5 Pilot. This online event is being held on April 1, 2015 from 1:00PM to 2:00PM EDT. The Chief Acquisition Officers Council, General Services Administration, and the Department of Health and Human Services are sponsoring a dialogue and pilot to identify clear recommendations for (1) standardizing grant and contractor awardee reporting, (2) eliminating duplicative and/or unnecessary reporting, and (3) reducing awardee compliance costs. The open dialogue, which will launch in spring of 2015, is iterative and will first ask interested parties to weigh in on these ideas, then we will apply those ideas in a pilot, and finally we will ask participants to again weigh in on the next iteration of ideas. Participation in the dialogue will provide federal contract and grant recipient organizations a unique opportunity to guide the future of the government-wide implementation of the DATA Act.

Page 10: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

10

Parties Interested in the DATA Act 2

• Attendees will learn the background and goals of the DATA Act Section 5 Pilot, expected outcomes, and participant opportunities and requirements. The event also will address commonly asked questions about the pilot. DATA Act Section 5 Pilot Grants Lead Lora Kutkat and DATA Act Program Management Office Communications Lead Christopher Zeleznik will be leading the discussion, which will include ample time for questions and answers.• A recording and documentation from the event will be posted to the

Outreach section of http://www.grants.gov following the event.• Please send any questions regarding the DATA Act Session 5 Pilot

Webinar to Emily Gartland at [email protected].

Page 11: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

11

National Webcast on Implementation of the Data Act

• On March 27th at 3:30pm EDT, please join us for a national webcast about implementation of the Digital Accountability and Transparency Act (DATA Act). Sponsored by a number of national organizations representing a broad-cross section of DATA Act stakeholders, the webcast will feature Federal leaders responsible for the Act's implementation. Hear from OMB Controller Dave Mader and Treasury Fiscal Assistant Secretary David Lebryk about plans for implementing this important legislation, which will have an impact on Federal agencies and all those who receive Federal funds. In particular, learn about the Federal government's approach to setting the required data element definition standards. There is no cost for participating in the webcast.

Source: PostponedSee: GitHub Site for National Dialogue

Page 12: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

12

https://actiac.org/project/data-act-transparency-federal-financials-project

Page 13: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

13

Art Nicewick, Executive Consultant, CGI Federal

• I have been talking with Mike Wood about pulling something together for the Data Act demo day in June. I have some ideas, but no time. I'm still unclear on the goals of the Act. From what I see, it’s five headed monster, with many goals, and many of which are divergent. Everybody has a lot of ideas on what it can be, all the ideas are good. However, partitioning the problem into actionable components, defining the cost benefits of the components, and then setting the priorities --- is a challenge. I'd love to hear your thoughts.• Art, Thanks and hopefully we could discuss this at the Meetup on

Wednesday.

Page 14: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

14

http://www.datacoalition.org/events/summits/finreg-2015/

Page 15: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

15

So Many Activities About Financial Data, But Not with Financial Data!

• But See: Data Science for Financial Data by Dr. Brand Niemann Published by AOL Government in 2011-2013:• Recovery.gov: A Good Start But Show Us All the Missing Data, By Brand Niemann, on September 08, 2011 at 3:00 PM

• http://breakinggov.com/2011/09/08/recovery-gov-a-good-start-but-show-us-all-the-missing-data/

• But See: Semantic Community showed A USASpending.gov Dashboard with All the USA Spending Data in 2011.• A USASpending.gov Dashboard, December 18, 2013

• http://semanticommunity.info/A_USASpending.gov_Dashboard

• But See: Semantic Community showed for the 2014 Data Transparency Summit that the Federal Digital Government Strategy accomplishes the Data Act. Hudson Hollister, Executive Director, Data Transparency Coalition, agreed.• Data Science for Financial Data Transparency (with Ontologies)

• http://semanticommunity.info/Data_Science/Data_Transparency_Summit

• But See: Data Science for the Data Act at Treasury• Data Act at US Department of Treasury

• http://semanticommunity.info/Data_Science/Data_Science_for_the_Data_Act_at_Treasury

Page 16: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

16

Page 17: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

17

http://semanticommunity.info/A_USASpending.gov_Dashboard

Page 18: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

18

http://semanticommunity.info/Data_Science/Data_Transparency_Summit

Data Science uses the Data Mining Ontology (suggest by Dr. Barry Smith) and Data Mining Standard Process (CRISP-DM) to structure the content into a knowledge base using semantic web standards for big data.

Page 19: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

19

http://semanticommunity.info/Data_Science/Data_Science_for_the_Data_Act_at_Treasury

Page 20: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

20

Data Science for the Data Act at Treasury

• My Questions For the Fourth Data Transparency Breakfast Panel:• My EPA Experience:

• Why not have a Federal Chief Data Officer and Agency Chief Data Officers with Data Scientists Mining Agency Data Assets?

• Federal Spending Data Elements:• Will they support more than just reporting? Data analysis and even predictive analytics?

• Some results highlights are:• There are 59 data elements in the Data Act and 46 in the USASpending Data

Dictionary.• The USASpending data set with 149,110 rows and 46 columns was geocoded

by Spotfire using the PlaceofPerformanceCity column. There were other columns like Congressional District, ZIP Code, and County that were available.

Page 21: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

21

Data Science for the DataAct Datathon

• Finally a Data Act Activity with Actual Financial Data Where a Data Scientist Can Actually Get Ready Access to the Data!• Just by happenstance, I discovered the DATA Act Forum Datathon Call for Participants,

DATA Act Forum-The Art of the Possible, and the DATA Act Forum Data Zoo Technology Showcase Application on July 27-28, and July 29, respectively.• The three events (July 27-29) will be summarized for our future meetup (

Data Science for the Data Act at Treasury?) and this Data Science for the Data Act Datathon will be extended by our Data Act Data Science team to make recommendations to OMB and other agencies.• The next step is to render the data dictionaries and the OMB Standard Data Act Data

Elements in spreadsheet form so we can begin the semantic harmonization and mediation process in Spotfire.

Page 22: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

22

My Conclusions and Recommendations

• The Federal Big Data Working Group Meetup Data Mining – Data Science Process was Applied to the DataAct Datathon Data Sets.• A Data Ecosytem was Built by Downloading 19 Files from the IAC/ACT

Datathon Socrata Catalog and Using Spotfire to Inventory Their Characteristics in an Excel Spreadsheet.• There are many duplicate files in the IAC/ACT Datathon Socrata Catalog.• The 14 unique files were imported into 3 Spotfire files for analytics and

visualizations.• Screen Capture Samples Are Shown to Help the Datathon Participants and in

Preparation for Another Federal Big Data Working Group/Virginia Big Data Meetup on the Data Act.

Page 23: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

23

http://semanticommunity.info/Data_Science/Data_Science_for_the_DataAct_Datathon

Page 24: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

24

My Suggested Harmonization Process 1

• What I am suggesting, which is the opposite of say you have an Access or MySQL database with multiple tables and key fields to join them, and you issue a SQL command to extract the subset of joined table data set you want to analyze. • We have the reverse problem of trying to make 20 or so Datathon

data sets, and ultimately multiple tables for every agency with their financial data, into a integrate data base to do the same thing with queries as above. • I showed this in a recent Meetup for multiple Harmful Algal Bloom

data sets that had been purposely designed with key fields.

Page 25: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

25

My Suggested Harmonization Process 2

• But what if the data sets have not been purposely designed with well-defined key fields or it is very difficult to match the “key fields” because of lack of data dictionaries, slightly different wording, etc. What I call semantic interoperability problems. • Well I, or a team, can do this by hand using data dictionaries and the data sets

in Spotfire and/or get a tool like TAMR that we had demonstrated recently in a Meetup. • First you match as many of the data elements to the new OMB standard data

elements (57), as I recall from work in our earlier Data Act for Treasury Meetup, and then you implement those matches in Spotfire Tools, Data Relationships feature so you can the “query” (without any SQL) a new merged, semantically harmonized table or tables.

Page 26: Data Science for Data Act Data Harmonization Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data

26

http://www.tamr.com/tamr-catalog-alpha-download/