Data Pitch Project - Data catalogue for the challengers

Data Pitch is for startups and SMEs who:

  • Want to create products and services with data.

  • Want to work alongside established businesses and corporates.

  • Are registered in an EU country.


Why join Data Pitch?

  • Build new solutions to challenges that affect businesses and the public with data.

  • Receive equity free funding of up to €100k.

  • Demonstrate your data expertise to potential clients and investors.

  • Be part of an ecosystem where you can share ideas and collaborate with other startups.

  • Benefit from mentoring, office space, networking opportunities, communications support and other business resources.

Explore new data and build innovative solutions


Explore the data catalogue



Submit your proposal



Build your solution


Timeline

1st July

Call open

October & November

Selection of the challengers

December

Begining of the accelerator program

The Data Catalogue

Dataset n°1 - Sonae Supply Chain Data (DPC1-2017) BIG DATA, LOGISTIC, RETAIL
DESCRIPTION: Supply chain data: One huge denormalized table with one line per product flow between locations. These type of datasets, though format specific to Sonae, are general data sets for the retail sector. All the datasets are created in our operational systems, collected in our on premises data warehouse, and made available to 3rd parties through Amazon AWS S3/Redshift.
INDUSTRY SECTOR: Retail
DATA PROVIDER COUNTRY: Portugal
UPDATES: The dataset used in the experiment will have a bespoken update frequency to be decided with the challenge winner.
DATASET SIZE: 20TB of compressed data (1/10 ratio)
NUMBER OF ATTRIBUTES: >120
DATA FORMAT AND STORAGE: Csv files stored in Amazon AWS
ATTRIBUTES: Supply chain data – One denormalized table with one line per product flow between locations
PERSONAL DATA: No data relating to persons present
SYNTHETIC DATA: No Synthetic data present
GEOGRAPHIC COVERAGE: Portugal
TIMESPAN & PRODUCTION:
Timespan: Jan 2017 – present
Production: live

LEVEL OF AGGREGATION: Raw data
DATA ACCESS: Bulk download

Download a sample | Download the dataset
Dataset n°2 - IMIN Dataset (DPC2-2017) API BASED ACCESS, BOOKING SYSTEM, MULTIPLE SOURCES AGGREGATION, REAL-TIME, RECREATIONAL ACTIVITIES, SPORT AND WELLBEING
DESCRIPTION: IMIN offers to challenge winners free access to their API. This includes an availability API (Real Time), which supplies event data and a booking API, which allows events to be booked and paid for. These give access to data about physical activities and sports classes from providers such as Fusion Lifestyle Sports, Open Sessions and Go Mammoth Sports.
DATASET SIZE: ~20 sport events and venues sources
DATA FORMAT AND STORAGE: JSON format
ATTRIBUTES:
  • Availability

  • Location

  • Scheduling

SYNTHETIC DATA: No Synthetic data present
GEOGRAPHIC COVERAGE: Great Britain
DATA ACCESS: API, here is the link to the API documentation IMIN API

Download a sample | Download the dataset
Dataset n°3 - SpazioDati Data (DPC3-2017) KNOWLEDGE GRAPHS, MULTIPLE SOURCES AGGREGATION, SALES AND MARKETING, TEXT
DESCRIPTION: Information about persons and companies are dispersed across a number of sources. Crawlers collect these information and make them available in different formats. The source types span from basic firmographics, to financial, marketing, key persons and services offered.
DATASET SIZE: Hundreds of Gigabyte
NUMBER OF ATTRIBUTES: There are different type of entities, The number of attributes per entity is different. As a approximate estimation, on average each entity will have 15 attributes.
DATA FORMAT AND STORAGE: Data are serialized as json.
ATTRIBUTES:
  • Data from our corporate Web crawl: websites and contact information collected from the websites, e.g., phones, emails, description, links to social web.

  • Data about companies/legal entities: basic firmographics, directors and managers, locations of companies’ sites, matches to the websites + entities extracted from various textual descriptions.

  • Financial data (e.g. important indicators, ratings)

  • Target data

  • house number

PERSONAL DATA: The dataset contains pseudonymized data derived from personal data
SYNTHETIC DATA: No Synthetic data present
GEOGRAPHIC COVERAGE: Italy and UK
LEVEL OF AGGREGATION: Access is at raw level data and no aggregation is perfomermed before the analysis. Reange of values might be provided instead of actual values for the most sensitive paramenters.
DATA ACCESS: Subject to negotiation

Download a sample | Download the dataset
Dataset n°4 - Deutsche Bahn Data (DPC4-2017) API BASED ACCESS, BIG DATA, MOBILITY, REAL-TIME, TRANSPORT
DESCRIPTION: Deutsche Bahn collects lots of information covering all the main aspects of their business. They span from Rail network condition to business information to logistic aspects. Some information are provided through a real-time API.
UPDATES: The dataset used in the experiment will have a bespoken update frequency to be decided with the challenge winner.
ATTRIBUTES: The following list is an extract of the most relevant information about the data for the challenge
  • Master data

    • Rail network DB

    • Station data(Addresses, GPS plus various additional information as length of platforms etc)

    • Opening times travel centers

    • Our entire operations location register (RIL 100/DS 100)

    • Service facilities

  • Business information

    • Real “historic” booking data from Call-a-bike and Flinkster (of course anonymized & without customer data) for 2,5 years

    • Network radar(availability of mobile networks from app measurements)

    • Air pollutant register/cadaster

  • Logistics

    • So far exemplary shipment data for 8 containers around the world from DB Schenker (location, temperature)

    • Data from DB Cargo(Aggregated à 10 trains per operating location per day)

  • Target data

    • Target data car position diagram/indicator

    • Target timetable Fernverkehr (long distance trains)

  • Real-time-APIs

    • Condition of elevators and escalators (works/not working) + master data (successor of the legendary ADAM-API)

    • Booking options of Call-a-bike and Flinkster

    • Actual occupancy of DB Bahnpark car parks

    • APIs to master data, i.e. station master data

GEOGRAPHIC COVERAGE: Germany
LEVEL OF AGGREGATION: Raw data
DATA ACCESS: All the data are published here: DB data portal. This is a description of the datasets: details and this is the description of the API: to APIs. Additional dataset are periodically added. An english version is going to be produced by DB in the near future.

Download a sample | Download the dataset
Dataset n°5 - Uniserv Data (DPC5-2017) COMPANY DATA, ENTITY RESOLUTION, NAMES AND ADDRESSES, RETAIL, TEXT
DESCRIPTION: The data set represents typical customer master data in enterprise applications like CRM or ERP Systems.
INDUSTRY SECTOR: Retail
DATA PROVIDER COUNTRY: Germany
UPDATES: Static test data set; no updates
DATASET SIZE: The full data set contains 34 Million records
NUMBER OF ATTRIBUTES: ~ 10
DATA FORMAT AND STORAGE: Flat file, text format, CSV
ATTRIBUTES:
PERSONAL DATA: Anonymized or pseudonymized data derived from personal data
SYNTHETIC DATA: Yes first name and last name
GEOGRAPHIC COVERAGE: Germany
LEVEL OF AGGREGATION: No aggregation. The dataset involved in the experiment include raw data.
DATA ACCESS: Defined during the negotiation phase

Download a sample | Download the dataset