A short summary of the project and the deliverables of of the project
Project Ladd’s main deliverable is to establish a lab for datadriven innovation aimed to lowe the barrier for creating commercial and non-commercial innovations that promotes social, economic and ecological sustainability for cities and hinterland.
This main deliverable is divided into a number of sub-deliverables:
Establish ground rules that enables and balances the influence commercial and non-commercial innovation efforts and actors have in how the lab is operated and managed.
2) Professional profiles and skills
Define competency profiles/skills needed to create data-driven innovations and link these people with those skills to the lab.
3) Tools / technology
Deploying tools and technology, build handling skills and test technical the components/tools needed in the lab.
4) The meeting point
Develop venue and meeting procedures so that innovators based in industry, academia, government, non-profit organizations and even individual citizens, citizens can meet to innovate together.
5) Knowledge building
Build knowledge around data driven innovation among different actors in the region.
6) Innovation Events
Implement events where data-driven innovations are developed.
In the establishing phase, there are limited opportunities to create a technical environment that covers everything needed in a data-driven lab. For cost reasons, the lab will be limited to what we consider to be most essential for creating data-driven innovations.
The aim is to use Open Source where possible. The project also wants to use cloud services without fixed charges as far as possible and minimize need for having our own hardware. Our time constraints makes it necessary to use ready-made solutions to a large extent even if these lacks in transparency and openness.
At least during the relatively short establishing phase, it is therefore likely that the solutions that are not based on open source and transparency principles will be used. After the establishment phase it is reasonable to decide on the basis of rules and given economic and personnel conditions determine whether the use of open source-solutions can be extended.
Solutions developed from scratch will be too costly to produce and operate. We therefore choose to build on existing cloud services. We have chosen to use Amazon’s cloud services for two reasons. We assess that their technical environment is more mature regarding support for Machine Learning and handling sensor data. Their pricing model using only variable costs which will save us money since the intensity of the lab’s activities will vary over time. Microsoft’s cloud services his said to have a price model based on fixed costs, which, at least in this initial stage, is not ideal.
Lab environment, not a development environment
The lab consists of solutions for developing concepts and working code. The transfer if the innovation to a fully functional development environment outside of the lab should be facilitated as far as possible. Providing a fully functional development environment might be a part of a future upscaling of the lab.
The labs needs data. The data available in the lab is a combination of data from sensors which is supplied to the lab by a Lora network, data files that the project acquires from our stakeholders and existing data sources containing open data.
All data is stored in Amazon S3 storage solution. This is a cost effective solution that can smoothly handle the volumes of data that are relevant at this stage. This means that real-time analysis can not be performed. This need will probably arise and if the budget allows, also components for this will be added. This might be happening sooner than later.
The value creation from the data is primarily done by using the libraries provided by Apache Spark and other libraries that can by used by Python. To test parallelization and build knowledge about how effective the program will be if the amount of data increases significantly, a small cluster with at least two worker nodes will exist in the lab environment.
Data from the sensors is handled by Lora-WAN which is a technology for transmitting sensor data. The lab will have sensors available for lending by anyone who wants to try out some idea involving these type of sensor.
Besides sensors using LoRa Wan, the project will also have sensors that uses bluetooth technology to count the number of mobile phones within a limited area. If possible, also this data will be transferred over the LoRa Wan. If that is impossible the will be connected to a common wireless network or a cellular network.
The interface towards the innovator is primarily “Jupyter”, an online “notebook”. In this software you can mix text and program code so that you can describe the challenge and other interesting bits of text and then have an adjoining area of executable program code which is the concrete artefact that does what must be done in order to solve the challenge.
From the notebook, Python programs can access the different libraries included in e.g. Spark but other libraries used for the analysis and presentation of data.
The sensors are connected to LoRa-Wan captures the following types of data:
- The temperature, amount of light, CO2 and the number of people in a room
- Position, motion and temperature
- Number of people (mobile phones) in a defined area
- Fill ratio, and / or weight of the contents of containers
Data from these sensors is fixed test data, but the sensors can also be activated and deliver data in real time if the inventor has the need for this type of data. We also hope to be able to buy sensors and lend them to innovators who need to have them in their own facilities, buildings etc.
Data from the “bluetooth pucks” that measures the number of mobile phones in the vicinity is handled in the same way. The ambition is that all sensor data in the lab will be open, but if someone innovator wish to keep their own captured sensor data private, it will not be published as open data in the lab environment.
Open data from existing CKAN portals are retrieved by the innovator from the respective portal, preferably via the CKAN API. Subsets of this data may be stored in the lab S3 solution or in the lab’s database server if this would make it easier for the innovator.
The Open data that will be provided is are:
- All data from the municipalites of Umeå and Skellefteå on http://opendata.opennorth.se/dataset
- Data from the Environmental Protection Agency found on http://data.naturvardsverket.se/
- Data from the National Land Survey (GSD Roadmap, mountain map and terrain map, vector and Shape format from ftp://download-opendata.lantmateriet.se/
- The map data via the API surveying “Geodatatjänsten topographical site map”Traffic through traffic labs APIs https://www.trafiklab.se/api/ and the related APIs available on https://www.trafiklab.se/api/ovriga
It might happen that innovators want to combine their own data with open data. It may therefore exist private data that is only accessible to those who received the explicit permission of the inventor/owner of the data. At present no such data is identified.
The algorithms/logic that process data is made in the web-based “notebook” named Jupyter. Through various Python library, visualizations can also be done, but it will exist a need for more tools in the toolbox than Python and Jupyter. What additional tools that should be added is not clear at present.
Jupyters interface looks similar to the image below.
Link with FIWARE
FIWARE is a middleware that encapsulates complexity through open APIs which makes it easier for application developers to create solutions based on the data of various types. When it comes to managing big data and various types of processing of distributed data, FIWARE is offering COSMOS, which seems to be based on a Hadoop cluster.
This seems not to be up to par with present development. The mainstream efforts regarding Big Data/AI seems to be directed towards Apache Spark. Apache Spark is faster and also has a simpler interface for the developer while supporting fault-tolerant distributed data processing like Hadoop.
The function MapReduce in Hadoop is disk based which seems to be too slow for many use cases. Hadoop seems to be on its way to retirement except for the filesystem, HDFS.
Based on this, a possibility that we might take on eventually is to create a common lab environment that combines Jupyter and Spark by creating a cooperation between project Ladds and the project Sense Smart Region that is about to launch an instance of FIWARE.