Simplifying data science to make it easier and less expensive, UC-Alfarad researchers work goals
The Journal of Computer Languages awards the best published article in its journal in 2020 to the work of the University of California where Lavoisier language was developed to simplify data preparation processes
Alfonso de la Vega, Diego Garcia-Saez, Marta Zorrilla and Pablo Sanchez are members of the Software Engineering and Real-Time Research Group at the University of Cantabria and authors of the article “Lavoisier: DSL to Increase the Level of Data Selection Abstraction and Formatting in Data Mining” awarded by the Journal of Computer Languages as the best publication in the year 2020.
Lavoisier is the name given to the language that makes it possible to reduce the complexity of data preparation operations before analysis by 50%, and in some cases, by up to 80%, explains researcher Pablo Sanchez.
The co-author of the award-winning article notes that data science applications, such as the recommendation algorithms used by Facebook or YouTube, for example, have been developed with fairly traditional techniques and require the involvement of people with deep knowledge in the field. . “It’s as if you want to change the wheel of your car, you need to hire an industrial engineering doctor,” he compares.
The input data for the data mining algorithm must conform to a very specific format. Scientists organize data in this format by creating long and complex texts. Sanchez adds that this data preparation process is “too literal” and any engineer who has to prepare data before analyzing it needs to perform many low-level operations, which makes these operations very long, complex and tedious.
To mitigate this situation, Lavoisier is introduced as a declarative language for selecting and formatting data in the context of data mining. With Lavoisier, the size of the data preparation scripts can be reduced by 50% on average, and even 80% in some cases. In addition, the occasional complexity found in the technologies currently used for this task is greatly mitigated.
In this way, the Lavoisier language summarizes many of the low-level problems found in these data preparations and significantly reduces their complexity. It concludes, “Therefore, through this work, these data management processes can be implemented faster and easier, with the consequent reduction in the cost of time and money.”
The Real-time Software Engineering and Research group, coordinated by Michael González Harbor, has been working for years applying various software engineering techniques to data science.
The main outcome of this work was Alfonso de la Vega’s thesis “Domain-Specific Languages for the Democratization of Data Mining” which he championed in 2019 and whose results have been published in various journals of international influence.
Another important finding is another Lavoisier-like language, called Pinset, which is included in the Epsilon Model Transformation Kit, which is used in universities around the world and at companies like Rolls-Royce.
In addition, researchers are already working with companies in the environment to provide them with data analysis services, as in the case of the emergency service in Cantabria, 112.