Sensor Fusion
Assembly Line
Connecting the Dots: How AI Can Make Sense of the Real World
Working with Infineon, a global semiconductor manufacturer and leader in sensing and IoT, we are exploring how such powerful human-like functions can be developed and deployed in real-world applications using generative physical AI models like Newton. These models seamlessly integrate real-time events captured by simple, ubiquitous sensors β such as radars, microphones, proximity sensors, and environmental sensors β with high-level contextual information to generate rich and detailed interpretations of real-world behaviors. Importantly, this is achieved without requiring developers to explicitly define such interpretations or relying on complex, expensive, and privacy-invasive sensors like cameras.
Generative physical AI models, such as Newton, are able to overcome these challenges for the first time, unlocking a boundless range of applications. We explored Newtonβs ability to interpret real-world context and human activities by combining radar and microphone data. In our demo scenarios, Newton powers a home assistant in a kitchen setting, helping a user through their morning routine in one situation and in another helping to keep residents safe when the smoke alarm goes off.
When fused with additional contextual data β such as location, time, day of the week, weather, news, or user preferences β Newton can provide personalized and relevant recommendations or services. This capability makes it possible to go beyond basic sensor interpretations, offering meaningful insights tailored to the needs of individual users or organizations.
IBM and AWS partnering to transform industrial welding with AI and machine learning
IBM Smart Edge for Welding on AWS utilizes audio and visual capturing technology developed in collaboration with IBM Research. Using visual and audio recordings taken at the time of the weld, state-of-the-art artificial intelligence and machine learning models analyze the quality of the weld. If the quality does not meet standards, alerts are sent, and remediation action can take place without delay.
The solution substantially reduces the time between detection and remediation of defects, as well as the number of defects on the manufacturing line. By leveraging a combination of optical, thermal, and acoustic insights during the weld inspection process, two key manufacturing personas can better determine whether a welding discontinuity may result in a defect that will cost time and money: weld technician and process engineer.
Sensor Fusion with AI Transforms the Smart Manufacturing Era
Bosch calls its new semiconductor fab in Dresden a smart factory with highly automated, fully connected machines and integrated processes combined with AI and internet of things (IoT) technologies for facilitating data-driven manufacturing. With machines that think for themselves and glasses with built-in cameras, maintenance work in this fab can be performed from 9,000 kilometers (about 5,592 miles) away.
STMicroelectronics (STMicro) has added compute power to sensing in what it calls an intelligent sensor processing unit (ISPU). It combines a DSP suited to run AI algorithms and MEMS sensor on the same chip. The merger of sensors and AI puts electronic decision-making on the edge, while enabling smart sensors to sense, process and take actions, bridging the fusion of technology and the physical world.
Meta-Transformer: A Unified Framework for Multimodal Learning
Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities (e.g. natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a frozen encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data. Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers.
ImageBind: One Embedding Space To Bind Them All
We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications βout-of-the-boxβ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.
The ingenious micro-mechanisms inside your phone
Perceiver: General Perception with Iterative Attention
Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.