176 research outputs found

    Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing

    Full text link
    Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.Comment: Accepted to CVPR 202

    How Control Information Influences Multilingual Text Image Generation and Editing?

    Full text link
    Visual text generation has significantly advanced through diffusion models aimed at producing images with readable and realistic text. Recent works primarily use a ControlNet-based framework, employing standard font text images to control diffusion models. Recognizing the critical role of control information in generating high-quality text, we investigate its influence from three perspectives: input encoding, role at different stages, and output features. Our findings reveal that: 1) Input control information has unique characteristics compared to conventional inputs like Canny edges and depth maps. 2) Control information plays distinct roles at different stages of the denoising process. 3) Output control features significantly differ from the base and skip features of the U-Net decoder in the frequency domain. Based on these insights, we propose TextGen, a novel framework designed to enhance generation quality by optimizing control information. We improve input and output features using Fourier analysis to emphasize relevant information and reduce noise. Additionally, we employ a two-stage generation framework to align the different roles of control information at different stages. Furthermore, we introduce an effective and lightweight dataset for training. Our method achieves state-of-the-art performance in both Chinese and English text generation. The code and dataset available at https://github.com/CyrilSterling/TextGen

    Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

    Full text link
    In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks. The code is available at https://github.com/FaltingsA/SSM.Comment: Accepted to IJCAI202

    On-chip Q-factor greater than 1 billion

    Get PDF
    A record Q-factor of 1.1 billion is demonstrated in on-chip silica whispering-gallery resonators. Using the devices, sub-milliwatt parametric oscillation threshold is measured in 9 GHz free-spectral-range devices

    On-chip Q-factor greater than 1 billion

    Get PDF
    A record Q-factor of 1.1 billion is demonstrated in on-chip silica whispering-gallery resonators. Using the devices, sub-milliwatt parametric oscillation threshold is measured in 9 GHz free-spectral-range devices

    Impact of spatio-temporal thermal decoherence on soliton microcombs in multimode microresonators

    Get PDF
    The phase noise of the soliton repetition rate is experimentally characterized in silica microresonators. In conjunction with dispersive wave quieting of pump technical noise, spatio-temporal fluctuations of distinct transverse modes set a limit to performance

    Impact of spatio-temporal thermal decoherence on soliton microcombs in multimode microresonators

    Get PDF
    The phase noise of the soliton repetition rate is experimentally characterized in silica microresonators. In conjunction with dispersive wave quieting of pump technical noise, spatio-temporal fluctuations of distinct transverse modes set a limit to performance

    An all-photonic, dynamic device for flattening the spectrum of a laser frequency comb for precise calibration of radial velocity measurements

    Full text link
    Laser frequency combs are fast becoming critical to reaching the highest radial velocity precisions. One shortcoming is the highly variable brightness of the comb lines across the spectrum (up to 4-5 orders of magnitude). This can result in some lines saturating while others are at low signal and lost in the noise. Losing lines to either of these effects reduces the precision and hence effectiveness of the comb. In addition, the brightness of the comb lines can vary with time which could drive comb lines with initially reasonable SNR's into the two regimes described above. To mitigate these two effects, laser frequency combs use optical flattener's. Flattener's are typically bulk optic setups that disperse the comb light with a grating, and then use a spatial light modulator to control the amplitude across the spectrum before recombining the light into another single mode fiber and sending it to the spectrograph. These setups can be large (small bench top), expensive (several hundred thousand dollars) and have limited stability. To address these issues, we have developed an all-photonic spectrum flattener on a chip. The device is constructed from optical waveguides on a SiN chip. The light from the laser frequency comb's output optical fiber can be directly connected to the chip, where the light is first dispersed using an arrayed waveguide grating. To control the brightness of each channel, the light is passed through a Mach-Zehnder interferometer before being recombined with a second arrayed waveguide grating. Thermo-optic phase modulators are used in each channel before recombination to path length match the channels as needed. Here we present the results from our first generation prototype. The device operates from 1400-1800 nm (covering the H band), with 20, 20 nm wide channels.Comment: 7 pages, 5 figures, conferenc

    Flattening laser frequency comb spectra with a high dynamic range, broadband spectral shaper on-a-chip

    Full text link
    Spectral shaping is critical to many fields of science. In astronomy for example, the detection of exoplanets via the Doppler effect hinges on the ability to calibrate a high resolution spectrograph. Laser frequency combs can be used for this, but the wildly varying intensity across the spectrum can make it impossible to optimally utilize the entire comb, leading to a reduced overall precision of calibration. To circumvent this, astronomical applications of laser frequency combs rely on a bulk optic setup which can flatten the output spectrum before sending it to the spectrograph. Such flatteners require complex and expensive optical elements like spatial light modulators and have non-negligible bench top footprints. Here we present an alternative in the form of an all-photonic spectral shaper that can be used to flatten the spectrum of a laser frequency comb. The device consists of a circuit etched into a silicon nitride wafer that supports an arrayed-waveguide grating to disperse the light over hundreds of nanometers in wavelength, followed by Mach-Zehnder interferometers to control the amplitude of each channel, thermo-optic phase modulators to phase the channels and a second arrayed-waveguide grating to recombine the spectrum. The demonstrator device operates from 1400 to 1800 nm (covering the astronomical H band), with twenty 20 nm wide channels. The device allows for nearly 40 dBs of dynamic modulation of the spectrum via the Mach-Zehnders , which is greater than that offered by most spatial light modulators. With a superluminescent diode, we reduced the static spectral variation to ~3 dB, limited by the properties of the components used in the circuit and on a laser frequency comb we managed to reduce the modulation to 5 dBs, sufficient for astronomical applications.Comment: 15 pages, 10 figures. arXiv admin note: substantial text overlap with arXiv:2209.0945

    Hertz-linewidth semiconductor lasers using CMOS-ready ultra-high-QQ microresonators

    Get PDF
    Driven by narrow-linewidth bench-top lasers, coherent optical systems spanning optical communications, metrology and sensing provide unrivalled performance. To transfer these capabilities from the laboratory to the real world, a key missing ingredient is a mass-produced integrated laser with superior coherence. Here, we bridge conventional semiconductor lasers and coherent optical systems using CMOS-foundry-fabricated microresonators with record high QQ factor over 260 million and finesse over 42,000. Five orders-of-magnitude noise reduction in the pump laser is demonstrated, and for the first time, fundamental noise below 1 Hz2^2 Hz1^{-1} is achieved in an electrically-pumped integrated laser. Moreover, the same configuration is shown to relieve dispersion requirements for microcomb generation that have handicapped certain nonlinear platforms. The simultaneous realization of record-high QQ factor, highly coherent lasers and frequency combs using foundry-based technologies paves the way for volume manufacturing of a wide range of coherent optical systems.Comment: 19 pages, 11 figure
    corecore