index.html

<!doctype html>
<html lang="en">

<!-- === Header Starts === -->
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title> BEVGen: Street-View Image Generation from a Bird's-Eye View Layout</title>
    <link type="image/png" sizes="96x96" rel="icon" href="assets/favicon.png">
    <link href="./assets/bootstrap.min.css" rel="stylesheet">
    <link href="./assets/font.css" rel="stylesheet" type="text/css">
    <link href="./assets/style.css" rel="stylesheet" type="text/css">
    <script src="./assets/jquery.min.js"></script>
    <script type="text/javascript" src="assets/corpus.js"></script>

    <!-- <a target="_blank" href="https://icons8.com/icon/7bGlJrKnisOw/car-top-view">Car</a> icon by <a target="_blank" href="https://icons8.com">Icons8</a>-->

</head>
<!-- === Header Ends === -->

<script>
    var lang_flag = 1;
</script>

<body>

<!-- === Home Section Starts === -->
<div class="section" style="margin-top: 15pt">
    <!-- === Title Starts === -->
    <div class="header">
        <table>
            <tr>
                <td>
                    <div style="padding-top: 0pt;padding-left: 140pt;padding-bottom: 12pt; text-align: center" class="title" id="lang" >
                        <b> BEVGen: Street-View Image Generation from <br>a Bird's-Eye View Layout </b>
                    </div>

                </td>
            </tr>
        </table>


    </div>
    <!-- === Title Ends === -->
    <div class="author">
        <a href="https://aswerdlow.com" target="_blank">Alexander Swerdlow</a>,
        <a href="https://derrickxunu.github.io/">Runsheng Xu</a>,&nbsp;
        <a href="https://boleizhou.github.io/" target="_blank">Bolei Zhou</a>&nbsp;&nbsp;
    </div>

    <div class="institution" style="font-size: 11pt;">
        <div>
         University of California, Los Angeles
        </div>
    </div>
    <table border="0" align="center">
        <tr>
            <td align="center" style="padding: 0pt 0 15pt 0">
                <a class="bar" href="https://metadriverse.github.io/bevgen/"><b>Webpage</b></a> |
                <a class="bar" href="https://github.com/alexanderswerdlow/BEVGen"><b>Code</b></a> |
                <a class="bar" href="https://arxiv.org/abs/2301.04634"><b>Paper</b></a>
            </td>
        </tr>
    </table>
    <center>
            <img src="assets/images/argoverse_intro.png" alt="Image 1" width="100.4%" style="margin-bottom: 10px;">
            <img src="assets/images/nuscenes_intro.png" alt="Image 2" width="100%">
    </center>
</div>


<!-- === Overview Section Starts === -->
<div class="section">
    <div class="title" id="method">Overview</div>
    <div class="body">
        <div class="teaser">
            <img src="assets/images/figure_1.png">
            <div class="text">
                <br>
                Fig. 1 BEVGen framework: A BEV layout and source multi-view images are encoded to a discrete representation and are flattened before passed to the autoregressive transformer. Spatial embeddings are added to both camera and BEV tokens inside each transformed bloc, the learned pairwise camera bias are added to the attention weights. Weighted CE loss is applied during training, and we pass the tokens to the decoder to obtain generated images during inference.
            </div>
        </div>
        <div class="text">
            <p>
                In this work, we tackle the new task of generating street-view images from a BEV layout and propose a generative model called BEVGen to address the underlying challenges. We develop an autoregressive neural model called BEVGen that generates a set of realistic and spatially consistent images. BEVGen has two technical novelties: (i) it incorporates spatial embeddings using camera instrinsics and extrinsics to allow the model to attend to relevant portions of the images and HD map, and (ii) it contains a novel attention bias and decoding scheme that maintains both image consistency and correspondence.
            </p>

            <div class="teaser" style="text-align: center;">
                <img src="assets/images/3_6a467240b14745cd8ab11701bd4f08d3.png" style="width: 49.5%; display: inline-block; margin-bottom: 7px;">
                <img src="assets/images/1_8a51ed0367034945927c7c70eda2fd59.png" style="width: 49.5%; display: inline-block; margin-bottom: 7px;">
                <br>
                <img src="assets/images/a7e83e3b5cd24948a58c93c66ba54d77.png" style="width: 49.5%; display: inline-block; margin-bottom: 7px;">
                <img src="assets/images/a0214502bae94c408f317587614cac49.png" style="width: 49.5%; display: inline-block; margin-bottom: 7px;">
                <br>
                <img src="assets/images/f8883f89a30a4cafb79e49f6f533e4f5_W4I0EH2L3D.png" style="width: 49.5%; display: inline-block; margin-bottom: 15px;">
                <img src="assets/images/f8883f89a30a4cafb79e49f6f533e4f5_FZBU6TYAWV.png" style="width: 49.5%; display: inline-block; margin-bottom: 15px;">
                <br>
                <div class="text"><br>Fig. 2: Synthesized multi-view images from BEVGen on nuScenes. Image contents are diverse and realistic. The two instances in the bottom row use the same BEV layout for synthesizing the same location in day and night.</div>
             </div>
        </div>
    </div>
</div>

<div class="section" style=" text-align: left">
    <div class="title" id="Demo Video">Demo Video</div>
    To further demonstrate the spatial disentanglement of the model, we compare the generated images to the source images over multiple frames from the same scene. We use the original BEV layouts from the validation set, and create a video by appending each generation. We observe that cars and road markings generally stay consistent between frames, with the layout visually matching the source images. Note that our model does not enforce temporal consistency, and thus it is expected that the generated frames may not produce the same vehicles and background scenery in two adiacent frames. We leave incorporating temporal consistency for future work.
    &nbsp;

    <div class="body" style="display: flex; justify-content: center;">
        <iframe style="margin-top: 30px;" width="900" height="600" src="https://www.youtube.com/embed/8AFKpCFwwOo" title="BEVGen — Continuous Scene Generation" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
        </div>
    </div>
</div>

<!-- === Reference Section Starts === -->
<div class="section">
    <div class="bibtex">
        <div class="text">Reference</div>
    </div>
    <pre>
@article{swerdlow2024streetview,
    title={Street-View Image Generation from a Bird's-Eye View Layout}, 
    author={Alexander Swerdlow and Runsheng Xu and Bolei Zhou},
    year={2024},
    journal={IEEE Robotics and Automation Letters},
}
    </pre>
</div>

<div class="section" style=" text-align: left">
    <div class="title" id="Ack">Acknowledgement </div>
    This work was supported by the National Science Foundation under Grant No. 2235012.
</div>

</body>
</html>