Reimagining Software Development with Amazon Q Developer Agent

Amazon Q Developer is an AI-powered software development assistant that reimagines the experience across the entire software development lifecycle. It helps you faster build, secure, manage, and optimize applications on and off AWS. The Amazon Q Developer Agent includes a feature development agent that uses natural language input to automatically implement multi-file features, bug fixes, and unit tests in your integrated development environment (IDE) workspace. As you enter your query, the software development agent analyzes your code base and develops a plan to fulfill your request. You can accept the plan or ask the agent to iterate. If the plan is validated, the agent generates the code changes required to implement the requested functionality. You can then review and accept the code changes or request a revision.

Amazon Q Developer has achieved first place on the leaderboard for SWE-bench, a dataset that uses generative artificial intelligence (AI) to bring state-of-the-art accuracy to every developer, and tests a system's ability to automatically resolve GitHub issues. In this post, we explain how to get started with the software development agent, provide an overview of how the agent works, and how it performs on public benchmarks. We also take a deep dive into the process of getting started with the Amazon Q Developer Agent and provide an overview of the underlying mechanisms that make it a state-of-the-art feature development agent.

Get started

To get started, you need to have an AWS Builder ID or belong to an organization that has an AWS IAM Identity Center instance set up that can use Amazon Q. To use the Amazon Q Developer Agent for feature development in Visual Studio Code, first install the Amazon Q extension. The extension is also available for JetBrains, Visual Studio (preview), and command line on macOS. You can find the latest version on the Amazon Q Developer page.

After authenticating, you can invoke the feature development agent by typing: /dev In the chat box.

Your feature development agent is ready to act on your request. Use Amazon's Chronos predictive model repository to explain how your agent works. Chronos code is already of high quality, but the unit test coverage could use some improvement. Ask your software development agent to improve the unit test coverage of the chronos.py file. Communicating your request as clearly and precisely as possible will help your agent provide the best possible solution.

The agent returns a detailed plan for adding the missing tests to the existing test suite. test/test_chronos.pyTo generate the plan (and later the code changes), the agent examines the code base to understand how to fulfill the request. The agent works best when file and function names express its intent.

You will be asked to review your plan. If the plan is OK, click to continue. Generate codeIf you find something that can be improved, you can provide feedback and request an improvement plan.

Once the code is generated, the Software Development Agent lists the files in which the code was created. diff (For more on this post, test/test_chronos.py). You can review the code changes and decide whether to insert them into the code base or regenerate the code by providing feedback on possible improvements.

When you select a modified file, the IDE opens a diff view that shows the lines that were added or changed: The agent added several unit tests for parts of chronos.py that were not previously covered.

Once you have reviewed the code changes, you can decide to insert the changes, provide feedback and generate the code again, or discard it completely. That's it, you're done! There's nothing else to do. If you'd like to request another feature, \dev Check again in Amazon Q Developer.

System Overview

Now that we've described how to use the Amazon Q Developer Agent for software development, let's look at how it works. This is an overview of the system as of May 2024. We are continually improving the agent, and the logic described in this section will evolve and change.

When you submit a query, the agent generates a structured representation of the repository's file system in XML. Below is example output truncated for brevity:

<tree>
  <directory name="requests">
    <file name="README.rst"/>
    <directory name="requests">
      <file name="adapters.py"/>
      <file name="api.py"/>
      <file name="models.py"/>
      <directory name="packages">
        <directory name="chardet">
          <file name="charsetprober.py"/>
          <file name="codingstatemachine.py"/>
        </directory>
        <file name="__init__.py"/>
        <file name="README.rst"/>
        <directory name="urllib3">
          <file name="connectionpool.py"/>
          <file name="connection.py"/>
          <file name="exceptions.py"/>
          <file name="fields.py"/>
          <file name="filepost.py"/>
          <file name="__init__.py"/>
        </directory>
      </directory>
    </directory>
    <file name="setup.cfg"/>
    <file name="setup.py"/>
  </directory>
</tree>

The LLM then uses this representation in a query to determine which files are relevant and need to be retrieved. An automation system is used to ensure that all files identified by the LLM are valid. The agent uses the files retrieved in the query to generate a plan for solving the assigned task. This plan is returned for validation or iteration. Once the plan is validated, the agent moves on to the next step and ultimately finishes by proposing code changes to resolve the issue.

The contents of each retrieved code file are parsed with a syntactic parser to obtain an XML syntax tree representation of the code. LLM can use this representation more efficiently than the source code itself, using far fewer tokens. Below is an example of that representation: Non-code files are encoded and chunked using logic commonly used in Search Extension Generation (RAG) systems to allow efficient retrieval of chunks of documentation.

The following screenshot shows a portion of the Python code:

Below is its syntax tree representation.

LLM re-populates the problem description, plan, and XML tree structure of each retrieved file to identify the range of lines that need to be updated to resolve the issue. This approach results in more economical LLM bandwidth usage.

The software development agent is now ready to generate code to fix the problem. LLM directly rewrites parts of the code without attempting to generate a patch. This task is much closer to what LLM is optimized to perform than trying to generate a patch directly. The agent proceeds to syntax validation of the generated code and attempts to fix the issue before moving to the final step. The original and rewritten code are passed through a diff library, which programmatically generates a patch. This creates the final output that is shared with you for review and approval.

System Accuracy

In a press release announcing the release of the Amazon Q Developer Agent for feature development, the model scored 13.82% on SWE-bench and 20.33% on SWE-bench lite, putting it at the top of the SWE-bench leaderboard as of May 2024. SWE-bench is a public dataset of over 2,000 tasks from 12 popular Python open source repositories. The primary metric reported on the SWE-bench leaderboard is pass rate, which is how often all unit tests relevant to a given problem pass after applying the AI-generated code changes. This is an important metric; our customers want to use our agents to solve real-world problems, and we are proud to report state-of-the-art pass rates.

No single metric tells the whole story. We see agent performance as one point on the Pareto front of multiple metrics. The Amazon Q Developer Agent for Software Development is not specifically optimized for SWE-bench. Our approach focuses on optimizing for different metrics and datasets. For example, we aim to balance correctness and resource efficiency, such as the number of LLM calls and input/output tokens used, as this has a direct impact on execution time and cost. In this regard, we pride ourselves on the ability of our solution to consistently deliver results within minutes.

The limitations of public benchmarks

Public benchmarks such as SWE-bench are an extremely useful contribution to the AI code generation community and present interesting scientific challenges. We thank the team that releases and maintains this benchmark, and are proud to share our state-of-the-art results on it. However, we would like to point out that it has several limitations that are not unique to SWE-bench.

The success metric for SWE-bench is binary: either a code change passes all tests or it doesn't. We believe this does not fully capture the value that feature development agents bring to developers. Agents save developers a lot of time, even when they don't implement the entire feature at once. Latency, cost, number of LLM calls, and number of tokens are all highly correlated indicators of the computational complexity of a solution. This aspect is as important to our customers as accuracy.

The test cases included in the SWE-bench benchmark are publicly available on GitHub, and therefore may have been used with the training data of a variety of large-scale language models. Although LLM has the ability to memorize parts of the training data, it is difficult to quantify to what extent this memorization occurs and whether the model is accidentally leaking this information during testing.

To investigate this potential concern, we conducted multiple experiments to evaluate the potential for data leakage across a range of popular models. One way to test memory is to have the model predict the next line of a problem description given a very short context, a task that it should theoretically struggle with without memory. Our findings show signs that recent models were trained on the SWE-bench dataset.

The following figure shows the distribution of rougeL scores when we ask each model to complete the next sentence of the SWE bench problem statement, given the previous sentence.

We shared measurements of our software development agent's performance on SWE-bench to provide a reference point. We encourage you to test your agent on private code repositories not used to train LLM and compare the results to our publicly available baseline results. We will continue to benchmark our system on SWE-bench, with an emphasis on testing on private benchmark datasets that are not used to train our models and that better represent customer-submitted tasks.

Conclusion

In this post, we discussed how to get started with the Amazon Q Developer Agent for software development. The agent automatically implements features that you describe in natural language in your IDE. We provided an overview of how the agent works behind the scenes and discussed its state-of-the-art accuracy and its position at the top of the SWE-bench leaderboard.

Now you are ready to explore the capabilities of the Amazon Q Developer Agent for Software Development and turn it into your personal AI coding assistant. Install the Amazon Q plugin in your IDE of choice and use your AWS Builder ID to start using Amazon Q (including the Software Developer Agent) for free, or subscribe to Amazon Q to unlock higher limits.

About the Author

Christian Bock He is an Applied Scientist at Amazon Web Services working on AI for Code.

Laurent Caro He is a Principal Applied Scientist at Amazon Web Services where he leads a team that creates AI solutions for developers.

Tim Esler He is a Senior Applied Scientist at Amazon Web Services where he works on generative AI and coding agents to build development and foundational tooling for Amazon Q products.

Prabhu Teja Prabhu is an Applied Scientist with Amazon Web Services, working on LLM-assisted code generation with a focus on natural language interaction.

Martin Wistuba He is a Senior Applied Scientist at Amazon Web Services and part of the Amazon Q Developer team, helping developers write more code in less time.

Giovanni Zappella He is a Principal Applied Scientist working on creating intelligent agents for code generation. During his time at Amazon, he also contributed to creating new algorithms for continuous learning, AutoML, and recommendation systems.