Showing posts with label ML and AI. Show all posts
Showing posts with label ML and AI. Show all posts

Thursday, February 27, 2025

Kepler’s Mars Orbit Analysis with Python Notebooks & AI-Assisted Coding

Johannes Kepler’s analysis of Mars’ orbit stands as one of the greatest achievements in scientific history, revealing the elliptical nature of planetary paths and establishing the foundational laws of planetary motion. In this post, you will explore how you can recreate Kepler’s revolutionary findings using Python’s robust data science ecosystem. 
 
Our goal is not to produce a specialized scientific paper but to provide a clear, interactive, and visually appealing demonstration suitable for a broad audience. Python libraries like NumPy, Pandas, SciPy, and Matplotlib, provide an efficient environment for numerical computations, data manipulation, and visualization. Jupyter Notebooks further enhance this process by providing an interactive and user-friendly platform to run code, visualize results, and document your insights clearly. Additionally, AI-assisted coding significantly simplifies technical tasks such as ellipse fitting, data interpolation, and creating insightful visualizations. This integration allows us to focus more on understanding the insights behind Kepler’s discoveries, making complex analyses accessible and engaging.

In this post, we’ll explore how you can recreate Kepler’s revolutionary findings using Python’s robust data science ecosystem. Our goal is not to produce a specialized scientific paper but to provide a clear, interactive, and visually appealing demonstration suitable for a broad audience.  
Python libraries like NumPy, Pandas, SciPy, and Matplotlib, provide an efficient environment for numerical computations, data manipulation, and visualization. Jupyter Notebooks further enhance this process by providing an interactive and user-friendly platform to run code, visualize results, and document your insights clearly.

Additionally, AI-assisted coding significantly simplifies technical tasks such as ellipse fitting, data interpolation, and creating insightful visualizations. This integration allows us to focus more on understanding the insights behind Kepler’s discoveries, making complex analyses accessible and engaging.

This project showcases:

  • A structured approach to data analysis using a handful of short Jupyter Notebooks.
  • How Python’s ecosystem (NumPy, Pandas, SciPy, Matplotlib) facilitates computational research.
  • The benefits of AI-assisted coding in accelerating development and improving workflow efficiency.
  • An interactive, visually engaging reproduction of Kepler’s findings.

The full code and notebooks are available at: GitHub Repository


Jupyter Notebooks and AI-Assisted Coding: A Powerful Combination for Data Science

Jupyter Notebooks have become the standard environment for data science, offering an interactive and flexible platform for scientific computing. They can be run on local machines or cloud services such as Google Colab, Amazon SageMaker, IBM Watson Studio, Microsoft Azure, GitHub Codesopaces, Databricks, etc. CERN users can also run the notebooks on the CERN-hosted Jupyter notebooks service SWAN (Service for Web-based ANalysis), a widely popular service used by engineers and physicists across CERN for large-scale scientific analysis.

How Python and AI Tools Enhance This Project

  • Data Interpolation & Curve Fitting: Python libraries like SciPy and AI-assisted tools help generate optimal curve fits in seconds.

  • Plotting & Visualization: AI-driven code completion and Matplotlib make it easier and faster to generate plots.

  • Error Handling & Debugging: AI suggestions help identify and fix errors quickly, improving workflow efficiency.

  • Exploring Alternative Approaches: AI can suggest different computational methods, allowing for a more robust and exploratory approach to the analysis.

Why Use Jupyter Notebooks and AI-Assisted Coding?

  • Saves Time: Avoids writing repetitive, boilerplate code.

  • Enhances Accuracy: Reduces human error in complex calculations.

  • Boosts Creativity: Frees up cognitive resources to focus on insights rather than syntax.

  • Flexible & Scalable: Python notebooks can be used locally or on powerful cloud-based platforms for large-scale computations.

  • Widely Adopted: Used by researchers, engineers, and data scientists across academia, industry, and institutions like CERN.


Overview of the Analysis

The project is structured into a series of Jupyter notebooks, each building on the previous one to triangulate Mars' orbit and verify Kepler’s laws.  

Click on the notebook links below to explore the details of each step.

  1. Notebook Generating Mars Ephemeris

    Generate the measurements of Mars' celestial positions

    • Data is key for the success of this analysis, Kepler used Ticho Brahe's data, we are going to use NASA JPL's DE421 ephemeris via the Skyfield library to generate accurate planetary positions over a period of 12 Martian years (approximately 22 Earth years), starting from January 1, 2000.

    • Determine the ecliptic longitude of Mars and the Sun in the plane of Earth's orbit.Filters out observations where Mars is obscured by the Sun.

    • Save the filtered ephemeris data into a CSV file (ephemeris_mars_sun.csv).

    • Key attributes in the saved data are: Date, Mars Ecliptic Longitude (deg), Sun Ecliptic Longitude (deg)
  2. Notebook Key Insight of Kepler's Analysis

    Understand how Earth-based observations reveal Mars’ trajectory

    • Mars completes one full revolution around the Sun in 687 days (one Mars year). During this period, Earth occupies a different position in its orbit at each observation. By selecting measurements taken exactly one Mars year apart, we capture Mars' apparent position from varied vantage points. With enough observations over several Mars years, these multiple perspectives enable us to triangulate the position of Mars.


    • Figure 1, Triangulating Mars' Position:
      • Select observations spaced 687 days apart (one Mars year) so that Mars is observed at nearly the same position relative to the Sun for each measurement.

      • For each observation, compute Earth's position in the ecliptic and derive Mars' line-of-sight vectors.
      • Apply least-squares estimation to solve for Mars' ecliptic coordinates.
  3. Notebook Computing Mars' Orbit

    Calculate Mars orbit by triangulating Mars' position using all available observations.

    • Load the dataset (line_of_sight_mars_from_earth.csv) with Mars and Sun observations, notably the following fields: Date, Mars Ecliptic Longitude (deg), and Sun Ecliptic Longitude (deg).Computes Mars' heliocentric coordinates and estimates its orbit.

    • Generalized Triangulation

      • For each start date within the first Mars year, iterate through subsequent measurements at 687-day intervals (one Mars year), so that Mars is observed at nearly the same position relative to the Sun for each measurement.
      • Triangulate Mars' position from the accumulated data when at least two valid measurements are available.
      • Gracefully handle missing data and singular matrices to ensure robust estimation.
    • Compile the computed Mars positions into a results DataFrame and save the results to a CSV file (computed_Mars_orbit.csv) for further analysis.
  4. Notebook Kepler’s Laws

    Verify Kepler’s three laws with real data

    • Figure2: Demonstrate Kepler's First Law by fitting an elliptical model to confirm Mars’ orbit is an ellipse with the Sun at one focus. The fitted parameters match accepted values, notable eccentricity e ~ 0.09 and semi-major axis a ~ 1.52 AU.

    • Second Law: Demonstrate that Mars sweeps out equal areas in equal time intervals using the measured values of Mars' orbit.

    • Third Law: Validate the harmonic law by comparing the ratio T^2/a^3 for Mars and Earth.

  5. Notebook Estimating Earth's Orbit

    Use Mars' ephemeris and line-of-sight data to determine Earth’s orbit

    • Earth Position Computation:

      • For each selected observation, compute Earth's heliocentric position by solving for the Earth-Sun distance using the observed Sun and Mars ecliptic longitudes and the estimated Mars position (found in notebook 3 of this series "Compute Mars Orbit")
      • Utilize a numerical solver (via fsolve) to ensure that the computed Earth position yields the correct LOS angle towards Mars.
    • Fits Earth’s computed positions to an elliptical model and compares the results with accepted astronomical values.

    • Visualizes Earth’s orbit alongside the positions of Mars and the Sun. 



Conclusion

Kepler’s groundbreaking work reshaped our understanding of planetary motion, and today, we can revisit his analysis with modern computational tools. By combining Jupyter Notebooks, Python’s scientific libraries, and AI-assisted coding, we demonstrate how complex data analysis can be performed efficiently and interactively.

This project serves as an example of how AI and open-source tools empower researchers, educators, and enthusiasts to explore scientific discoveries with greater ease and depth.


👉 Check out the full project and try the notebooks yourself! GitHub Repository



References

This work is directly inspired by Terence Tao's project Climbing the Cosmic Distance Ladder. In particular see the two-part video series with Grant Sanderson (3Blue1Brown): Part 1 and Part 2

Further details on Kepler's analysis can be found in Tao's draft book chapter Chapter 4: Fourth Rung - The PlanetsDownload here

Another insightful video on Kepler’s discoveries is How the Bizarre Path of Mars Reshaped Astronomy [Kepler's Laws Part 2] by Welch Labs.

Mars-Orbit-Workshop contains material to conduct a workshop recreating Kepler's analysis.

The original work of Kepler was published in Astronomia Nova (New Astronomy) in 1609. The book is available on archive.org. See for example this link to chapter 42 of Astronomia Nova 

Figure 3: An illustration from Chapter 42 of Astronomia Nova (1609) by Kepler, depicting the key concept of triangulating Mars' position using observations taken 687 days apart (one Martian year). This is the original version of Figures 1 and 2 in this post.



Acknowledgements

This work has been conducted in the context of the Databases and Analytics activities at CERN, in particular I'd like to thank my colleagues in the SWAN (Service for Web-based ANalysis) team.

Thursday, June 22, 2023

Building a Semantic Search Engine and RAG Applications with Vector Databases and Large Language Models

This blog post is about building a getting-started example for semantic search using vector databases and large language models (LLMs), an example of retrieval augmented generation (RAG) architecture. You can find the accompanying notebook at this link. See also the SWAN gallery.

CERN users can run the notebooks using the SWAN platform and GPU resources. SWAN

Other options for running the notebooks in the cloud with a GPU include Google's Colab.   Open In Colab


Goals and Scope

Our primary goal is to demonstrate the implementation of a search engine that focuses on understanding the meaning of documents rather than relying solely on keywords.

The proposed implementation uses resources currently available to CERN users: Jupyter notebooks with GPUs, Python packages from the open source ecosystem, a vector database.

Limitations:it's important to note that this example does not cover building a fully-fledged search service or chat engine. We leave those topics for future work, here were limit the discussion to a getting-started example and a technology demonstrator. 


Understanding Key Concepts

Semantic search: Semantic search involves searching for meaning rather than just literal matches of query words. By understanding the context and intent behind the query, semantic search engines can provide more accurate and relevant results.  

Vector Database: A vector database is a specialized type of database designed to handle vector embeddings. These embeddings represent data in a way that captures essential semantic information. They are widely used in applications such as large language models, generative AI, and semantic search.  

Large Language Models (LLMs): LLMs are powerful language models built using artificial neural networks with a vast number of parameters (ranging from tens of millions to billions). These models are trained on extensive amounts of unlabeled text data using self-supervised or semi-supervised learning techniques.  


Implementation details

Building a semantic search prototype has become more accessible due to recent advancements in natural language processing and applied ML/AI. Using off-the-shelf components and integrating them effectively can accelerate the development process. Here are some notable key ingredients that facilitate this implementation:

  • Large Language Models (LLMs) and embedding Libraries
    • The availability of powerful LLMs such as OpenAI GPT-3.5 and GPT-4, Google's Palm 2, and of embedding libraries, significantly simplifies the implementation of semantic search and natural language processing in general. These models provide comprehensive language understanding and generation capabilities, enabling us to extract meaning from text inputs.
  • Platforms: 
    • Platforms and cloud services such as Hugging Face offer valuable resources for operating with ML models as these libraries provide pre-trained models, tokenization utilities, and interfaces to interact with LLMs, reducing the implementation complexity.
  • Open Source Libraries like LangChain:
    • Open source libraries like LangChain provide a convenient way to integrate and orchestrate the different components required for building applications in the semantic search domain. These libraries often offer pre-defined pipelines, data processing tools, and easy-to-use APIs, allowing developers to focus on the core logic of their applications.
  • Vector Databases  and Vector Libraries:  
    • Vector libraries play a crucial role in working with semantic embeddings. They provide functionalities for vector manipulation, similarity calculations, and operations necessary for processing and analyzing embedding data. Additionally, vector databases are recommended for advanced deployments, as they offer storage and querying capabilities for embeddings, along with metadata storage options. Several solutions are available in this area, ranging from mature products offered as cloud services to open source alternatives.

Back-end: prepare the embeddings and indexes in a vector database

To ensure factual accuracy and preserve the original document references, we will prepare the embeddings and indexes in a vector database for our semantic search query engine. Additionally, we aim to enable indexing of private documents, which necessitates storing the embeddings rather than relying on the LLM model directly.

Transforming document chunks into embedding vectors is a crucial step in the process. There are specialized libraries available that utilize neural networks for this task. These libraries can be accessed as cloud services or downloaded to run on local GPU resources. In the accompanying notebook, you can find in the accompanying notebook an example demonstrating this process. 

A second import part is about storing the embeddings. For this a vector library or a vector database can be quite useful. A library like FAISS is a good idea is you have a small amount of documents and/or are just prototyping. A vector DB can provide more features than a simple library, in particular when handling large amounts of documents. In the accompanying notebook we use the FAISS library and, as alternative option, OpenSearch k-NN indexing. Note that several other vector database products can be readily "substituted" to offer comparable and, in some cases, extended functionality.

Note: CERN users have the option to contact the OpenSearch service to request an instance of OpenSearch equipped with the plugin for k-NN search. This can be a valuable resource for your semantic search implementation.


Figure 1: A schematic diagram of how to prepare a set of documents for semantic search. The documents are split in chunks, for each chunk embeddings are computed with a specilized library and then stored in a vector database.

When using FAISS as the Vector library, this is how embedding and indexing can be done:




This is the equivalent code when using OpenSearch as Vector DB:



Semantic querying using similarity search and vector DB indexes


This uses a key functionality of vector libraries and vector databases: similarity search. The general idea is to create a vector embedding for the query and find in the database of embedded vectors the closest elements to the query. For large document collections this can be slow, so vector libraries and databases implement specialized indexes and algorithms for this, for example approximate k-nearest neighbors search.

Figure 2:  A diagram of the similarity query process. The query is converted into embeddings and similarity search via the specialized indexes is performed using a vector database or vector library. Algorithms such as k-nearest neighbors are used to find the matching document chunks for the given query.

Semantic search provides a list of relevant documents for a user query, list the page and text chunk reference, as in this example:



Grand Finale: a Large Language Model for natural language query capabilities


Semantic search returns a list of relevant document snippets, as the last (optional) step we want to convert that into a coherent text answer. For this we can use LLM models. The technique is simple, we just need to feed the query and the relevant pieces of text to the LLM and then take the answer from the model. For this we need to use a rather sophisticated LLM model. The best ones currently work as cloud services (some are free and some charge per use), other models available for free download currently require rather powerful GPUs to run locally.

This is the final result: a system capable of querying the indexed text(s) using natural language. In the following example we apply it to replying to queries about the future of LHC computing, based on the document A Roadmap for HEP Software and Computing R&D for the 2020s




Conclusions

In this blog post, we have demonstrated how to build a beginner's semantic search system using vector databases and large language models (LLMs). Our example has utilized Jupyter notebooks with GPUs, Python packages, and a vector database, proving that a semantic search engine that queries documents for meaning, instead of just keywords, can be feasibly built using existing resources.

In our implementation, we demonstrated how embeddings and indexing can be performed using FAISS as the vector library, or in alternative with OpenSearch as the vector database. We then moved onto the semantic query process using similarity search and vector DB indexes. To finalize the results, we utilized an LLM to convert the relevant document snippets into a coherent text answer.

Though the example provided is not intended to function as a fully-developed search service, it serves as an excellent starting point and technological demonstrator for those interested in semantic search engines. Additionally, we acknowledge the potential of these methods to handle private documents and produce factually accurate results with original document references.

We believe the combination of semantic search, vector databases, and large language models holds large potential for transforming how we approach information retrieval and natural language processing tasks.

The accompanying notebook, providing step-by-step code and more insights, is accessible on GitHub and via the CERN SWAN Gallery. For researchers and developers interested in delving into this exciting area of applied ML/AI, it offers a working example that can be run using CERN resources on SWAN, and also can run on Colab.


Acknowledgements

I would like to express my sincere gratitude to my colleagues at CERN for their invaluable assistance and insightful suggestions, in particular I'd like to acknowledge the CERN data analytics and web notebook services, the OpenSearch service, and the ATLAS database and data engineering teams. Their expertise and support have played a crucial role in making this collection of notebooks possible. Thank you for your contributions and dedication.